Module 4.1: Multiple Regressors
Old Dominion University
In the last module, we developed and estimated quite a few models to explain housing prices. However, as you can intuit, there are many factors that influence the price of a house. While an important feature of economic models is that they are relatively simple, our previous models were definitely too simple. To make our model a bit more realistic, we can include other factors that might influence price. First, let’s read in the data, remove some columns, and change the names of the remaining columns, and then we will see how to add more factors into our models.
In addition, we are going to create a new variable for age.
We’ll start by estimating a simple model where price is determined by age. Next, we’ll estimate a model where price is determined by age and square footage. Before we estimate anything, though, let’s first establish some of the correlations between the three variables in our data.
Why is this so important? Let’s first estimate a simple model:
\[\text{Price}_i = \alpha_0 + \alpha_1 \times \text{Age}_i + \epsilon_i\]
\(\alpha_1\) = -1474.96 suggests that for every one year increase in property age, housing price decreases by $1474.96. However, remember, when we are talking about an older house, we are also talking about a generally smaller home (as evidenced via the correlation from before). This means that the regression coefficient of -1474.96 is actually a “composite” effect of both increased age and decreased square footage! In other words, when we increase age, two things happen: not only does age increase, but square footage also decreases. So which of these two effects are we really picking up with this coefficient?
Obviously, in reality, houses do not shrink as they age, but the model doesn’t know that. It only understands the correlations in the data. In these data, older houses are also smaller, so it’s important we control the size of a home to get a true measure of the effect of age on prices.
Another way to think about this is that we are trying to compare houses that are the same in every way except for their ages. Controlling for square footage gets us one step closer to keeping everything about the home constant while we vary only its age.
How, then, do we control for square footage?
To get an estimate of the isolated effect of age on price, we need some extra information. First, we need the initial regression where price is a function of age. Next, we need to know how square footage will respond to changes in age. To get this information, we’ll estimate a regression where square footage is the outcome and age is the explanatory variable. Lastly, we need to know how price responds to changes in square footage.
The results:
The above is a rough approximation because all of those regressions are inherently biased for the same reason that the first one is. So, rather than estimating three regressions and doing some major algebratics1, we can do this all in one step. We are now going to estimate the following model:
\[\text{Price}_i = \beta_0 + \beta_1 \times \text{Age}_i + \beta_2 \times \text{Sq. Ft.}_i + \epsilon_i\]
Without going through all the math, OLS will simultaneously find two lines of best fit. You can even think of this as a surface, or a “plane”, since there are two dimensions instead of just one. \(\beta_0, \beta_1, \text{and} \ \beta_2\) will still minimize the sum of squared residuals and provide an average residual of zero.
Before estimating this specific model, however, let’s visualize the data in three dimensions. On the \(Z\)-axis (the one going up and down), we’ll plot the home’s sale price. Then, on the \(X\)- and \(Y\)-axes, we are going to plot age and square footage. Feel free to click on and spin the following visualization:
The following figures are 2D representations of what you would see if you were to spin the 3D figure in certain directions. The first panel shows us the relationship between price and square footage. The second panel depicts the relationship between age and price. The final panel shows the relationship between square footage and age.
Now let’s estimate the model I mentioned before. To include another variable in lm()
, simply add it to the right side of the formula like you would a math equation.
(Intercept) age sqft
79973.60078 -1087.23673 95.96949
This model also makes some intuitive sense: it says that older houses are worth less and larger houses are worth more.
Recall from earlier when I said a regression with two \(X\) variables creates a plane rather than a single line. See the below visualization of this regression in action. Given any combination of age and square footage, we can find a point on the plane that gives us the predicted price.
Let’s practice with another variable: bedrooms. We’ll estimate a model of the following form. Think for a minute: what do you think the sign of each coefficients will be?
\[\text{Price}_i = \gamma_0 + \gamma_1 \times \text{Sq. Ft.}_i + \gamma_2 \times \text{Bedrooms}_i + \epsilon_i\]
Let’s estimate three models:
r1 <- lm(price ~ sqft, ames)
r2 <- lm(price ~ bedrooms, ames)
r3 <- lm(price ~ sqft + bedrooms, ames)
regz <- list(`Price` = r1,
`Price` = r2,
`Price` = r3)
coefz <- c("sqft" = "Square Footage",
"bedrooms" = "Bedrooms",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")
modelsummary(regz,
title = "Effect of Sq. Ft. and Bedrooms on Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
Price | Price | Price | |
---|---|---|---|
Square Footage | 111.694*** | 136.361*** | |
(2.066) | (2.247) | ||
Bedrooms | 13889.495*** | −29149.110*** | |
(1765.042) | (1372.135) | ||
Constant | 13289.634*** | 141151.743*** | 59496.236*** |
(3269.703) | (5245.395) | (3741.249) | |
Num.Obs. | 2930 | 2930 | 2930 |
R2 | 0.500 | 0.021 | 0.566 |
In the first model, we find that an increase in square footage increases price. In the second model, we find that an increase in bedrooms increases price, too. However, in the third model, we find:
Wait a minute… An extra bedroom decreases sale price? Initially, this might be confusing, so let’s think about this carefully. In words, this coefficient’s interpretation is:
Increasing the number of bedrooms by one, holding square footage constant, decreases sale price by $29,149.11.
What does it mean to increase the number of bedrooms in a property while holding square footage constant? Suppose a home is 2,000 square feet with four rooms. Each room is 500 square feet (on average). If we add an additional room, but do not change the square footage, each room would now only be 400 square feet (on average). Therefore, adding an extra bedroom, while holding square footage constant, makes for a bunch of small rooms. This is not something that is typically sought after in the housing market, hence the negative coefficient.
We can add more than just two variables to our model. Below is a progression of models that culminate in a model with three explanatory variables:
r1 <- lm(log(price) ~ log(sqft), ames)
r2 <- lm(log(price) ~ log(sqft) + bedrooms, ames)
r3 <- lm(log(price) ~ log(sqft) + bedrooms + age, ames)
regz <- list(`log(Price)` = r1,
`log(Price)` = r2,
`log(Price)` = r3)
coefz <- c("log(sqft)" = "log(Square Footage)",
"bedrooms" = "Bedrooms",
"age" = "Age")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Determinants of Sale Price",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
log(Price) | log(Price) | log(Price) | |
---|---|---|---|
log(Square Footage) | 0.908*** | 1.094*** | 0.875*** |
(0.016) | (0.018) | (0.015) | |
Bedrooms | −0.138*** | −0.082*** | |
(0.007) | (0.006) | ||
Age | −0.006*** | ||
(0.000) | |||
Num.Obs. | 2930 | 2930 | 2930 |
R2 | 0.523 | 0.580 | 0.731 |
Interpreting these coefficients is similar to the case where you have only two explanatory variables.
Something else to note is how the \(R^2\) changes from model to model as we add explanatory/control variables. Each new control variable adds a little bit more information, which improves the model’s ability to explain. As a way to visualize this, we can plot the fitted values against the true, observed values. If the model was perfect at explaining prices, all of the points would fall on the red 45°, \(y = x\) line.
par(mfrow = c(1, 3))
ylim <- exp(range(r1$fitted.values, r2$fitted.values, r3$fitted.values))
plot(ames$price, exp(r1$fitted.values), ylim = ylim,
ylab = "Fitted/Predicted Price", xlab = "",
main = "Controls: sqft",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r1)$r.squared, 3)))
plot(ames$price, exp(r2$fitted.values), ylim = ylim,
xlab = "Observed Price", ylab = "",
main = "Controls: sqft + bedrooms",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r2)$r.squared, 3)))
plot(ames$price, exp(r3$fitted.values), ylim = ylim,
ylab = "", xlab = "",
main = "Controls: sqft + bedrooms + age",
col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
legend = paste0("R2: ", round(summary(r3)$r.squared, 3)))
As control variables are included, the \(R^2\) increases and the points start to get tighter to the diagonal line. This means that the predicted values are getting closer to the actual values.
Points that are above the 45° line are expected to have higher sale prices than what they actually sold for. Points below the line sold for higher prices that what the model predicted.
ECON 311: Economics, Causality, and Analytics