Many X Variables

Module 4.1: Multiple Regressors

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Housing Price Models

In the last module, we developed and estimated quite a few models to explain housing prices. However, as you can intuit, there are many factors that influence the price of a house. While an important feature of economic models is that they are relatively simple, our previous models were definitely too simple. To make our model a bit more realistic, we can include other factors that might influence price. First, let’s read in the data, remove some columns, and change the names of the remaining columns, and then we will see how to add more factors into our models.

Code
ames <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/ames.csv")
keep_columns <- c("price", "area", "Bedroom.AbvGr", "Year.Built", "Overall.Cond")
ames <- ames[,keep_columns]
colnames(ames) <- c("price", "sqft", "bedrooms", "yr_built", "condition")

This code has been run in the background for you to facilitate your use of the following WebR chunks.

In addition, we are going to create a new variable for age.

Code
ames$age <- 2011 - ames$yr_built

Simple Correlations

We’ll start by estimating a simple model where price is determined by age. Next, we’ll estimate a model where price is determined by age and square footage. Before we estimate anything, though, let’s first establish some of the correlations between the three variables in our data.

  1. The correlation between price and age is -0.558. This suggests that newer homes are typically more expensive.
  2. The correlation between price and square footage is 0.707. We saw this last module – as square footage increases, so does price. This makes sense; bigger houses typically command higher prices.
  3. The correlation between age and square footage is -0.242. This suggests that newer houses are bigger than older homes. This is an important data artifact!

Why is this so important? Let’s first estimate a simple model:

\[\text{Price}_i = \alpha_0 + \alpha_1 \times \text{Age}_i + \epsilon_i\]

\(\alpha_1\) = -1474.96 suggests that for every one year increase in property age, housing price decreases by $1474.96. However, remember, when we are talking about an older house, we are also talking about a generally smaller home (as evidenced via the correlation from before). This means that the regression coefficient of -1474.96 is actually a “composite” effect of both increased age and decreased square footage! In other words, when we increase age, two things happen: not only does age increase, but square footage also decreases. So which of these two effects are we really picking up with this coefficient?

Obviously, in reality, houses do not shrink as they age, but the model doesn’t know that. It only understands the correlations in the data. In these data, older houses are also smaller, so it’s important we control the size of a home to get a true measure of the effect of age on prices.

Another way to think about this is that we are trying to compare houses that are the same in every way except for their ages. Controlling for square footage gets us one step closer to keeping everything about the home constant while we vary only its age.

How, then, do we control for square footage?

Breaking Down the Composite

To get an estimate of the isolated effect of age on price, we need some extra information. First, we need the initial regression where price is a function of age. Next, we need to know how square footage will respond to changes in age. To get this information, we’ll estimate a regression where square footage is the outcome and age is the explanatory variable. Lastly, we need to know how price responds to changes in square footage.

Solution
Code
coef(lm(price ~ age, ames)); cat("\n")
coef(lm(sqft ~ age, ames)); cat("\n")
coef(lm(price ~ sqft, ames))
Output
(Intercept)         age 
 239269.065   -1474.964 

(Intercept)         age 
1659.855230   -4.040108 

(Intercept)        sqft 
  13289.634     111.694 
Explanation of Results

The results:

  1. According to the first regression, when we increase age by one year, sale price drops by $-1474.96. Remember, this is a “composite” effect.
  2. According to the second regression, when we increase age by one year, square footage drops by -4.04 ft2.
  3. How much would the price decrease because of the associated reduction in square footage? From the third regression, $111.69 is the change in price for a one square foot increase, but we have a square footage change of -4.04 ft2 since age increased by one year. Therefore, multiplying these two numbers together gives us $-451.26. You can think of this as the “indirect” effect of decreasing property size due to increasing age.
  4. The effect of age alone would then be the composite effect ($-1474.96) minus the effect attributable to the change in square footage ($-451.26). This comes out to be: $-1023.7, which is certainly smaller in magnitude compared to what the first regression tells us.

The above is a rough approximation because all of those regressions are inherently biased for the same reason that the first one is. So, rather than estimating three regressions and doing some major algebratics1, we can do this all in one step. We are now going to estimate the following model:

\[\text{Price}_i = \beta_0 + \beta_1 \times \text{Age}_i + \beta_2 \times \text{Sq. Ft.}_i + \epsilon_i\]

Without going through all the mathSpongbob Meme, OLS will simultaneously find two lines of best fit. You can even think of this as a surface, or a “plane”, since there are two dimensions instead of just one. \(\beta_0, \beta_1, \text{and} \ \beta_2\) will still minimize the sum of squared residuals and provide an average residual of zero.

Before estimating this specific model, however, let’s visualize the data in three dimensions. On the \(Z\)-axis (the one going up and down), we’ll plot the home’s sale price. Then, on the \(X\)- and \(Y\)-axes, we are going to plot age and square footage. Feel free to click on and spin the following visualization:

The following figures are 2D representations of what you would see if you were to spin the 3D figure in certain directions. The first panel shows us the relationship between price and square footage. The second panel depicts the relationship between age and price. The final panel shows the relationship between square footage and age.

Plot

Estimation with Multiple Variables

Now let’s estimate the model I mentioned before. To include another variable in lm(), simply add it to the right side of the formula like you would a math equation.

Code
reg_beta <- lm(price ~ age + sqft, ames)
coef(reg_beta)
Output
(Intercept)         age        sqft 
79973.60078 -1087.23673    95.96949 
Coefficient Interpretation
  1. \(\beta_0\)’s interpretation is now the price of a property where all \(X\) variables are equal to zero. In this case, the price of a house that has an age of zero and zero square footage. Of course, a house like this does not exist, so \(\beta_0\) is not very meaningful.
  2. \(\beta_1\) is the effect of one additional year of age on price, holding square footage constant. The model has removed the effect of square footage from this coefficient since we have “controlled” for it by adding square footage into the model.2
  3. \(\beta_2\) is the effect of an additional square foot of size on final sale price, holding age constant.

This model also makes some intuitive sense: it says that older houses are worth less and larger houses are worth more.

Recall from earlier when I said a regression with two \(X\) variables creates a plane rather than a single line. See the below visualization of this regression in action. Given any combination of age and square footage, we can find a point on the plane that gives us the predicted price.

Square Footage and Bedrooms

Let’s practice with another variable: bedrooms. We’ll estimate a model of the following form. Think for a minute: what do you think the sign of each coefficients will be?

\[\text{Price}_i = \gamma_0 + \gamma_1 \times \text{Sq. Ft.}_i + \gamma_2 \times \text{Bedrooms}_i + \epsilon_i\]

Hypotheses
  1. We should expect that \(\gamma_1>0\), since bigger houses should be more valuable.
  2. We should also expect \(\gamma_2>0\), since houses with more bedrooms should also be more valuable.

Let’s estimate three models:

Code
r1 <- lm(price ~ sqft, ames)
r2 <- lm(price ~ bedrooms, ames)
r3 <- lm(price ~ sqft + bedrooms, ames)

regz <- list(`Price` = r1,
             `Price` = r2,
             `Price` = r3)
coefz <- c("sqft" = "Square Footage",
           "bedrooms" = "Bedrooms",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")
modelsummary(regz,
             title = "Effect of Sq. Ft. and Bedrooms on Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of Sq. Ft. and Bedrooms on Sale Price
Price Price Price
Square Footage 111.694*** 136.361***
(2.066) (2.247)
Bedrooms 13889.495*** −29149.110***
(1765.042) (1372.135)
Constant 13289.634*** 141151.743*** 59496.236***
(3269.703) (5245.395) (3741.249)
Num.Obs. 2930 2930 2930
R2 0.500 0.021 0.566

In the first model, we find that an increase in square footage increases price. In the second model, we find that an increase in bedrooms increases price, too. However, in the third model, we find:

  1. Each additional square foot increases price by $136.36.
  2. Each additional bedroom decreases price by $29,149.11.

Wait a minute An extra bedroom decreases sale price? Initially, this might be confusing, so let’s think about this carefully. In words, this coefficient’s interpretation is:

Increasing the number of bedrooms by one, holding square footage constant, decreases sale price by $29,149.11.

What does it mean to increase the number of bedrooms in a property while holding square footage constant? Suppose a home is 2,000 square feet with four rooms. Each room is 500 square feet (on average). If we add an additional room, but do not change the square footage, each room would now only be 400 square feet (on average). Therefore, adding an extra bedroom, while holding square footage constant, makes for a bunch of small rooms. This is not something that is typically sought after in the housing market, hence the negative coefficient.

We can add more than just two variables to our model. Below is a progression of models that culminate in a model with three explanatory variables:

Code
r1 <- lm(log(price) ~ log(sqft), ames)
r2 <- lm(log(price) ~ log(sqft) + bedrooms, ames)
r3 <- lm(log(price) ~ log(sqft) + bedrooms + age, ames)
regz <- list(`log(Price)` = r1,
             `log(Price)` = r2,
             `log(Price)` = r3)
coefz <- c("log(sqft)" = "log(Square Footage)",
           "bedrooms" = "Bedrooms",
           "age" = "Age")
gofz <- c("nobs", "r.squared")

modelsummary(regz,
             title = "Determinants of Sale Price",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Determinants of Sale Price
log(Price) log(Price) log(Price)
log(Square Footage) 0.908*** 1.094*** 0.875***
(0.016) (0.018) (0.015)
Bedrooms −0.138*** −0.082***
(0.007) (0.006)
Age −0.006***
(0.000)
Num.Obs. 2930 2930 2930
R2 0.523 0.580 0.731

Interpreting these coefficients is similar to the case where you have only two explanatory variables.

Coefficient Interpretation for Model 3
  1. A 1% increase in square footage, holding bedrooms and age constant, increases sale price by 0.875%.
  2. An additional bedroom, holding age and square footage constant, reduces sale price by 8.2%.
  3. An additional year of age, holding square footage and bedrooms constant, reduces sale price by 0.6%.

Evaluating \(R^2\)

Something else to note is how the \(R^2\) changes from model to model as we add explanatory/control variables. Each new control variable adds a little bit more information, which improves the model’s ability to explain. As a way to visualize this, we can plot the fitted values against the true, observed values. If the model was perfect at explaining prices, all of the points would fall on the red 45°, \(y = x\) line.

Code
par(mfrow = c(1, 3))
ylim <- exp(range(r1$fitted.values, r2$fitted.values, r3$fitted.values))
plot(ames$price, exp(r1$fitted.values), ylim = ylim,
     ylab = "Fitted/Predicted Price", xlab = "",
     main = "Controls: sqft",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r1)$r.squared, 3)))
plot(ames$price, exp(r2$fitted.values), ylim = ylim,
     xlab = "Observed Price", ylab = "",
     main = "Controls: sqft + bedrooms",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r2)$r.squared, 3)))
plot(ames$price, exp(r3$fitted.values), ylim = ylim,
     ylab = "", xlab = "",
     main = "Controls: sqft + bedrooms + age",
     col = scales::alpha("black", 0.2), pch = 19)
abline(0, 1, col = "tomato")
legend("bottomright", bty = "n", cex = 2,
       legend = paste0("R2: ", round(summary(r3)$r.squared, 3)))
Plot

As control variables are included, the \(R^2\) increases and the points start to get tighter to the diagonal line. This means that the predicted values are getting closer to the actual values.

A note about \(R^2\) You should not choose which variables are (or are not) important based on changes in \(R^2\). Technically, it is impossible for \(R^2\) to decrease after you add another variable. Of course, variables that add a lot of explanatory power to your model should be considered, but \(R^2\) does not tell you which model is best. As we will see later, there are trade-offs faced when including/excluding variables, so you should rely on theory and intuition to guide your modeling decisions.

Points that are above the 45° line are expected to have higher sale prices than what they actually sold for. Points below the line sold for higher prices that what the model predicted.

Footnotes

  1. I’m inventing a new word for “Algebra” + “Acrobatics”. I’m doing my best to make econometrics fun, so you have to cut me some slack.↩︎

  2. Notice how close this number is to -1023.7, from before.↩︎