Code
lm(price ~ age + yr_built, ames)
Output
Call:
lm(formula = price ~ age + yr_built, data = ames)
Coefficients:
(Intercept) age yr_built
239269 -1475 NA
Module 4.2: Choosing Regressors
All materials can be found at alexcardazzi.github.io.
Hopefully, after seeing what happened in the bedrooms and square footage example, you are convinced that omitting variables can bias our coefficients. In typical (and boring) economist fashion, we call this omitted variable bias.
Unfortunately, sometimes we cannot directly observe every variable that we’d like to control for in our regression. However, we can sign the bias of our coefficients if we can anticipate correlation coefficients.
Consider the age and square footage example again. Initially, our omitted variable was square footage. We knew that square footage was positively correlated with price but negatively correlated with age. So, when we put square footage into the model, the coefficient for age went from \(-1474.96\) to \(-1087.24\). So, the omitted variable bias was negative since it was pushing the coefficient down in the negative direction. In other words, the bias pushed \(-1087.24\) down to \(-1474.96\).
The same thing happened with square footage and bedrooms. Bedrooms and price are negatively correlated, but bedrooms and square footage are positively correlated. So, our coefficient moved from \(111.69\) to \(136.36\), which is the same direction as the previous coefficients. Therefore, omitting bedrooms also caused negative omitted variable bias in this regression.
If you have a model \(Y = \delta_0 + \delta_1\times A + \epsilon\) and \(B\) is your omitted variable, the bias in \(\delta_1\) will be:
cor(A, B) > 0 | cor(A, B) < 0 | |
---|---|---|
B’s effect on Y is positive | \(+\) | \(-\) |
B’s effect on Y is negative | \(-\) | \(+\) |
Remember: positive (negative) bias means the magnitude of the coefficient is artificially pushed upwards (downwards).
Now we know the effect of omitting important variables on our coefficients. But what is the effect of the opposite: including unimportant variables?
Including unimportant variables will not bias your other coefficients because if they’re truly unimportant, we would expect \(\beta_i = 0\). Take a look at the above table – if \(B\) does not effect \(Y\) whatsoever, then the bias will be neither positive nor negative regardless of \(B\)’s correlation with \(A\).
If this is the case, why don’t we throw every single variable we can think of into the model?! Well, there’s no such thing as a free lunch. Each time we include a new variable, especially when the new variable is highly correlated with one or more other variables, we lose a little bit of our precision. In other words, including new variables generally increases the standard errors of the coefficients for the other variables in the model. This makes it more difficult to effectively test our hypotheses.
Intuitively, adding two variables into a model that are highly correlated supplies the model with overlapping and redundant information. This makes it difficult for the model to parse out the effect of each variable individually. The problem of redundant information in a model is called multicollinearity. If two variables contain exactly the same information, the regression cannot differentiate between the two variables and therefore cannot be estimated without removing one of the variables. This is called perfect multicollinearity.
Multicollinearity occurs for one of two reasons:
lm(price ~ age + yr_built, ames)
Call:
lm(formula = price ~ age + yr_built, data = ames)
Coefficients:
(Intercept) age yr_built
239269 -1475 NA
NA
. This is because R could not estimate it and drops it from the model.
To demonstrate the effects of multicollinearity on model output, we are going to simulate some data. Basically, we are going to simulate the following data generating process:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]
where \(\beta_0 = 0\) and \(\beta_1=\beta_2=1\).
The first time we run the model, \(X_1\) and \(X_2\) will be completely uncorrelated. However, we are going to use a for
loop to re-run the model over and over, increasing the correlation between the two variables after each iteration. The code is not necessarily important, but you can still hopefully follow along.
set.seed(757)
<- 100 # sample size
N <- rnorm(N, 0, 1) # create X1
x1 <- rnorm(N, 0, 3) # create unobserved error
e <- list()
all1 for(a in seq(0, .99, 0.01)){
# X2 is a combination of randomness and X1
# When a = 0, X2 and X1 are completely uncorrelated
<- ((1-a)*rnorm(N, 0, 1)) + (a*x1)
x2 <- cor(x1, x2) # calculate correlation
rho <- x1 + x2 + e # Data Generating Process
y # Estimate regression and store coefficients
<- as.data.frame(coef(summary(lm(y ~ x1 + x2))))
tmp # Store correlation
$a <- rho
tmplength(all1) + 1]] <- tmp # save results
all1[[
}
# Combine results
<- do.call(rbind, all1)
all # Round off correlation
$a <- round(all$a, 2)
all
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
par(mar = c(4.1, 4.1, 1.1, 1.1))
plot(all$a[c(T,F,F)], las = 1, pch = 19,
col = scales::alpha("black", 0.6), cex = 1.25,
xlab = "Loop Iteration Number",
ylab = "Corr. Between X1 and X2")
abline(h = c(0, 0.5, 0.9), lty = 2)
plot(all$a[c(F,T,F)], all$Estimate[c(F,T,F)], las = 1,
xlab = "Correlation", cex = 1.25,
ylab = "Coefficient Estimate", pch = 19,
ylim = range(all$Estimate[c(F,T,F)][-which(all$Estimate[c(F,T,F)] == max(all$Estimate[c(F,T,F)]))]),
col = scales::alpha("black", 0.6))
abline(h = 1)
plot(all$a[c(F,T,F)], all$`t value`[c(F,T,F)], las = 1,
xlab = "Correlation", cex = 1.25,
ylab = "t-Statistic", pch = 19,
col = scales::alpha("black", 0.6))
abline(h = 1.96, lty = 2)
layout(matrix(c(1), 1, 1, byrow = TRUE))
All three of these plots provide evidence that multicollinearity does not bias coefficients (until it becomes extreme), but does hinder the precision of the estimates.
To summarize the effect of including (or not including) additional regressors:
The way you should think about adding new variables to your model is through a bias-variance tradeoff. We know that coefficients can be biased if we omit important variables, but our standard errors get larger as the number of parameters we have to estimate increases.
It’s important to consider only the most important variables in your model to minimize both bias and variance. You might ask: how can I tell which variables are most important? To that I would say: good question, and the best answer is economic theory.
Economic theory should guide your choices about which variables to include/exclude and which functional forms you should use (log()
, x + x^2
, etc.). Unfortunately, this might seem like an unsatisfying answer, but econometrics is as much an art as it is a science. As you become more comfortable with econometrics, selecting variables and functional forms will become more natural.
In the next portion of the module, we will discuss how to incorporate binary/categorical variables into our models.
Age is a linear function of year built: ames$age <- 2011 - ames$yr_built
↩︎