Many X Variables

Module 4.2: Choosing Regressors

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Omitted Variables

Hopefully, after seeing what happened in the bedrooms and square footage example, you are convinced that omitting variables can bias our coefficients. In typical (and boring) economist fashion, we call this omitted variable bias.

Unfortunately, sometimes we cannot directly observe every variable that we’d like to control for in our regression. However, we can sign the bias of our coefficients if we can anticipate correlation coefficients.

Consider the age and square footage example again. Initially, our omitted variable was square footage. We knew that square footage was positively correlated with price but negatively correlated with age. So, when we put square footage into the model, the coefficient for age went from \(-1474.96\) to \(-1087.24\). So, the omitted variable bias was negative since it was pushing the coefficient down in the negative direction. In other words, the bias pushed \(-1087.24\) down to \(-1474.96\).

The same thing happened with square footage and bedrooms. Bedrooms and price are negatively correlated, but bedrooms and square footage are positively correlated. So, our coefficient moved from \(111.69\) to \(136.36\), which is the same direction as the previous coefficients. Therefore, omitting bedrooms also caused negative omitted variable bias in this regression.

If you have a model \(Y = \delta_0 + \delta_1\times A + \epsilon\) and \(B\) is your omitted variable, the bias in \(\delta_1\) will be:

Omitted Variable Bias Table
cor(A, B) > 0 cor(A, B) < 0
B’s effect on Y is positive \(+\) \(-\)
B’s effect on Y is negative \(-\) \(+\)

Remember: positive (negative) bias means the magnitude of the coefficient is artificially pushed upwards (downwards).

Unimportant Variables

Now we know the effect of omitting important variables on our coefficients. But what is the effect of the opposite: including unimportant variables?

Including unimportant variables will not bias your other coefficients because if they’re truly unimportant, we would expect \(\beta_i = 0\). Take a look at the above table – if \(B\) does not effect \(Y\) whatsoever, then the bias will be neither positive nor negative regardless of \(B\)’s correlation with \(A\).

If this is the case, why don’t we throw every single variable we can think of into the model?! Well, there’s no such thing as a free lunch. Each time we include a new variable, especially when the new variable is highly correlated with one or more other variables, we lose a little bit of our precision. In other words, including new variables generally increases the standard errors of the coefficients for the other variables in the model. This makes it more difficult to effectively test our hypotheses.

Multicollinearity

Intuitively, adding two variables into a model that are highly correlated supplies the model with overlapping and redundant information. This makes it difficult for the model to parse out the effect of each variable individually. The problem of redundant information in a model is called multicollinearity. If two variables contain exactly the same information, the regression cannot differentiate between the two variables and therefore cannot be estimated without removing one of the variables. This is called perfect multicollinearity.

Multicollinearity occurs for one of two reasons:

  1. Structural: Sometimes we create variables based on already existing variables. For example, we created an “age” variable from the “year built” variable. Year built and age are highly correlated. In fact, since one is a linear function of the other1, including both into a regression will create perfect multicollinearity. Put differently, since the correlation between the two is -1, they contain exactly the same information.
Age and Year Built in a Regression
Code
lm(price ~ age + yr_built, ames)
Output

Call:
lm(formula = price ~ age + yr_built, data = ames)

Coefficients:
(Intercept)          age     yr_built  
     239269        -1475           NA  
Notice how the resulting “yr_built” coefficient is NA. This is because R could not estimate it and drops it from the model.
  1. Data-Based: Other times, variables are correlated simply because we are working with observational data. For example, in reality, the number of hours of SAT tutoring for a student is highly correlated with their family income. If we had an RCT that randomized the amount of tutoring students got, we wouldn’t have to worry about parental income. However, in an observational setting, this is just a fact of the matter.

To demonstrate the effects of multicollinearity on model output, we are going to simulate some data. Basically, we are going to simulate the following data generating process:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]

where \(\beta_0 = 0\) and \(\beta_1=\beta_2=1\).

The first time we run the model, \(X_1\) and \(X_2\) will be completely uncorrelated. However, we are going to use a for loop to re-run the model over and over, increasing the correlation between the two variables after each iteration. The code is not necessarily important, but you can still hopefully follow along.

Code
set.seed(757)
N <- 100 # sample size
x1 <- rnorm(N, 0, 1) # create X1
e <-  rnorm(N, 0, 3) # create unobserved error
all1 <- list()
for(a in seq(0, .99, 0.01)){
  
  # X2 is a combination of randomness and X1
  # When a = 0, X2 and X1 are completely uncorrelated
  x2 <- ((1-a)*rnorm(N, 0, 1)) + (a*x1)
  rho <- cor(x1, x2) # calculate correlation
  y <- x1 + x2 + e # Data Generating Process
  # Estimate regression and store coefficients
  tmp <- as.data.frame(coef(summary(lm(y ~ x1 + x2))))
  # Store correlation
  tmp$a <- rho
  all1[[length(all1) + 1]] <- tmp # save results
}

# Combine results
all <- do.call(rbind, all1)
# Round off correlation
all$a <- round(all$a, 2)

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
par(mar = c(4.1, 4.1, 1.1, 1.1))
plot(all$a[c(T,F,F)], las = 1, pch = 19,
     col = scales::alpha("black", 0.6), cex = 1.25,
     xlab = "Loop Iteration Number",
     ylab = "Corr. Between X1 and X2")
abline(h = c(0, 0.5, 0.9), lty = 2)
plot(all$a[c(F,T,F)], all$Estimate[c(F,T,F)], las = 1,
     xlab = "Correlation", cex = 1.25,
     ylab = "Coefficient Estimate", pch = 19,
     ylim = range(all$Estimate[c(F,T,F)][-which(all$Estimate[c(F,T,F)] == max(all$Estimate[c(F,T,F)]))]),
     col = scales::alpha("black", 0.6))
abline(h = 1)
plot(all$a[c(F,T,F)], all$`t value`[c(F,T,F)], las = 1,
     xlab = "Correlation", cex = 1.25,
     ylab = "t-Statistic", pch = 19,
     col = scales::alpha("black", 0.6))
abline(h = 1.96, lty = 2)
layout(matrix(c(1), 1, 1, byrow = TRUE))
Plot

Explanation of these figures
  1. As expected, the first few iterations have a very low correlation coefficient, hovering between 0 and 0.2. Then, the correlation coefficient reaches 0.5 by around iteration 40. Finally, the correlation reaches 0.9 by iteration 70 or so.
  2. Up until the correlation reaches $$0.95, the coefficient estimate is extremely stable and hovers around 1.
  3. However, the t-statistic for the coefficients signals that the regression is unable to differentiate between the coefficient estimate (which is about 1) and 0 at the 95% level starting when the correlation is around 0.5.

All three of these plots provide evidence that multicollinearity does not bias coefficients (until it becomes extreme), but does hinder the precision of the estimates.

Choosing Variables

To summarize the effect of including (or not including) additional regressors:

  1. Omitting important variables biases coefficient estimates (and therefore model predictions, too).
  2. Including unimportant variables reduces the precision of the coefficient estimates.

The way you should think about adding new variables to your model is through a bias-variance tradeoff. We know that coefficients can be biased if we omit important variables, but our standard errors get larger as the number of parameters we have to estimate increases.

It’s important to consider only the most important variables in your model to minimize both bias and variance. You might ask: how can I tell which variables are most important? To that I would say: good question, and the best answer is economic theory.

Economic theory should guide your choices about which variables to include/exclude and which functional forms you should use (log(), x + x^2, etc.). Unfortunately, this might seem like an unsatisfying answer, but econometrics is as much an art as it is a science. As you become more comfortable with econometrics, selecting variables and functional forms will become more natural.

In the next portion of the module, we will discuss how to incorporate binary/categorical variables into our models.

Footnotes

  1. Age is a linear function of year built: ames$age <- 2011 - ames$yr_built↩︎