Many X Variables

Module 4.3: Binary Variables and Interactions

Alex Cardazzi

Old Dominion University

Guidance Counselors

To start this portion of the module, we are going to revisit our beloved SAT - GPA data (data, documentation).

Last module, we began by estimating the relationship between HS GPA and the sum of SAT Percentiles. We then split the sample into males and females, and estimated two models separately for each group. Let’s re-do these three models so we can have them as baselines.

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
df <- df[df$hs_gpa <= 4,]
r1 <- lm(sat_sum ~ hs_gpa, df)
r2 <- lm(sat_sum ~ hs_gpa, df, subset = df$sex == 1)
r3 <- lm(sat_sum ~ hs_gpa, df, subset = df$sex == 2)

library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")
regz <- list(`All Students` = r1,
             `Female` = r2,
             `Male` = r3)
coefz <- c("hs_gpa" = "High School GPA",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
All Students Female Male
High School GPA 11.538*** 11.392*** 13.715***
(0.753) (0.999) (1.079)
Constant 66.468*** 70.196*** 55.851***
(2.440) (3.167) (3.579)
Num.Obs. 999 515 484
R2 0.191 0.202 0.251

Guidance Counselors

Next, let’s recreate some visualizations of these regression lines.

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(r2, col = "tomato", lwd = 3)
abline(r3, col = "dodgerblue", lwd = 3)
abline(r1, col = "black", lwd = 2, lty = 2)
legend("bottomright", bty = "n", lty = c(2, 1, 1),
       col = c("black", "tomato", "dodgerblue"),
       legend = c("All", "Female", "Male"), lwd = 2)
Plot

Splitting our sample like this is convenient because it allows for a lot of flexibility – each group gets their own parameters. However, remember: the more parameters you need to estimate, the more difficult inference becomes.1

If we wanted to reduce the number of parameters in our model (which is currently four: two intercepts and two slopes), we could simply not differentiate by gender. By doing this, we are sacrificing a bit of bias for reductions in variance. Then, we’d only have the two parameters (intercept and slope) to estimate. Is there a compromise we can make, and maybe estimate just three parameters?

Guidance Counselors

The short answer is: yes! Let’s explore this a bit.

In the two-model situation, we have two intercepts and two slopes. In order to reduce the number of parameters, we would need to make an assumption about the setting we’re working with. We could either have male and female students share a slope or an intercept. If they share a slope, we would only have to estimate the single slope parameter and two intercept parameters. After looking at the two regression lines, they seem to be pretty parallel, which means that the same-slope assumption seems reasonable.

We are going to estimate the following regression equation:

\[\text{SAT}_i = \beta_0 + \beta_1 \times \text{Female}_i + \beta_2 \times \text{GPA}_i + \epsilon_i\]

where \(\text{Female}_i\) is a binary variable equal to 1 if the observation if for a female student and 0 otherwise, and \(\text{GPA}_i\) is the student’s high school GPA.

Guidance Counselors

Before we estimate the model, let’s run through some of the algebra of this model. Suppose we are working with a male student, and they have a 3.0 GPA. What do we expect their SAT to be?

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 3) = \widehat{\beta}_0 + 3\widehat{\beta}_2\]

What about a female student with the same GPA?

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 1) + (\widehat{\beta}_2 \times 3) = \widehat{\beta}_0 + \widehat{\beta}_1 + 3\widehat{\beta}_2\]

The only difference in expected SAT scores is \(\widehat{\beta}_1\), and this is independent of \(\text{GPA}\). In other words, we would interpret \(\beta_1\) as an intercept shifter/modifier, or the average difference between female and male scores.

Let’s estimate the model and plot the resulting lines.

Guidance Counselors

Code
# create binary indicator
df$female <- ifelse(df$sex == 1, 1, 0)
r4 <- lm(sat_sum ~ female + hs_gpa, df)
summary(r4)
Output

Call:
lm(formula = sat_sum ~ female + hs_gpa, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.642  -8.401  -0.083   8.540  34.198 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.9776     2.4687  24.295   <2e-16 ***
female        6.8978     0.7927   8.702   <2e-16 ***
hs_gpa       12.4561     0.7335  16.982   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.39 on 996 degrees of freedom
Multiple R-squared:  0.248, Adjusted R-squared:  0.2465 
F-statistic: 164.2 on 2 and 996 DF,  p-value: < 2.2e-16
Coefficient Interpretations
  1. The intercept is the expected sum of SAT percentiles when hs_gpa and female are equal to zero. Therefore, the interpretation for this coefficient is now the expected sum of SAT percentiles for a male student with a GPA of zero.
  2. Keeping hs_gpa at zero, the expected sum of SAT percentiles for female students is 59.98 + 6.9 = 66.88.
  3. Increasing hs_gpa has the same effect for both groups given our initial assumption. A one unit increase in hs_gpa leads to an increase in SAT of 12.46

Guidance Counselors

Visualizing these results on a set of axes:

Code
plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(a = coef(r4)[1], b = coef(r4)[3], col = "dodgerblue", lwd = 3)
abline(a = sum(coef(r4)[1:2]), b = coef(r4)[3], col = "tomato", lwd = 3)
legend("bottomright", bty = "n", lty = c(1, 1),
       col = c("tomato", "dodgerblue"),
       legend = c("Female", "Male"), lwd = 2)
Plot

Note how these lines are completely parallel. This is a result of our assuming that the two groups share the same slope. In this case, it does not seem to be a bad assumption. When we include a binary variable into a regression, we call these dummy variables.

Now, what if we wanted to allow the slopes to differ but not the intercept. Well, we can do this too!

Guidance Counselors

To allow for differing slopes by group, we can specify the following functional form:

\[\text{SAT}_i = \beta_0 + (\beta_1 \times \text{GPA}_i) + (\beta_2 \times \text{GPA}_i \times \text{Female}_i) + \epsilon_i\]

If a male student had a GPA of 0, we would expect their SAT to be:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 0 \times 0) = \widehat{\beta}_0\]

Similarly, if a female student had a GPA of zero, their expected SAT would be:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 0) + (\widehat{\beta}_2 \times 0 \times 1) = \widehat{\beta}_0\]

So, in other words, the intercepts would be exactly the same, which matches our assumption.

Guidance Counselors

What if these two students had 3.0 GPAs? How would their expected SAT scores change?

For the male student:

\[\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 3) + (\widehat{\beta}_2 \times 3 \times 0) = \widehat{\beta}_0 + 3\widehat{\beta}_1\]

For the female student:

\[\begin{aligned}\text{SAT}_i = \widehat{\beta}_0 + (\widehat{\beta}_1 \times 3) + (\widehat{\beta}_2 \times 3 \times 1) &= \widehat{\beta}_0 + 3\widehat{\beta}_1 + 3\widehat{\beta}_2 \\&= \widehat{\beta}_0 + 3(\widehat{\beta}_1 + \widehat{\beta}_2)\end{aligned}\]

This makes the difference between the two groups equal to \(3\widehat{\beta}_2\), or \(\text{GPA} \times \widehat{\beta}_2\). If \(\widehat{\beta}_2\) is positive (negative), the difference between the groups would widen (shrink) as \(\text{GPA}\) increased.

Guidance Counselors

To estimate this regression in R, we will write the equation as follows: sat_sum ~ hs_gpa + hs_gpa:female. The colon (:) tells R to multiply the two variables in the model like we have written above.

Code
r5 <- lm(sat_sum ~ hs_gpa + hs_gpa:female, df)
summary(r5)

plot(df$hs_gpa, df$sat_sum, las = 1, pch = 19,
     col = scales::alpha(ifelse(df$sex == 1, "tomato", "dodgerblue"), 0.4),
     ylab = "SAT Percentile Sum",
     xlab = "High School GPA")
abline(a = coef(r5)[1], b = coef(r5)[2], col = "dodgerblue", lwd = 3)
abline(a = coef(r5)[1], b = sum(coef(r5)[2:3]), col = "tomato", lwd = 3)
legend("bottomright", bty = "n", lty = c(1, 1),
       col = c("tomato", "dodgerblue"),
       legend = c("Female", "Male"), lwd = 2)
Output

Call:
lm(formula = sat_sum ~ hs_gpa + hs_gpa:female, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.884  -8.284  -0.255   8.600  34.832 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    63.9518     2.3804  26.866  < 2e-16 ***
hs_gpa         11.3040     0.7288  15.511  < 2e-16 ***
hs_gpa:female   2.0289     0.2446   8.294 3.54e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 996 degrees of freedom
Multiple R-squared:  0.2431,    Adjusted R-squared:  0.2415 
F-statistic: 159.9 on 2 and 996 DF,  p-value: < 2.2e-16
Plot

Here, the two lines certainly diverge as hs_gpa increases, which matches up with the regression coefficient (2.03). To interpret this coefficient: female student SAT scores grow two points more than male student SAT scores when increasing GPA by one.

The coefficient for the product (multiplication) of two variables in a regression is called an interaction term.

Guidance Counselors

Econometricians like to use dummy variables and interaction terms because they allow us to consider the setting’s inherent heterogeneity in our models. However, another convenient thing about them is that they allow us to estimate differences.

For example, what if we wanted to know: do female students preform better on SAT exams than male students? Before this module, we might collect SAT scores for female and male students then do a difference-in-means hypothesis test (from Module 2). If we found a statistically significant difference, however, we could not be sure that the difference was in fact due to gender but rather due to the female students in the sample just being smarter overall. By using regression, we can partial out the effect of intelligence (via GPA1) and then re-calculate the difference in means. In fact, our female dummy variable from r4 is exactly that: a difference in means test. The difference we found was 6.9, and it was statistically significant even after controlling for hs_gpa.

Guidance Counselors

Now that we’ve estimated these two models with three parameters each, we can combine them into one giant model. This model will look like the following:

\[\begin{aligned}\text{SAT}_i = \ &\beta_0 + (\beta_1 \times \text{Female}_i) + (\beta_2 \times \text{GPA}_i) \\ &+ (\beta_3 \times \text{GPA}_i \times \text{Female}_i) + \epsilon_i\end{aligned}\]

Coefficient Interpretations
  1. \(\beta_0\) is the expected SAT for male students with a GPA of 0.
  2. \(\beta_1\) is the difference in the expected SAT for female students relative to male students with a GPA of 0.
  3. \(\beta_2\) is the change in expected SAT for male students if they increase their GPA by 1.0
  4. \(\beta_3\) is the difference in the change in expected SAT for female students relative to male students when they increase their GPA by 1.0.

Guidance Counselors

Let’s take a look at the parameter estimates and compare them to the previous gender-specific models:

Code
r6 <- lm(sat_sum ~ female + hs_gpa + hs_gpa:female, df)
# As a note, we could have used "female*hs_gpa" in the model.
#   "*" is a short-hand way to include both variables
#   by themselves in addition to their interaction.
#   However, when learning, I think it's best to write out
#   the full euqation.
regz <- list(`Female` = r2,
             `Male` = r3,
             `All Students` = r6)
coefz <- c("hs_gpa" = "HS GPA",
           "female" = "Female",
           "female:hs_gpa" = "HS GPA x Female",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
Female Male All Students
HS GPA 11.392*** 13.715*** 13.715***
(0.999) (1.079) (1.083)
Female 14.345**
(4.782)
HS GPA x Female −2.323
(1.471)
Constant 70.196*** 55.851*** 55.851***
(3.167) (3.579) (3.593)
Num.Obs. 515 484 999
R2 0.202 0.251 0.250

Guidance Counselors

Let’s break down the model’s output:

  • First, notice how the coefficients for “HS GPA” and “Constant” in the third column match the coefficients in the second column. This is because those two coefficients in the third column are, in essence, male-specific. In effect, we have matched the estimates from the male-only model!
  • You should also note that even though the coefficients match, the standard errors are slightly larger in the third column. This is due to having to estimate additional parameters.
  • If we replicated the male-specific model, what about the female specific model? Well, if we add the constant and the female dummy, we get 70.196, which is the constant in the female-specific regression. This is similar for HS GPA and HS GPA x Female – the sum of these two equals the HS GPA coefficient in the first column.
  • The benefit of estimating all of these parameters in a single model is that now we can look to see if the differences between coefficients in the models are statistically different. For example, the difference in constants across the models is indeed statistically different according to the stars next to the Female parameter in column 3. We also find that the two slopes across the models are not statistically different.
  • The results of this model suggest that female students do perform better on the SAT than their male counterparts. However, the difference between male and female students is relatively constant at all levels of intelligence (i.e. GPA) given the insignificant interaction term.

Guidance Counselors

Initially, we made an arbitrary choice by selecting males to be the “reference” group in our setting. We could have made the exact opposite decision and created a dummy variable for males. We can re-run these models by using a male dummy:

Code
df$male <- ifelse(df$sex == 2, 1, 0)
r7 <- lm(sat_sum ~ male*hs_gpa, df)
regz <- list(`Female` = r2,
             `Male` = r3,
             `All Students (M ref)` = r6,
             `All Students (F ref)` = r7)
coefz <- c("hs_gpa" = "HS GPA",
           "female" = "Sex",
           "female:hs_gpa" = "HS GPA x Sex",
           "male" = "Sex",
           "male:hs_gpa" = "HS GPA x Sex",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
Female Male All Students (M ref) All Students (F ref)
HS GPA 11.392*** 13.715*** 13.715*** 11.392***
(0.999) (1.079) (1.083) (0.996)
Sex 14.345** −14.345**
(4.782) (4.782)
HS GPA x Sex −2.323 2.323
(1.471) (1.471)
Constant 70.196*** 55.851*** 55.851*** 70.196***
(3.167) (3.579) (3.593) (3.155)
Num.Obs. 515 484 999 999
R2 0.202 0.251 0.250 0.250

Guidance Counselors

Numerically, it doesn’t matter which group is the reference group. However, it certainly changes the interpretation of your coefficients which should be carefully considered.

As a final note, it is important to mention that you cannot include both the Female and Male dummies in the same regression. These two variables are perfectly collinear with the constant, which makes for a regression that cannot be estimated.

Linear Probability Model

So far in this module, we have learned about binary explanatory variables. However, what if your outcome variable is binary? We can still use OLS for this!

Let’s consider some data on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana (data, documentation). In these data, we can observe whether the arrestee was released, their race, age, sex, employment status, citizen status and previous police interactions.

Let’s build a model to explain whether the arrestee was released given some of these controls.

Code
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/carData/Arrests.csv")
df$released <- ifelse(df$released == "Yes", 1, 0)
df$black <- ifelse(df$colour == "Black", 1, 0)
df$female <- ifelse(df$sex == "Female", 1, 0)
df$citizen <- ifelse(df$citizen == "Yes", 1, 0)
df$employed <- ifelse(df$employed == "Yes", 1, 0)
summary(r_lpm <- lm(released ~ female + citizen + employed + age + black + checks, df))
Output

Call:
lm(formula = released ~ female + citizen + employed + age + black + 
    checks, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98272  0.03631  0.09415  0.19252  0.60548 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.7405159  0.0238808  31.009  < 2e-16 ***
female      -0.0034265  0.0179731  -0.191    0.849    
citizen      0.0892898  0.0143901   6.205 5.89e-10 ***
employed     0.1236378  0.0125926   9.818  < 2e-16 ***
age          0.0004879  0.0006049   0.807    0.420    
black       -0.0573499  0.0119889  -4.784 1.77e-06 ***
checks      -0.0499783  0.0033967 -14.714  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3581 on 5219 degrees of freedom
Multiple R-squared:  0.09551,   Adjusted R-squared:  0.09447 
F-statistic: 91.85 on 6 and 5219 DF,  p-value: < 2.2e-16
Coefficient Interpretations
  1. The coefficient for female suggests that females are 0.3% less likely than men to be released following an arrest for small amounts of marijuana. However, this coefficient is not statistically different from zero, so we do not have evidence that this difference is meaningful.
  2. The estimate for citizen suggests that, relative to non-citizens, citizens are 8.9% more likely to be released.
  3. Similarly, those who are employed, relative to those who are unemployed, are 12.4% more likely to be released.
  4. An increase in age results in an (insignificant) 0.04% increase in the probability of being released.
  5. Individuals who are black, controlling for other factors, are less likely by 5.7% to be released following an arrest.
  6. For each additional increase checks reduces the probability of being released by 5%.
  7. The intercept suggests a baseline probability of being released of 74.1%. Note that this applies when all other variables are equal to zero. Specifically, this says that the probability of being released for unemployed white male non-citizens with no previous arrest history and who are age zero is 74.1%.
Are these results causal?

An interesting, albeit unsurprising, result of this regression is that black Canadians are less likely to be released than white Canadians. Can we interpret this coefficient as causal? Recall the four reasons why we might find a correlation:

  1. The causal interpretation: race causes release. To settle on this one, we need to rule out the other three.
  2. The reverse causality interpretation: release causes race. This one is clearly implausible, and can be promptly ruled out.
  3. The third variable interpretation: is there some third variable that is causing both race and release? Unfortunately, given this setting/regression, we cannot rule out there being some third factor that is correlated with race and whether an individual is released following an arrest.
  4. The random chance interpretation: we found this correlation by “accident”. This is unlikely given the number of observations we have and our statistical tests. We can probably rule this out.
If we could figure out a clever way (e.g. a Randomized Control Trial) to rule out all possible third variables, we could make causal claims. Unfortunately, we will not cover that in this course. However, if you want to stick around and take Econ 400 in the Spring, we’ll get to things like this!