Many X Variables

Module 4.5: Fixed Effects

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Fixed Effects

One of the most common uses for dummy variables is for fixed effects. A common use case for fixed effects is when you are working with a time-by-unit panel. For example: a state-by-year panel.

By including state dummies, or fixed effects, we are removing / controlling for any and all constant (time-invariant) factors. As an example, if the outcome variable in our state-by-year panel is housing prices, we can reasonably expect California to have high prices and West Virginia to have low prices. Why? For a variety of reasons: weather and other local amenities, economic factors, etc. To make the states more comparable to one another, we can remove the effect of the idiosyncratic differences intrinsic to each state.

We can also include year fixed effects, which control for any and all factors that impact each state at the same time in the same way. For example, the 2007 financial crisis reduced housing prices across the country. When we include dummies for each year (except one, the reference group), we can remove the influence of these events on our outcome.

When using fixed effects, we write our equations a bit differently. If we have a state-by-year panel, you will see equations written like the following:

\[Y_{st} = \beta_s + \beta_t + \beta_1 X_{st} + \beta_2 Z_{st} + \epsilon_{st}\]

where \(\beta_s\) represents state (\(s\)) fixed effects and \(\beta_t\) represents time (\(t\)) fixed effects. Notice how we do not include \(\beta_0\) (the intercept). This is because the fixed effects take the place of the intercept in practice.

Income and Housing

To demonstrate fixed effects, we are going to compare per-capita personal income and housing price indices for different counties/cities in Hampton Roads over time. Let’s take a look at the data files we have.

First, we have the FIPS codes for each of the areas of interest. FIPS codes are unique identifies for counties/cities in the United States. The first two digits denote the state and the final three numbers denote the county or city within that state.

Code
hr <- read.csv("https://alexcardazzi.github.io/econ311/data/hr_fips.csv")
hr # hr is "hampton roads"
Output
                                    name  fips
1                      Gloucester County 51073
2                   Isle of Wight County 51093
3  James City County + Williamsburg City 51931
4                         Mathews County 51115
5                            York County 51199
6                        Chesapeake City 51550
7                           Hampton City 51650
8                      Newport News City 51700
9                           Norfolk City 51710
10                       Portsmouth City 51740
11                          Suffolk City 51800
12                   Virginia Beach City 51810

The other two datasets we are going to use contain per-capita personal income and house price index values. Both contain observations by FIPS code and year.

Code
pi <- read.csv("https://alexcardazzi.github.io/econ311/data/pcpi.csv")
head(pi, 12) # pi is "personal income"
Output
                    series_id value  fips year
1  Per Capita Personal Income  3696 51073 1969
2  Per Capita Personal Income  3916 51073 1970
3  Per Capita Personal Income  4278 51073 1971
4  Per Capita Personal Income  4812 51073 1972
5  Per Capita Personal Income  5115 51073 1973
6  Per Capita Personal Income  5329 51073 1974
7  Per Capita Personal Income  5815 51073 1975
8  Per Capita Personal Income  6493 51073 1976
9  Per Capita Personal Income  6986 51073 1977
10 Per Capita Personal Income  7819 51073 1978
11 Per Capita Personal Income  8297 51073 1979
12 Per Capita Personal Income  9478 51073 1980
Code
hp <- read.csv("https://alexcardazzi.github.io/econ311/data/athpi.csv")
head(hp, 12) # hp is "housing price"
Output
                            series_id  value  fips year
1  All-Transactions House Price Index  83.51 51073 1993
2  All-Transactions House Price Index  83.21 51073 1994
3  All-Transactions House Price Index  85.19 51073 1995
4  All-Transactions House Price Index  86.72 51073 1996
5  All-Transactions House Price Index  88.11 51073 1997
6  All-Transactions House Price Index  92.83 51073 1998
7  All-Transactions House Price Index  95.75 51073 1999
8  All-Transactions House Price Index 100.00 51073 2000
9  All-Transactions House Price Index 104.24 51073 2001
10 All-Transactions House Price Index 109.65 51073 2002
11 All-Transactions House Price Index 115.22 51073 2003
12 All-Transactions House Price Index 131.49 51073 2004

We will need to “merge” these three data files in order to run our regression. In other words, we need to pull information from one file and place it in another file.

Merging Datasets

To begin merging, we should check out which data set has the most “coverage”. For example, the pi dataset contains the range of years {1969, 2021} whereas hp contains {1975, 2022}. This can also be seen graphically:

Code
par(mfrow = c(1, 2))
plot(table(pi$year), las = 1,
     main = "Personal Income",
     xlab = "Year", ylab = "Frequency")
plot(table(hp$year), las = 1,
     main = "Housing Price Index",
     xlab = "Year", ylab = "Frequency")
par(mfrow = c(1, 1))
Plot

These figures suggest we should merge into pi, or else we would potentially be dropping data. It’s always a good idea to have too much than too little.

Next, we are going to merge in the names from hr into pi so we can have a better way to decipher which FIPS code is which county/city. To do this, we are going to match the entries in fips in pi to the entries in fips in hr. We are going to use match() to help us with this.

match() accepts two arguments. The first is the values to be matched, and the second is the values to match against. The resulting vector tells us the location of each element of the first vector (pi$fips) in the second vector (hr$fips).

As a simple example, consider the following. I have a short vector of colleges/universities (v1). We have a second data source (ex) that contains the state these institutions are located. Use the match function below to pull the state for each institution in the first vector from the second data source. The result should be "VA" "WVU" "NJ".

Solution
Code
v1 <- c("ODU", "WVU", "RCNJ")
ex <- data.frame(name = c("RCNJ", "RU", "Union", "ODU", "WVU", "WCU"),
                 state = c("NJ", "NJ", "NY", "VA", "WV", "NC"))
the_match <- match(v1, ex$name)
the_match; cat("\n")
ex$state[the_match]
Output
[1] 4 5 1

[1] "VA" "WV" "NJ"

Merging by One Variable

Now let’s use the same code to pull the FIPS names from hr.

Code
m <- match(pi$fips, hr$fips)
pi$name <- hr$name[m]
View(pi[,c("value", "fips", "year", "name")])

It worked! We have successfully pulled over the county name for each FIPS code.

Next, we need to pull in the housing price index data from hp. We can try a similar process:

Code
m <- match(pi$fips, hp$fips)
pi$hp <- hp$value[m]
View(pi[,c("value", "fips", "year", "hp")])

While we were able to pull a number from the dataset, the code did not work as we intended. There should be variation in hp across years. When using match(), R will always return the first instance of a match, but that is not necessarily what we want. Since we did not specify anything about years, R pulled the first value for each FIPS which is not what we’re looking for.

Merging by Two Variables

We need to create a variable that contains information about FIPS and year. To do this, we are going to use paste(). This function takes two or more variables and “pastes” them together into one. As an example: paste(c("ODU", "WVU"), c("Monarchs", "Mountaineers")): ODU Monarchs, WVU Mountaineers.

Code
m <- match(paste(pi$fips, pi$year),
           paste(hp$fips, hp$year))
pi$hp <- hp$value[m]
View(pi[,c("value", "fips", "year", "hp")])

You might notice that there is quite a bit of missing data in the hp column of pi. That is because, if you recall, we have more personal income data than housing price index data. Now that we have the two variables lined up in the same data frame, let’s explore their relationship a bit. First, let’s plot the two variables in a scatterplot:

Code
pi$value000 <- pi$value/1000
plot(pi$value000, pi$hp, las = 1,
     xlab = "Personal Income (in Thousands)", pch = 19,
     ylab = "Housing Price Index (100 is 2000 $'s)",
     col = scales::alpha("mediumseagreen", 0.4))
Plot

Estimating Fixed Effects

There appears to be a pretty strong, positive relationship between these variables. Now let’s take a look at what some regressions tell us about the strength of the relationships. First, we are going to estimate a model where housing price index is the outcome and personal income is the explanatory variable. Next, we are going to include FIPS and year fixed effects.

Code
coef(lm(hp ~ value000, pi))[2]; cat("\n")
coef(lm(hp ~ value000 + as.factor(fips) + as.factor(year), pi))[2]
Output
value000 
3.698684 

 value000 
0.4945878 

Why are these two coefficients so different?

We have already investigated what happens when we include control variables into our regressions: their effects get removed from both the outcome variable and the other explanatory variables. In this case, it is likely that time is an important third variable that drives both housing prices and personal income. Let’s revisit the coefficient bias table. Since time likely has a positive effect on both housing prices and personal income, the bias is likely positive. In other words, by not accounting for time, the coefficient magnitudes are being artificially pushed in positive direction.

FWL Theorem

Let’s take a look at what happens when we partial out the effect of our “fixed effects”.

We are going to use a trick where we run two regressions with only these fixed effects. One of these regressions will have housing price as the outcome variable and the other regression will have personal income as the outcome. After each estimation, we are going to collect the residuals, which is the variation left over after having removed the influence of the fixed effects. Then, we are going to plot the two sets of residuals against each other and run a third regression between them.

Code
# First two regressions:
r_hp <- lm(hp ~ as.factor(fips) + as.factor(year),
           pi, subset = !is.na(pi$hp))
r_pi <- lm(value000 ~ as.factor(fips) + as.factor(year),
           pi, subset = !is.na(pi$hp))

plot(r_pi$residuals, r_hp$residuals, las = 1)

# Third regression, using only the residuals:
r_resid <- lm(r_hp$residuals ~ r_pi$residuals)
coef(r_resid)[2]
Output
r_pi$residuals 
     0.4945878 
Plot

The coefficient from the fixed effects model matches the coefficient from this funky residual model! Taking residuals from two equations like this is called the Frisch–Waugh–Lovell theorem.

The point you should take away is that often times there are unobserved idiosyncratic factors that need to be controlled for. As an example, our coefficient estimate shrunk by an order of magnitude, or almost 10x, after we included fixed effects.

Reporting Coefficients

When reporting regression models, people are usually uninterested in fixed effects. Rather, people like to include rows in their tables that indicate which fixed effects are included in which models. As an example, let’s consider the housing price and personal income regressions.

Code
library("modelsummary")
options("modelsummary_factory_default" = "kableExtra")

fe1 <- lm(hp ~ value000, pi)
fe2 <- lm(hp ~ value000 + as.factor(fips) + as.factor(year), pi)

regz <- list(`Housing Price Index` = fe1,
             `Housing Price Index` = fe2)
coefz <- c("value000" = "Personal Income (in Thousands)")
gofz <- c("nobs", "r.squared")

rowz <- as.data.frame(matrix(c('Fixed Effects', "No", "Yes",
                               "Something Else", "A", "B"),
                             byrow = T, nrow = 2)) # 2 rows
attr(rowz, 'position') <- c(3, 4) # rows 3 and 4
# since rows 1 and 2 are coefficient and standard error

modelsummary(regz,
             title = "Effect of Personal Income on Housing Price Index",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz,
             add_rows = rowz)
Effect of Personal Income on Housing Price Index
 Housing Price Index  Housing Price Index
Personal Income (in Thousands) 3.699*** 0.495***
(0.077) (0.130)
Fixed Effects No Yes
Something Else A B
Num.Obs. 391 391
R2 0.856 0.991