Difference-in-Differences

Module 3.2: Staggered Treatment Timing

Author

Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Staggered Treatment Timing

Difference-in-Differences is one of the most widely used research designs in the social sciences. This is likely due to how (relatively) easy it is to set up and to understand. While economists tend to prefer designs like Regression Discontinuity (Module 7) in terms of credibility, DiD remains a much more popular method. There have been many econometric innovations in the past decade or so, but none as big as the DiD one. Certainly, no other method has had a “revolution” in the way DiD has.

In these notes, we are going to discuss the innovations that have come to light since ~2020.

When treatment timing is staggered (for example: policies are rolled out in different places at different times), and treatment effects are heterogeneous (for example: different across time/units), the DiD coefficient occasionally fails to represents what we think it should represent. In short, this is due to improper comparisons of groups at different points in time.

In a TWFE design, OLS¹ compares the following:

Newly treated units to never treated units (good!)
Newly treated units to not-yet treated units (good!)
Newly treated units to already treated units (bad!)

This last group, where newly treated units are compared to already treated units, can result in negative estimates of positive treatment effects and vice versa. This issue puzzled social scientists and ubiquitously stalled research. For a few years, economists everywhere were scrambling to figure out what was wrong with two-way fixed effects models, who Andrew Goodman-Bacon was, and how to do his difference-in-differences thing. This was such a FAQ that in late 2019, Goodman-Bacon wrote a paper So You’ve Been Told to Do My Difference-in-Differences Thing: A Guide.

Let’s spend some time motivating and understanding the problem with TWFE before moving onto ~~the shortcuts~~ some solutions to the problem.

TWFE

Consider the following example that we will simulate data for. We have three groups over ten time periods. The first group is never treated throughout the sample. The second and third groups are treated at \(t = 4.5\) and \(t = 7.5\), but we only observe the units at \(t \in \{1, 2, ..., 10\}\). Put more simply: at \(t = 4\), the second group is 100% untreated and at \(t = 5\), the group is 100% treated. The treatment is either going to change the trajectory of growth for each group, or simply be a static level shift, depending on the arguments you choose for the simulation. You should experiment with both, however.

Below are the simulation functions that create the data and then plot the data. I am providing the code so you can see what it looks like under the hood.²

Code

create_data <- function(dynamic = FALSE, rnd = FALSE){
  
  set.seed(757)
  df <- data.frame(id = rep(1:3, times = 10),
                   tt = rep(1:10, each = 3))
  df$col <- ifelse(df$id == 1, "black",
                   ifelse(df$id == 2, "tomato",
                          "dodgerblue"))
  
  # Setting the binary treatment variable
  df$D <- ifelse((df$id == 2 & df$tt >= 5) | (df$id == 3 & df$tt >= 8), 1, 0)
  
  df$btrue <- ifelse(df$D == 1 & df$id == 3, 5,
                     ifelse(df$D == 1 & df$id == 2, 3, 0))
  # Decide if treatment is dynamic or static.
  if(dynamic){
    
    df$rel_time <- df$tt - c(NA, 5, 8)
    df$rel_time <- ifelse(is.na(df$rel_time), -100, df$rel_time)
  } else {
    
    df$rel_time <- 1
  }
  
  # Setting the "true" treatment effect for each time period
  #   This will be nice for later comparison
  df$treatment_effect <- df$btrue*df$D*df$rel_time
  
  # Create outcome variable
  df$y <- df$id + df$tt + df$treatment_effect + if(rnd) rnorm(nrow(df), 0, 0.01) else 0
  
  return(df)
}
plot_data <- function(df){
  
  plot(df$tt, df$y, col = df$col, pch = 19,
       xlab = "Time", ylab = "Y", las = 1)
  for(i in 1:3){
    
    lines(df$tt[df$id == i],
          df$y[df$id == i],
          col = df$col[df$id == i])
  }
  abline(v = c(4.5, 7.5), lty = 2)
}

Compare the static and dynamic data by switching between TRUE and FALSE below:

Since we simulated these data, and we know the true treatment effect for each observation (via treatment_effect), we should be able to average them across treated units. Of course, since the treatments are inherently different for groups 2 and 3, there are many ways we can summarize these treatment effects. For example, since the two treatment effects are 3 and 5 in the static case, we can use the average (4) for our treatment effect. We could also weigh 3 and 5 by how many time periods they’ve been observed for (6 and 3, respectively). This would give us an average of 3.67. We could also find an average treatment by time-since-treatment like in an event study.

Let’s think about what a regression would give us. Suppose we have a TWFE model of the following form³:

\[Y_{it} = \alpha_i + \lambda_t + \delta^{DD}D_{it} + \epsilon_{it}\]

Intuitively, OLS should return an average of all the treatment effects in the data as \(\delta^{DD}\). This would be consistent with a time-weighted treatment effect of 3.67 in static case (or 6.67 in the dynamic case).

Let’s compare our time-weighted treatment effect with what TWFE returns.

Spoiler: our coefficient is (seemingly) too large in the static case and too small in the dynamic case.⁴

Goodman-Bacon

Wait, so what gives? Well, it’s complicated and we (they: the econometricians) just figured this all out.

As simple as this example is, this was a non-trivial observation to econometricians. Andrew Goodman-Bacon, the most recent scapegoat for grumbling economists, demonstrated in an immediately seminal work that the DiD coefficient in a TWFE framework (\(\delta^{DD}\)) is a (sometimes strangely) weighted average of all possible 2x2 DiD comparisons.

What does “all possible” 2x2 DiD comparisons mean?

In this three-group case with one untreated group, there are four possible 2x2 DiD comparisons.

The first two comparisons are the two treated groups each separately compared to the never-treated group.
The third comparison is the earlier treated group compared to the later treated group while the later treated group remains untreated.
The final comparison is the later treated group compared to the earlier treated group while the earlier treated group is always treated.

It’s easier to visualize this as the following:

Code

df <- create_data(TRUE)
par(mfrow = c(2, 2), mar = c(2.1, 2.1, 2.1, 0.6))
## Comp. 1 -- Early vs Never
lim <- df$id %in% c(1, 2)
r1 <- fixest::feols(y ~ D | id + tt, df[lim,])
plot(df$tt[lim], df$y[lim], pch = 19,
     ylim = range(df$y),
     xlim = range(df$tt),
     xlab = "", ylab = "", las = 1,
     main = "I. Early vs Never",
     col = df$col[lim])
lines(df$tt[df$id == 1], df$y[df$id == 1], col = df$col[df$id == 1], lty = 2)
lines(df$tt[df$id == 2], df$y[df$id == 2], col = df$col[df$id == 2])
abline(v = c(4.5, 7.5), lty = 2)
legend("topleft", legend = c("Treatment", "Control"),
       lty = 1:2, bty = "n")

## Comp. 2 -- Late vs Never
lim <- df$id %in% c(1, 3)
r2 <- fixest::feols(y ~ D | id + tt, df[lim,])
plot(df$tt[lim], df$y[lim], pch = 19,
     ylim = range(df$y),
     xlim = range(df$tt),
     xlab = "", ylab = "", las = 1,
     main = "II. Late vs Never",
     col = df$col[lim])
lines(df$tt[df$id == 1], df$y[df$id == 1], col = df$col[df$id == 1], lty = 2)
lines(df$tt[df$id == 3], df$y[df$id == 3], col = df$col[df$id == 3])
abline(v = c(4.5, 7.5), lty = 2)

## Comp. 3 -- Early vs Late
lim <- df$id %in% c(2, 3) & df$tt < min(df$tt[df$id == 3 & df$D == 1])
r3 <- fixest::feols(y ~ D | id + tt, df[lim,])
plot(df$tt[lim], df$y[lim], pch = 19,
     ylim = range(df$y),
     xlim = range(df$tt),
     xlab = "", ylab = "", las = 1,
     main = "III. Early vs Late",
     col = df$col[lim])
lim2 <- lim & df$id == 2
lines(df$tt[lim2], df$y[lim2], col = df$col[lim2])
lim2 <- lim & df$id == 3
lines(df$tt[lim2], df$y[lim2], col = df$col[lim2], lty = 2)
abline(v = c(4.5, 7.5), lty = 2)


## Comp. 4 -- Late vs Early
lim <- df$id %in% c(2, 3) & df$tt >= min(df$tt[df$id == 2 & df$D == 1])
r4 <- fixest::feols(y ~ D | id + tt, df[lim,])
plot(df$tt[lim], df$y[lim], pch = 19,
     ylim = range(df$y),
     xlim = range(df$tt),
     xlab = "", ylab = "", las = 1,
     main = "IV. Late vs Early",
     col = df$col[lim])
lim2 <- lim & df$id == 2
lines(df$tt[lim2], df$y[lim2], col = df$col[lim2], lty = 2)
lim2 <- lim & df$id == 3
lines(df$tt[lim2], df$y[lim2], col = df$col[lim2])
abline(v = c(4.5, 7.5), lty = 2)

par(mfrow = c(1, 1))

Plot

Each of these 2x2 comparisons will be given positive weights when making up \(\delta^{DD}\). However, we only necessarily want to consider comparisons I-III. Comparison IV is flawed because the “control” group is already tainted by treatment dynamics.

One of the innovations in Goodman-Bacon was the ability to decompose \(\delta^{DD}\) into it’s parts. Using the bacondecomp package, we can apply Goodman-Bacon’s diagnostic methodology to split up \(\delta^{DD}\) into its different 2x2 comparisons and their resulting estimates and weights.

Code

library("bacondecomp")

Loading required package: fixest

Warning: package 'fixest' was built under R version 4.3.3

Code

bacon(formula = y ~ D, data = df,
      id_var = "id", time_var = "tt",
      quietly = FALSE) -> our_decomp

Output

                      type  weight  avg_est
1 Earlier vs Later Treated 0.18182  3.00000
2 Later vs Earlier Treated 0.13636 -4.00000
3     Treated vs Untreated 0.68182  6.33333

This output summarizes the three possible comparisons: treated vs never-treated (good!), early vs late treated (good!), and late vs early treated (maybe bad?). What should raise an eyebrow, though, is the late vs early comparison group with an average estimate of -4! This is, of course, not representative of what we want OLS to reflect. Looking at the plot of our dynamic simulation data, it’s crystal clear that our treatment effects are positive.⁵

An aside about bacondecomp::bacon()

If we print the result of our decomposition, we can observe each 2x2. We can also always use this to re-aggregate to the summary bacon() output for us before.

Code

our_decomp
cat("\n########## Summary ##########\n")
aggregate(list(weight = our_decomp$weight,
               avg_est = our_decomp$estimate * our_decomp$weight),
          list(type = our_decomp$type), sum) -> agg
agg$avg_est <- agg$avg_est / agg$weight
agg

Output

  treated untreated estimate    weight                     type
2       5       Inf      7.5 0.3636364     Treated vs Untreated
3       8       Inf      5.0 0.3181818     Treated vs Untreated
6       8         5     -4.0 0.1363636 Later vs Earlier Treated
8       5         8      3.0 0.1818182 Earlier vs Later Treated

########## Summary ##########
                      type    weight   avg_est
1 Earlier vs Later Treated 0.1818182  3.000000
2 Later vs Earlier Treated 0.1363636 -4.000000
3     Treated vs Untreated 0.6818182  6.333333

This is no mistake. When I plotted the four comparisons, I also calculated the TWFE regressions for each. Displaying these coefficients, we can see how Goodman-Bacon’s decomposition gets it’s estimates.⁶

Code

library("modelsummary")

Warning: package 'modelsummary' was built under R version 4.3.3

`modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
  backend. Learn more at: https://vincentarelbundock.github.io/tinytable/

Revert to `kableExtra` for one session:

  options(modelsummary_factory_default = 'kableExtra')
  options(modelsummary_factory_latex = 'kableExtra')
  options(modelsummary_factory_html = 'kableExtra')

Silence this message forever:

  config_modelsummary(startup_message = FALSE)

Code

options(modelsummary_factory_default = 'kableExtra')
options("modelsummary_format_numeric_latex" = "plain")

modelsummary(list("E vs U" = r1, "L vs U" = r2,
                  "E vs L" = r3, "L vs E" = r4),
             gof_map = "", estimate = "{estimate}",
             title = "All 2x2 DiD Coefficients")

All 2x2 DiD Coefficients
	E vs U	L vs U	E vs L	L vs E
D	7.500	5.000	3.000	−4.000
	(0.000)	(0.000)	(0.000)	(0.000)

With this very simple simulation, we’ve found a case where all treatment effects are positive (visually), but part of the decomposed \(\delta^{DD}\) is negative.⁷ In fact, it’s possible for this comparison to turn the entire treatment effect negative. See below for an example.⁸

Code

baker <- haven::read_dta('https://github.com/scunning1975/mixtape/raw/master/baker.dta')
baker$treated <- ifelse(baker$treat_date > 0, 1, 0)

tmp <- aggregate(list(y = baker$y),
                 list(year = baker$year,
                      group = baker$group),
                 mean)
plot(tmp$year, tmp$y, pch = 19,
     col = scales::alpha(tmp$group, .4),
     xlab = "Year", ylab = "Outcome")
lines(tmp$year[tmp$group == unique(tmp$group)[1]],
      tmp$y[tmp$group == unique(tmp$group)[1]],
      col = tmp$group[tmp$group == unique(tmp$group)[1]])
lines(tmp$year[tmp$group == unique(tmp$group)[2]],
      tmp$y[tmp$group == unique(tmp$group)[2]],
      col = tmp$group[tmp$group == unique(tmp$group)[2]])
lines(tmp$year[tmp$group == unique(tmp$group)[3]],
      tmp$y[tmp$group == unique(tmp$group)[3]],
      col = tmp$group[tmp$group == unique(tmp$group)[3]])
lines(tmp$year[tmp$group == unique(tmp$group)[4]],
      tmp$y[tmp$group == unique(tmp$group)[4]],
      col = tmp$group[tmp$group == unique(tmp$group)[4]])
abline(v = unique(baker$treat_date)-0.5, col = 1:4)
cat("\n##### Goodman-Bacon Decomp #####\n")
bacon(formula = y ~ treat, data = baker,
      id_var = "id", time_var = "year",
      quietly = FALSE) -> this_decomp
cat("\n##### TWFE Results #####\n")
fixest::feols(y ~ treat | id + year, data = baker)

Output


##### Goodman-Bacon Decomp #####
                      type weight   avg_est
1 Earlier vs Later Treated    0.5  51.80009
2 Later vs Earlier Treated    0.5 -65.18046

##### TWFE Results #####
OLS estimation, Dep. Var.: y
Observations: 30,000 
Fixed-effects: id: 1,000,  year: 30
Standard-errors: Clustered (id) 
      Estimate Std. Error  t value  Pr(>|t|)    
treat -6.69019   0.670213 -9.98218 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 29.1     Adj. R2: 0.803436
             Within R2: 0.003954

Plot

As a note: unfortunately, the weights are not computed in any intuitive way. According to Andrew Baker ⁹, “we can think of the TWFE estimator as a weighted average of individual 2x2 DiD estimators, with the weights proportional to group sizes and the variance of the treatment indicators in each pair, which is highest for units treated in the middle of the panel.” Therefore, \(\delta^{DD}\) is sensitive to both panel length and which groups appear towards the center. These are not necessarily things we want to influence our estimates. If you remain interested in how weights are derived, I encourage you to read Goodman-Bacon (2019).

Solutions

So, now what?

There have been many estimators proposed that solve this problem. In a blog post, Scott Cunningham notes three categories of these estimators. While this is a helpful dichotomy to have, these estimators are all quite similar in reality. Of course, each has its own strengths and weaknesses, so it’s important to be familiar with more than just one, but the main thread through all of them is to fix the issues associated with wonky comparisons.

Some of the proposed estimators, just to name a few, are as follows:

Gardner (2022)
Dube et al. (2023)
Callaway and Sant’Anna (2021)
Sun and Abraham (2021)
de Chaisemartin and D’Haultfouille (2020)
Borusyak et al. (2023)
Wooldridge (2021, 2022)

We do not have enough time to cover each method, so instead we are going to focus on two: Gardner (2022) and Dube et al. (2023)

Gardner (2022) proposes something called “Two-Stage Difference-in-Differences”, while Dube et al. (2023) propose a method they call a “Local Projections” approach to Difference-in-Differences. While these two are less known relative to, say, Callaway & Sant’Anna (2021), I think their methodologies are more intuitive.

Two-Stage DiD

The first solution we will discuss is, in my opinion, the most intuitive. This is one of those papers that make you think, “I wish I thought of that!” Gardner (2022) takes a clever approach towards estimating the TWFE model. The idea starts like this:

If the parallel trends assumption holds (is true), then the potential outcome for \(Y_{it}^0\) is simply a sum of unit \(i\)’s unit-specific effect and time period \(t\)’s common time-specific effect.¹⁰ Perhaps we should re-frame the problem and switch our priority to accurately measure these parameters, since they provide information about the unobservable counterfactual.

Gardner goes through quite a bit of math to effectively (albeit more elegantly) say the following: since we only have a single, static \(\delta^{DD}\) in our equation, any heterogeneity in treatment effects gets absorbed by the fixed effects \(\alpha_i\) and \(\lambda_t\). Informally, OLS is trying to fit a square peg in a round hole. Forcing the peg and biasedly estimating \(\alpha_i\) and \(\lambda_t\) in turn biases \(\widehat{\delta^{DD}}\)!

The solution Gardner proposes is rather simple: if estimating \(\alpha_i\) and \(\lambda_t\) causes us problems, let’s remove them. Wait… how can we just remove them?! If we can somehow get unbiased estimates of our fixed effects, we can simply subtract them from our outcome before estimating \(\delta^{DD}\)!

Before moving on, let’s think about what fixed effects are. Suppose the data we have contain Virginia from 2010 to 2019. A Virginia-fixed-effect is a time-invariant effect. So, if we estimate a Virginia-fixed-effect, it shouldn’t matter whether we use data from 2010-2014, 2015-2019, strictly odd years, or any other combination of years. In theory, all estimations should yield the same (or similar) result. The same logic holds for time-period-fixed-effects: it shouldn’t matter which units you use to estimate them since the effect is invariant across units.

Since we know treatment is messing up our fixed effects, we can estimate \(\alpha_i\) and \(\lambda_t\) using only data from untreated time periods. We will call this our first-stage. The resulting fixed effects from this estimating will necessarily be unbiased (assuming parallel trends) since there is no treatment involved. Then, in a second stage, we can fit the following model using the entire dataset:

\[ (Y_{it} - \widehat{\alpha}_i - \widehat{\lambda}_i) = \delta^{DD}D_{it} + \epsilon_{it} \]

Let’s take a look at the mechanics of this. We are going to start with the first stage, and regress our outcome variable on only fixed effects only using observations from untreated units.

Code

r1 <- lm(y ~ as.factor(id) + as.factor(tt) - 1, df[df$D == 0,])
coef(r1)

Output

 as.factor(id)1  as.factor(id)2  as.factor(id)3  as.factor(tt)2  as.factor(tt)3 
              2               3               4               1               2 
 as.factor(tt)4  as.factor(tt)5  as.factor(tt)6  as.factor(tt)7  as.factor(tt)8 
              3               4               5               6               7 
 as.factor(tt)9 as.factor(tt)10 
              8               9

Next, we are going to collect the estimated fixed effects and subtract them from the outcome variable.

Code

fe_id <- coef(r1)[grepl("id", names(coef(r1)))]
fe_tt <- coef(r1)[grepl("tt", names(coef(r1)))]
df$fe_id <- fe_id[match(df$id, gsub("as.factor\\(id|\\)", "", names(fe_id)))]
df$fe_tt <- fe_tt[match(df$tt, gsub("as.factor\\(tt|\\)", "", names(fe_tt)))]
df$fe_tt <- ifelse(is.na(df$fe_tt), 0, df$fe_tt)
df$y_2sdid <- df$y - df$fe_id - df$fe_tt

Finally, estimate the second-stage regression without fixed effects.

Code

lm(y_2sdid ~ D, df)

Output


Call:
lm(formula = y_2sdid ~ D, data = df)

Coefficients:
(Intercept)            D  
 -3.243e-16    6.667e+00

Notice how this is exactly equivalent to our average treatment effect of 6.67 for the dynamic group! The only issue with doing this two-step procedure is that the standard errors can get weird. This is true for all two-stage procedures. There’s an R package called did2s that can help implement two-stage DiD for you. I highly recommend this package.

Code

library("did2s")
did2s(df, yname = "y",
      first_stage = ~ 0 | id + tt,
      second_stage = ~ D, treatment = "D",
      cluster_var = "id")

Output

OLS estimation, Dep. Var.: y
Observations: 30
Standard-errors: Custom 
  Estimate Std. Error t value  Pr(>|t|)    
D  6.66667   0.785674 8.48528 2.383e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 2.70801   Adj. R2: 0.544828

For more information about this package, see the vignette and/or the R Journal article.

In addition to estimating \(\delta^{DD}\), we can also estimate an event study using 2SDiD. To do this, simply replace the treatment variable with the fixest notation for an event study. See Module 3.1 for this information.

As a final note, keep in mind that in order to estimate the unit and/or time fixed effect, you need to observe some untreated units in both. What I mean is that there cannot be any units that are always-treated and there cannot be any times that only contain treated units. Their fixed effects will, of course, not be estimated in the first stage, so they automatically get removed from the second stage.

Stacked Difference-in-Differences

The next approach we’ll discuss is stacked difference-in-differences. There is not a specifically associated paper with this estimator, but the intuition has been used in multiple studies. One of the most famous studies that uses a variation of this method is Cengiz et at. (2019).

First, let’s define some terms:

Event: an event is going to be any time some policy is changed or some shock is experienced within a unit (e.g. state, etc.) For example, New Jersey changing its minimum wage would constitute an event.
Sub-experiment: a sub-experiment is a collection of treatment and clean control groups. We’ll define “clean” in a moment. An event might be New Jersey changing its minimum wage. However, if New Jersey and Michigan change their minimum wages at the same time, then their events, in addition to their controls, would constitute a single sub-experiment. The way to refer to New Jersey’s and Michigan’s events would be a cohort.
Clean control: a clean control group is made up of units that do not experience any treatment, or residual treatment dynamics, during a sub-experiment’s treatment periods.

Let’s start with a fresh df so we can experiment with stacking.

Code

df <- create_data(TRUE)
# Since this dataset is so small, I need to increase
#   the number of observations for this demo.
#   You would **not** do this in real life.  
df <- rbind(df, df)
df <- df[order(df$id, df$tt),]

Since there are two treated units in this example, we would have two sub-experiments (one for each event). To focus in on the treatment, we can create a window of \(\pm2\) time period relative to treatment for each sub-experiment.

Code

par(mfrow = c(1, 2), mar = c(2.1, 2.1, 2.1, 0.6))
lim <- df$tt %in% (5 + (-2:2))
df1 <- df[lim,]
plot(df$tt[lim], df$y[lim],
     main = "Sub-Experiment 1",
     col = df$col[lim], pch = 19,
     xlim = range(df$tt), ylim = range(df$y),
     xlab = "Time", ylab = "Y", las = 1)
for(i in 1:3){
    
  lines(df$tt[df$id == i & lim],
        df$y[df$id == i & lim],
        col = df$col[df$id == i & lim])
}
abline(v = c(4.5, 7.5), lty = 2)

lim <- df$tt %in% (8 + (-2:2)) & df$id != 2
df2 <- df[lim,]
plot(df$tt[lim], df$y[lim],
     main = "Sub-Experiment 2",
     col = df$col[lim], pch = 19,
     xlim = range(df$tt), ylim = range(df$y),
     xlab = "Time", ylab = "Y", las = 1)
for(i in 1:3){
    
  lines(df$tt[df$id == i & lim],
        df$y[df$id == i & lim],
        col = df$col[df$id == i & lim])
}
abline(v = c(4.5, 7.5), lty = 2)
par(mfrow = c(1, 1))

Plot

Now that we have our two separately defined sub-experiments, with clean control units, we can create variables to enumerate each experiment and denote relative time within each experiment. Finally, we’ll stack the two subsets of data on top of each other in relative time. By “stack”, I mean that we will literally be vertically concatenating the data. In R, this is done via rbind().

Code

# Enumerate the sub-experiment
df1$experiment <- 1
df2$experiment <- 2
# Add in relative time
df1$rel_time <- df1$tt - 5
df2$rel_time <- df2$tt - 8
# Stack the data with rbind()
stacks <- rbind(df1, df2)
# Creat a treatment variable
stacks$ever_treat <- ave(stacks$D, paste0(stacks$id, stacks$experiment), FUN = max)

Using these stacked data, we can now estimate a simple 2x2 TWFE specification. Note: the fixed effects should be interacted with the sub-experiment identifier. With fixest, we can do this by using FE1^subexperiment.

Code

feols(y ~ D | id^experiment + tt^experiment, stacks)

Output

OLS estimation, Dep. Var.: y
Observations: 50
Fixed-effects: id^experiment: 5,  tt^experiment: 10
Standard-errors: Clustered (id^experiment) 
  Estimate Std. Error t value  Pr(>|t|)    
D  3.85714   0.613813 6.28391 0.0032753 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.23889     Adj. R2: 0.870041
                Within R2: 0.351834

As I see it, there are two main advantages of stacking. First, after stacking, we’re back to the familiar 2x2 case that we’re already comfortable with. Moreover, and this is the second benefit, we can also estimate \(\delta^{DD}\) for each sub-experiment and summarize/combine the results however we’d like. See below for an example of how to do this with fixest.

Code

# Estimate an event study by sub-experiment using split
feols(y ~ i(rel_time, ever_treat, -1)
      | id^experiment + tt^experiment,
      stacks,
      split = ~experiment) -> r2
iplot(r2, grid = F)

# Estimate the aggregate event study
feols(y ~ i(rel_time, ever_treat, -1)
      | id^experiment + tt^experiment, stacks) -> r1
iplot(r1, add = TRUE, col = "dodgerblue")

Plot

While this is seems easy enough with just three groups, the complexity increases when the number of sub-experiments and/or the size of the data becomes large. This becomes even more difficult when treatment is non-absorbing. Treatments that are absorbing forever alter the trajectory of the unit. It is the opposite for non-absorbing treatments, which “wears off”. In other words, after some time, the effect of the treatment levels off, a new equilibrium is reached, and the unit effectively returns back to being untreated. The type of treatment, absorbing or non-absorbing, does not imply the magnitude of the effect, but rather whether the treated units can eventually return to the control group after some time. This is an assumption that necessarily depends on the institutional setting.

An example of non-absorbing treatment would be a pipe in my apartment bursting. I would likely have to move out for a few days or so while the pipe is fixed, but eventually I would be able to return to my normal life. An example of absorbing treatment would be my landlord evicting me from my apartment. I would need to find a new place to live and (not to be dramatic) my life would be changed forever (in a causal inference sense).

More concrete examples would be minimum wage laws compared to texting-and-driving laws. Minimum wages are increased every few years to account for inflation, which necessarily makes the changes non-absorbing. On the other hand, once texting-and-driving has been made illegal it will always be illegal.

Unfortunately, there are not any (well developed) R packages that address stacked difference-in-differences. However, that last sentence may be out-dated by the time you read it with how quickly theory and software are developing around this topic. Again, at the time of writing, I am working on developing a package to do this.

Local Projections DiD

While there are, at the time of writing, no packages that handle stacked DiD in the way I have outlined, there is a method that is tangential – Dube et al. (2023). The general idea of this estimator is as follows. Let’s consider a simple 2x2 TWFE model. The equation would, of course, look like the following:

\[Y_{i,t} = \delta^{DD} D_{i,t} + \alpha_{i} + \lambda_{t} + u_{it}\]

Instead of estimating this equation, we can first do some algebra. We are going to subtract \(Y_{i,t}\) and \(Y_{i,t-1}\). This will yield the following:

\[\begin{alignat}{2}Y_{i,t} - Y_{i,t-1} & = \ && (\delta^{DD} D_{i,t} + \alpha_{i} + \lambda_{t} + u_{i,t}) \\ & &&-(\delta^{DD} D_{i,t-1} + \alpha_{i} + \lambda_{t-1} + u_{i,t-1})\\ & = &&\delta^{DD}\Delta D_{it} + (\alpha_{i} - \alpha_{i}) \\ & && + (\lambda_{t} - \lambda_{t-1}) + (u_{i,t} - u_{i,t-1})\end{alignat}\]

Finally, this can be rewritten as the following if we replace a few terms:

\[Y_{i,t} - Y_{i,t-1} = \delta^{DD}\Delta D_{it} + \tau_{t} + \epsilon_{it}\]

Estimating \(\delta^{DD}\) in this equation will result in a numerically equivalent parameter as a simple TWFE model if there exists one pre-period and one post-period. In fact, this is the way Card and Krueger (1994) set up their data.

What if we have more pre- and post-periods than just one on both sides of treatment? If we want to include a total of \(H\) additional post-periods, we can adjust this formula to look like the following:

\[Y_{i,t+h} - Y_{i,t-1} = \delta_h^{DD}\Delta D_{it} + \tau_{t}^h + \epsilon_{it}^h\] where \(h \in \{0, 1, ..., H\}\). Of course, this also will work for \(h \in \{-H, ..., -3, -2\}\). We are unable to estimate a value for \(h = -1\) because this is our baseline, and all values would be \(0\).

So, what is this equation saying? \(\delta_{h}^{DD}\) is akin to an event-study estimate of the treatment at time \(t + h\) while \(\tau_{t}^h\) controls for the effect of treatment going into place at time \(t\). However, to avoid the wonky comparison problem, we are going to restrict our estimation sample for each \(h\). We are only going to use observations that meet one of the following criteria:

Newly treated: \(\Delta D_{it} = 1\) (treated units at time \(t\))
Clean control: \(D_{i,t+h} = 0\) (control units that have not been treated at \(t+h\))

This is a bit math-y, so let’s take a look at how to manually estimate this. Warning: estimation is, in fact, a non-trivial procedure. Moreover, even after we estimate the model, summarizing/plotting the results can become even more tedious.

Code

library("plm") # Use this packages for leads and lags.
# We need to set rnd = TRUE to introduce some randomness.
#   This is because of the small sample.
df <- create_data(TRUE, TRUE)
df <- pdata.frame(df, index = c("id", "tt"))

window <- -2:2
# Drop -1 because this is going to be mechanically set
#   to zero like in an event study.
window <- window[window != -1]
coefz <- list()
for(time in window){
  
  # Figure out treated obs.
  treat <- which(df$D - lag(df$D) == 1)
  # Figure out control obs.
  ctrl <- which(lag(df$D, -time) == 0)
  lim <- unique(c(treat, ctrl))
  
  # Create outcome and treatment.
  df$y_lp <- lag(df$y, -time) - lag(df$y)
  df$D_lp <- df$D - lag(df$D)
  
  r <- feols(y_lp ~ D_lp | tt, df[lim,])$coeftable
  r$h <- time
  coefz[[length(coefz) + 1]] <- r
}

coefz <- do.call(rbind, coefz)
coefz

Output

           Estimate  Std. Error     t value     Pr(>|t|)  h
D_lp  -0.0008253359 0.008711992 -0.09473562 9.263962e-01 -2
D_lp1 -0.0080362091 0.008337197 -0.96389821 3.578269e-01  0
D_lp2  3.8520399372 0.343088292 11.22754705 3.553766e-06  1
D_lp3  7.7137209601 0.805579210  9.57537243 7.413941e-05  2

This method inherently gives estimates that are similar in spirit to an event study. So, let’s plot the coefficient estimates by relative time:

Code

h <- coefz$h
b <- coefz$Estimate
i <- which(h == 0)
h <- c(h[1:(i-1)], -1, h[i:length(h)])
b <- c(b[1:(i-1)],  0, b[i:length(b)])
plot(h, b, pch = 19); abline(h = 0)

Plot

Again, while there is no stacking package, there is an LPDiD package available called lpdid. Unfortunately, this cannot be installed by install.packages() because the developer is lazy and did not want to be bothered with putting it on CRAN.¹¹ To install this package, you will need to run the following code:

Code

if(!"devtools" %in% installed.packages()) install.packages("devtools")
if(!"lpdid" %in% installed.packages()) devtools::install_github("alexCardazzi/lpdid")

You can find more information about the package via its vignette. Let’s see how it works.

Code

library("lpdid")
lp <- lpdid(create_data(TRUE, TRUE),
            window = c(-2, 2),
            y = "y",
            unit_index = "id",
            time_index = "tt",
            treat_status = "D")
lp
plot_lpdid(lp)

Plot

Output

       Estimate  Std. Error    t value     Pr(>|t|)
1 -0.0008253359 0.007316453 -0.1128055 4.550924e-01
2  0.0000000000 0.000000000        NaN          NaN
3 -0.0080362091 0.004410147 -1.8222090 3.421164e-02
4  3.8520399372 0.678934520  5.6736546 6.989143e-09
5  7.7137209601 1.437261834  5.3669560 4.003830e-08
$coeftable
       Estimate  Std. Error    t value     Pr(>|t|)
1 -0.0008253359 0.007316453 -0.1128055 4.550924e-01
2  0.0000000000 0.000000000        NaN          NaN
3 -0.0080362091 0.004410147 -1.8222090 3.421164e-02
4  3.8520399372 0.678934520  5.6736546 6.989143e-09
5  7.7137209601 1.437261834  5.3669560 4.003830e-08

$window
[1] -2 -1  0  1  2

$nobs
  window nobs
1     -2   17
2     -1    0
3      0   20
4      1   17
5      2   14

$regz
$regz[[1]]
OLS estimation, Dep. Var.: Dy
Observations: 17
Weights: reweight_0
Fixed-effects: tt: 8
Standard-errors: Clustered (cluster_var) 
            Estimate Std. Error   t value Pr(>|t|) 
treat_diff -0.000825   0.007316 -0.112805  0.92049 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.006903     Adj. R2: 0.314241
                 Within R2: 9.8e-4  

$regz[[2]]
NULL

$regz[[3]]
OLS estimation, Dep. Var.: Dy
Observations: 20
Weights: reweight_use
Fixed-effects: tt: 9
Standard-errors: Clustered (cluster_var) 
            Estimate Std. Error  t value Pr(>|t|) 
treat_diff -0.008036    0.00441 -1.82221  0.21001 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.006368     Adj. R2: 0.529289
                 Within R2: 0.085012

$regz[[4]]
OLS estimation, Dep. Var.: Dy
Observations: 17
Weights: reweight_use
Fixed-effects: tt: 8
Standard-errors: Clustered (cluster_var) 
           Estimate Std. Error t value Pr(>|t|)    
treat_diff  3.85204   0.678935 5.67365 0.029689 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.254214     Adj. R2: 0.92741 
                 Within R2: 0.940324

$regz[[5]]
OLS estimation, Dep. Var.: Dy
Observations: 14
Weights: reweight_use
Fixed-effects: tt: 7
Standard-errors: Clustered (cluster_var) 
           Estimate Std. Error t value Pr(>|t|)    
treat_diff  7.71372    1.43726 5.36696 0.033008 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.569631     Adj. R2: 0.916334
                 Within R2: 0.93858

Some other helpful features of lpdid:

Aggregate the coefficients using pooled = TRUE
Find an equally weighted treatment effect (rather than perhaps odd OLS weights) using reweight = TRUE
Remove units that are treated soon after from the control group using composition_correction = TRUE
Estimate non-absorbing treatments using nonaborbing = TRUE, setting nonabsorbing_lag (tells how long after treatment the unit can be considered control again) and nonabsorbing_treat_status.

Wait a minute – this package does exactly what a stacking package would do and more! Why can’t we just always use LPDiD? Unfortunately, lpdid, as written, can only handle panel data and not pooled cross sections. For example, this package works for data that contains one observation per state-year. However, it does not work if there are many observations per state-year, such as survey data. This is because of the first-differencing being ambiguous in a pooled cross section environment.

Comparisons

Finally, let’s revisit that dataset where all the treatment effects were clearly positive but the TWFE estimate turned out negative. We can compare the results the 2SDiD, LPDiD and TWFE. We did not discuss it, but Sun and Abraham (2021) is also very easy to implement in fixest using the sunab function. This might also be something you’re interested in toying with once you’ve read up on it. First, let’s try all of these with a pooled effect.

Code

baker <- as.data.frame(baker)
r_twfe <- feols(y ~ treat | id + year, data = baker,
                cluster = ~ state)
r_did2s <- did2s(baker, "y",
                 first_stage = ~ 0 | id + year,
                 second_stage = ~ treat,
                 treatment = "treat",
                 cluster_var = "state")
r_lpdid1 <- lpdid(df = baker, window = c(-16, 16), y = "y",
                  unit_index = "id", time_index = "year",
                  treat_status = "treat", reweight = FALSE,
                  pooled = TRUE)
r_lpdid2 <- lpdid(df = baker, window = c(-16, 16), y = "y",
                  unit_index = "id", time_index = "year",
                  treat_status = "treat", reweight = TRUE,
                  pooled = TRUE)
r_sunab <- feols(y ~ sunab(treat_date, year) | id + year, 
                 baker[baker$year<2004,],
                 cluster = ~state)

cat("TWFE DiD:\n")
r_twfe$coeftable; cat("\n\nTwo-Stage DiD:\n")
r_did2s$coeftable; cat("\n\nLocal Projections DiD, OLS Weight:\n")
r_lpdid1$coeftable; cat("\n\n\nLocal Projections DiD, Equal Weight:\n")
r_lpdid2$coeftable; cat("\n\n\nSun & Abraham:\n")
summary(r_sunab, agg = "ATT")$coeftable

Output

   Estimate Std. Error  t value Pr(>|t|)
17  68.8109  0.5281872 130.2775        0
   Estimate Std. Error  t value Pr(>|t|)
17 68.24366  0.5238335 130.2774        0
TWFE DiD:
       Estimate Std. Error   t value   Pr(>|t|)
treat -6.690186   3.391857 -1.972426 0.05568123
attr(,"type")
[1] "Clustered (state)"


Two-Stage DiD:
      Estimate Std. Error  t value   Pr(>|t|)
treat 68.32906   5.191276 13.16229 1.9815e-39
attr(,"type")
[1] "Custom"


Local Projections DiD, OLS Weight:
   Estimate Std. Error  t value Pr(>|t|)
17  68.8109  0.5281872 130.2775        0



Local Projections DiD, Equal Weight:
   Estimate Std. Error  t value Pr(>|t|)
17 68.24366  0.5238335 130.2774        0



Sun & Abraham:
    Estimate Std. Error  t value      Pr(>|t|)
ATT 68.33072 0.01417939 4819.018 3.118791e-114
attr(,"type")
[1] "Clustered (state)"

Last but not least, let’s visualize the event study versions.

Code

es_twfe <- feols(y ~ i(time_til, treated, -1) | id + year, data = baker)
es_did2s <- did2s(baker, "y",
                  first_stage = ~ 0 | id + year,
                  second_stage = ~ i(time_til, treated, -1),
                  treatment = "treat",
                  cluster_var = "id")
es_lpdid1 <- lpdid(df = baker, window = c(-16, 16), y = "y",
                   unit_index = "id", time_index = "year",
                   treat_status = "treat", reweight = FALSE)
es_lpdid2 <- lpdid(df = baker, window = c(-16, 16), y = "y",
                   unit_index = "id", time_index = "year",
                   treat_status = "treat", reweight = TRUE)

iplot(es_twfe, grid = F, xlab = "Time until Treatment", main = "")
iplot(es_did2s, add = T, x.shift = -0.1, col = "dodgerblue")
plot_lpdid(es_lpdid1, add = T, x.shift = 0.1, col = "mediumseagreen", opacity = 0.6)
plot_lpdid(es_lpdid2, add = T, x.shift = 0.2, col = "tomato", opacity = 0.6)
iplot(r_sunab, add = T, x.shift = -0.2, col = "gold")
legend("topright", c("TWFE", "LPDiD 1", "LPDiD 2",
                     "2SDiD", "S&A"),
       col = c("black", "mediumseagreen",
               "tomato", "dodgerblue", "gold"),
       bty = "n", pch = 19, ncol = 2)

Output

        Estimate Std. Error      t value   Pr(>|t|)
1   -0.016809705 0.03188038   -0.5272743 0.29900155
2    0.044857301 0.03140285    1.4284467 0.07658167
3    0.010991991 0.03330361    0.3300540 0.37067957
4    0.030953362 0.03172186    0.9757738 0.16458827
5   -0.014682085 0.02106586   -0.6969612 0.24291355
6   -0.010208876 0.02163771   -0.4718094 0.31853140
7   -0.011667879 0.02104780   -0.5543516 0.28966914
8    0.012288063 0.02117727    0.5802477 0.28087379
9   -0.009949051 0.02084344   -0.4773229 0.31656610
10   0.007201914 0.02242783    0.3211151 0.37406159
11   0.017757201 0.01681031    1.0563279 0.14540921
12   0.011204793 0.01701260    0.6586172 0.25507081
13   0.011201235 0.01625301    0.6891790 0.24535532
14  -0.004366847 0.01674433   -0.2607956 0.39712507
15   0.034686640 0.01645988    2.1073446 0.01754386
16   0.000000000 0.00000000          NaN        NaN
17   8.282509127 0.04293563  192.9052821 0.00000000
18  16.534910842 0.08144170  203.0275702 0.00000000
19  24.792831170 0.11965636  207.2002871 0.00000000
20  33.050000664 0.15936795  207.3817238 0.00000000
21  41.314244893 0.19859386  208.0338522 0.00000000
22  49.587093012 0.23774201  208.5752251 0.00000000
23  63.999947733 0.19974819  320.4031459 0.00000000
24  73.146771449 0.22852544  320.0815231 0.00000000
25  82.277405272 0.25609292  321.2794992 0.00000000
26  91.421196205 0.28425985  321.6113529 0.00000000
27 100.629937843 0.31247709  322.0394102 0.00000000
28 109.733868311 0.34152058  321.3096824 0.00000000
29 129.916374182 0.04523460 2872.0578401 0.00000000
30 139.923548094 0.04429049 3159.2233439 0.00000000
31 150.008735624 0.04575301 3278.6634564 0.00000000
32 159.970235655 0.05238147 3053.9470919 0.00000000
33 169.985686079 0.05476450 3103.9395305 0.00000000
        Estimate Std. Error      t value   Pr(>|t|)
1   -0.016809705 0.03186445   -0.5275379 0.29891005
2    0.044857301 0.03138942    1.4290582 0.07649376
3    0.010991991 0.03329115    0.3301776 0.37063290
4    0.030953362 0.03171132    0.9760983 0.16450789
5   -0.016417023 0.02131068   -0.7703660 0.22054141
6   -0.011349211 0.02204457   -0.5148303 0.30333579
7   -0.017846435 0.02111267   -0.8452950 0.19897308
8    0.010474879 0.02151503    0.4868634 0.31317758
9   -0.011343728 0.02122623   -0.5344203 0.29652539
10   0.005947339 0.02281655    0.2606590 0.39717774
11   0.014888175 0.01721536    0.8648193 0.19356899
12   0.011768255 0.01713199    0.6869168 0.24606758
13   0.011907200 0.01646168    0.7233285 0.23473905
14  -0.006014839 0.01686092   -0.3567325 0.36064605
15   0.032129183 0.01671285    1.9224245 0.02727619
16   0.000000000 0.00000000          NaN        NaN
17   8.021910067 0.04385990  182.8985142 0.00000000
18  16.008210657 0.08332803  192.1107563 0.00000000
19  24.010946148 0.12243745  196.1078593 0.00000000
20  32.002549296 0.16311835  196.1922077 0.00000000
21  40.006424653 0.20324614  196.8373137 0.00000000
22  48.021333139 0.24330496  197.3709574 0.00000000
23  63.000587177 0.20381851  309.1014052 0.00000000
24  72.001978109 0.23317371  308.7911439 0.00000000
25  80.991778287 0.26131730  309.9365312 0.00000000
26  89.995447556 0.29001569  310.3123356 0.00000000
27  99.060505244 0.31881213  310.7174944 0.00000000
28 108.019150777 0.34843944  310.0083916 0.00000000
29 129.916374182 0.04524221 2871.5743725 0.00000000
30 139.923548094 0.04429609 3158.8236590 0.00000000
31 150.008735624 0.04575634 3278.4249916 0.00000000
32 159.970235655 0.05238152 3053.9440247 0.00000000
33 169.985686079 0.05475905 3104.2482241 0.00000000

Plot

Footnotes

The beautiful, sentient being that it is↩︎
That said, if you want to change how create_data works, you can copy, paste, and modify it in one of the following webr code chunks.↩︎
Notation taken from Baker et al. (2022)↩︎
Note: it’s not the case that static \(\rightarrow\) too small and dynamic \(\rightarrow\) too large. This is just an artifact of our simulated data. My point is that the coefficient estimate can be either too large or too small and it’s not necessarily clear which will happen.↩︎
Note: in real life, things are hardly ever as clear!↩︎
Note: “U” represents never-treated units, “E” represents early-treated units, and “L” represents later-treated units.↩︎
This is not to say that some parts cannot be negative. In real life, this will likely be the case. However, in this illustration, it’s clear that the negative is unintended.↩︎
Note: in these data, there is no never-treated group. Therefore, the decomposition only contains two groups: early vs late and late vs early.↩︎
This is an excellent blog post, and would be worth a read.↩︎
I am using \(Y^0\) to denote \(Y\) as untreated.↩︎
Other developers who host packages via GitHub are not necessarily lazy, but this one is.↩︎