Module 4.1: Basics

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Matching in Econometrics

Thus far, we have talked about simple regression techniques, why you might use certain controls, and something called Difference-in-Differences. As I will stress in nearly every module, Randomized Control Trials are the gold standard in causal inference and we’re always trying to mimic their random assignment. DiD tries to do this by subtracting unit- and time-specific effects from our outcomes. This is only really possible if you observe treatment and control units before and after.

What if our data is only a cross section and we only observe post-period outcomes? What if you cannot credibly assume parallel trends? Thankfully, not all is lost. In these cases, we can try a relatively intuitive approach to causal inference called matching.

Suppose we are interested in the effects of a new government program that provides grants to health clinics in a developing country. Unfortunately, the political regime in power is corrupt, and they provide grants to those clinics in areas who most align with the government’s political ideology. At the same time, those areas that support the current political regime tend to have better health outcomes. If we were trying to study the effects of these grants, we’d have to worry about the backdoor from treatment (getting a grant) to the outcome (health) via this confounding variable (political ideology).

So, what are our options? If we observed clinic-level outcomes before and after the treatment, we could use DiD. What if we only observed outcomes post-treatment? We could try to control for political ideology, the backdoor causing us issues, via regression.

However, this isn’t the only way we could “control” for this! If the political regime had \(n-k\) grants, where \(n > k\), and \(n\) clinics in areas that supported them, they would fall short of funding all of their preferred clinics. In this case, we could subset our data, which presumably has \(N > n\) clinic observations, to include only the \(n\) clinics that support the current regime. By choosing to look at this subset of clinics, we are removing all of the variation associated political affiliation, thus closing the backdoor!

Wait, when does the matching come into play? There are two forms of matching we’ll discuss:

  1. Matching treated units to observationally similar non-treated units, and dropping everything else. Dropping observations will reduce the bias in our estimates by focusing on only the most relevant control units. We are implicitly assuming that being observationally similar means the unit is a good representation of its match’s counterfactual. However, reducing the number of observations also reduces our statistical precision, resulting in larger standard errors. This is an example of the bias-variance tradeoff.
  2. Applying weights to both groups so they appear more comparable. In this strategy, we will weight non-treated units more heavily the more they “look like” the units in the treated group. In addition, we’ll also weight treated units more heavily the more they resemble control units. This method allows us to use the entire dataset which should reduce the estimate variance relative to the smaller, matched sample. However, by using the entire dataset, we still may be using less than ideal comparison units (even if we weight them less), which could still allow bias to creep in.

A question you may have is “if we can control for something by just including the variable into the regression, why would we ever use matching?” First, you’ll generally always use matching and regression together, so you shouldn’t necessarily think of them as an either-or proposition. Second, an advantage of matching over regression for controlling for confounders is that matching is agnostic1 to functional form. For example, what if the government regime instead allocated grants to clinics that most or least supported their ideologies. They might do this to reward loyalty from their allies and reduce hostility from their political enemies. Putting ideology into the regression model as a linear variable would completely miss this U-shaped relationship while matching would allows us the additional flexibility.

Let’s return to the initial premise where being closely aligned in political ideology caused treatment. What would matching look like in words?

If we were creating a matched sample, we would take an inventory of the political ideology of each treated clinic. Then, for each of these clinics, we would find the untreated clinic with the “closest” political ideology (among other variables, if we had them). We would consider these two clinics “matched”. Then, once we do this for every treated clinic, we would drop all of the unmatched clinics from the sample.

If we were creating weights, they would differ within the treated and control groups. For the treated group, we would want to weight the less aligned clinics more than the more aligned clinics. For the control group, we would weight the more aligned clinics more than the less aligned clinics. Why? Remember, the control (treatment) group is probably dominated by less (more) aligned clinics. Therefore, we apply these weights to balance out the groups so they appear more similar.


Funding Clinics

Before moving onto implementing matching, let’s simulate some data to practice. The following function will generate a total of 1,000 clinics of which 250 get funding grants. The way the code works is that the 1,000 clinics are considered in random order, and the value of their political ideology (a number between 0 and 1) is a stand in for the probability that the clinic gets funding. Funding is given out probabilistically until the 250 grants are exhausted and then all remaining clinics get nothing. The simulation function allows for two measures of ideology: binary and continuous. The reason for this will become a bit more clear as you work through these notes.

Code
genr_data <- function(type = "binary"){
  
  N <- 1000 # number of clinics
  n <- N/2 # number of politically aligned clinics
  k <- n/2 # number of grants
  
  if(type == "continuous"){
    
    ideology <- c(rnorm(n, mean = 0.65, sd = 0.15),
                  rnorm(N-n, mean = 0.35, sd = 0.15))
    ideology <- ifelse(ideology > 1, 1,
                       ifelse(ideology < 0, 0,
                              ideology))
    ideology <- ideology[sample(1:N, N, F)]
  } else {
    
    ideology <- c(rep(1, n),
                  rep(0, N-n))
    ideology <- ideology[sample(1:N, N, F)]
  }
  
  treat <- 0; grant_count <- k
  for(i in 1:N){

    if(grant_count == 0){
        
      grant <- 0
    } else {
        
      grant <- sample(0:1, 1, TRUE, prob = c(1-ideology[i],ideology[i]))
      if(grant == 1) grant_count <- grant_count - 1
    }
      
    treat[i] <- grant
  }
  
  health <- 5 + 2*ideology + 0.5*treat
  df <- data.frame(index = 1:N,
                   health = health,
                   treat = treat,
                   ideology = ideology)
  return(df)
}

Let’s begin by generating data that is binary.

Code
df <- genr_data("binary")

Next, try generating data on your end and then tabulate the treatment by ideology.


    

Now that we have our data configured, let’s move onto matching.

Exact Matching

Let’s suppose that you’re convinced and ready to use matching. Next you’ll have to decide the following:

  1. What are you going to match on?
  2. What is the criteria to consider something a valid match?
  3. Do you want to create a matched sample or a weighted sample?

In our example, we already stated that we want to match on the confounding variable that is political ideology. Now, we need to come up with some criteria to ensure we’re not matching two vastly different clinics. So, what criteria should we use? Well, let’s just start with perfect matches, or in other words, exact matches. Then, finally, let’s try using a matched sample.

Using exact matches to construct a matched sample is (uninspiringly) called exact matching. For each treated observation, we are going to find a control observation with exactly the same observational characteristics. In our example, especially when ideology is binary, this should be very easy to do. For each treated clinic, we just need to find a control clinic with the same value of ideology.

To perform our matching techniques, we are going to use a package called MatchIt. The main function in this package is called matchit() and it accepts formula and data arguments like when we run regressions with lm(). The difference is that with matchit(), the outcome variable will be the treatment variable. Another argument this function accepts is method. This is where we are going to specify the type of matching we want to use. For the matching types we’ll discuss in this course, we are always going to use method = "nearest". Then, when using exact matching, we also need to specify which variables we want to exactly match. Check out the following code:

Code
library("MatchIt")
m.out <- matchit(treat ~ ideology, data = df,
                 method = "nearest", exact = c("ideology"))

This code might be a bit opaque at the moment, but we’ll return to what it’s exactly doing momentarily. Once we perform our matching, we need to pull out the matched data subset. To do this, we will use the match.data() function with drop.unmatched set to TRUE.

Code
df_m <- match.data(m.out, drop.unmatched = TRUE)

Initially, we started with 1,000 clinics. We can identify this by using dim(df). Try loading in MatchIt, run the matching algorithm, and check out the resulting data dimensions:


    

Now that we have our two sets of data, let’s estimate some models to explain health:

  1. A naïve specification where treat is the lone explanatory variable.
  2. The correct specification where treat and ideology are the explanatory variables.
  3. The naïve specification, but using the matched data.

    

Take a look at the generated coefficients. Of course, the coefficient in r_naive suffers from omitted variable bias due to the open backdoor path via ideology. r_real gives us our initial coefficients by closing this backdoor. Finally, r_matched also gives us the correct coefficient since we’ve closed the backdoor through matching rather than controlling.

Next, let’s try matching when our data is continuous. In the following code chunk, re-generate the data such that type = "continuous" and create matched data.


    

Do you notice anything about the matched data?

Since ideology is continuous, the probability that another clinic has exactly the same value is practically zero. This means we’ll have almost no matches even though it’s probably the case that we can find very close, albeit imperfect, matches.

What should we do about this? One thought might be to round ideology to the nearest tenth or so and then use exact matching. In fact, this type of matching is called coarsened exact matching. We won’t go into detail about this method, but it is an option.

Another way we could do this is via distances or differences. For example, if we have a treated clinic with an ideology of 0.842, we can find the clinic with the smallest difference in ideology to the treated one. This is called distance matching.

Distance Matching

Let’s think through what matching via distances would be like. Suppose we have some variable \(A\) that we want to match on. For treated unit \(i\), the value of \(A\) is \(A_i\). To measure the distance between unit \(i\) and every other unit, we can use the expression \((A_i - A_j)^2\). We square the difference in order to ensure the distance is positive (negative distances do not make much intuitive sense). Then, we choose unit \(j\) as a match for unit \(i\) so long as \(j\) minimizes the expression.

What if there are two variables? If we have variables \(A\) and \(B\), we would minimize the sum of squared differences. Sound familiar? This is what OLS does! The formula would then look like the following: \(\sqrt{(A_i - A_j)^2 + (B_i - B_j)^2}\). Look familiar? This is the Pythagorean theorem. Yeah, let that sink in. Again, we would still choose unit \(j\) as a match for unit \(i\) if it minimizes the distance. This type of distance matching is called "euclidean".

Euclidean distances make intuitive sense but there exists a problem. What happens when the units of \(A\) and \(B\) are different? For example, consider matching on GDP (in $s) and unemployment rate (in %s). Differences in unemployment rates will probably top out around ten, but differences in GDP could be in the trillions! Without accounting for this, differences in units of measurement will cause one variable to carry much more weight than another for no reason other than its scale. To adjust for differences in units, our friends over at MatchIt have created the "scaled_euclidean" distance! This will scale each variable by its variance before calculating the euclidean distance, accounting for differences in units amongst variables.

But wait, there’s more!image We also need to consider redundant information. Typically, when GDP is high, unemployment is low, and vice versa. Essentially, there’s some correlation between GDP and unemployment that we also need to account for to avoid double counting. This is sort of like what regressions do when we include multiple explanatory variables. We can account for these correlations by using the "mahalanobis" distance. This option will both scale and remove redundant information when computing euclidean distances.

How can we implement the mahalanobis distance algorithm? Easy! Instead of using exact = c(...), we can use distance = "mahalanobis" inside of matchit().

Code
m.out <- matchit(treat ~ ideology, data = df,
                 method = "nearest", distance = "mahalanobis")
df_m <- match.data(m.out, drop.unmatched = TRUE)

Let me now give an explanation of what the matchit function is doing when we use method = "nearest". First, the function splits the data into treatment and control groups. Then, it goes down the list of treated units and finds the best match in the control group for each. Note: control units are not reused, meaning each control unit is matched to only one treated unit. As another note, the "nearest" algorithm uses “greedy” matching rather than “optimal” matching. To illustrate, suppose there are treated units with values of 0.8, 0.86, and 0.9, and control units with values of 0.9, 0.8, and 0.8. Of course, the first treated unit of 0.8 will be matched with the 0.8 in the control group. Then, the 0.86 has to choose between 0.8 and 0.9. Since this is “greedy”, it will select the 0.9 which is the better match. Finally, the 0.9 will then be left to match with the remaining 0.8. The optimal approach, on the other hand, would be smart enough to match the 0.8’s and 0.9’s, and then figure out the best way to allocate the remaining treatment (0.86) and control (0.8).

Finally, let’s see how this matching did. Let’s estimate the same regressions we did after exact matching before:

Code
lm(health ~ treat, data = df)
lm(health ~ treat + ideology, data = df)
lm(health ~ treat, data = df_m)
Output

Call:
lm(formula = health ~ treat, data = df)

Coefficients:
(Intercept)        treat  
     5.9246       0.7701  


Call:
lm(formula = health ~ treat + ideology, data = df)

Coefficients:
(Intercept)        treat     ideology  
        5.0          0.5          2.0  


Call:
lm(formula = health ~ treat, data = df_m)

Coefficients:
(Intercept)        treat  
     6.1853       0.5094  

For creating matched samples, exact matching and mahalanobis distance matching are the most popular methods.


Propensity Scores

One potentially negative aspect of distance matching is the “curse of dimensionality”. Essentially, the more things you give R to match on, the more that distance measure will increase. Even though it may seem like providing as many variables as you can is a good idea, in reality, it might increase the difficulty of finding good matches. This is not to say that you should omit important variables, but rather be picky about the ones you choose.

One way to get around this issue is by using something called a propensity score. A propensity score is the probability that a unit receives treatment. You can also think about it as a measure of how much a given unit “looks like” a treated unit. So, what do propensity scores do? Remember, we want to shut down backdoors that cause both the treatment and the outcome. We are going to use these propensity scores to close the backdoors.image

The most common way of calculating propensity scores is to use logistic regression, but you can use any prediction algorithm. We calculate propensity scores with regression by estimating the model and generating fitted or predicted values. Let’s calculate our propensity scores via logistic regression:

Code
# Estimate Model
r <- glm(treat ~ ideology, data = df,
         family = binomial(link = "logit"))
# Generate Propensity Scores
df$ps <- predict(r, type = "response")

Next, let’s compare these propensity scores to the propensity scores generated via matchit(). To do this, we are going to set distance = "glm". Then, we’ll use match.data(), except we do not want to drop the unmatched observations. It has been shown that using propensity scores to creat a matched sample is inefficient. Rather, this is where weighting comes into play, and we’ll see this in a moment.

Code
m.out <- matchit(treat ~ ideology, data = df,
                 method = "nearest", distance = "glm")
df_m <- match.data(m.out, drop.unmatched = FALSE)

Now let’s compare the propensity scores from the glm() logistic regression to the ones generated via matchit():

Code
cat("Percent of propensity scores equal: ",
    100 * round(mean(df$ps == df_m$distance), 2), "%")
Output
Percent of propensity scores equal:  100 %

They’re the same! Now you might be curious what the distributions of the propensity scores look like between the treated and control groups. Let’s plot both distributions.

Code
dens_t <- density(df$ps[df$treat == 1])
dens_c <- density(df$ps[df$treat == 0])

plot(0, 0, type = "n",
     xlab = "Propensity Score",
     ylab = "Density",
     ylim = range(dens_t$y, dens_c$y),
     xlim = range(dens_t$x, dens_c$x))
lines(dens_t$x, dens_t$y, col = "tomato", lwd = 2)
lines(dens_c$x, dens_c$y, col = "dodgerblue", lwd = 2)
abline(v = mean(df$ps[df$treat == 1]), col = "tomato", lwd = 2)
abline(v = mean(df$ps[df$treat == 0]), col = "dodgerblue", lwd = 2)
legend("topright", legend = c("Treatment", "Control"),
       col = c("tomato", "dodgerblue"), bty = "n", lty = 1)
Plot

How should we think about these distributions?

  1. You might notice that the propensity scores for the treated group are not reaching very high values. Think about this from the binary perspective. There are 500 clinics that support the political regime, but only 250 grants. This suggests, at best, the probability of treatment among the supporters is only 50%. This is going to vary when ideology is continuous, but hopefully the point remains.
  2. Having propensity scores that are very close to 0 or 1 is generally not great, anyway. This is due to two reasons. First, high (or low) propensity scores indicate particularly high (or low) probabilities of treatment. Finding counterfactual observations for these units is unlikely. Second, we will be constructing weights by taking the inverse (the reciprocal) of these probabilities. Dividing by near zero values will make for particularly large weights which can also be suspicious.
  3. It’s helpful when the propensity scores have significant overlap. This signals that we will be able to find plenty of good matches in the data.

Before moving onto actually using these propensity scores, let’s take a look at a summary of our matching. Essentially, how did we do?

First, we can use summary() to check out some of the output. The two main items you should care about are the Summary of Balance for All Data and Summary of Balance for Matched Data. The first of these gives some summary statistics of the matching variables across treatment and control groups. In other words, it tells us how much the treatment and control look like each other. The second table replicates the first but uses the matched data. This is important – ideally, we’d like for the two groups to be indistinguishable after matching. If the two groups remain very different, we need to continue adjusting our matching algorithm.

Code
summary(m.out)
Output

Call:
matchit(formula = treat ~ ideology, data = df, method = "nearest", 
    distance = "glm")

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.3105        0.2298          0.6198     1.3119    0.1828
ideology        0.5973        0.4623          0.6756     0.9792    0.1828
         eCDF Max
distance     0.28
ideology     0.28

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.3105        0.3065          0.0309     1.1041    0.0033
ideology        0.5973        0.5924          0.0248     1.0730    0.0033
         eCDF Max Std. Pair Dist.
distance    0.044          0.0325
ideology    0.044          0.0265

Sample Sizes:
          Control Treated
All           750     250
Matched       250     250
Unmatched     500       0
Discarded       0       0

Since summary statistics do not always tell the full story, it’s also helpful to visualize the distributions of the covariates (the matching variables) before and after. We can use plot() with type = "density" and interactive = FALSE.2

Code
plot(m.out, type = "density", interactive = FALSE)
Plot

From this visualization, it’s (hopefully) clear that the matching did a good job balancing out the ideology variable across observations. Since these lines are ontop of one another, we can say that we’ve done a good job getting our propensity scores.

Inverse Probability Weighting

Now that we’ve generated propensity scores, we need to construct weights for our regression. If treatment is denoted by \(D\), and propensity scores are \(P(D)\), then the inverse probability weights (IPW) can be calculated using the following formula:

\[\text{Weight} = \frac{\text{Treatment}}{\text{Propensity}} + \frac{1 - \text{Treatment}}{1 - \text{Propensity}} = \frac{D}{P(D)} + \frac{1 - D}{1 - P(D)}\]

Treated units (\(D = 1\)) will be given a weight of \(\frac{1}{P(D)}\). Note that as the propensity score \(P(D)\) increases, the weight will decrease. This is because we want to discount units that are especially prone to being treated since they’re less comparable to the control group. By the same logic, untreated units (\(D = 0\)) will be given a weight of \(\frac{1}{1-P(D)}\).

We can calculate the IPW via:

Code
df$weight <- (df$treat / df$ps) + ((1 - df$treat) / (1 - df$ps))

Now that we have our weights, let’s plug them into our regression. We can do this by supplying the vector name into the weights argument in lm() (or feols()).

Code
lm(health ~ treat, df,
   weights = weight) -> reg

Try estimating the regression with inverse probability weights on your own, and see what coefficient you get on treat. Of course, feel free to use MatchIt or glm():


    

If this worked correctly, you should get a coefficient around 0.5.


Dispensary Lottery

Now for a “real life” use case of PSM and IPW:

There are often difficulties in identifying the causal effects of place-based policies on local outcomes for various reasons. One of which has to do with the endogeneity of the policy implementation. It is not often that place-based policies are located where they are due to random assignment. Rather, they are often used to target or correct poor outcomes. As an example, consider Opportunity Zones. These zones give preferential tax treatment to new investment in particularly distressed communities. If you were to estimate the effect of these zones on investment in a DiD framework, there would be certain violations of the parallel trends assumption.

As another example, consider the effect of marijuana dispensaries on property prices, crime, health, etc. Dispensaries will endogenously locate to different areas for many possible reasons. Since you’ll never convince everyone that you’ve controlled for all of these confounding factors, we need some way to mitigate these endogeneity concerns. Let’s take a look at a short, interesting paper Conyers & Ayres (2020).

At this point in the course, you should be able to entirely decipher this paper and provide critiques.

In 2012, Arizona Department of Health Services (ADHS) conducted a lottery to allocate medical marijuana dispensaries throughout the state. ADHS planned to locate one dispensary within each of the 126 Community Health Analysis Areas (CHAA). There were 435 dispensary applications across 99 of the CHAAs (27 received zero applications). 30 of the 99 CHAAs had only a single application, meaning a lottery was only needed in 69 CHAAs. The average number of application in these CHAAs was 5.94.

Conyers & Ayres (2020) then looked at zip codes, which do not perfectly line up with CHAAs. Importantly, a zip code can fall within more than one CHAA, so they can participate in multiple allocation lotteries. If all applications for a CHAA originate from a single zip code, this zip code would have a probability of allocation of 1.3

The probability of a zip code in multiple CHAAs winning at least one lottery is equal to 1 minus the probability of winning 0 lotteries. Suppose a zip code will participate in two lotteries where the probability of allocation in each one is 0.4 and 0.3. Then, the probability of being allocated at least one is equal to \(1 - (0.6\times0.7) = 0.58\). 84 zip codes had a probability of dispensary allocation between 0 and 1. Of these 84, 36 were allocated dispensaries and 48 were not.

Since each zip code has a directly calculable probability of treatment, the authors use this as their propensity score. They then generate regression weights as follows:

\[\begin{aligned}w_\text{Allocated} &= \frac{1}{P(\text{Allocation})}\\ w_\text{Non-Allocated} &= \frac{1}{1-P(\text{Allocation})}\end{aligned}\]

Quote from Conyers & Ayres (2020) about IPW

The inverse probability procedure weights allocated dispensaries with a low probability of allocation higher than allocated dispensaries with a high probability of allocation. It weights non-allocated dispensaries with a high probability of allocation higher than non-allocated dispensaries with a low probability of allocation. This procedure reduces possible bias associated with the endogeneity of the probability of allocation, producing a set of allocated and non-allocated zip codes that, when weighted, are statistically identical on prelottery characteristics.

Why do the authors use weighting in this setting? Zip codes with especially high allocation probabilities are likely different from zip codes with particularly low allocation probabilities. Weighting the zip codes by their different propensity scores is an attempt to produce better comparisons. After computing these weights, the authors estimate the following model:

\[\begin{aligned}O_{it} = \beta_0 + \beta_1 D + \beta_2 T + \beta_3 (D\times T) + \rho_i + \omega_t + e_{it}\end{aligned}\]

where \(i\) and \(t\) represent zip code and time indices, \(O\) represents the outcome measure, \(D\) represents the number of allocated dispensaries, \(T\) represents the number of quarters since the lottery, and \(\rho\) and \(\omega\) represent zip code and time fixed effects.

Notice how this is really just a variation of difference-in-differences. Instead of \(D\) being a binary treatment variable, it is a count of the number of dispensaries. Then, instead of \(T\) being a binary post variable, it is a linear time variable.

If this were my paper, I would have shown the more standard DiD estimation for transparency. Also, looking at their time series plots, I would be concerned about changes in their control group.

Cleaning Up

As another note about implementing matching, it’s possible to use a combination of PSM and exact matching. If you run the matchit() code above with distance = "glm", you can still use exact = c(...) as well. For example, in the clinics example, it might make sense to match on both ideology and region. Using this code, you could find the control observation with the closest propensity score within the same region. This is also true for creating matched samples with distance = "mahalanobis".

Lastly, matching, for all its cleverness, has generally fallen out of favor with econometricians. The main complaint about matching is the following. Of course, econometricians are worried about confounding factors. However, confounding factors are often unobservable. Matching only closes backdoors with observable confounding variables. Compare this to difference-in-differences, which attempts to remove confounding factors via time and unit fixed effects. Is DiD perfect? Of course not, but it’s generally preferred to matching.

That said, there is a new class of estimators called doubly-robust estimators. What these estimators do is, essentially, combine DiD and IPW. We won’t cover this, but I want to advertise the frontier of econometrics when I can.


Footnotes

  1. Agnostic is too strong of a word and my using it is just for illustrative purposes. Maybe “less sensitive” is a better way to describe it.↩︎

  2. Note: we use interactive = FALSE because of the following. MatchIt wants to save space and plot three covariates, before and after, at a time. When we turn interactive off, all plots get generated at once. So, if you have 5 covariates, you’ll have two plots (one with 3 covariates and one with 2). This space saving also makes for sometimes crowded plots. As you’ll see below, since we only have one variable, it gets squished up towards the top. Is this the best use of space? Unfortunately, no, but everything else in MatchIt is so good that we forgive.↩︎

  3. This was the case for 51 zip codes, which were subsequently dropped from the analysis.↩︎