Module 2.1: Identification
Old Dominion University
Let’s begin this module by defining a few terms. Be on the lookout for these terms, and more, throughout the notes bolded and underlined.
Second, here is a flow chart to help you visualize the relationship between these terms:
When you’re first taught econometrics, much focus is placed on estimation and, perhaps to a lesser extent, inference. Your professor, maybe even me, probably tried to hammer home “a one unit change in \(X\) is associated with a \(\widehat{\beta}\) units change in \(Y\).” This is estimation. Then, you were probably asked to test if \(\widehat{\beta} = 0\) by considering \(\beta\)’s standard error. This is inference.
Once you’re familiar with how to estimate and interpret \(\beta\)s, which admittedly is not always easy, you can start thinking more about your modeling decisions. Estimation and interpretation are the spelling and grammar of modern econometrics, and you need strong foundations as you take this next step.
The next step I’m talking about is called identification. In this module, we are going to start thinking about how to estimate a plausibly causal relationship between \(X\) and \(Y\) rather than a correlational relationship between \(X\) and \(Y\). Before jumping into identification, though, we need to lay out some more foundations.
Economists use econometric models to examine the effects of some policy, shock, or treatment on some outcome. For example, suppose you are in charge of figuring out if a new pill successfully alleviates headaches. Perhaps the best way to test the effectiveness of the drug would be to recruit a number of participants who have headaches. Some of these participants would be given the new pill while the other half would not be allowed to take anything. Consider individual \(i\) who is part of this study.
Individual \(i\)’s outcome, when treated, can be written as \(Y_i|_{D = 1}\) (or \(Y_i|_{D=0}\) if untreated). However, as shorthand, you’ll see this written as either \(Y_i(1)\) or \(Y_{i}^1\). If \(Y_i(1) = 1\) and \(Y_i(0) = 0\), then we’d say that the new pill worked. On the other hand, if \(Y_i(1) = 0\) and \(Y_i(0) = 0\), we’d probably say that the pill did not work. Therefore, the treatment effect can be written as the difference between \(Y_i(1)\) and \(Y_i(0)\): \(Y_i(1) - Y_i(0)\).
If the treatment effect for person/unit \(i\) is \(Y_i(1) - Y_i(0)\), then we can write the average treatment effect (or ATE) as \(E[Y_i(1) - Y_i(0)] = E[Y_i(1)] - E[Y_i(0)]\). Remember, this is shorthand for \(E[Y_i|_{D = 1}] - E[Y_i|_{D = 0}]\). This statement is our causal estimand – the quantity we want to estimate. We need to find a statistical representation of this so we can take it to data.
In statistics, the expectation of \(Y\), \(E[Y]\), simply represents the average of \(Y\). The expectation of \(Y\) given \(X\), \(E[Y|X]\), is the conditional average of \(Y\). The connection between this and econometrics is that you can think of OLS as a way to calculate conditional expectations. This represents our statistical estimand.
Back to the ATE – in words, this is equal to the average difference in the expected outcome of \(Y\) minus the average outcome of \(Y\) for untreated units.
Okay, cool, but there’s a big issue with all of this… We are talking about \(Y_i(1)\) and \(Y_i(0)\). How can we observe both \(Y_i(1)\) and \(Y_i(0)\)? Either individual \(i\) is treated or they’re not!
In other words, there are two possible states of the world. In one, individual \(i\) takes the pill, and their outcome is \(Y_i(1)\). In the other version of the world, where they don’t take the pill, their outcome is \(Y_i(0)\). To find the treatment effect like I talked about above only works if we can simultaneously observe both of these versions. It’s not even enough for us to believe in parallel universes – we would need to observe each outcome in each universe. This is the fundamental problem of causal inference, and this issue is going to motivate the rest of this class.
So, how can we plausibly estimate treatment effects? Since, at the time of writing, we’re limited to only a single universe, we usually have to rely on observational data. Like in the new pill example, there might be a group of treated individuals and a group of untreated individuals. If the only difference between the groups is their treatment status, then we can attribute any differences in outcomes to the difference in treatment status.1 In other words, we can use \(E[Y|D=1]\) to approximate \(E[Y(1)]\). Think of \(E[Y|D=1]\) as the average of \(Y\) for the subset of individuals who are treated. This is subtly different from \(E[Y(1)]\), which is is the average of \(Y\) for all individuals in the specific universe they are treated. If the treated subset of individuals are a random sample of all individuals, then \(E[Y|D=1] \approx E[Y(1)]\). This is why randomized control trials (RCTs) are considered the gold standard in causal inference. With randomization, we can be sure that the treated subset is representative of the rest of the population.
However, it’s usually impossible to run RCTs in real life. For example, if I want to know the effect of speed limits on crash rates, I cannot go around randomly changing the speed limits of different highways. Rather, I have to rely on natural experiments that occur out in the wild. Usually, these natural experiments are comprised of one group that experiences some change and another very similar group that does not. For example, maybe a few states change their speed limits but others do not, or the law change affects interstates by not state routes. Natural experiments can also occur when some institutional quirk allows us to isolate and leverage random variation in treatment. As an example, lotteries of all shapes and sizes (e.g. military drafts, health insurance lotteries, housing voucher lotteries, etc.) can mimic the randomness of an RCT. Using data in a way that allows us to pick out the variation we care about is called identification. We’ll discuss different identification strategies to estimate causal treatment effects in observational data as we progress throughout this course.
Before working with natural experiments, let’s first consider some toy models to demonstrate what identification really means. Suppose I have the following hypothesis:
Sleeping with shoes on leads to headaches in the morning.
How could I test this hypothesis?
The first option would be to run an RCT. In this RCT, I would sneak into homes at random as some Nike-Santa and put shoes onto people who are asleep. Then, I would record information on whether the participants had headaches the next morning. Of course, this is next to impossible, but let’s think through this. If I were to estimate a treatment effect, I would be able to interpret it as causal since the only thing differing between the groups is their treatment status. In this setting, the randomization allows me to ensure that I am identifying the treatment effect and not some other factor.
In the absence of an RCT or natural experiment, I need to rely on observational data. In this case, I would have to find some people who slept with their shoes on and some people who did not. Suppose the data I collect are as follows.
Individual \((i)\) | Treatment \((D)\) | Outcome \((Y)\) |
---|---|---|
1 | 1 | 0 |
2 | 1 | 0 |
3 | 1 | 1 |
4 | 1 | 0 |
5 | 0 | 1 |
6 | 0 | 0 |
7 | 0 | 1 |
8 | 0 | 1 |
Remember, we can only observe a single universe, so this table is really a condensed version of the following:
Individual \((i)\) | Treatment \((D)\) | Outcome \((Y(0))\) | Outcome \((Y(1))\) |
---|---|---|---|
1 | 1 | 0 | ? |
2 | 1 | 0 | ? |
3 | 1 | 1 | ? |
4 | 1 | 0 | ? |
5 | 0 | ? | 1 |
6 | 0 | ? | 0 |
7 | 0 | ? | 1 |
8 | 0 | ? | 1 |
Here we observe most of the treated group have an outcome of 0 while most of the control group have an outcome of 1. Using these data, we can calculate \(E[Y|D = 0]\) and \(E[Y|D = 1]\) (the statistical estimands) with the parts of the data that we can observe as stand-ins for \(E[Y(0)]\) and \(E[Y(1)]\) (the causal estimands). Since \(E[Y | D = 1] = 0.75\) and \(E[Y | D = 0] = 0.25\), it seems that wearing shoes when you sleep does lead to more headaches! If we do this, our strategy to identify the treatment effect of shoes on headaches is simply throwing our hands up and assuming (hoping?) that treatment was randomly assigned.
Before we run to tell people not to wear shoes when they sleep, let’s think about this a bit deeper. Is assuming random assignment of treatment a good assumption? Put differently, might there be something causing both sleeping with shoes on and waking up with a headache? Is there something about the treated individuals that makes them different from the untreated individuals? Since this is observational, there could be many reasons!
One possible reason would be drinking the night before. Drunk people are much more likely to fall asleep with their shoes on and wake up with a headache, or so I’ve been told. This is an example of a confounding variable: something that causes both the outcome and the treatment. When we have confounding variables, we need to control for them. In other words, we want to account for the factor that differs between the treatment and control groups. This is what we call our identification strategy. When we had our RCT, that randomization was our identifications strategy. In this case, our identifications strategy is controlling for this confounding variable.
How do we implement identification strategies? In short: econometrics. Using OLS, we can account for, or control for, confounding variables by including them in our regressions. For example, we can write the initial regression as the following:
\[\text{Headache}_i = \alpha + \delta\text{Shoes}_i + \epsilon_i\]
Here, \(\widehat{\delta}\) will be biased. In other econometrics courses, you might have heard of this as omitted variable bias. To address omitted variable bias, we need to control for the omitted, or confounding, variable. We can modify the regression like the following:
\[\text{Headache}_i = \alpha + \delta\text{Shoes}_i + \beta\text{Drink}_i + \epsilon_i\]
In this regression, we will get a plausibly causal estimate of \(\delta\) since we’ve controlled for the influence of drinking. In reality, it’s very likely that \(\delta \approx 0\), but now we can think of this as causal rather than correlational.1
In the examples above, it was fairly easy to pick out confounding variables and think through the logic of how to control for them. However, as models gets more complicated, we need a systematic way to think through them. Many economists have begun to use Directed Acyclic Graphs, which are more commonly known as DAGs, to illustrate or graphically represent causality. Let’s break down the meaning of DAGs:
Every DAG is be made up of nodes, which are causal factors or variables. Each node will be connected to at least one other node by an edge. Edges establish the flow of causality between nodes via arrows.
To do this in R, we are going to install and load the dagitty
package.
To start, let’s re-imagine our RCT where we randomly put shoes on people as they sleep. We can illustrate this setting with a DAG. First, let’s label our variables. Our treatment, wearing shoes, will be denoted by \(D\). Our outcome, headaches, will be labeled \(Y\), and drinking the night before as \(X\). Second, we are going to use the dagitty()
function. In this function, we need to do the following:
dagitty
that we are trying to make a DAG. We do this by typing dag
at the very beginning of our string.dagitty
what nodes we’ll be using for our outcome and exposure (treatment).dagitty
how each node is connected.Here we have two factors that cause Y. Since there’s only one path from \(D\) to \(Y\), we are free to estimate a regression of the following form:
\[Y_i = \alpha + \delta D_i + \epsilon_i\]
In this DAG, there is not much use in controlling for \(X\) since it is independent of \(D\). If we were to control for \(X\), though, we’d be removing something from the error term. This should increase the precision of our estimate (i.e. shrink our standard errors), but ultimately leave \(\widehat{\delta}\) unchanged.
What would the DAG look like if we did not have an RCT to assign treatment? Now, drinking would cause both shoes and headaches. We’d have to re-write the DAG as the following:
Note that there are now two paths from \(D\) to \(Y\):
I know this is a bit weird with the way the arrows are drawn, but this is the norm. This second path, where you see two arrows pointing away from one another \((\leftarrow X \rightarrow)\), is called a backdoor path. Any time you see an arrow pointing back at the treatment variable, you have a backdoor path. For every backdoor we see, we need to control for something along that path (that isn’t the treatment or the outcome, of course).
In this case, since \(X\) is the only variable on this backdoor path, we need to control for it. This will block/close the backdoor path from D to Y, which is not the path we’re interested in. Once \(X\) is controlled for, and assuming we’re satisfied with this DAG, we can interpret \(\widehat{\delta}\) as the causal effect of wearing shoes while asleep on waking up with a headache.
Translating from DAG to regression, we would write \(Y_i = \alpha + \delta D_i + \beta X_i + \epsilon_i\). Without \(\beta X_i\) in the regression, our analysis would suffer from confounding or omitted variable bias.
We can also use a function from dagitty
, adjustmentSets()
, to help us identify which variables we need to control to identify a causal effect of the treatment on the outcome.
Of course, estimating this regression is only possible if we can observe and measure \(X\). If \(X\) was either unobservable or unmeasurable, we would not be able to control for it, and thus be unable to interpret \(\delta\) as causal. Let’s modify our DAG such that \(X\) is unobservable and see what the output of adjustmentSets()
gives us.
When \(X\) was observed, dagitty
told us to control for it. However, now that \(X\) is unobserved, it no longer tells us to control for it. This is not because we don’t have to, but rather dagitty
won’t tell us to control for something that we physically can’t. Unfortunately, given the current DAG, there’s no way to isolate the effect of \(D\) on \(Y\).
However, if \(X\) causes another observable thing, \(W\), before causing \(D\), then we could just condition on \(W\) instead of \(X\) because \(W\) is on the same backdoor path.
Paths in DAG:
$paths
[1] "D -> Y" "D <- W <- X -> Y"
$open
[1] TRUE TRUE
Adjustment Set for DAG:
{ W }
These are the basics of how to handle confounding variables and sketching DAGs. Of course, there’s more to it, but in the interest of not being too verbose, I will let Nick Huntington-Klein fill in some of those gaps with the following two (optional) videos:
The opposite of backdoor paths are collider paths. To identify collider paths, somewhere along the path you’ll see something like this: \(\rightarrow X \leftarrow\). These paths are already closed. When we have backdoor paths, we need to control for at least one node along the path to close the backdoor. If we condition on a node on a collider path, we are actually opening a new backdoor path that had been already closed to start.
This might be a bit confusing to start, but think through this example. Suppose you’re interested in the effect of someone’s IQ (intelligence1) on the IQ of their significant other. In other words, do intelligent people tend to match with people with similar intelligence? To investigate this, we could collect data on the IQ of each subject \((D)\), their significant other \((Y)\), and their first-born child \((X)\). Let’s draw the DAG.
In this DAG, both the subject and their significant other cause the IQ of their child. The edge connecting \(D\) and \(Y\) is then the relationship we’d like to test. This should make some intuitive sense as IQ is likely genetic.
Since the relationship between \(D\) and \(Y\) is what we’re testing, this variable will appear in our regression: \(Y_i = \alpha + \delta D_i + \epsilon_i\). There are two reasons why \(X\), the IQ of the child, should not appear in the regression:
Let’s explore a simple example simulation of this setting.
For another example, take a look at this discussion about Beauty and Talent from The Mixtape.
For practice, let’s build a DAG to explain headaches and to study the effects of taking medicine. The variables and paths we’ll consider are as follows:
Drawing this setting out in a DAG would look like the following:
Let’s check on the paths of this DAG.
Doctor
being a collider.Sick
in a regression.An equation to identify the impact of medicine on headaches would be the following:
\[\text{Headache}_{i} = \alpha + \delta \times \text{Medicine}_{i} + \beta\times \text{Sick}_i + \epsilon_i\]
Here, our estimate of \(\delta\), \(\widehat{\delta}\), can be interpreted as causal.
In upcoming modules, we will be exposed to different research designs (like RCTs) that leverage variation in observational data to identify causal effects. Each of these designs rely on their own sets of assumptions, data, etc. Causal inference is the umbrella term for these designs, and it’s become a rapidly growing field.
ECON 708: Econometrics III