Module 5.1: Special Topics
Old Dominion University
In each of the models we’ve estimated thus far, we’ve included an error term. Error terms are, effectively, all the other things we cannot observe or account for that still effect the outcome, as well as sheer randomness.
Sometimes in social science, business, etc., we are unable to perfectly measure our variables, which we call measurement error. Measurement error can arise for many different reasons such as sampling error, numbers being rounded or truncated, misreporting, etc.
What are the effects of measurement error?
Measurement error in explanatory variables bias coefficients towards zero. Note the difference between this and omitted variable bias. Omitted variable bias pushes coefficients in a specific direction (positive or negative). If the omitted variable bias is negative, true negative coefficients become more negative and true positive coefficients become less positive.
However, with measurement error in explanatory variables, coefficients get squished down towards zero rather than pushed up or down.
Random measurement error in \(Y\) gets absorbed into the error term. In reality, this just makes inference more difficult (by increasing the standard errors of the coefficient estimates).
As a broad, descriptive summary, measurement error in \(X\) variables will bias coefficients towards zero. Measurement error in \(Y\) does not bias your coefficients, but does inflate their standard errors. In the following, I will demonstrate some of the effects of measurement error using some simulated data.
In this section of notes, I have simulated two datasets of 50 observations each. Both datasets are simulated exactly the same way, except at different times so the random numbers will be different. The data is generated by:
\[Y_i = X_i + e_i\]
To generate measurement error in \(X\), I create a new variable called \(\widetilde{X}\), which is simply \(X\) plus some random noise that progressively gets larger. To generate measurement error in \(Y\), I slowly increase the magnitude of the error term \(e\) in the equation above. The figures below are gifs that demonstrate what happens as measurement error in either \(X\) or \(Y\) increase.
In the panel on the left, you’ll see that the points spread out horizontally but do not change their vertical positions. Notice how the regression line, depicted in solid black, seems to quickly flatten out as the measurement error increases. This is what I mean by saying measurement error in \(X\) pushes the slope coefficient towards zero.
In the panel on the right, the opposite movement is occurring for the points. Here, they are spreading out vertically, but not changing their position horizontally. In this example, the regression line becomes more vertical over time (i.e., the slope coefficient is pushed away from zero), but this is due to the particularities of this specific simulation. With measurement error in \(Y\), the slope coefficient is as likely to increase as it is to decrease.
See below for the second set of simulations with similar results.
The following figure tracks both the coefficient estimates and standard errors for a similar simulation as measurement error increases. Again, the slope coefficient plummets towards zero as measurement error in \(X\) increases, whereas the effect of measurement error in \(Y\) is much less stark. However, the opposite is true for the standard error estimates. While the coefficient heads towards zero while measurement error in \(X\) increases, the standard error also decreases, which indicates an increase in precision. On the other hand, while the coefficient estimate is less affected by measurement error in \(Y\), the standard error steadily climbs.
Generally speaking, you will be unable to determine exactly how much measurement error you are dealing with and whether it is in \(X\), \(Y\), or both. Typically, people will mostly write things like, “this estimate is a lower bound, since measurement error is pushing the coefficient towards zero.” Or, “the estimate is insignificant, but this may be due to measurement error in the outcome variable.” Do these arguments work? Sometimes. Like most things, it really just depends.
Another point worth making is that these simulations only consider random measurement error rather than non-random error. Random error is when measurements are equally likely to be high relative to low. However, measurement error is considered non-random when the measurements are consistently, or more likely to be, inflated or deflated. Take this article from The Economist as an example. Some clever economists got their hands on satellite data that measured how bright a given country or city was at night. Their hypothesis was that areas with more economic activity should have more ambient light at night compared to less developed areas. As a simple example, see this image of the United States at night. Anyway, these economists used luminosity to predict GDP, and found that autocratic countries consistently reported higher levels of GDP than what their luminosities would suggest. This was not the case for democratic countries. The point here is that this represents non-random measurement error in reported GDP.
When we’re first learning econometrics, much is made out about estimating the \(\beta\)’s. Don’t get it wrong – this is incredibly important. However, as you learn more econometrics, you begin to realize that standard errors are as important, if not even more important, than the coefficients.
Ultimately, we are using econometrics to build models and test hypotheses. We can build the best model in the world, but if our standard errors are wrong, we’ll draw false conclusions.
We briefly touched on the word heteroskedasticity in previous modules. However, we did not really discuss its implications. We assume that the errors in the model are homoskedastic in order to use “normal” standard errors. This comes straight from the CLT.
So, what happens if we have heteroskedasticity, or errors that are not normally distributed, which is a violation of this important assumption?
Let’s explore once again with some simulations. We are going to simulate data from the following equation:
\[Y_i = 10 + X_i + e_i\]
Each time we simulate data, however, we are going to force different distributions onto \(e_i\). Below, I plot the simulated \(X\) and \(Y\) data with different error distributions in addition to the estimated regression line.
I will explain what you are looking at after the next set of images, where we visualize the relationship between \(X\) and each regression’s residuals.
Now, what do the parameter estimates and standard errors look like from these regressions? Remember, the slope should be about equal to \(1\).
(a) | (b) | (c) | (d) | (e) | |
---|---|---|---|---|---|
X | 0.946*** | 0.902*** | 0.981*** | 0.906*** | 0.997*** |
(0.047) | (0.069) | (0.046) | (0.082) | (0.041) | |
Num.Obs. | 500 | 500 | 500 | 500 | 500 |
R2 | 0.454 | 0.258 | 0.483 | 0.196 | 0.548 |
Column (a) in this table is the homoskedastic version, and should be thought of as a baseline. Since the errors are different, the estimates will also be a bit different, so I want you to focus just on the standard errors. Heteroskedasticity can both artificially inflate or deflate standard errors, though it’s not always clear how the standard errors will be impacted. In general, though, the coefficients are all close to \(1\), though some of the standard errors are larger and some are smaller than the homoskedastic one.
To address the issue of heteroskedasticity, we can use heteroskedasticity-robust standard errors. Essentially, without getting too deep into the math, we are going to correct for the non-constant variance of the errors. This will give us better (more accurate) standard errors for testing our hypotheses.
To apply this correction, we are going to use an R package called fixest
.
At this point, the main benefits of this package are two fold. First, fixed effects are estimated extremely quickly. We’ll do some “speed tests” in just a bit. Second, we can apply different standard error corrections to our models fairly easily.
feols
fixest
comes with a few different functions, but we will mostly be interested in feols()
which stands for “fixed effects ordinary least squares”. The important arguments into this function are:
data
: the data for the regressions.subset
: a vector that tells R
which observations out of data
to use.split
/fsplit
: sometimes we want to estimate regressions on different subsamples of data. For example, in an early module we estimated the effect of GPA on SAT scores by gender. split
will automatically estimate the sample based on unique values of the vector passed to it. fsplit
will do the same, except it will also estimate models for the full sample.vcov
: this is how you will specify the v
ariance-cov
ariance matrix (or, in English, the standard errors). For our purposes, “normal” standard errors are "iid"
and heteroskedasticity-robust standard errors will be "hetero"
. Note: feols
defaults to "cluster"
when using fixed effects, so you will need to manually override this with one of the two options above.fml
: this is the f
orm
ul
a to be estimated. For the purposes of this course, fml
will look something like the following: c(y1, y2, y3) ~ x1 + x2 + x3*x4 | FE1 + FE2
. feols
can estimate models with multiple outcomes at once, and you would write this as: c(y1, y2, y3)
. x
variables follow the squiggle ~
, and fixed effects follow the vertical bar |
.feols
feols
and fsplit
library("fixest")
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
df <- df[df$hs_gpa <= 4,]
r <- feols(sat_sum ~ hs_gpa, df, fsplit = ~sex)
# Note that r is now a "list" of three regressions.
regz <- list(`All Students` = r[[1]],
`Female` = r[[2]],
`Male` = r[[3]])
coefz <- c("hs_gpa" = "High School GPA",
"(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
title = "Effect of GPA on SAT Scores",
estimate = "{estimate}{stars}",
coef_map = coefz,
gof_map = gofz)
All Students | Female | Male | |
---|---|---|---|
High School GPA | 11.538*** | 11.392*** | 13.715*** |
(0.753) | (0.999) | (1.079) | |
Constant | 66.468*** | 70.196*** | 55.851*** |
(2.440) | (3.167) | (3.579) | |
Num.Obs. | 999 | 515 | 484 |
R2 | 0.191 | 0.202 | 0.251 |
feols
For speed comparisons between lm
and feols
, we are going to simulate data of varying sizes with a proportional number of fixed effects (e.g. 1,000 observations and 10 FEs; 10,000 obs and 100 FEs) and see how long each function takes.
lm_time <- c()
feols_time <- c()
nz <- c(1000, 5000, 10000, 20000, 35000, 50000, 75000, 100000)
for(n in nz){
x <- rnorm(n)
fe <- sample(1:(n/100), n, TRUE)
y <- x + rnorm(n)
df <- data.frame(y = y, x = x, fe = fe)
lm_time <- c(lm_time, system.time(lm(y ~ x + as.factor(fe), data = df))[3])
feols_time <- c(feols_time, system.time(feols(y ~ x | fe, df))[3])
}
plot(nz, lm_time, pch = 19, col = "tomato", type = "b",
xlab = "Sample Size", ylab = "Time in Seconds", las = 1)
lines(nz, feols_time, pch = 19, col = "dodgerblue", type = "b")
legend("topleft", legend = c("lm()", "feols()"), bty = "n",
col = c("tomato", "dodgerblue"), lty = 1)
cat("Maximum lm() time:", max(lm_time), "\n")
cat("Maximum feols() time:", max(feols_time))
Maximum lm() time: 164.28
Maximum feols() time: 0.04
With the largest dataset, lm()
needed almost six minutes to estimate while feols()
did not even take a tenth of a second. When datasets are small, this is not necessarily an issue, but hopefully you can see the advantage of using a faster function.
feols
Finally, let’s apply those standard error corrections. To do this, we are going to use vcov = "hetero"
in our function call. The table of the new estimates can be found below:
(a) | (b) | (c) | (d) | (e) | |
---|---|---|---|---|---|
No Correction | |||||
X | 0.946*** | 0.902*** | 0.981*** | 0.906*** | 0.997*** |
(0.047) | (0.069) | (0.046) | (0.082) | (0.041) | |
Robust Standard Errors | |||||
X | 0.946*** | 0.902*** | 0.981*** | 0.906*** | 0.997*** |
(0.047) | (0.079) | (0.076) | (0.043) | (0.055) | |
Num.Obs. | 500 | 500 | 500 | 500 | 500 |
R2 | 0.454 | 0.258 | 0.483 | 0.196 | 0.548 |
In some cases, the standard errors increased. In other cases, the standard errors decreased. So what? When standard errors grow larger, it means that we would have been failing to reject the null too often, making our relationship seem weaker than it really is. The same is true in reverse, of course. Sometimes, students get caught up in wanting significant results, but this is not the be all, end all. It is more important that you come to the correct conclusion, even that is inconclusiveness, rather than an incorrect conclusion.
R
As a final note for this course, we will discuss an important package that does not take much time to learn. Often times, the data we are working with has some time dimension (e.g., yearly, monthly, quarterly, daily, etc.) to it, so learning how to work with dates is helpful. The lubridate
package in R is outfitted for this exact purpose, so we are going to learn our way around some of its functions.
Let’s go through some of the important functions / capabilities in lubridate
.
Note: There is a great cheatsheet available online.
R
ymd()
, dmy()
, mdy()
: when you read data into R and one of the columns has entries like "January 2 2014"
, we can use these functions to help us out. Note that y
stands for year, d
stands for day, and m
stands for month, so just choose the function that makes sense for your situation.
Output from these functions appear in year-month-day format, but are actually numeric under the hood. We can see this by executing as.numeric(mdy("January 2 2014"))
: 16072.
R
Once your data is converted to well-behaved date objects, we can begin to manipulate them. We can extract the year, month, day, etc. using lubridate’s helpfully named year()
, month()
, and day()
functions. We can also grab the weekday by using wday(df$date, label = TRUE)
.
Next, we can use floor_date(df$date, "quarter")
“round” our dates to the nearest quarter, month, year, whatever.
Finally, we can calculate differences in dates by using simple subtraction. However, often times we want to know the number of, say, months between two dates. For this, we can use the following code: interval(date1, date2) %% months(1)
.
It’s also often the case that quantifiable information is recorded as textual data. There is a robust set of tools for handling text in R, but these are outside the scope of this course, and you will not be tested on them. However, I wanted to include some information just in case it’s needed for a project or perhaps some students are just interested. See below.
Note: The rest of these slides are purely bonus material.
Date data is a very particular type of text data. However, working with text more generally is equally important when programming. We will not use any new packages for this, as base R has plenty of functions for us.
The first function we’ll take a look at is paste()
, which has a sibling function paste0()
.
paste()
takes whatever vectors you give it and smushes them together. paste()
will recycle the shorter of the vector(s) until it matches the length of the longest vector. paste()
will separate text with spaces by default, but this can be changed via the sep
argument. paste0()
defaults to sep = ""
, or nothing inbetween text. Some examples are below:
paste("alex", "cardazzi"); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alexander", "alex"), c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex"), "middle-name", c("cardazzi", "trebek")); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton"), sep = "-"); cat("\n")
paste0("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex", "christine"), c("cardazzi", "strong"), sep = " abcxyz ")
[1] "alex cardazzi"
[1] "alex cardazzi" "alex trebek" "alex hamilton"
[1] "alexander cardazzi" "alex trebek" "alexander hamilton"
[1] "alex middle-name cardazzi" "alex middle-name trebek"
[1] "alex-cardazzi" "alex-trebek" "alex-hamilton"
[1] "alexcardazzi" "alextrebek" "alexhamilton"
[1] "alex abcxyz cardazzi" "christine abcxyz strong"
Next, rather than combining character strings, we can break them up. There are a few ways to do this, but let’s work with substr()
first. substr()
accepts three arguments: x
, start
, and stop
. Effectively, you tell R the “full” character vector and then the start and stop positions of the characters you want. Some examples are below:
char <- c("julius randle", "jalen brunson")
substr(char, 1, 3)
# use `regexpr()` to give you the position of
# a certain sub-string in another string.
substr(char, 1, regexpr(" ", char)); cat("\n")
# use -1 to *not* get the substring you're looking for
substr(char, 1, regexpr(" ", char) - 1); cat("\n")
# when using regexpr(), if the substring does not appear,
# it will return -1, and nothing will be returned.
substr(char, 1, regexpr("b", char)); cat("\n")
# use nchar() to get the length of the string.
substr(char, regexpr(" ", char), nchar(char)); cat("\n")
# here, use +1 to avoid the space
substr(char, regexpr(" ", char) + 1, nchar(char))
[1] "jul" "jal"
[1] "julius " "jalen "
[1] "julius" "jalen"
[1] "" "jalen b"
[1] " randle" " brunson"
[1] "randle" "brunson"
To check if a certain substring exists in another string, we can use grepl()
. This accepts two arguments: a single substring and a vector of other strings. Examples follow:
[1] TRUE TRUE FALSE
[1] TRUE TRUE TRUE
Similarly, we can use gsub()
to find and replace things.
Working with text patterns can get a bit crazy. There’s a whole mini-language for this that we won’t fully dive into, but I want to expose you to some of it. Here are some important ones:
\\d
: this is all digits 0-9. This is very helpful if you want to quickly find and replace all numeric values.\\D
: this is the opposite – all non-digit characters.[[:alpha:]]
: all alphabetic characters.\\s
: space, tab, new line, etc. This is helpful when there’s a bunch of spaces you want to delete.[[:punct:]]
: all punctuation characters..
: this means any character.+
: means match the previous character at least once. This is helpful when there are many, for example, spaces in a row. You can use gsub("\\s+", "", txt)
.^
: indicates the beginning of a string.$
: indicates the end of a string.Below are some examples of how to use regular expressions:
namez <- c("margot elise robbie",
"samuel l jackson",
"jennifer lawrence")
# find space-something-space and replace with nothing.
gsub("\\s.+\\s", "", namez); cat("\n")
# find letters-space and replace with nothing
gsub("[[:alpha:]]+\\s", "", namez); cat("\n")
# find start_of_string-letters-space and replace with nothing
gsub("^[[:alpha:]]+\\s", "", namez); cat("\n")
# find space-letters-end_of_string and replace with nothing
gsub("\\s[[:alpha:]]+$", "", namez)
[1] "margotrobbie" "samueljackson" "jennifer lawrence"
[1] "robbie" "jackson" "lawrence"
[1] "elise robbie" "l jackson" "lawrence"
[1] "margot elise" "samuel l" "jennifer"
Regular expressions are very difficult, but very convenient once you get the hang of them. Just like lubridate
, there’s a fantastic cheatsheet online you should check out.
Let’s begin by practicing on the Knicks roster (data, source):
X No. Player Pos Ht Wt Birth.Date Var.7 Exp College bbrefID
1 1 51 Ryan Arcidiacono PG 6-3 195 March 26, 1994 us 5 Villanova /players/a/arcidry01.html
2 2 9 RJ Barrett SG 6-6 214 June 14, 2000 ca 3 Duke /players/b/barrerj01.html
3 3 11 Jalen Brunson PG 6-2 190 August 31, 1996 us 4 Villanova /players/b/brunsja01.html
4 4 13 Evan Fournier SG 6-7 205 October 29, 1992 fr 10 /players/f/fournev01.html
5 5 6 Quentin Grimes SG 6-5 205 May 8, 2000 us 1 Kansas, Houston /players/g/grimequ01.html
6 6 3 Josh Hart SF 6-5 215 March 6, 1995 us 5 Villanova /players/h/hartjo01.html
7 7 55 Isaiah Hartenstein C 7-0 250 May 5, 1998 us 4 /players/h/harteis01.html
8 8 8 DaQuan Jeffries SG 6-5 230 August 30, 1997 us 3 Oral Roberts, Tulsa /players/j/jeffrda01.html
9 9 0, 3 Trevor Keels SG 6-5 221 August 26, 2003 us R Duke /players/k/keelstr01.html
10 10 2 Miles McBride PG 6-2 200 September 8, 2000 us 1 West Virginia /players/m/mcbrimi01.html
11 11 17 Svi Mykhailiuk SF 6-7 205 June 10, 1997 ua 4 Kansas /players/m/mykhasv01.html
12 12 5 Immanuel Quickley SG 6-3 190 June 17, 1999 us 2 Kentucky /players/q/quickim01.html
13 13 30 Julius Randle PF 6-8 250 November 29, 1994 us 8 Kentucky /players/r/randlju01.html
14 14 0 Cam Reddish SF 6-8 218 September 1, 1999 us 3 Duke /players/r/reddica01.html
15 15 23 Mitchell Robinson C 7-0 240 April 1, 1998 us 4 Western Kentucky /players/r/robinmi01.html
16 16 4 Derrick Rose PG 6-3 200 October 4, 1988 us 13 Memphis /players/r/rosede01.html
17 17 45 Jericho Sims C 6-10 245 October 20, 1998 us 1 Texas /players/s/simsje01.html
18 18 1 Obi Toppin PF 6-9 220 March 4, 1998 us 2 Dayton /players/t/toppiob01.html
Let’s calculate the following items:
First, let’s examine which players are guards. To do this, we are going to use grepl()
to search for "G"
in the Pos
column. If we find it, we are going to give this player a 1
. If we don’t, we are going to give them a 0
.
Player Pos guard
1 Ryan Arcidiacono PG 1
2 RJ Barrett SG 1
3 Jalen Brunson PG 1
4 Evan Fournier SG 1
5 Quentin Grimes SG 1
6 Josh Hart SF 0
Percent of the roster that are guards: 56 %
Next, let’s calculate the team’s average experience in the league. However, as you might have noticed, rookie players (meaning players who have never played in the league before) have an “R” for their experience. If we calculate an average of this vector, R will return NA
because of this. If we simply drop the “R”/NA
values, this will overstate the average experience since these rookies should have values equal to zero. Let’s replace “R” with 0.
The next bullet wants us to calculate player heights. To do this, we are going to grab the first part of their height (feet), multiply by 12 to get inches, and then add in the second part of their height (inches).
Player Ht foot inch Ht_inch
1 Ryan Arcidiacono 6-3 6 3 75
2 RJ Barrett 6-6 6 6 78
3 Jalen Brunson 6-2 6 2 74
4 Evan Fournier 6-7 6 7 79
5 Quentin Grimes 6-5 6 5 77
6 Josh Hart 6-5 6 5 77
Now we are going to calculate a players age in days at the start of the season. To do this, we have to convert their birthday-text into a numeric birthday object. Then, we are going to subtract the specific date from the birthdays vector. This will return the difference in days. We can also take the difference in months (or years, quarters, etc.) which will be demonstrated below.
Player Birth.Date bday age age_m
1 Ryan Arcidiacono March 26, 1994 1994-03-26 10434 342
2 RJ Barrett June 14, 2000 2000-06-14 8162 268
3 Jalen Brunson August 31, 1996 1996-08-31 9545 313
4 Evan Fournier October 29, 1992 1992-10-29 10947 359
5 Quentin Grimes May 8, 2000 2000-05-08 8199 269
6 Josh Hart March 6, 1995 1995-03-06 10089 331
Finally, we are going to calculate each player’s difference in age to Jalen Brunson. First, we are going to calculate JB’s age, and then subtract his age from everyone else’s age. To finish, we are going to sort the data by this difference.
Player jb_age_diff
3 Jalen Brunson 0
11 Svi Mykhailiuk 283
8 DaQuan Jeffries 364
6 Josh Hart 544
18 Obi Toppin 550
15 Mitchell Robinson 578
Next, we are going to explore some crime data. The Minneapolis police record information on stops they conduct such as location and date/time (data, documentation). Let’s generate some numeric data from both of these columns.
First, let’s read in the data and view it.
id date problem race gender location precinct
1 17-036337 2017-02-01T00:00:12Z traffic White Male 44.95134737 -93.28133076 5
2 17-036349 2017-02-01T00:07:38Z suspicious Black Female 44.9474742 -93.29829195 5
3 17-036351 2017-02-01T00:11:39Z traffic Other Male 44.89233 -93.28067 5
4 17-036357 2017-02-01T00:18:50Z suspicious Latino Male 45.01497 -93.24734 2
5 17-036360 2017-02-01T00:22:41Z traffic Black Male 45.00951934 -93.28989378 4
6 17-036378 2017-02-01T00:39:23Z suspicious Unknown Unknown 44.94047824 -93.26763786 3
7 17-036380 2017-02-01T00:39:53Z suspicious Black Male 44.9784072 -93.27881908 1
8 17-036381 2017-02-01T00:40:11Z traffic White Female 44.97429462 -93.27996035 1
9 17-036388 2017-02-01T00:46:54Z traffic Black Female 44.94659 -93.2797 5
10 17-036390 2017-02-01T00:48:37Z traffic White Male 44.97526008 -93.26992525 1
Next, let’s convert the datetime text into something usable with lubridate
. Since there’s time information in this, we need to use the still-conveniently named ymd_hms
function. Below we’ll generate some distribution plots of the weekday, hour, and minute of these police stops.
Finally, we can extract latitude and longitude from the loc
column, and plot the resulting points as a map.
There’s a whole subset of data that we have not talked about yet that has to do with spatial features of data. Here, I am going to plot Minneapolis neighborhoods with the stops data on top of it. There’s tons of things you can do with this, but it is outside the scope of this course.
ECON 311: Economics, Causality, and Analytics