Special Topics in R

Module 5.1: Special Topics

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Housing Price Models

Measurement Error

In each of the models we’ve estimated thus far, we’ve included an error term. Error terms are, effectively, all the other things we cannot observe or account for that still effect the outcome, as well as sheer randomness.

Sometimes in social science, business, etc., we are unable to perfectly measure our variables, which we call measurement error. Measurement error can arise for many different reasons such as sampling error, numbers being rounded or truncated, misreporting, etc.

What are the effects of measurement error?

Measurement Error in X

Measurement error in explanatory variables bias coefficients towards zero. Note the difference between this and omitted variable bias. Omitted variable bias pushes coefficients in a specific direction (positive or negative). If the omitted variable bias is negative, true negative coefficients become more negative and true positive coefficients become less positive.

True positive coefficients can even become negative altogether. See the bedrooms and square footage example from Module 4.

However, with measurement error in explanatory variables, coefficients get squished down towards zero rather than pushed up or down.

Measurement Error in Y

Random measurement error in \(Y\) gets absorbed into the error term. In reality, this just makes inference more difficult (by increasing the standard errors of the coefficient estimates).

As a broad, descriptive summary, measurement error in \(X\) variables will bias coefficients towards zero. Measurement error in \(Y\) does not bias your coefficients, but does inflate their standard errors. In the following, I will demonstrate some of the effects of measurement error using some simulated data.

Effects of Measurement Error

In this section of notes, I have simulated two datasets of 50 observations each. Both datasets are simulated exactly the same way, except at different times so the random numbers will be different. The data is generated by:

\[Y_i = X_i + e_i\]

To generate measurement error in \(X\), I create a new variable called \(\widetilde{X}\), which is simply \(X\) plus some random noise that progressively gets larger. To generate measurement error in \(Y\), I slowly increase the magnitude of the error term \(e\) in the equation above. The figures below are gifs that demonstrate what happens as measurement error in either \(X\) or \(Y\) increase.

Example of Measurment Error

Measurement Error in X, Simulation #1

Measurement Error in Y, Simulation #1

In the panel on the left, you’ll see that the points spread out horizontally but do not change their vertical positions. Notice how the regression line, depicted in solid black, seems to quickly flatten out as the measurement error increases. This is what I mean by saying measurement error in \(X\) pushes the slope coefficient towards zero.

In the panel on the right, the opposite movement is occurring for the points. Here, they are spreading out vertically, but not changing their position horizontally. In this example, the regression line becomes more vertical over time (i.e., the slope coefficient is pushed away from zero), but this is due to the particularities of this specific simulation. With measurement error in \(Y\), the slope coefficient is as likely to increase as it is to decrease.

See below for the second set of simulations with similar results.

Example of Measurment Error

Measurement Error in X, Simulation #2

Measurement Error in Y, Simulation #2

The following figure tracks both the coefficient estimates and standard errors for a similar simulation as measurement error increases. Again, the slope coefficient plummets towards zero as measurement error in \(X\) increases, whereas the effect of measurement error in \(Y\) is much less stark. However, the opposite is true for the standard error estimates. While the coefficient heads towards zero while measurement error in \(X\) increases, the standard error also decreases, which indicates an increase in precision. On the other hand, while the coefficient estimate is less affected by measurement error in \(Y\), the standard error steadily climbs.

Plot

Generally speaking, you will be unable to determine exactly how much measurement error you are dealing with and whether it is in \(X\), \(Y\), or both. Typically, people will mostly write things like, “this estimate is a lower bound, since measurement error is pushing the coefficient towards zero.” Or, “the estimate is insignificant, but this may be due to measurement error in the outcome variable.” Do these arguments work? Sometimes. Like most things, it really just depends.

Another point worth making is that these simulations only consider random measurement error rather than non-random error. Random error is when measurements are equally likely to be high relative to low. However, measurement error is considered non-random when the measurements are consistently, or more likely to be, inflated or deflated. Take this article from The Economist as an example. Some clever economists got their hands on satellite data that measured how bright a given country or city was at night. Their hypothesis was that areas with more economic activity should have more ambient light at night compared to less developed areas. As a simple example, see this image of the United States at night. Anyway, these economists used luminosity to predict GDP, and found that autocratic countries consistently reported higher levels of GDP than what their luminosities would suggest. This was not the case for democratic countries. The point here is that this represents non-random measurement error in reported GDP.

Non-random measurement error can be thought of as a form of omitted variable bias. In this example about GDP and luminosity, we can also observe levels of freedom within a country, which can therefore be used as a control. However, if low GDP countries inflated their reported GDPs but high GDP countries did not, then using this variable in a model would not be particularly helpful.

Standard Errors

When we’re first learning econometrics, much is made out about estimating the \(\beta\)’s. Don’t get it wrong – this is incredibly important. However, as you learn more econometrics, you begin to realize that standard errors are as important, if not even more important, than the coefficients.

Ultimately, we are using econometrics to build models and test hypotheses. We can build the best model in the world, but if our standard errors are wrong, we’ll draw false conclusions.

We briefly touched on the word heteroskedasticity in previous modules. However, we did not really discuss its implications. We assume that the errors in the model are homoskedastic in order to use “normal” standard errors. This comes straight from the CLT.

Normal standard errors are called analytical or iid (independently and identically distributed) standard errors.

So, what happens if we have heteroskedasticity, or errors that are not normally distributed, which is a violation of this important assumption?

Let’s explore once again with some simulations. We are going to simulate data from the following equation:

\[Y_i = 10 + X_i + e_i\]

Each time we simulate data, however, we are going to force different distributions onto \(e_i\). Below, I plot the simulated \(X\) and \(Y\) data with different error distributions in addition to the estimated regression line.

Plot

I will explain what you are looking at after the next set of images, where we visualize the relationship between \(X\) and each regression’s residuals.

Plot

Explanation of Figures
  1. Figure (a) depicts a homoskedastic error process. In plain language, this means that the distribution of the errors is normal, and does not change as values of \(X\) changes. Essentially, there is no whatsoever pattern in the errors. If this is still confusing, don’t worry. Homoskedasticity is one of those things that only starts to make sense once you see examples of things that are not homoskedastic. Each of the following figures contain heteroskedastic errors.
  2. Figure (b) contains a “fan” shaped. Essentially, the variance of the standard errors is increasing as \(X\) increases.
  3. Figure (c) has an hour glass shape, meaning that the variance is related to the magnitude of \(X\).
  4. Figure (d) also depicts errors whose variance is related to the magnitude of \(X\).
  5. Figure (e) has the strangest distribution of errors. Honestly, I was just being creative.

Now, what do the parameter estimates and standard errors look like from these regressions? Remember, the slope should be about equal to \(1\).

Table of Regression Estimates and Standard Errors
Effect of Heteroskedasticity
(a) (b) (c) (d) (e)
X 0.946*** 0.902*** 0.981*** 0.906*** 0.997***
(0.047) (0.069) (0.046) (0.082) (0.041)
Num.Obs. 500 500 500 500 500
R2 0.454 0.258 0.483 0.196 0.548

Column (a) in this table is the homoskedastic version, and should be thought of as a baseline. Since the errors are different, the estimates will also be a bit different, so I want you to focus just on the standard errors. Heteroskedasticity can both artificially inflate or deflate standard errors, though it’s not always clear how the standard errors will be impacted. In general, though, the coefficients are all close to \(1\), though some of the standard errors are larger and some are smaller than the homoskedastic one.

To address the issue of heteroskedasticity, we can use heteroskedasticity-robust standard errors. Essentially, without getting too deep into the math, we are going to correct for the non-constant variance of the errors. This will give us better (more accurate) standard errors for testing our hypotheses.

To apply this correction, we are going to use an R package called fixest.

At this point, the main benefits of this package are two fold. First, fixed effects are estimated extremely quickly. We’ll do some “speed tests” in just a bit. Second, we can apply different standard error corrections to our models fairly easily.

feols

fixest comes with a few different functions, but we will mostly be interested in feols() which stands for “fixed effects ordinary least squares”. The important arguments into this function are:

  • data: the data for the regressions.
  • subset: a vector that tells R which observations out of data to use.
  • split/fsplit: sometimes we want to estimate regressions on different subsamples of data. For example, in an early module we estimated the effect of GPA on SAT scores by gender. split will automatically estimate the sample based on unique values of the vector passed to it. fsplit will do the same, except it will also estimate models for the full sample.
  • vcov: this is how you will specify the variance-covariance matrix (or, in English, the standard errors). For our purposes, “normal” standard errors are "iid" and heteroskedasticity-robust standard errors will be "hetero". Note: feols defaults to "cluster" when using fixed effects, so you will need to manually override this with one of the two options above.
  • fml: this is the formula to be estimated. For the purposes of this course, fml will look something like the following: c(y1, y2, y3) ~ x1 + x2 + x3*x4 | FE1 + FE2. feols can estimate models with multiple outcomes at once, and you would write this as: c(y1, y2, y3). x variables follow the squiggle ~, and fixed effects follow the vertical bar |.
An example of feols and fsplit
Code
library("fixest")
df <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/satgpa.csv")
df <- df[df$hs_gpa <= 4,]

r <- feols(sat_sum ~ hs_gpa, df, fsplit = ~sex)

# Note that r is now a "list" of three regressions.
regz <- list(`All Students` = r[[1]],
             `Female` = r[[2]],
             `Male` = r[[3]])
coefz <- c("hs_gpa" = "High School GPA",
           "(Intercept)" = "Constant")
gofz <- c("nobs", "r.squared")
modelsummary(regz,
             title = "Effect of GPA on SAT Scores",
             estimate = "{estimate}{stars}",
             coef_map = coefz,
             gof_map = gofz)
Effect of GPA on SAT Scores
All Students Female Male
High School GPA 11.538*** 11.392*** 13.715***
(0.753) (0.999) (1.079)
Constant 66.468*** 70.196*** 55.851***
(2.440) (3.167) (3.579)
Num.Obs. 999 515 484
R2 0.191 0.202 0.251

This example comes directly from the beginning of Module 4.2

For speed comparisons between lm and feols, we are going to simulate data of varying sizes with a proportional number of fixed effects (e.g. 1,000 observations and 10 FEs; 10,000 obs and 100 FEs) and see how long each function takes.

Code
lm_time <- c()
feols_time <- c()
nz <- c(1000, 5000, 10000, 20000, 35000, 50000, 75000, 100000)
for(n in nz){
  
  x <- rnorm(n)
  fe <- sample(1:(n/100), n, TRUE)
  y <- x + rnorm(n)
  df <- data.frame(y = y, x = x, fe = fe)
  lm_time <- c(lm_time, system.time(lm(y ~ x + as.factor(fe), data = df))[3])
  feols_time <- c(feols_time, system.time(feols(y ~ x | fe, df))[3])
}

plot(nz, lm_time, pch = 19, col = "tomato", type = "b",
     xlab = "Sample Size", ylab = "Time in Seconds", las = 1)
lines(nz, feols_time, pch = 19, col = "dodgerblue", type = "b")
legend("topleft", legend = c("lm()", "feols()"), bty = "n",
       col = c("tomato", "dodgerblue"), lty = 1)
cat("Maximum lm() time:", max(lm_time), "\n")
cat("Maximum feols() time:", max(feols_time))
Output
Maximum lm() time: 129.11 
Maximum feols() time: 0.05
Plot

With the largest dataset, lm() needed almost six minutes to estimate while feols() did not even take a tenth of a second. When datasets are small, this is not necessarily an issue, but hopefully you can see the advantage of using a faster function.

Finally, let’s apply those standard error corrections. To do this, we are going to use vcov = "hetero" in our function call. The table of the new estimates can be found below:

Regression Table
Effect of Heteroskedasticity
(a) (b) (c) (d) (e)
No Correction
X 0.946*** 0.902*** 0.981*** 0.906*** 0.997***
(0.047) (0.069) (0.046) (0.082) (0.041)
Robust Standard Errors
X 0.946*** 0.902*** 0.981*** 0.906*** 0.997***
(0.047) (0.079) (0.076) (0.043) (0.055)
Num.Obs. 500 500 500 500 500
R2 0.454 0.258 0.483 0.196 0.548

In some cases, the standard errors increased. In other cases, the standard errors decreased. So what? When standard errors grow larger, it means that we would have been failing to reject the null too often, making our relationship seem weaker than it really is. The same is true in reverse, of course. Sometimes, students get caught up in wanting significant results, but this is not the be all, end all. It is more important that you come to the correct conclusion, even that is inconclusiveness, rather than an incorrect conclusion.

Dates in R

As a final note for this course, we will discuss an important package that does not take much time to learn. Often times, the data we are working with has some time dimension (e.g., yearly, monthly, quarterly, daily, etc.) to it, so learning how to work with dates is helpful. The lubridate package in R is outfitted for this exact purpose, so we are going to learn our way around some of its functions.

Yeah, the name is hilarious. Mount Rushmore of package names.

lubridate

Let’s go through some of the important functions / capabilities in lubridate.

Note: There is a great cheatsheet available online.

ymd(), dmy(), mdy(): when you read data into R and one of the columns has entries like "January 2 2014", we can use these functions to help us out. Note that y stands for year, d stands for day, and m stands for month, so just choose the function that makes sense for your situation.

Output from these functions appear in year-month-day format, but are actually numeric under the hood. We can see this by executing as.numeric(mdy("January 2 2014")): 16072.

Once your data is converted to well-behaved date objects, we can begin to manipulate them. We can extract the year, month, day, etc. using lubridate’s helpfully named year(), month(), and day() functions. We can also grab the weekday by using wday(df$date, label = TRUE).

Next, we can use floor_date(df$date, "quarter") “round” our dates to the nearest quarter, month, year, whatever.

Finally, we can calculate differences in dates by using simple subtraction. However, often times we want to know the number of, say, months between two dates. For this, we can use the following code: interval(date1, date2) %% months(1).

Text as Data

It’s also often the case that quantifiable information is recorded as textual data. There is a robust set of tools for handling text in R, but these are outside the scope of this course, and you will not be tested on them. However, I wanted to include some information just in case it’s needed for a project or perhaps some students are just interested. See below.

Text as Data

Date data is a very particular type of text data. However, working with text more generally is equally important when programming. We will not use any new packages for this, as base R has plenty of functions for us.

The first function we’ll take a look at is paste(), which has a sibling function paste0().

paste() takes whatever vectors you give it and smushes them together. paste() will recycle the shorter of the vector(s) until it matches the length of the longest vector. paste() will separate text with spaces by default, but this can be changed via the sep argument. paste0() defaults to sep = "", or nothing inbetween text. Some examples are below:

Code
paste("alex", "cardazzi"); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alexander", "alex"), c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex"), "middle-name", c("cardazzi", "trebek")); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton"), sep = "-"); cat("\n")
paste0("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex", "christine"), c("cardazzi", "strong"), sep = " abcxyz ")
Output
[1] "alex cardazzi"

[1] "alex cardazzi" "alex trebek"   "alex hamilton"

[1] "alexander cardazzi" "alex trebek"        "alexander hamilton"

[1] "alex middle-name cardazzi" "alex middle-name trebek"  

[1] "alex-cardazzi" "alex-trebek"   "alex-hamilton"

[1] "alexcardazzi" "alextrebek"   "alexhamilton"

[1] "alex abcxyz cardazzi"    "christine abcxyz strong"

Next, rather than combining character strings, we can break them up. There are a few ways to do this, but let’s work with substr() first. substr() accepts three arguments: x, start, and stop. Effectively, you tell R the “full” character vector and then the start and stop positions of the characters you want. Some examples are below:

Code
char <- c("julius randle", "jalen brunson")
substr(char, 1, 3)
# use `regexpr()` to give you the position of
#   a certain sub-string in another string.
substr(char, 1, regexpr(" ", char)); cat("\n")
# use -1 to *not* get the substring you're looking for
substr(char, 1, regexpr(" ", char) - 1); cat("\n")
# when using regexpr(), if the substring does not appear,
#   it will return -1, and nothing will be returned.
substr(char, 1, regexpr("b", char)); cat("\n")
# use nchar() to get the length of the string.
substr(char, regexpr(" ", char), nchar(char)); cat("\n")
# here, use +1 to avoid the space
substr(char, regexpr(" ", char) + 1, nchar(char))
Output
[1] "jul" "jal"
[1] "julius " "jalen " 

[1] "julius" "jalen" 

[1] ""        "jalen b"

[1] " randle"  " brunson"

[1] "randle"  "brunson"

To check if a certain substring exists in another string, we can use grepl(). This accepts two arguments: a single substring and a vector of other strings. Examples follow:

Code
str <- c("mitch robinson", "rj barrett", "patrick ewing")
grepl("b", str); cat("\n")
# You can also search for multiple thing, like "b or r", by:
grepl("b|r", str)
Output
[1]  TRUE  TRUE FALSE

[1] TRUE TRUE TRUE

Similarly, we can use gsub() to find and replace things.

Code
str <- c("mitch robinson", "rj barrett", "patrick ewing")
gsub("b", "z", str); cat("\n")
# to delete things, replace it with ""
grepl("b|r", "", str)
Output
[1] "mitch rozinson" "rj zarrett"     "patrick ewing" 

[1] FALSE

Regular Expressions

Working with text patterns can get a bit crazy. There’s a whole mini-language for this that we won’t fully dive into, but I want to expose you to some of it. Here are some important ones:

  • \\d: this is all digits 0-9. This is very helpful if you want to quickly find and replace all numeric values.
  • \\D: this is the opposite – all non-digit characters.
  • [[:alpha:]]: all alphabetic characters.
  • \\s: space, tab, new line, etc. This is helpful when there’s a bunch of spaces you want to delete.
  • [[:punct:]]: all punctuation characters.
  • .: this means any character.
  • +: means match the previous character at least once. This is helpful when there are many, for example, spaces in a row. You can use gsub("\\s+", "", txt).
  • ^: indicates the beginning of a string.
  • $: indicates the end of a string.

Below are some examples of how to use regular expressions:

Code
namez <- c("margot elise robbie",
           "samuel l jackson",
           "jennifer lawrence")
# find space-something-space and replace with nothing.
gsub("\\s.+\\s", "", namez); cat("\n")
# find letters-space and replace with nothing
gsub("[[:alpha:]]+\\s", "", namez); cat("\n")
# find start_of_string-letters-space and replace with nothing
gsub("^[[:alpha:]]+\\s", "", namez); cat("\n")
# find space-letters-end_of_string and replace with nothing
gsub("\\s[[:alpha:]]+$", "", namez)
Output
[1] "margotrobbie"      "samueljackson"     "jennifer lawrence"

[1] "robbie"   "jackson"  "lawrence"

[1] "elise robbie" "l jackson"    "lawrence"    

[1] "margot elise" "samuel l"     "jennifer"    

Regular expressions are very difficult, but very convenient once you get the hang of them. Just like lubridate, there’s a fantastic cheatsheet online you should check out.

Knicks Roster

Let’s begin by practicing on the Knicks roster (data, source):

Code
knicks <- read.csv("https://alexcardazzi.github.io/econ311/data/knicks23.csv")
knicks
Output
    X  No.             Player Pos   Ht  Wt        Birth.Date Var.7 Exp             College                   bbrefID
1   1   51   Ryan Arcidiacono  PG  6-3 195    March 26, 1994    us   5           Villanova /players/a/arcidry01.html
2   2    9         RJ Barrett  SG  6-6 214     June 14, 2000    ca   3                Duke /players/b/barrerj01.html
3   3   11      Jalen Brunson  PG  6-2 190   August 31, 1996    us   4           Villanova /players/b/brunsja01.html
4   4   13      Evan Fournier  SG  6-7 205  October 29, 1992    fr  10                     /players/f/fournev01.html
5   5    6     Quentin Grimes  SG  6-5 205       May 8, 2000    us   1     Kansas, Houston /players/g/grimequ01.html
6   6    3          Josh Hart  SF  6-5 215     March 6, 1995    us   5           Villanova  /players/h/hartjo01.html
7   7   55 Isaiah Hartenstein   C  7-0 250       May 5, 1998    us   4                     /players/h/harteis01.html
8   8    8    DaQuan Jeffries  SG  6-5 230   August 30, 1997    us   3 Oral Roberts, Tulsa /players/j/jeffrda01.html
9   9 0, 3       Trevor Keels  SG  6-5 221   August 26, 2003    us   R                Duke /players/k/keelstr01.html
10 10    2      Miles McBride  PG  6-2 200 September 8, 2000    us   1       West Virginia /players/m/mcbrimi01.html
11 11   17     Svi Mykhailiuk  SF  6-7 205     June 10, 1997    ua   4              Kansas /players/m/mykhasv01.html
12 12    5  Immanuel Quickley  SG  6-3 190     June 17, 1999    us   2            Kentucky /players/q/quickim01.html
13 13   30      Julius Randle  PF  6-8 250 November 29, 1994    us   8            Kentucky /players/r/randlju01.html
14 14    0        Cam Reddish  SF  6-8 218 September 1, 1999    us   3                Duke /players/r/reddica01.html
15 15   23  Mitchell Robinson   C  7-0 240     April 1, 1998    us   4    Western Kentucky /players/r/robinmi01.html
16 16    4       Derrick Rose  PG  6-3 200   October 4, 1988    us  13             Memphis  /players/r/rosede01.html
17 17   45       Jericho Sims   C 6-10 245  October 20, 1998    us   1               Texas  /players/s/simsje01.html
18 18    1         Obi Toppin  PF  6-9 220     March 4, 1998    us   2              Dayton /players/t/toppiob01.html

Let’s calculate the following items:

  • The fraction of players who are either a point guard (PG) or shooting guard (SG).
  • Each player’s experience in the league in years.
  • Each player’s height in inches.
  • Each player’s exact age in days on the day of the first game of the 2022-23 season (Oct. 19, 2022)
  • The player who is closest in age to Jalen Brunson, the indisputable best player on the team.

First, let’s examine which players are guards. To do this, we are going to use grepl() to search for "G" in the Pos column. If we find it, we are going to give this player a 1. If we don’t, we are going to give them a 0.

Code
knicks$guard <- ifelse(grepl("G", knicks$Pos), 1, 0)
head(knicks[,c("Player", "Pos", "guard")])
cat("Percent of the roster that are guards:",
    round(mean(knicks$guard), 2)*100, "%")
Output
            Player Pos guard
1 Ryan Arcidiacono  PG     1
2       RJ Barrett  SG     1
3    Jalen Brunson  PG     1
4    Evan Fournier  SG     1
5   Quentin Grimes  SG     1
6        Josh Hart  SF     0
Percent of the roster that are guards: 56 %

Next, let’s calculate the team’s average experience in the league. However, as you might have noticed, rookie players (meaning players who have never played in the league before) have an “R” for their experience. If we calculate an average of this vector, R will return NA because of this. If we simply drop the “R”/NA values, this will overstate the average experience since these rookies should have values equal to zero. Let’s replace “R” with 0.

Code
mean(as.numeric(knicks$Exp))
mean(as.numeric(knicks$Exp), na.rm = TRUE)
knicks$Exp <- gsub("R", 0, knicks$Exp)
mean(as.numeric(knicks$Exp))
Output
[1] NA
[1] 4.294118
[1] 4.055556

The next bullet wants us to calculate player heights. To do this, we are going to grab the first part of their height (feet), multiply by 12 to get inches, and then add in the second part of their height (inches).

Code
knicks$foot <- as.numeric(gsub("-.+", "", knicks$Ht))
knicks$inch <- as.numeric(gsub(".-", "", knicks$Ht))
knicks$Ht_inch <- (knicks$foot*12) + knicks$inch
head(knicks[,c("Player", "Ht", "foot", "inch", "Ht_inch")])
Output
            Player  Ht foot inch Ht_inch
1 Ryan Arcidiacono 6-3    6    3      75
2       RJ Barrett 6-6    6    6      78
3    Jalen Brunson 6-2    6    2      74
4    Evan Fournier 6-7    6    7      79
5   Quentin Grimes 6-5    6    5      77
6        Josh Hart 6-5    6    5      77

Now we are going to calculate a players age in days at the start of the season. To do this, we have to convert their birthday-text into a numeric birthday object. Then, we are going to subtract the specific date from the birthdays vector. This will return the difference in days. We can also take the difference in months (or years, quarters, etc.) which will be demonstrated below.

Code
library("lubridate")
knicks$bday <- mdy(knicks$Birth.Date)
knicks$age <- as.numeric(ymd("2022-10-19") - knicks$bday)
knicks$age_m <- interval(knicks$bday, ymd("2022-10-19")) %/% months(1)
head(knicks[,c("Player", "Birth.Date", "bday", "age", "age_m")])
Output
            Player       Birth.Date       bday   age age_m
1 Ryan Arcidiacono   March 26, 1994 1994-03-26 10434   342
2       RJ Barrett    June 14, 2000 2000-06-14  8162   268
3    Jalen Brunson  August 31, 1996 1996-08-31  9545   313
4    Evan Fournier October 29, 1992 1992-10-29 10947   359
5   Quentin Grimes      May 8, 2000 2000-05-08  8199   269
6        Josh Hart    March 6, 1995 1995-03-06 10089   331

Finally, we are going to calculate each player’s difference in age to Jalen Brunson. First, we are going to calculate JB’s age, and then subtract his age from everyone else’s age. To finish, we are going to sort the data by this difference.

Code
knicks2 <- knicks
jb_age <- knicks2$age[knicks2$Player == "Jalen Brunson"]
knicks2$jb_age_diff <- abs(knicks2$age - jb_age)
knicks2 <- knicks2[order(knicks2$jb_age_diff),]
head(knicks2[,c("Player", "jb_age_diff")])
Output
              Player jb_age_diff
3      Jalen Brunson           0
11    Svi Mykhailiuk         283
8    DaQuan Jeffries         364
6          Josh Hart         544
18        Obi Toppin         550
15 Mitchell Robinson         578

Police Stops

Next, we are going to explore some crime data. The Minneapolis police record information on stops they conduct such as location and date/time (data, documentation). Let’s generate some numeric data from both of these columns.

First, let’s read in the data and view it.

Code
pd <- read.csv("https://alexcardazzi.github.io/econ311/data/minn_stops.csv")
print(head(pd, 10))
Output
          id                 date    problem    race  gender                 location precinct
1  17-036337 2017-02-01T00:00:12Z    traffic   White    Male 44.95134737 -93.28133076        5
2  17-036349 2017-02-01T00:07:38Z suspicious   Black  Female  44.9474742 -93.29829195        5
3  17-036351 2017-02-01T00:11:39Z    traffic   Other    Male       44.89233 -93.28067        5
4  17-036357 2017-02-01T00:18:50Z suspicious  Latino    Male       45.01497 -93.24734        2
5  17-036360 2017-02-01T00:22:41Z    traffic   Black    Male 45.00951934 -93.28989378        4
6  17-036378 2017-02-01T00:39:23Z suspicious Unknown Unknown 44.94047824 -93.26763786        3
7  17-036380 2017-02-01T00:39:53Z suspicious   Black    Male  44.9784072 -93.27881908        1
8  17-036381 2017-02-01T00:40:11Z    traffic   White  Female 44.97429462 -93.27996035        1
9  17-036388 2017-02-01T00:46:54Z    traffic   Black  Female        44.94659 -93.2797        5
10 17-036390 2017-02-01T00:48:37Z    traffic   White    Male 44.97526008 -93.26992525        1

Next, let’s convert the datetime text into something usable with lubridate. Since there’s time information in this, we need to use the still-conveniently named ymd_hms function. Below we’ll generate some distribution plots of the weekday, hour, and minute of these police stops.

Code
pd$timestamp <- ymd_hms(pd$date)

par(mfrow = c(1, 3))
plot(table(wday(pd$timestamp, label = T)),
     xlab = "Weekday", ylab = "Frequency")
plot(table(hour(pd$timestamp)),
     xlab = "Hour", ylab = "")
plot(table(minute(pd$timestamp)),
     xlab = "Minute", ylab = "")
par(mfrow = c(1, 1))
Plot

Finally, we can extract latitude and longitude from the loc column, and plot the resulting points as a map.

Code
pd$lon <- gsub(".+\\s", "", pd$location)
pd$lat <- gsub("\\s.+", "", pd$location)
plot(pd$lon, pd$lat, pch = 19, las = 1,
     col = scales::alpha("black", 0.1),
     xlab = "", ylab = "")
Plot

There’s a whole subset of data that we have not talked about yet that has to do with spatial features of data. Here, I am going to plot Minneapolis neighborhoods with the stops data on top of it. There’s tons of things you can do with this, but it is outside the scope of this course.

Code
minn <- sf::read_sf("https://raw.githubusercontent.com/blackmad/neighborhoods/master/minneapolis.geojson")
par(mar = c(0, 0, 0, 0))
plot(minn$geometry, col = scales::alpha("dodgerblue", 0.3))
points(pd$lon, pd$lat, pch = 20,
       col = scales::alpha("black", 0.2))
Plot