When you open up a blank RScript, there are a few things you may want to put at the very top of your code. I like to lead with any libraries I will need. While there are a lot of things you can do with base R, people write packages to make our lives easier. For example, here is a list of a few packages I use in almost all of my scripts:
lubridate
: this package is for working with date objects. Dates are not always easy to work with but I cannot recommend this package enough. Cheatsheet.stargazer
: this package makes regression output beautiful and simple. Cheatsheet.data.table
: this is for reading in large CSVs.1scales
: this is for changing the opacity of plot colors. Purely aesthetic, but I love it.fixest
: this package is excellent for when you have many fixed effects and/or observations. There is too much to cover on this, but see here for more details.modelsummary
, kableExtra
: unfortunately, fixest
does not work for stargazer, so you will need to use these. Here is a code snippet (by me) you might be interested in using. Here is the package's vignette.Note that before you load a library (library("scales")
), you need to install the library (install.packages("scales")
). Installing packages only has to be done the very first time you use a package, then it will be on your machine forever.
Next, try to always set up a working directory. Suppose you have a folder for your project, and lets say that file path is: C:/Users/alexc/Desktop/sharkAttacks
. Within this folder, I might have subfolders like raw_data, cleaned_data, literature, code, etc. It is likely that I'll want to save my script in the code folder but read/write data to the raw_data/cleaned_data folders. To make this simple, I can do this:
setwd("C:/Users/alexc/Desktop/sharkAttacks/")
#read in data
# do stuff ...
setwd("../cleaned_data")
#write out data
This allows for more "flexible" code since we only specify the filepath once. Notice how I used ../cleaned_data
in the second setwd()
. These two dots mean "go back one level". This tells the computer to back out of the raw_data folder and go inside the cleaned_data folder. If you are unsure of your working directory, you can use getwd()
to see what R is currently using.
The last thing is to read in the datasets you need. Often times, you need to read in more than one dataset, merge them, etc. Let's read in both knicks.csv
dataset and knicks_stats.csv
. As you have seen before, knicks
is simply the roster, but the knicks_stats.csv
is new. This file contains data on season totals for each player.
#library("enter library here")
setwd("C:/Users/alexc/Desktop/Empirical Workshop 2021/data")
rm(list = ls()) #this clears my workspace ... be careful with this
roster <- read.csv("knicks.csv", stringsAsFactors = F)
stats <- read.csv("knicks_stats.csv", stringsAsFactors = F)
You can check out the data by clicking on the object in the Enviornment tab, or typing View(roster)
into the console. Before we get too much farther, we need to explore the data. Here is a list of some invaluable functions. Mastering these few functions take you from level 1 to level 2, and can save you hours of time.
table()
aggregate()
match()
First, in the roster
data frame, we might want to see the distribution of player experience. We can easily do this by calling:
table(roster$Exp)
##
## 1 10 2 3 4 5 7 9 R
## 6 2 2 2 1 3 2 1 4
Notice the odd sorting in this table. This is due to the rookies being given an "R". Let's change this and remake the table.
roster$Exp[roster$Exp == "R"] <- 0
roster$Exp <- as.numeric(roster$Exp)
table(roster$Exp)
##
## 0 1 2 3 4 5 7 9 10
## 4 6 2 2 1 3 2 1 2
It can also be helpful to plot these for visualization purposes. You can simply use plot(table(roster$Exp))
. Now, what if we wanted to look at the distribution of experience by position? We can add in a second vector to table.
table(roster$Exp, roster$Pos)
##
## C PF PG SF SG
## 0 1 1 0 0 2
## 1 0 2 2 0 2
## 2 0 1 0 0 1
## 3 0 0 1 1 0
## 4 0 1 0 0 0
## 5 0 0 1 0 2
## 7 1 1 0 0 0
## 9 0 0 0 0 1
## 10 1 0 0 0 1
Table is especially helpful when you are counting events per some unit. For example, suppose you have a dataset where every entry is a crime with a date as a timestamp. Well, you can use table(crime$date)
and this will generate the number of crimes per day. You might need to be careful, though, since if there is a day missing altogether, R will not know this!
What if we are not interested in counts but rather averages, or some other function? This is where aggregate()
comes in. This function takes 3 arguements. First, the vector(s) you to aggregate. Second, the vectors you want to collapse to, and finally the function. For example, suppose we want to know the average weight by position.
agg1 <- aggregate(roster$Wt, by = list(roster$Pos), FUN = mean)
agg1
## Group.1 x
## 1 C 250
## 2 PF 237
## 3 PG 190
## 4 SF 201
## 5 SG 210
Notice the column names of dataframe. Terrible. Instead, we can name them like: aggregate(Wt = roster$Wt, by = list(Pos = roster$Pos), FUN = mean)
What if, instead, we are interested in average height and weight by position? Let's remember how dataframes are special cases of named lists.
agg2 <- aggregate(list(Wt = roster$Wt, Ht_inches = roster$Ht_inches), by = list(Pos = roster$Pos), FUN = mean)
agg2
## Pos Wt Ht_inches
## 1 C 250 83.33333
## 2 PF 237 81.66667
## 3 PG 190 74.50000
## 4 SF 201 80.00000
## 5 SG 210 76.22222
agg3 <- aggregate(roster[,c("Wt", "Ht_inches")], by = list(Pos = roster$Pos), FUN = mean)
agg3
## Pos Wt Ht_inches
## 1 C 250 83.33333
## 2 PF 237 81.66667
## 3 PG 190 74.50000
## 4 SF 201 80.00000
## 5 SG 210 76.22222
What if we wanted average weight by position and country? This is also easy, but we need to be careful to remember that there is not every single position-county combo in the data. For example, there is a PG-be combo, but there is no C-be combo. If we want to keep the non-existent combos, we need to use drop = FALSE
. table()
will count the zero combinations, but aggregate
cannot take a mean of something that does not exist. Therefore, it assumes that you don't want that observation and R drops it.
agg4 <- aggregate(roster$Wt, by = list(Pos = roster$Pos, roster$X), FUN = mean)
agg4
## Pos Group.2 x
## 1 PG be 190
## 2 PG cd 200
## 3 C ch 262
## 4 SF hr 201
## 5 C us 244
## 6 PF us 237
## 7 PG us 185
## 8 SG us 210
agg5 <- aggregate(roster$Wt, by = list(Pos = roster$Pos, roster$X), FUN = mean, drop = FALSE)
agg5
## Pos Group.2 x
## 1 C be NA
## 2 PF be NA
## 3 PG be 190
## 4 SF be NA
## 5 SG be NA
## 6 C cd NA
## 7 PF cd NA
## 8 PG cd 200
## 9 SF cd NA
## 10 SG cd NA
## 11 C ch 262
## 12 PF ch NA
## 13 PG ch NA
## 14 SF ch NA
## 15 SG ch NA
## 16 C hr NA
## 17 PF hr NA
## 18 PG hr NA
## 19 SF hr 201
## 20 SG hr NA
## 21 C us 244
## 22 PF us 237
## 23 PG us 185
## 24 SF us NA
## 25 SG us 210
As a final note about aggregate()
, you can use any function with it! Instead of mean
, you can use sd
, sum
, or anything.
Okay, now you're probably thinking: why did I load in the second dataset??? Well, we want to merge/join these two datasets! You are going to google merging in R at some point in your life and I am here to tell you: all you need is the match()
function. That's it! Don't use merge, join, or anything else. Before I even pull anything from the stats
dataframe, I need to know the corresponding order of roster
within stats
.
m <- match(roster$Player, stats$Var.2)
m
## [1] 16 20 13 2 18 21 NA 8 23 17 15 9 1 11 19 22 4 10 7 NA 12 6 3
What does this tell us? It tells us that Kadeem Allen
is the 16
th row in the stats
dataframe. So now, we can use m
to reorder the stats
dataset how we need it to merge. You might also notice that there are NA values in m
. This is what happens where there is no match. This is due to quirks in how the names are typed which we will ignore for now, but it's important to know. Now, if we want to know how many games each player appeared in, we use the following:
roster$G <- stats$G[m]
If we want to match in many columns:
roster[,c("G", "GS", "PTS")] <- stats[m,c("G", "GS", "PTS")]
Before moving on, I want to mention something people often get stuck on. Hopefully you see how to match based on one variable, but what about two? Or three? We will talk about strings later, but suppose you want to match by both city and year. You would do something like: match(paste(abc$city, abc$year), paste(xyz$city, xyz$year))
.
These are some really important things in R. I would encourage you to go over this a few times and really understand what is going on. The power of these three functions cannot be overstated.
Let's talk about regressions and econometrics. Let's use a dataset native to R called quakes
. This contains a bunch of earthquake readings in Fiji. Documentation here. To read it in, use:
data("quakes")
str(quakes)
## 'data.frame': 1000 obs. of 5 variables:
## $ lat : num -20.4 -20.6 -26 -18 -20.4 ...
## $ long : num 182 181 184 182 182 ...
## $ depth : int 562 650 42 626 649 195 82 194 211 622 ...
## $ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
## $ stations: int 41 15 43 19 11 12 43 15 35 19 ...
str()
gives the structure of the object. There are 1000 rows and 5 variables.
The first thing you'll want to show is summary stats. There are two ways (that I use) to show summary stats. The first one uses a library called stargazer
and the second uses both modelsummary
and kableExtra
. Stargazer is much easier, but modelsummary is way more customizable. I will only focus on stargazer
here:
library("stargazer")
stargazer(quakes, type = "text", #make this "latex" if you want that output
summary.stat = c("n", "mean", "sd"))
##
## ================================
## Statistic N Mean St. Dev.
## --------------------------------
## lat 1,000 -20.643 5.029
## long 1,000 179.462 6.069
## depth 1,000 311.371 215.535
## mag 1,000 4.620 0.403
## stations 1,000 33.418 21.900
## --------------------------------
The variable "stations" tells the number of stations that reported the earthquake. It would be interesting to see if depth or magnitude impacted this. Obviously, one of the first steps would be to plot or to run correlations. Let's try magnitude and stations.
plot(quakes$mag, quakes$stations)
cor.test(quakes$mag, quakes$stations)
##
## Pearson's product-moment correlation
##
## data: quakes$mag and quakes$stations
## t = 51.231, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8331527 0.8674048
## sample estimates:
## cor
## 0.8511824
We will do much more with plotting later, but for now, we can see a relationship. To run a univariate regression:
reg1 <- lm(stations ~ mag, data = quakes)
Pretty simple! To check out the diagnostics:
summary(reg1)
##
## Call:
## lm(formula = stations ~ mag, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.871 -7.102 -0.474 6.783 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -180.4243 4.1899 -43.06 <2e-16 ***
## mag 46.2822 0.9034 51.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared: 0.7245, Adjusted R-squared: 0.7242
## F-statistic: 2625 on 1 and 998 DF, p-value: < 2.2e-16
To add in multiple variables:
reg2 <- lm(stations ~ mag + depth, data = quakes)
summary(reg2)
##
## Call:
## lm(formula = stations ~ mag + depth, data = quakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.506 -6.996 -0.453 6.643 47.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.920e+02 4.332e+00 -44.335 < 2e-16 ***
## mag 4.791e+01 9.016e-01 53.135 < 2e-16 ***
## depth 1.318e-02 1.685e-03 7.822 1.32e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.17 on 997 degrees of freedom
## Multiple R-squared: 0.7404, Adjusted R-squared: 0.7399
## F-statistic: 1422 on 2 and 997 DF, p-value: < 2.2e-16
Now, if you want to see all the models together, you can use stargazer
once again.
stargazer(reg1, reg2,
type = "text")
##
## =========================================================================
## Dependent variable:
## -----------------------------------------------------
## stations
## (1) (2)
## -------------------------------------------------------------------------
## mag 46.282*** 47.909***
## (0.903) (0.902)
##
## depth 0.013***
## (0.002)
##
## Constant -180.424*** -192.043***
## (4.190) (4.332)
##
## -------------------------------------------------------------------------
## Observations 1,000 1,000
## R2 0.725 0.740
## Adjusted R2 0.724 0.740
## Residual Std. Error 11.501 (df = 998) 11.169 (df = 997)
## F Statistic 2,624.656*** (df = 1; 998) 1,422.047*** (df = 2; 997)
## =========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Of course, we can also use logistic regression to model a binary outcome or a poisson to model a count. For this, we use glm()
.
# making a binary variable:
quakes$stations_30plus <- quakes$stations >= 30
reg3 <- glm(stations_30plus ~ mag + depth, data = quakes, family = "binomial")
reg4 <- glm(stations ~ mag + depth, data = quakes, family = "poisson")
stargazer(reg1, reg2, reg3, reg4,
type = "text",
omit.stat = c("f", "ser"))
##
## ====================================================================
## Dependent variable:
## --------------------------------------------------
## stations stations_30plus stations
## OLS logistic Poisson
## (1) (2) (3) (4)
## --------------------------------------------------------------------
## mag 46.282*** 47.909*** 7.315*** 1.189***
## (0.903) (0.902) (0.472) (0.012)
##
## depth 0.013*** 0.002*** 0.0003***
## (0.002) (0.0004) (0.00003)
##
## Constant -180.424*** -192.043*** -34.563*** -2.205***
## (4.190) (4.332) (2.227) (0.059)
##
## --------------------------------------------------------------------
## Observations 1,000 1,000 1,000 1,000
## R2 0.725 0.740
## Adjusted R2 0.724 0.740
## Log Likelihood -366.572 -4,023.375
## Akaike Inf. Crit. 739.143 8,052.749
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Next, within regressions, there are shortcuts for interactions and other manipulations. There are three you should know: *
, :
, and I()
. If you want a full set of interactions (A + B + AxB), use *
between two variables. If you want just the interaction term, use :
. Finally, if you want to do calculations within the regression, you can do I(x * 100)
, I(x^2)
, etc.
fixest
Extremely large datasets usually require alternate solutions. Since R is open source, very kind individuals have written highly optimized packages for us to use. fixest
is one of these packages. The way to write formulas for fixest
is very similar to how you would write them for lm()
. Below is an example:
reg_x <- feols(y1 ~ x1 + x2 | FE1 + FE2, data = df)
y
is an outcome variable. This can be changed to list(y1, y2)
if you have more than one outcome variable.x1
, x2
are independent variables. If you want to run multiple models, you can do sw(x1, x2, x1 + x2)
. This will do one model with x1
, one with x2
and one with both. sw0()
will do all three plus one without any of them.x1
, x2
are fixed effects. By default, standard errors are clustered by FE1
, but you can change this in post.look into readRDS and saveRDS. Essentially, this saves an R object to your file directory. This provides two benefits over a CSV: it is zipped so much smaller and R can read the data in way quicker than a CSV. Drawback is you cannot open with Excel or Notepad. Should be used for LARGE datasets.↩