Setting Up R for Use

When you open up a blank RScript, there are a few things you may want to put at the very top of your code. I like to lead with any libraries I will need. While there are a lot of things you can do with base R, people write packages to make our lives easier. For example, here is a list of a few packages I use in almost all of my scripts:

Note that before you load a library (library("scales")), you need to install the library (install.packages("scales")). Installing packages only has to be done the very first time you use a package, then it will be on your machine forever.

Next, try to always set up a working directory. Suppose you have a folder for your project, and lets say that file path is: C:/Users/alexc/Desktop/sharkAttacks. Within this folder, I might have subfolders like raw_data, cleaned_data, literature, code, etc. It is likely that I'll want to save my script in the code folder but read/write data to the raw_data/cleaned_data folders. To make this simple, I can do this:

setwd("C:/Users/alexc/Desktop/sharkAttacks/")
#read in data

# do stuff ...

setwd("../cleaned_data")
#write out data

This allows for more "flexible" code since we only specify the filepath once. Notice how I used ../cleaned_data in the second setwd(). These two dots mean "go back one level". This tells the computer to back out of the raw_data folder and go inside the cleaned_data folder. If you are unsure of your working directory, you can use getwd() to see what R is currently using.

The last thing is to read in the datasets you need. Often times, you need to read in more than one dataset, merge them, etc. Let's read in both knicks.csv dataset and knicks_stats.csv. As you have seen before, knicks is simply the roster, but the knicks_stats.csv is new. This file contains data on season totals for each player.

#library("enter library here")

setwd("C:/Users/alexc/Desktop/Empirical Workshop 2021/data")
rm(list = ls()) #this clears my workspace ... be careful with this

roster <- read.csv("knicks.csv", stringsAsFactors = F)
stats <- read.csv("knicks_stats.csv", stringsAsFactors = F)

Important Functions

You can check out the data by clicking on the object in the Enviornment tab, or typing View(roster) into the console. Before we get too much farther, we need to explore the data. Here is a list of some invaluable functions. Mastering these few functions take you from level 1 to level 2, and can save you hours of time.

  • table()
  • aggregate()
  • match()

First, in the roster data frame, we might want to see the distribution of player experience. We can easily do this by calling:

table(roster$Exp)
## 
##  1 10  2  3  4  5  7  9  R 
##  6  2  2  2  1  3  2  1  4

Notice the odd sorting in this table. This is due to the rookies being given an "R". Let's change this and remake the table.

roster$Exp[roster$Exp == "R"] <- 0
roster$Exp <- as.numeric(roster$Exp)
table(roster$Exp)
## 
##  0  1  2  3  4  5  7  9 10 
##  4  6  2  2  1  3  2  1  2

It can also be helpful to plot these for visualization purposes. You can simply use plot(table(roster$Exp)). Now, what if we wanted to look at the distribution of experience by position? We can add in a second vector to table.

table(roster$Exp, roster$Pos)
##     
##      C PF PG SF SG
##   0  1  1  0  0  2
##   1  0  2  2  0  2
##   2  0  1  0  0  1
##   3  0  0  1  1  0
##   4  0  1  0  0  0
##   5  0  0  1  0  2
##   7  1  1  0  0  0
##   9  0  0  0  0  1
##   10 1  0  0  0  1

Table is especially helpful when you are counting events per some unit. For example, suppose you have a dataset where every entry is a crime with a date as a timestamp. Well, you can use table(crime$date) and this will generate the number of crimes per day. You might need to be careful, though, since if there is a day missing altogether, R will not know this!

What if we are not interested in counts but rather averages, or some other function? This is where aggregate() comes in. This function takes 3 arguements. First, the vector(s) you to aggregate. Second, the vectors you want to collapse to, and finally the function. For example, suppose we want to know the average weight by position.

agg1 <- aggregate(roster$Wt, by = list(roster$Pos), FUN = mean)
agg1
##   Group.1   x
## 1       C 250
## 2      PF 237
## 3      PG 190
## 4      SF 201
## 5      SG 210

Notice the column names of dataframe. Terrible. Instead, we can name them like: aggregate(Wt = roster$Wt, by = list(Pos = roster$Pos), FUN = mean)

What if, instead, we are interested in average height and weight by position? Let's remember how dataframes are special cases of named lists.

agg2 <- aggregate(list(Wt = roster$Wt, Ht_inches = roster$Ht_inches), by = list(Pos = roster$Pos), FUN = mean)
agg2
##   Pos  Wt Ht_inches
## 1   C 250  83.33333
## 2  PF 237  81.66667
## 3  PG 190  74.50000
## 4  SF 201  80.00000
## 5  SG 210  76.22222
agg3 <- aggregate(roster[,c("Wt", "Ht_inches")], by = list(Pos = roster$Pos), FUN = mean)
agg3
##   Pos  Wt Ht_inches
## 1   C 250  83.33333
## 2  PF 237  81.66667
## 3  PG 190  74.50000
## 4  SF 201  80.00000
## 5  SG 210  76.22222

What if we wanted average weight by position and country? This is also easy, but we need to be careful to remember that there is not every single position-county combo in the data. For example, there is a PG-be combo, but there is no C-be combo. If we want to keep the non-existent combos, we need to use drop = FALSE. table() will count the zero combinations, but aggregate cannot take a mean of something that does not exist. Therefore, it assumes that you don't want that observation and R drops it.

agg4 <- aggregate(roster$Wt, by = list(Pos = roster$Pos, roster$X), FUN = mean)
agg4
##   Pos Group.2   x
## 1  PG      be 190
## 2  PG      cd 200
## 3   C      ch 262
## 4  SF      hr 201
## 5   C      us 244
## 6  PF      us 237
## 7  PG      us 185
## 8  SG      us 210
agg5 <- aggregate(roster$Wt, by = list(Pos = roster$Pos, roster$X), FUN = mean, drop = FALSE)
agg5
##    Pos Group.2   x
## 1    C      be  NA
## 2   PF      be  NA
## 3   PG      be 190
## 4   SF      be  NA
## 5   SG      be  NA
## 6    C      cd  NA
## 7   PF      cd  NA
## 8   PG      cd 200
## 9   SF      cd  NA
## 10  SG      cd  NA
## 11   C      ch 262
## 12  PF      ch  NA
## 13  PG      ch  NA
## 14  SF      ch  NA
## 15  SG      ch  NA
## 16   C      hr  NA
## 17  PF      hr  NA
## 18  PG      hr  NA
## 19  SF      hr 201
## 20  SG      hr  NA
## 21   C      us 244
## 22  PF      us 237
## 23  PG      us 185
## 24  SF      us  NA
## 25  SG      us 210

As a final note about aggregate(), you can use any function with it! Instead of mean, you can use sd, sum, or anything.

Okay, now you're probably thinking: why did I load in the second dataset??? Well, we want to merge/join these two datasets! You are going to google merging in R at some point in your life and I am here to tell you: all you need is the match() function. That's it! Don't use merge, join, or anything else. Before I even pull anything from the stats dataframe, I need to know the corresponding order of roster within stats.

m <- match(roster$Player, stats$Var.2)
m
##  [1] 16 20 13  2 18 21 NA  8 23 17 15  9  1 11 19 22  4 10  7 NA 12  6  3

What does this tell us? It tells us that Kadeem Allen is the 16th row in the stats dataframe. So now, we can use m to reorder the stats dataset how we need it to merge. You might also notice that there are NA values in m. This is what happens where there is no match. This is due to quirks in how the names are typed which we will ignore for now, but it's important to know. Now, if we want to know how many games each player appeared in, we use the following:

roster$G <- stats$G[m]

If we want to match in many columns:

roster[,c("G", "GS", "PTS")] <- stats[m,c("G", "GS", "PTS")]

Before moving on, I want to mention something people often get stuck on. Hopefully you see how to match based on one variable, but what about two? Or three? We will talk about strings later, but suppose you want to match by both city and year. You would do something like: match(paste(abc$city, abc$year), paste(xyz$city, xyz$year)).

These are some really important things in R. I would encourage you to go over this a few times and really understand what is going on. The power of these three functions cannot be overstated.

Econometrics

Let's talk about regressions and econometrics. Let's use a dataset native to R called quakes. This contains a bunch of earthquake readings in Fiji. Documentation here. To read it in, use:

data("quakes")
str(quakes)
## 'data.frame':    1000 obs. of  5 variables:
##  $ lat     : num  -20.4 -20.6 -26 -18 -20.4 ...
##  $ long    : num  182 181 184 182 182 ...
##  $ depth   : int  562 650 42 626 649 195 82 194 211 622 ...
##  $ mag     : num  4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ...
##  $ stations: int  41 15 43 19 11 12 43 15 35 19 ...

str() gives the structure of the object. There are 1000 rows and 5 variables.

The first thing you'll want to show is summary stats. There are two ways (that I use) to show summary stats. The first one uses a library called stargazer and the second uses both modelsummary and kableExtra. Stargazer is much easier, but modelsummary is way more customizable. I will only focus on stargazer here:

library("stargazer")
stargazer(quakes, type = "text", #make this "latex" if you want that output
          summary.stat = c("n", "mean", "sd"))
## 
## ================================
## Statistic   N    Mean   St. Dev.
## --------------------------------
## lat       1,000 -20.643  5.029  
## long      1,000 179.462  6.069  
## depth     1,000 311.371 215.535 
## mag       1,000  4.620   0.403  
## stations  1,000 33.418   21.900 
## --------------------------------

The variable "stations" tells the number of stations that reported the earthquake. It would be interesting to see if depth or magnitude impacted this. Obviously, one of the first steps would be to plot or to run correlations. Let's try magnitude and stations.

plot(quakes$mag, quakes$stations)

cor.test(quakes$mag, quakes$stations)
## 
##  Pearson's product-moment correlation
## 
## data:  quakes$mag and quakes$stations
## t = 51.231, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8331527 0.8674048
## sample estimates:
##       cor 
## 0.8511824

We will do much more with plotting later, but for now, we can see a relationship. To run a univariate regression:

reg1 <- lm(stations ~ mag, data = quakes)

Pretty simple! To check out the diagnostics:

summary(reg1)
## 
## Call:
## lm(formula = stations ~ mag, data = quakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.871  -7.102  -0.474   6.783  50.244 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -180.4243     4.1899  -43.06   <2e-16 ***
## mag           46.2822     0.9034   51.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.5 on 998 degrees of freedom
## Multiple R-squared:  0.7245, Adjusted R-squared:  0.7242 
## F-statistic:  2625 on 1 and 998 DF,  p-value: < 2.2e-16

To add in multiple variables:

reg2 <- lm(stations ~ mag + depth, data = quakes)
summary(reg2)
## 
## Call:
## lm(formula = stations ~ mag + depth, data = quakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.506  -6.996  -0.453   6.643  47.259 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.920e+02  4.332e+00 -44.335  < 2e-16 ***
## mag          4.791e+01  9.016e-01  53.135  < 2e-16 ***
## depth        1.318e-02  1.685e-03   7.822 1.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.17 on 997 degrees of freedom
## Multiple R-squared:  0.7404, Adjusted R-squared:  0.7399 
## F-statistic:  1422 on 2 and 997 DF,  p-value: < 2.2e-16

Now, if you want to see all the models together, you can use stargazer once again.

stargazer(reg1, reg2,
          type = "text")
## 
## =========================================================================
##                                      Dependent variable:                 
##                     -----------------------------------------------------
##                                           stations                       
##                                (1)                        (2)            
## -------------------------------------------------------------------------
## mag                         46.282***                  47.909***         
##                              (0.903)                    (0.902)          
##                                                                          
## depth                                                   0.013***         
##                                                         (0.002)          
##                                                                          
## Constant                   -180.424***                -192.043***        
##                              (4.190)                    (4.332)          
##                                                                          
## -------------------------------------------------------------------------
## Observations                  1,000                      1,000           
## R2                            0.725                      0.740           
## Adjusted R2                   0.724                      0.740           
## Residual Std. Error     11.501 (df = 998)          11.169 (df = 997)     
## F Statistic         2,624.656*** (df = 1; 998) 1,422.047*** (df = 2; 997)
## =========================================================================
## Note:                                         *p<0.1; **p<0.05; ***p<0.01

Of course, we can also use logistic regression to model a binary outcome or a poisson to model a count. For this, we use glm().

# making a binary variable:
quakes$stations_30plus <- quakes$stations >= 30
reg3 <- glm(stations_30plus ~ mag + depth, data = quakes, family = "binomial")
reg4 <- glm(stations ~ mag + depth, data = quakes, family = "poisson")
stargazer(reg1, reg2, reg3, reg4,
          type = "text",
          omit.stat = c("f", "ser"))
## 
## ====================================================================
##                                  Dependent variable:                
##                   --------------------------------------------------
##                          stations         stations_30plus  stations 
##                             OLS              logistic      Poisson  
##                       (1)         (2)           (3)          (4)    
## --------------------------------------------------------------------
## mag                46.282***   47.909***     7.315***      1.189*** 
##                     (0.903)     (0.902)       (0.472)      (0.012)  
##                                                                     
## depth                          0.013***      0.002***     0.0003*** 
##                                 (0.002)      (0.0004)     (0.00003) 
##                                                                     
## Constant          -180.424*** -192.043***   -34.563***    -2.205*** 
##                     (4.190)     (4.332)       (2.227)      (0.059)  
##                                                                     
## --------------------------------------------------------------------
## Observations         1,000       1,000         1,000        1,000   
## R2                   0.725       0.740                              
## Adjusted R2          0.724       0.740                              
## Log Likelihood                               -366.572     -4,023.375
## Akaike Inf. Crit.                             739.143     8,052.749 
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

Next, within regressions, there are shortcuts for interactions and other manipulations. There are three you should know: *, :, and I(). If you want a full set of interactions (A + B + AxB), use * between two variables. If you want just the interaction term, use :. Finally, if you want to do calculations within the regression, you can do I(x * 100), I(x^2), etc.

fixest

Extremely large datasets usually require alternate solutions. Since R is open source, very kind individuals have written highly optimized packages for us to use. fixest is one of these packages. The way to write formulas for fixest is very similar to how you would write them for lm(). Below is an example:

reg_x <- feols(y1 ~ x1 + x2 | FE1 + FE2, data = df)
  • y is an outcome variable. This can be changed to list(y1, y2) if you have more than one outcome variable.
  • x1, x2 are independent variables. If you want to run multiple models, you can do sw(x1, x2, x1 + x2). This will do one model with x1, one with x2 and one with both. sw0() will do all three plus one without any of them.
  • x1, x2 are fixed effects. By default, standard errors are clustered by FE1, but you can change this in post.

  1. look into readRDS and saveRDS. Essentially, this saves an R object to your file directory. This provides two benefits over a CSV: it is zipped so much smaller and R can read the data in way quicker than a CSV. Drawback is you cannot open with Excel or Notepad. Should be used for LARGE datasets.