Introduction to R

Author

Alexander Cardazzi

Published

January 1, 2020

All Materials Here

#What is R? R is an open source (OSS), object oriented scripting language. It is mostly used by data scientists as an enviornment for statistical computing and graphics. Its source code is written in C, Fortran and R, though it stems from the mostly defunct language S. R, as a language, ranks 12th in terms of overall popularity (according to TIOBE 2019). Due to this, the answer to nearly any question you might have about R is on Google. Specifically, Stack Overflow, RStudio’s Community Website, or a Reddit Community ( r/rstats, r/Rlanguage ) should have what you are looking for. The hardest part about programming is knowing what and how to google.

For those of you who are curious, R is different from STATA in a few ways though ultimately minor. In my opinion, R has a steeper learning curve but it is more flexible and powerful once you are over that hump.

When using R, most people use RStudio. I like to think of R as the brain and RStudio as the body. It is likely that you will never use the default R GUI (Graphical User Interface) alone.

#Download R To download R, follow this link. R asks that you to choose a location close to you when you download. Then, you can select the R that works for your operating system.1 Then, click on Install R for the first time. Finally, Download whatever version of R is displayed. Download R before RStudio!

#Download RStudio#### To download RStudio, go here. This should detect what OS you have and make the appropriate suggestion. However, in the event that the suggestion is incorrect, scrolling down will reveal other versions you can download.

Here, we can see four distinct panels. The bottom left is the console. This is where you can view your code’s output. I also use this for “test code” I do not want in my saved file. The top left is where you write code that you want to save, which we’ll call a script. The upper right is your enviornment (along with history and connections, which we will not discuss right now). This is where you can see all the data you have loaded or created. The bottom right is where you can view plots, view files and use the help functionality. The help in R is an excellent start.

To use help, type in a question mark and then the function you need with into the console and hit enter. For example, type ?mean and check out what pops up. There will be information on Usage, Arguments, and Examples.

#Data Structures####

Some basic objects, or data structures, in R are as follows:

#Logic Before the development of Object Oriented Programming (OOP), most notably C++, programming languages were more focused on logic instead of data. Even now with OOP, logic plays a big part in how programs are written. Here are some examples of how the “and” (&) and “or” (|) operators work in tandem with logical values. Next, we can also use inequalities like in written math, and these will be evaluated the way you might expect.

A note about the semicolon used throughout this tutorial: this is just so the code is more concise. You do not need to end a line with a semicolon (like how you would in C++), but it allows you can put two lines onto one line if you separate them with a semicolon. I only ever do this if I have multiple, short lines that “go together”. Here it is purely aesthetic.

TRUE & TRUE; TRUE & FALSE; FALSE & FALSE;
[1] TRUE
[1] FALSE
[1] FALSE
TRUE | TRUE; TRUE | FALSE; FALSE | FALSE;
[1] TRUE
[1] TRUE
[1] FALSE
5 < 3; 5 > 3; 5 <= 3; 5 >= 3; 5 == 3; 5 != 3
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE

Hint: All numerical values, except 0, are coded as TRUE. I would hesitate using this unless you have binary variables that you have coded as 1/0 instead of TRUE/FALSE as it is unintuitive.

1 & 5; 5 & 0
[1] TRUE
[1] FALSE

#Arithmetic Since we aren’t interested in just logic, we can also use R as a calculator. An incredibly elaborate calculator, but a calculator nonetheless. Typical mathematical operators are what you would expect in R. +, -, *, / are addition, subtraction, multiplication and division respectively. Other useful operators are integer division (%/%) , modulo (%%), and exponents (^ or **). Integer division chops off the remainder after division and modulo returns the remainder after division.

1 + 2; 3 - 4; 5 * 6; 7 / 8; 5%/%3; 5%%3
[1] 3
[1] -1
[1] 30
[1] 0.875
[1] 1
[1] 2

#Variables / Objects Further, because we aren’t necessarily interested in just incredibly elaborate calculators, we can explore how to store data in R. Much like algebra, we can use characters, or strings of characters, to represent values. We can do this with three operators: <-, ->, =. The first two are identical besides the way they “push” data, though the first is strictly more popular than the second. The final one is different, but in a nuanced way which is beyond the scope of this tutorial.2

x <- 1 + 1; y <- 2 + 2
print(x); print(y)
[1] 2
[1] 4
#I am going to create a variable that is the sum of two other variables:
x + y -> z; print(z)
[1] 6
#notice how changing the value of x does not change z.  This may not be trivial but it is good to note.
x <- 100; print(z)
[1] 6

Typically, printing things in R is done with the print() function. However, if one is working in RStudio, just calling the actual variable will print it to the console just the same. Unless you are building a larger program that would require printing, this is not necessary to think about. Though, often times, printing is a good debugging tool when developing any script. Moreover, I encourage litering your scripts with print statements while you are learning.

#Vectors Rarely do we work with single values. Instead, we have collections of data, or variables, which we will consider vectors. The first thing to know is c(). This stands for concatenate, and it is very important. Essentially, it coerces multiple values into a vector. The rep() function repeats the first agument passed the number of times specified by the second argument passed. The seq() function creates a sequence between the first two arguments passed with the third arguement being the level of incrimentation. Lastly, in this code block, we see how to subset vectors by indexing and by logic.

#Note how we do not need to type every integer here by using the colon
c(1, 2, 3, 4, 5); c(1:5)
[1] 1 2 3 4 5
[1] 1 2 3 4 5
rep(1, 10); rep(c(1:5), 10)
 [1] 1 1 1 1 1 1 1 1 1 1
 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3
[39] 4 5 1 2 3 4 5 1 2 3 4 5
seq(1, 10, 2); seq(1, 10, .1)
[1] 1 3 5 7 9
 [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4
[16]  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9
[31]  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4
[46]  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
[61]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3  8.4
[76]  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9
[91] 10.0

Now that we can make vectors, we will want to manipulate them. We can do arithmetic and logical operations to vectors like we did to scalars before.

x <- c(1:5)
x > 2; x + 2; x*3; x + x
[1] FALSE FALSE  TRUE  TRUE  TRUE
[1] 3 4 5 6 7
[1]  3  6  9 12 15
[1]  2  4  6  8 10

Suppose now that we have a long vector but only want to look at, modify, remove, etc. only certain elements. We will use square brackets to do this.

#Vector subsetting by indexing and logic
y <- c(1:100)
y[10:20] #by index, give me the 10th through 20th elements
 [1] 10 11 12 13 14 15 16 17 18 19 20
y[y < 5] #by logic, give me the elements of y where y is less than 5
[1] 1 2 3 4
y[y %% 6 == 0] #by logic, give me the elements in y such that dividing them by 6 does not yield a remainder
 [1]  6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96
z <- c(x, y[90:100]); z #combining two vectors, x and a subset of y.
 [1]   1   2   3   4   5  90  91  92  93  94  95  96  97  98  99 100

As an example, we can run some of these operations inside square brackets alone and get vectors of logical values. For example, we can use y %% 6 == 0 and get 100 TRUE or FALSE values. Sometimes, we want to know the indices where these are TRUE and we can use the which() function. This is an important function.

You might be able to see how this can become complicated. For example, suppose we had 3 vectors: GDP, Country, and Continent, and we only want the GDP for countries in Europe. It may look something like this:

GDP <- c(47, 23, 61, 29, 80, 48, 92, 42)
Country <- c("USA", "Italy", "Egypt", "Mexico", "Japan", "UK", "Germany", "Brazil")
Continent <- c("NA", "EU", "AF", "NA", "AS", "EU", "EU", "SA")
print(GDP[Continent == "EU"])
[1] 23 48 92

I like to think of it like: “Give me GDP where/given Continent equals EU”. Talk to yourself when you program! If you wanted continent to be equal to europe OR north america, this can be done like: Continent == "EU" | Continent == "NA" or Continent %in% c("EU", "NA").

Overall, vectors are of extreme interest to us. I want to take a moment to show some helpful, prebuilt statistical functions in R. For example, we can use mean(), sum(), sd(), var(), summary(), t.test(), etc. There are also normal distribution functions rnorm(), qnorm(), dnorm(), and pnorm().

N <- 100
set.seed(789) #this is so we can replicate randomness
x <- rnorm(N, mean = 1, sd = 4)
y <- rnorm(N, mean = 0, sd = 4)
summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-11.3505  -1.8126   0.3370   0.8459   3.6653  10.3941 
summary(y)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-11.91203  -2.92133  -0.07153  -0.28624   2.27216  12.20789 
t.test(x,y)

    Welch Two Sample t-test

data:  x and y
t = 1.9832, df = 197.21, p-value = 0.04873
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.006345714 2.257987623
sample estimates:
 mean of x  mean of y 
 0.8459310 -0.2862356 

#Matrices In the following section, we will learn a little bit about how matrix operations work in R. Typically, most data is held in data frames, but first it is valuable to look into matrices. We will begin with two vectors of 5 randomly generated numbers.

#rnorm generates n numbers from a normal distribution ... use ?rnorm for help
#rbind - combine by row, or stack on top
#cbind - combine by column, or squish next to each other
set.seed(456) #set.seed allows us to replicate randomness
x1 <- rnorm(5); x1
[1] -1.3435214  0.6217756  0.8008747 -1.3888924 -0.7143569
x2 <- rnorm(5); x2
[1] -0.3240611  0.6906430  0.2505479  1.0073523  0.5732347
rbindEx <- rbind(x1, x2); rbindEx
         [,1]      [,2]      [,3]      [,4]       [,5]
x1 -1.3435214 0.6217756 0.8008747 -1.388892 -0.7143569
x2 -0.3240611 0.6906430 0.2505479  1.007352  0.5732347
cbindEx <- cbind(x1, x2); cbindEx
             x1         x2
[1,] -1.3435214 -0.3240611
[2,]  0.6217756  0.6906430
[3,]  0.8008747  0.2505479
[4,] -1.3888924  1.0073523
[5,] -0.7143569  0.5732347
rbindEx * 10
         [,1]     [,2]     [,3]      [,4]      [,5]
x1 -13.435214 6.217756 8.008747 -13.88892 -7.143569
x2  -3.240611 6.906430 2.505479  10.07352  5.732347

Now, we have created two matrices with our data. You can transpose them (t()), multiply them (%*%) or invert them (solve()). Instead of going too far into matrices, we’re going to move onto lists.

#Lists Lists are a super flexible way to collect things. Here is an example of a list:

x <- list(c("a", "b", "c"),
          "d",
          list(1,
               c(2, 3),
               "alex")
          )

Here, my list x has three elements. There is a character vector, a character element and then another list all together. To access the elements of a list, we need a double square bracket like x[[1]] which will yield the entire vector c("a", "b", "c"). x[[1]][1] will yield "a".

We can also name elements in a list and use these to access certain elements.3

atm <- list(
  
  balance = c(5, 2, 3),
  name = list(
    first = c("alex", "bryan", "brad"),
    last = c("cardazzi", "mccannon", "humphreys")
    ),
  favoriteSport = c("basketball", "nascar", "wvu football")
  )

atm$favoriteSport
[1] "basketball"   "nascar"       "wvu football"
atm$name$first
[1] "alex"  "bryan" "brad" 
atm[[2]][[1]]
[1] "alex"  "bryan" "brad" 

You might imagine taking this list and converting it into something that looks like a table. This is where data frames come in.

#Data Frames

Data frames are just a special case of a list. In a data frame, you should think of a matrix, or an excel sheet, where each column has a name. Typically, we read in data from CSV, Excel, JSON files, etc. In this tutorial, I will strictly use CSV files. Here is how to save your excel files as CSVs if you are unsure how. I am going to use this file called knicks.csv and it can be found here. We will talk more about reading data into R later.

knicks <- read.csv("C:/Users/alexc/Desktop/Empirical Workshop/data/knicks.csv", stringsAsFactors = FALSE)
dim(knicks) #returns the rows and columns of a data frame
[1] 23 10
head(knicks) #head will display the first 6 rows just so you can take a peak at the data
  No.                    Player Pos     Ht Ht_inches  Wt       Birth.Date  X
1   0   Kadeem Allen\\allenka01  SG  1-Jun        73 200  January 15 1993 us
2  31      Ron Baker\\bakerro01  SG  4-Jun        76 220    March 30 1993 us
3  23     Trey Burke\\burketr01  PG Jun-00        73 175 November 12 1992 us
4  21 Damyean Dotson\\dotsoda01  SG  5-Jun        77 202       May 6 1994 us
5  13 Henry Ellenson\\ellenhe01  PF 10-Jun        82 240  January 13 1997 us
6   3  Billy Garrett\\garrebi01  SG  6-Jun        78 213  October 16 1994 us
  Exp        College
1   1        Arizona
2   2  Wichita State
3   5       Michigan
4   1 Oregon Houston
5   2      Marquette
6   R         DePaul

We used square brackets when subsetting vector, and we will do the same here. However, there are two dimensions (rows and columns) as opposed to the one dimensional vectors before. To obtain just the weight column, we can do knicks$Wt. To get the weight of players who did not attend college, we can do knicks$Wt[knicks$College == ""]. Note the column knicks$Exp. There are numbers but all of them have quotes around them. This is because some players have an “R” in their row, which means the player is a rookie and this is their first year in the NBA. In other words, if there is a vector of all numeric elements and one character elements, the whole vector will be converted to a character. You can force a vector to be numeric by as.numeric(). Here, the “R” will be removed for an NA, since the program can turn "9" into 9, but gets confused trying to convert "R" into a number.

Homework 1

For this homework, use knicks.csv. Try your best not to “hard code” anything. For example, there are 23 Knicks players, but try not to type the number 23 in your code. This is to keep the script as flexible as possible, incase you had to repeat this code for a completely different roster list. Use View(knicks) to look at the dataset.

  • Find the average weight of the Knicks roster.
  • Find the range of the weights.
  • Find the standard deviation of the weights.
  • Calculate the standard deviation of the weights without using either sd() or var().
  • Test if there is a significant difference in the weights of point guards & shooting guards relative to the rest of the players. Hint: point guard = PG, shooting guard = SG in the Pos. column.
  • Test if guards (PG, SG) tend to have lower jersey numbers than other positions, but remove the centers (C).
  • Find the average experience of the players.
  • Generate a vector of the names of the 5 least heavy players
  • Generate a vector of the names of the 5 heaviest players
  • Drop the rows where players did not go to college or they are foreign. Save this as knicks2
  • Drop the College column from the dataset.
  • Check if there is a significant correlation between jersey number and experience.
  • Make a correlation matrix for weight, jersey number and experience.

Footnotes

  1. These tutorials assume you have Windows. The differences in Mac, Windows, Linux mostly have to do with reading in data since file paths are different. For example, in Windows, you might see a file path like C:/Users/alexc/Documents whereas on a Mac it might look like ~/Documents. We will cover this later.↩︎

  2. If you are interested, a link↩︎

  3. This is a good example of when to use <- vs =. Take “balance” as an example. The equal sign will make “balance” exist inside the list, but not in our Global Enviornment window pane. Instead, I could have written balance <- c(5, 2, 3) but this would have saved balance inside the atm list and in our Global Enviornment. Usually, this is not desired, but there may be occasions.↩︎