Module 1.3: Using R

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

More R

By now, you can probably use R like a calculator – adding and subtracting single numbers, etc.

Calling this thing a ‘phone’ is like calling a Lamborghini a… cupholder. An incredibly elaborate cupholder.

Gary Gulman, referencing an iPhone

However, there are a lot of features that make R the Lamborghini of calculators.

Variables

In R, you can assign names to values (remember, objects). You do this by using either <- or =. Online, when Googling, you may find solutions with both. Despite what you might read, there are differences between the two, but we can ignore those differences for right now.

Why would you want to assign names to values? This allows your code to be much more flexible. Consider the following example.

Code
5 + 3
5 / 10
as.character(5)
Output
[1] 8
[1] 0.5
[1] "5"
Code
x <- 5
x + 3
x / 10
as.character(x)
Output
[1] 8
[1] 0.5
[1] "5"

Naming 5 as x allows us to change x only once, and the entire code will run. This will 1) reduce our effort 2) decrease typos / bugs and 3) increase readability. Now, x is not readable, per se, but this is just an example.

This may seem like a simple point, but it is very important. If you manipulate a variable in any way, but do not re-assign it to a name (same or different), it does not get updated/saved. Consider the following example.

Code
x <- 10
x + 5

x
Output
[1] 15
[1] 10
Code
x <- 10
x <- x + 5

x
Output
[1] 15
Code
x <- 10
y <- x + 5
x
y
Output
[1] 10
[1] 15

In the following WebR chunk, calculate your age in months. Assign birth a value equal to your birth year times twelve plus the number corresponding to your birth month. Next, assign now the current year times twelve plus the number corresponding to this month. Finally, subtract these two numbers, assign the result to age_in_months, and print age_in_months.

Solution
# Suppose it is September 2024 and you were born in April 2004.
birth <- (2004*12) + 4
now <- (2024*12) + 9
age_in_months <- now - birth
print(age_in_months)
Output
[1] 245

Now, if you wanted, you could modify this code to calculate ages in months for your friends and family by simply changing the value of birth! This should hopefully highlight the advantages of using variables instead of numeric values whenever possible.

Naming Variables

It is important to choose informative names for your variables. Generally, single (or few) character names are easy to type, but can easily lose meaning. Too-long names aren’t great if you need to type them over and over. You will figure out a sweet spot for yourself.

There are some names you cannot use for your variable names, and other names that you simply shouldn’t. For example, you cannot start a variable name with a number. You cannot start names with certain punctuation either. On the other hand, you should not name things after already-used words that are native to R. This will just lead to confusing code. For example, do not name anything mean, because that is already a function name that is native to R.

Learning what is and what is not a good variable name takes time and practice.

Collections

So far, we have only worked with single values. Data tends to come in sets of multiple values, like large spreadsheets with columns and rows. Let’s built up to the R version of “spreadsheets”, which are called data.frames. We will touch on each of the following ways to store multiple values:

  • Vectors
  • Matrices
  • Lists
  • data.frame

Vectors

In R, the definition of a vector is a collection of values that are all of the same type. We use a c() to denote vectors. The c stands for combine. Once we have our vector, we apply different operations to it like we did before. For example, we know how to add two values, but what about a vector and a single value? Or two vectors?

Notice how when we added 10 to vec1, 10 was added to each element of vec1. However, when we added the two vectors, addition was element-wise. If two vectors are of different lengths, R will “recycle” the shorter one to match the longer one.

Explanation The results of this are 1 + 1, 2 + 2, 3 + 3, and 1 + 4. Notice how the first vector has to loop back around to the beginning to match the length of the second vector. This is effectively just adding c(1, 2, 3, 1) and c(1, 2, 3, 4). Also, again note the warning generated by R. It’s not often that you will add two vectors of unequal length, so this should be a flag to you that maybe there’s an issue.

As a quick aside, R has good help functionality. To access this, you need to put a ? in front of whatever you want help with. For example, suppose you need help with the mean function from before.

Running this line (as a reminder: ctrl + enter in RStudio) will bring you to the function’s documentation.

Here, mean becomes: mean(x, trim = 0, na.rm = FALSE, ...)

  • x, trim, and na.rm are the function’s arguments. These are inputs, and the function gives you an output.
    • x is the vector, x <- c(1, 4, 8, 7, 2), you want the mean of.
    • trim is the fraction of observations (elements in the vector) to be removed before taking the mean. You might want to remove the top and bottom 5% of observations since they might be outliers.
    • na.rm is a boolean that will remove NA values for you.

R also has different ways (functions) to generate vectors. Explore some of them below:

What happens if you have a vector of elements that are of different types?

Experiment with the chunk above. Does the resulting vector change depending on the number of character values vs numeric values? Does the resulting vector change depending on the first entry of the vector?

Let’s suppose you only want a part of a vector. You can select elements from vectors by index (position within the vector) or by boolean values. You do this by typing the vectors name, followed by a square bracket, followed by another vector that containing indices or boolean values. Experiment with the following examples.

Another important way to subset vectors is with the %in% operator. Suppose you have a vector of years as follows: c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008). If you were to subset the vector where you only kept elements where years were equal to 2006, 2007, or 2008, you would have to write the following:

This can be very tedious, is prone to error/typo, and infeasible if the list were much longer (i.e., not just three years). As a shortcut, R has the following:

Matrices

A collection of vectors (of similar type and length) is called a matrix. Matrices have two dimensions: rows and columns. Matrices look like: example_mat[rows,cols]. To create matrices from vectors, you can use rbind() (to stack row-wise) or cbind() (column-wise). Let’s start by assuming you have a few vectors to work with.

I am using ; cat("\n") to break up the output. This is purely for aesthetics and should be ignored.

Another way to generate matrices would be to put one giant vector into the matrix function. Of course, you will need to give matrix() a bit of help. You need to tell it something about the dimensions you’d like. This could be ncol for number of columns or nrow for number of rows. In addition, you should specify whether the vector is “by row” or not (i.e. “by column”).

If r1 denotes an element belonging on the first row, etc.:
A “by row” vector would be c(r1, r1, r1, r2, r2, r2, r3, r3, r3)
A “by column” vector would be c(r1, r2, r3, r1, r2, r3, r1, r2, r3)

Once you have the matrix of your dreams, you may need to access certain columns or rows. Remember: example_mat[rows, columns]. For vectors, if you want the first element, you would use example_vec[1]. For a matrix, example_max[1,] will give you the first row, example_max[,1] will give you the first column, and example_max[i,j] will give you the i\(^{th}\) row and j\(^{th}\) column. To select multiple rows, you can use logic or indices, much like vectors. Explore the code below:

Lists

Lists are similar to vectors in that they allow for the collection of elements. However, with lists, each element can be of a different type. In fact, each element of a list can be an entire vector! Lists, for this reason, are incredibly flexible. In fact, this flexibility can actually make it difficult to work with lists. Try exploring the lists below.

An interesting feature of lists, is that you can name the elements within the list. This is possible with vectors as well, but not as useful. Here are some examples of naming and using the names within lists.

Changing the format of the list a little bit:

This is a special list because both vectors of the list have the same number of elements, or observations. When this happens, we have something called a data.frame. Really, this is just how R represents spreadsheets – a collection of columns all with the same number of rows!

data.frame

So, what do data.frame’s look like?

Observations can be accessed in data.frames via the $ or [. These objects combine lists and matrices to make a more realistic view of the types of data that are most common in the real world.

Code
data.frame(first = c("alex", "jalen", "thom"),
           last = c("cardazzi", "brunson", "yorke"),
           num_of_albums = c(0, 0, 10),
           nba_seasons = c(0, 5, 0),
           phds = c(1, 0, 0),
           birth_country = c("us", "us", "uk")) -> df
df
Output
  first     last num_of_albums nba_seasons phds birth_country
1  alex cardazzi             0           0    1            us
2 jalen  brunson             0           5    0            us
3  thom    yorke            10           0    0            uk

Suppose you want to subset the df object that you’ve created. Again, there are different ways to do this. Like matrices, to get some rows and all columns, you would use df[lim,] where lim is a vector of boolean values or indices. Leaving nothing following the comma indicates to R that you want everything in that dimension. To get columns, you can reverse this (df[,3:4]) or use names (df[,c("nba_seasons", "phds")]). If you only want a single column, of course, you can use df$phds.

Please re-read this last part. Subsetting data is one of the most important and most used operations you will learn throughout this course. If I had a dollar every time a student asked me to remind them how to do this, I would be able to retire tomorrow.

As a final note about data.frames, here are a few important functions:

  • nrow(): Returns the number of rows in a data.frame.
  • ncol(): Returns the number of columns in a data.frame.
  • colnames(): Returns the names of columns in a data.frame.

Using R

Now that we have some knowledge about how to work with data in R, how do we access data from different sources (e.g. spreadsheets, etc.)?

Before we can answer that question, we first need to set up a file where we can write code. Most of the time, people will use .R files to write their code. .R is simply the file extension, much like how Word files end in .docx or Excel files end in .xlsx. So, your file might be named something like analysis.R. However, in this class, almost all of the code you will write will be inside .qmd files. .qmd stands for “Quarto Markdown”, and you can think of it as a way of merging R with a text editor (like Word, etc.). In other words, you will be able to generate everything (code, text, tables, figures) inside of a single file. This is helpful because you will never have to manually change numbers, etc., as all you will need to do is modify the underlying code!

.R vs .qmd

You can think of a .qmd as a text document with a bunch of mini .R files embedded throughout. Every time you want to switch from text to code, you just need to make a “code chunk” where you can type in your code. Sound familiar? These notes are all generated using .qmd files! Aside from this, there are not too many more differences between these two file types. Moreover, for each assignment you turn in, there will be an accompanying .qmd template you can use. Please see the class videos for more information on how to use the templates. As I discuss how to use R in the rest of the submodule, I will point out the differences between .R and .qmd when they exist.

Writing a Script

To write a new script, click on the top left button underneath “File”. You should be able to see a white paper icon with a green +. This will open up a menu of different files. Just select “R Script” for now, but note the ability to select “Quarto Document”.

So, you now have your first .R file open… now what?

For beginners, as a template, the top of your script should look like the following:

Code
# library("")

rm(list = ls())
setwd(".")
  • library(""): This is where you load any packages or libraries you might want. We will discuss packages later. This is commented out for now since we are not yet ready to load any packages/libraries.
  • rm(list = ls()): This is how you clear your environment to “start fresh”. It is a good idea to start with an empty environment so you don’t get confused between what is old and what is new.
  • setwd("."): This is where you set your working directory.

Working Directories

  • Computers have different folders, sometimes called directories. For example, you might have a folder on your computer called “Documents”. To stay organized, you might make a folder inside “Documents” called “Econ 311” where you will put all of your “stuff” for this course. You might make more folders inside this folder called “HW”, “Lectures”, etc.
  • Suppose you save a file to your “HW” folder which you’ve name “HW01.R”. This file’s path looks like Documents/Econ 311/HW/HW01.R.
  • Now, suppose you have some data in this folder, and you want R to find it and read it. Well, you can’t tell R just the name of the file because it doesn’t know what folder, or directory, to look in.
  • So, to tell R, you can do one of two things:
    1. Supply the file’s entire path along with its name.
    2. Tell R where all the files you’re working on will live.

R has two helpful functions for dealing with working directories. First, getwd() tells you where R is currently looking.

Code
getwd()
Output
[1] "C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/module01"

Then, if I wanted to change this, I would use setwd(). Here, I could either:

  1. Type in the entire new working directory
  2. Navigate to the new working directory
Code
# Two periods means "go back one level"
# So, if we were in "Documents/Econ 311/HW01"
# setwd("..") would bring us to "Documents/Econ 311"
setwd("..")

# If there was another folder inside "HW01",
# (for example, suppose you have a folder "Data" inside "HW01")
# From "HW01", you can navigate to "Data" like:
setwd("Data")

# Maybe you want to go from "HW01" to "HW02/Data"
# You would have to back out from HW01 (using "..")
# Then go into HW02 and Data
setwd("../HW02/Data")

Of course, to use the ".." trick, you need to know where you’re starting from (i.e. with getwd()). If you are unsure, you could always type in your entire working directory in one go.

Code
# This is my working directory:
setwd("C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/HW02/Data")

A note about .R vs .qmd: RStudio’s default working directory is likely in your basic “Documents” folder. So, when using .R scripts, you almost always need to use setwd() and point R to the folder with your data. However, .qmd files are a bit smarter than .R files, and they know where they’re saved. The advantage of this is that if your data and .qmd file are in the same folder, you do not need to use setwd(), or you can easily navigate to the folder via the “dots” method.

Reading Data

Now that R knows where to look for data, we need it to import it so we can use it. Most of the time in this course, we will use files that end in .csv. This stands for “comma separated values”. .csv files are very common and require relatively small amounts of storage. .csv files are also open-able in Excel (you just might get some warning about how any Excel formulas you write will not be saved). To create a .csv from an .xlsx (Excel) file, just use “Save As” in Excel, and change the file extension to “Comma Separated Values (.csv)”.

Now, let’s read in a file called “ford_escort.csv”. On my machine, the file lives in a folder called C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/data. Since these notes are generated via .qmd, and this file is saved in C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/module01, fold_escort.csv’s relative filepath is ../data/fold_escort.csv. Rather than changing my working directory, I can just use this relative filepath. To import this data, we will use the read.csv() function.

Code
# Since my working directory is in "econ311/module01",
# But the file is in "econ311/data",
# I need to back out of "module01" and navigate to the data folder
ford <- read.csv("../data/ford_escort.csv")
dim(ford); cat("\n") # dim() gives the number of columns and rows
head(ford) # the head() function displays the first 6 rows.

# I could have also done:
# setwd("../data")
# ford <- read.csv("ford_escort.csv")
Output
[1] 23  3

  Year Mileage..thousands. Price
1 1998                  27  9991
2 1997                  17  9925
3 1998                  28 10491
4 1998                   5 10990
5 1997                  38  9493
6 1997                  36  9991

You can also read data straight from a URL if it has been uploaded correctly. Most of the data for this class can be read into R this way. Try using read.csv() in the following chunk. Be sure to put quotes around the URL!

Solution
ford <- read.csv("https://alexcardazzi.github.io/econ311/data/ford_escort.csv")

Manipulating Data

Now that we have our data read into R, let’s manipulate it a bit. This might feel like a lot at once, but try to stay with me. You can experiment in the previous WebR chunk or in RStudio if you’d like.

  1. Rename the second column from Mileage..thousands. to mileage.
  2. Calculate the number of Ford Escapes that were less than $9000?
  3. Currently, mileage is in thousands of miles. Multiply it by 1000 to convert it to just miles, and save over the original variable.
  4. Calculate the price per mile ($/mi) for each vehicle, and save it as a new variable.
  5. Calculate the average price per mile.
  6. Use the range() function to find the minimum and maximum price per mile.

As a note, you can use cat() to combine text and code. Put "\n" at the end to make a new line. You can experiment with this on your own.

Code
# colnames(ford) # Take a look at the column names.
colnames(ford)[2] <- "mileage" # change the second name

# ford$Price < 9000 # this gives boolean (T/F) values.
# Since R treats TRUE as 1 and FALSE as 0, use sum()
cat("Number of Escapes less than $9,000:", sum(ford$Price < 9000), "\n")

ford$mileage <- ford$mileage * 1000 # Multiply by 1000 and save
ford$cost_per_mile <- ford$Price / ford$mileage # Create $/mi

cat("Average Cost per Mile:", mean(ford$cost_per_mile), "\n") # average
cat("Range of Cost per Mile:", range(ford$cost_per_mile)) # min and max
Output
Number of Escapes less than $9,000: 5 
Average Cost per Mile: 0.4315369 
Range of Cost per Mile: 0.08325 2.198

Plotting

An especially attractive feature of R (that does not differ between .R and .qmd) is its powerful graphics. Just Google “Best R Plots”, or something, and you’ll see what I mean.

To start, we’ll learn some of the basics. We will begin by generating two scatter plots using data from ford.

  1. Plot mileage vs Price.
  2. Plot mileage vs cost_per_mile.
Code
plot(ford$mileage, ford$Price)

# This is the same as:
# plot(x = ford$mileage, y = ford$Price)
Plot

Below contains some examples of additional arguments for the plot() function.

  1. las = 1 rotates the text on the y-axis. Different numbers will rotate it more or less
  2. col sets the colors used in the plot. This can take a vector of colors.
  3. pch sets the type of point used.
  4. cex sets the size of the points. The default is 1.
  5. main sets the title of the plot.
  6. xlab sets the name of the x-axis
  7. ylab sets the name of the y-axis
Code
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 23, cex = 1.2,
     col = "tomato", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
Plot

Try tweaking some of these options in the following WebR chunk:

We can also add reference lines to the plot, and also make the colors a bit more complex.

Code
# Set all colors as "tomato"
ford$point_color <- "tomato"
# If the Year is less than the mean year, color it "dodgerblue"
# Of course, these are therefore the "older" cars
ford$point_color[ford$Year < mean(ford$Year)] <- "dodgerblue"
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 19, cex = 1.2,
     col = ford$point_color, main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
abline(h = 1) # horiz. line at Y = 1
abline(v = mean(ford$mileage)) # vert. line at the mean of X
Plot

Of course, whenever you choose to add some differences in shapes, colors, etc., it’s helpful to add a legend to your plot. To do this, we can use the legend() function. This function accepts a few important arguments:

  • bty: setting this to "n" removes the box around the legend. I always use this option.
  • legend: this is the actual text to be displayed in the legend. It accepts a character vector, so if you colored your plot by men and women, you would use c("Men", "Women").
  • x, y: You can specify the exact coordinates of your legend, or you can specify things like: "topleft", "topright", "bottomleft", or "bottomright".
  • horiz: this accepts a boolean value, and turns the legend from vertical to horizontal.
  • Then, you will need to specify either pch or lty options to tell R if you want to display points or lines next to your legend.

Below is a plat with two legends (which is certainly redundant) to show off some of the different ways to customize the output.

Code
plot(ford$mileage, ford$cost_per_mile, las = 1,
     pch = 19, cex = 1.2,
     col = ford$point_color, main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
legend("topright", pch = 19, bty = "n", horiz = TRUE,
       legend = c("Old Ford", "New Ford"), cex = 1.5,
       col = c("dodgerblue", "tomato"))
legend("bottomleft", lty = c(1, 2), pch = c(2, 19),
       legend = c("Old Ford", "New Ford"),
       col = c("dodgerblue", "tomato"))
Plot

When generating figures, you will sometimes need to add data from a different source to the same set of axes. As an example, let’s simply plot the data above, but in two steps instead of one.

To do this, we will use points(). This function accepts nearly every argument plot() does, except you are unable to impact the axes/labels of the plot.

Code
plot(ford$mileage[ford$point_color == "tomato"],
     ford$cost_per_mile[ford$point_color == "tomato"],
     las = 1, pch = 19, cex = 1.2,
     col = "tomato", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
       ford$cost_per_mile[ford$point_color != "tomato"],
       pch = 19, cex = 1.2, col = "dodgerblue")
Plot

Notice how I am subsetting the data when plotting. This is an important thing to learn!

Once you get the hang of using plot() and points() in tandem, you’ll find it convenient that points() does not impact the axes. However, to start, this will be annoying. For example, let’s switch the order of the data in plot() and points().

Code
plot(ford$mileage[ford$point_color != "tomato"],
     ford$cost_per_mile[ford$point_color != "tomato"],
     las = 1, pch = 19, cex = 1.2,
     col = "dodgerblue", main = "Cost vs Mileage",
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color == "tomato"],
       ford$cost_per_mile[ford$point_color == "tomato"],
       pch = 19, cex = 1.2, col = "tomato")
Plot

The plot is different because when plot() is setting the axes, it doesn’t know that you’re planning on using points() next. So, it scales the axes so the data fed into plot() “fits” the space.

To overcome this issue, we can use the following trick. The idea is to plot the point (0, 0) (or any point, really!), but use type = "n" so the point is not displayed. Then, within this plot() call, we can set ylim and xlim equal to range() of the variables we’ll plot so axes fit the data perfectly.

Code
plot(0, 0, type = "n",
     ylim = range(ford$cost_per_mile),
     xlim = range(ford$mileage), # range can include multiple vectors
     main = "Cost vs Mileage", las = 1,
     xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
       ford$cost_per_mile[ford$point_color != "tomato"],
       pch = 19, cex = 1.2, col = "dodgerblue")
points(ford$mileage[ford$point_color == "tomato"],
       ford$cost_per_mile[ford$point_color == "tomato"],
       pch = 19, cex = 1.2, col = "tomato")
Plot

Of course, this is a lot more coding than the initial plot’s code. The idea of showing you this is that, now, you can always make sure your data “fits”. This is one of the little things that I use constantly, but it took me a long time to figure out.

Finally, adding lines to a plot is very similar in that one needs to use lines(). To illustrate, we will examine panel data on cigarette consumption by state (documentation).

Read in the data, and plot sales on the y-axis and year on the x-axis below. Be sure to clean the data where appropriate.

Solution
cig <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Cigar.csv")
cig$year <- cig$year + 1900
plot(cig$year, cig$sales, las = 1,
     ylab = "Sales", xlab = "Year")
Plot

This figure is very difficult to understand. Let’s trim it down to just a few states. In addition, we can add colors to the figure.

Unfortunately, the states seem to just be numbered instead of labeled, so we’ll just pick 1 through 5. In addition, there does not seem to be a state number 2.

Code
cig <- cig[cig$state %in% 1:5,]
plot(cig$year, cig$sales, las = 1,
     # since state is a number,
     #  we can just use this as the color
     col = cig$state,
     ylab = "Sales", xlab = "Year")
Plot

This plot can still be improved. It’d be a lot more natural to see the data as lines instead of points. To do this, we can use type = "l".

Code
plot(cig$year, cig$sales, las = 1,
     col = cig$state, type = "l",
     ylab = "Sales", xlab = "Year")
Plot

Notice two things about this plot. First, there’s only a single color. In R, you should think of a line as a single point. R cannot color different parts of line differently, so it will just take the first color it’s given (here, it’s 1, which is black). Second, there are these three crazy diagonal lines that dash across the plot. This is because R is trying to connect each line into a single one. If you look closely, R is connecting the last year of one state to the first year of another state.

To fix this, we need to use lines like we used points before. This is another example case of a time where we’ll want to set up the axes before we plot anything.

Code
# before, I plotted 0, 0
# now, I am simply keeping the data
#   in plot().
# this way, I don't need to set the axes
#   via ylim() and xlim()
plot(cig$year, cig$sales,
     las = 1, type = "n",
     ylab = "Sales", xlab = "Year")
lines(cig$year[cig$state == 1],
      cig$sales[cig$state == 1],
      col = 1)
lines(cig$year[cig$state == 3],
      cig$sales[cig$state == 3],
      col = 3)
lines(cig$year[cig$state == 4],
      cig$sales[cig$state == 4],
      col = 4)
lines(cig$year[cig$state == 5],
      cig$sales[cig$state == 5],
      col = 5)
legend("bottomleft", ncol = 2,
       legend = c("State 1", "State 3", "State 4", "State 5"),
       bty = "n", col = c(1, 3, 4, 5), lty = 1)
Plot

Next module, you’ll learn about “loops”, which will significantly cut down on the amount of code we need to write to generate these lines.