Output
[1] 8
[1] 0.5
[1] "5"
Module 1.3: Using R
Old Dominion University
By now, you can probably use R like a calculator – adding and subtracting single numbers, etc.
Calling this thing a ‘phone’ is like calling a Lamborghini a… cupholder. An incredibly elaborate cupholder.
However, there are a lot of features that make R the Lamborghini of calculators.
In R, you can assign names to values (remember, objects). You do this by using either <-
or =
. Online, when Googling, you may find solutions with both. Despite what you might read, there are differences between the two, but we can ignore those differences for right now.
Why would you want to assign names to values? This allows your code to be much more flexible. Consider the following example.
Naming 5
as x
allows us to change x
only once, and the entire code will run. This will 1) reduce our effort 2) decrease typos / bugs and 3) increase readability. Now, x
is not readable, per se, but this is just an example.
This may seem like a simple point, but it is very important. If you manipulate a variable in any way, but do not re-assign it to a name (same or different), it does not get updated/saved. Consider the following example.
In the following WebR chunk, calculate your age in months. Assign birth
a value equal to your birth year times twelve plus the number corresponding to your birth month. Next, assign now
the current year times twelve plus the number corresponding to this month. Finally, subtract these two numbers, assign the result to age_in_months
, and print age_in_months
.
Now, if you wanted, you could modify this code to calculate ages in months for your friends and family by simply changing the value of birth
! This should hopefully highlight the advantages of using variables instead of numeric values whenever possible.
It is important to choose informative names for your variables. Generally, single (or few) character names are easy to type, but can easily lose meaning. Too-long names aren’t great if you need to type them over and over. You will figure out a sweet spot for yourself.
There are some names you cannot use for your variable names, and other names that you simply shouldn’t. For example, you cannot start a variable name with a number. You cannot start names with certain punctuation either. On the other hand, you should not name things after already-used words that are native to R. This will just lead to confusing code. For example, do not name anything mean
, because that is already a function name that is native to R.
Learning what is and what is not a good variable name takes time and practice.
So far, we have only worked with single values. Data tends to come in sets of multiple values, like large spreadsheets with columns and rows. Let’s built up to the R version of “spreadsheets”, which are called data.frame
s. We will touch on each of the following ways to store multiple values:
data.frame
In R, the definition of a vector is a collection of values that are all of the same type. We use a c()
to denote vectors. The c
stands for combine. Once we have our vector, we apply different operations to it like we did before. For example, we know how to add two values, but what about a vector and a single value? Or two vectors?
Notice how when we added 10
to vec1
, 10
was added to each element of vec1
. However, when we added the two vectors, addition was element-wise. If two vectors are of different lengths, R will “recycle” the shorter one to match the longer one.
1 + 1
, 2 + 2
, 3 + 3
, and 1 + 4
. Notice how the first vector has to loop back around to the beginning to match the length of the second vector. This is effectively just adding c(1, 2, 3, 1)
and c(1, 2, 3, 4)
. Also, again note the warning generated by R. It’s not often that you will add two vectors of unequal length, so this should be a flag to you that maybe there’s an issue.
?
)As a quick aside, R has good help functionality. To access this, you need to put a ?
in front of whatever you want help with. For example, suppose you need help with the mean
function from before.
Running this line (as a reminder: ctrl
+ enter
in RStudio) will bring you to the function’s documentation.
Here, mean
becomes: mean(x, trim = 0, na.rm = FALSE, ...)
x
, trim
, and na.rm
are the function’s arguments. These are inputs, and the function gives you an output.
x
is the vector, x <- c(1, 4, 8, 7, 2)
, you want the mean of.trim
is the fraction of observations (elements in the vector) to be removed before taking the mean. You might want to remove the top and bottom 5% of observations since they might be outliers.na.rm
is a boolean that will remove NA
values for you.R also has different ways (functions) to generate vectors. Explore some of them below:
What happens if you have a vector of elements that are of different types?
Experiment with the chunk above. Does the resulting vector change depending on the number of character values vs numeric values? Does the resulting vector change depending on the first entry of the vector?
Let’s suppose you only want a part of a vector. You can select elements from vectors by index (position within the vector) or by boolean values. You do this by typing the vectors name, followed by a square bracket, followed by another vector that containing indices or boolean values. Experiment with the following examples.
Another important way to subset vectors is with the %in%
operator. Suppose you have a vector of years as follows: c(2006, 2006, 2003, 2005, 2012, 2002, 2016, 2006, 2008)
. If you were to subset the vector where you only kept elements where years were equal to 2006, 2007, or 2008, you would have to write the following:
This can be very tedious, is prone to error/typo, and infeasible if the list were much longer (i.e., not just three years). As a shortcut, R has the following:
A collection of vectors (of similar type and length) is called a matrix. Matrices have two dimensions: rows and columns. Matrices look like: example_mat[rows,cols]
. To create matrices from vectors, you can use rbind()
(to stack row-wise) or cbind()
(column-wise). Let’s start by assuming you have a few vectors to work with.
Another way to generate matrices would be to put one giant vector into the matrix
function. Of course, you will need to give matrix()
a bit of help. You need to tell it something about the dimensions you’d like. This could be ncol
for number of columns or nrow
for number of rows. In addition, you should specify whether the vector is “by row” or not (i.e. “by column”).
r1
denotes an element belonging on the first row, etc.:c(r1, r1, r1, r2, r2, r2, r3, r3, r3)
c(r1, r2, r3, r1, r2, r3, r1, r2, r3)
Once you have the matrix of your dreams, you may need to access certain columns or rows. Remember: example_mat[rows, columns]
. For vectors, if you want the first element, you would use example_vec[1]
. For a matrix, example_max[1,]
will give you the first row, example_max[,1]
will give you the first column, and example_max[i,j]
will give you the i\(^{th}\) row and j\(^{th}\) column. To select multiple rows, you can use logic or indices, much like vectors. Explore the code below:
Lists are similar to vectors in that they allow for the collection of elements. However, with lists, each element can be of a different type. In fact, each element of a list can be an entire vector! Lists, for this reason, are incredibly flexible. In fact, this flexibility can actually make it difficult to work with lists. Try exploring the lists below.
An interesting feature of lists, is that you can name the elements within the list. This is possible with vectors as well, but not as useful. Here are some examples of naming and using the names within lists.
Changing the format of the list a little bit:
This is a special list because both vectors of the list have the same number of elements, or observations. When this happens, we have something called a data.frame
. Really, this is just how R
represents spreadsheets – a collection of columns all with the same number of rows!
data.frame
So, what do data.frame
’s look like?
data.frame
Observations can be accessed in data.frame
s via the $
or [
. These objects combine lists and matrices to make a more realistic view of the types of data that are most common in the real world.
first last num_of_albums nba_seasons phds birth_country
1 alex cardazzi 0 0 1 us
2 jalen brunson 0 5 0 us
3 thom yorke 10 0 0 uk
data.frame
Suppose you want to subset the df
object that you’ve created. Again, there are different ways to do this. Like matrices, to get some rows and all columns, you would use df[lim,]
where lim
is a vector of boolean values or indices. Leaving nothing following the comma indicates to R that you want everything in that dimension. To get columns, you can reverse this (df[,3:4]
) or use names (df[,c("nba_seasons", "phds")]
). If you only want a single column, of course, you can use df$phds
.
Please re-read this last part. Subsetting data is one of the most important and most used operations you will learn throughout this course. If I had a dollar every time a student asked me to remind them how to do this, I would be able to retire tomorrow.
data.frame
As a final note about data.frame
s, here are a few important functions:
nrow()
: Returns the number of rows in a data.frame
.ncol()
: Returns the number of columns in a data.frame
.colnames()
: Returns the names of columns in a data.frame
.Now that we have some knowledge about how to work with data in R, how do we access data from different sources (e.g. spreadsheets, etc.)?
Before we can answer that question, we first need to set up a file where we can write code. Most of the time, people will use .R
files to write their code. .R
is simply the file extension, much like how Word files end in .docx
or Excel files end in .xlsx
. So, your file might be named something like analysis.R
. However, in this class, almost all of the code you will write will be inside .qmd
files. .qmd
stands for “Quarto Markdown”, and you can think of it as a way of merging R with a text editor (like Word, etc.). In other words, you will be able to generate everything (code, text, tables, figures) inside of a single file. This is helpful because you will never have to manually change numbers, etc., as all you will need to do is modify the underlying code!
.R
vs .qmd
You can think of a .qmd
as a text document with a bunch of mini .R
files embedded throughout. Every time you want to switch from text to code, you just need to make a “code chunk” where you can type in your code. Sound familiar? These notes are all generated using .qmd
files! Aside from this, there are not too many more differences between these two file types. Moreover, for each assignment you turn in, there will be an accompanying .qmd
template you can use. Please see the class videos for more information on how to use the templates. As I discuss how to use R in the rest of the submodule, I will point out the differences between .R
and .qmd
when they exist.
To write a new script, click on the top left button underneath “File”. You should be able to see a white paper icon with a green +. This will open up a menu of different files. Just select “R Script” for now, but note the ability to select “Quarto Document”.
So, you now have your first .R
file open… now what?
For beginners, as a template, the top of your script should look like the following:
library("")
: This is where you load any packages or libraries you might want. We will discuss packages later. This is commented out for now since we are not yet ready to load any packages/libraries.rm(list = ls())
: This is how you clear your environment to “start fresh”. It is a good idea to start with an empty environment so you don’t get confused between what is old and what is new.setwd(".")
: This is where you set your working directory.Documents/Econ 311/HW/HW01.R
.R has two helpful functions for dealing with working directories. First, getwd()
tells you where R is currently looking.
Then, if I wanted to change this, I would use setwd()
. Here, I could either:
# Two periods means "go back one level"
# So, if we were in "Documents/Econ 311/HW01"
# setwd("..") would bring us to "Documents/Econ 311"
setwd("..")
# If there was another folder inside "HW01",
# (for example, suppose you have a folder "Data" inside "HW01")
# From "HW01", you can navigate to "Data" like:
setwd("Data")
# Maybe you want to go from "HW01" to "HW02/Data"
# You would have to back out from HW01 (using "..")
# Then go into HW02 and Data
setwd("../HW02/Data")
Of course, to use the ".."
trick, you need to know where you’re starting from (i.e. with getwd()
). If you are unsure, you could always type in your entire working directory in one go.
A note about .R
vs .qmd
: RStudio’s default working directory is likely in your basic “Documents” folder. So, when using .R
scripts, you almost always need to use setwd()
and point R to the folder with your data. However, .qmd
files are a bit smarter than .R
files, and they know where they’re saved. The advantage of this is that if your data and .qmd
file are in the same folder, you do not need to use setwd()
, or you can easily navigate to the folder via the “dots” method.
Now that R knows where to look for data, we need it to import it so we can use it. Most of the time in this course, we will use files that end in .csv
. This stands for “comma separated values”. .csv
files are very common and require relatively small amounts of storage. .csv
files are also open-able in Excel (you just might get some warning about how any Excel formulas you write will not be saved). To create a .csv
from an .xlsx
(Excel) file, just use “Save As” in Excel, and change the file extension to “Comma Separated Values (.csv)”.
Now, let’s read in a file called “ford_escort.csv”. On my machine, the file lives in a folder called C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/data
. Since these notes are generated via .qmd
, and this file is saved in C:/Users/alexc/Dropbox/teaching/Fall 2024/econ311/module01
, fold_escort.csv
’s relative filepath is ../data/fold_escort.csv
. Rather than changing my working directory, I can just use this relative filepath. To import this data, we will use the read.csv()
function.
# Since my working directory is in "econ311/module01",
# But the file is in "econ311/data",
# I need to back out of "module01" and navigate to the data folder
ford <- read.csv("../data/ford_escort.csv")
dim(ford); cat("\n") # dim() gives the number of columns and rows
head(ford) # the head() function displays the first 6 rows.
# I could have also done:
# setwd("../data")
# ford <- read.csv("ford_escort.csv")
[1] 23 3
Year Mileage..thousands. Price
1 1998 27 9991
2 1997 17 9925
3 1998 28 10491
4 1998 5 10990
5 1997 38 9493
6 1997 36 9991
You can also read data straight from a URL if it has been uploaded correctly. Most of the data for this class can be read into R this way. Try using read.csv()
in the following chunk. Be sure to put quotes around the URL!
Now that we have our data read into R, let’s manipulate it a bit. This might feel like a lot at once, but try to stay with me. You can experiment in the previous WebR chunk or in RStudio if you’d like.
Mileage..thousands.
to mileage
.range()
function to find the minimum and maximum price per mile.As a note, you can use cat()
to combine text and code. Put "\n"
at the end to make a new line. You can experiment with this on your own.
# colnames(ford) # Take a look at the column names.
colnames(ford)[2] <- "mileage" # change the second name
# ford$Price < 9000 # this gives boolean (T/F) values.
# Since R treats TRUE as 1 and FALSE as 0, use sum()
cat("Number of Escapes less than $9,000:", sum(ford$Price < 9000), "\n")
ford$mileage <- ford$mileage * 1000 # Multiply by 1000 and save
ford$cost_per_mile <- ford$Price / ford$mileage # Create $/mi
cat("Average Cost per Mile:", mean(ford$cost_per_mile), "\n") # average
cat("Range of Cost per Mile:", range(ford$cost_per_mile)) # min and max
Number of Escapes less than $9,000: 5
Average Cost per Mile: 0.4315369
Range of Cost per Mile: 0.08325 2.198
An especially attractive feature of R (that does not differ between .R
and .qmd
) is its powerful graphics. Just Google “Best R Plots”, or something, and you’ll see what I mean.
To start, we’ll learn some of the basics. We will begin by generating two scatter plots using data from ford
.
mileage
vs Price
.mileage
vs cost_per_mile
.Below contains some examples of additional arguments for the plot()
function.
las = 1
rotates the text on the y-axis. Different numbers will rotate it more or lesscol
sets the colors used in the plot. This can take a vector of colors.pch
sets the type of point used.cex
sets the size of the points. The default is 1.main
sets the title of the plot.xlab
sets the name of the x-axisylab
sets the name of the y-axisTry tweaking some of these options in the following WebR chunk:
We can also add reference lines to the plot, and also make the colors a bit more complex.
# Set all colors as "tomato"
ford$point_color <- "tomato"
# If the Year is less than the mean year, color it "dodgerblue"
# Of course, these are therefore the "older" cars
ford$point_color[ford$Year < mean(ford$Year)] <- "dodgerblue"
plot(ford$mileage, ford$cost_per_mile, las = 1,
pch = 19, cex = 1.2,
col = ford$point_color, main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
abline(h = 1) # horiz. line at Y = 1
abline(v = mean(ford$mileage)) # vert. line at the mean of X
Of course, whenever you choose to add some differences in shapes, colors, etc., it’s helpful to add a legend to your plot. To do this, we can use the legend()
function. This function accepts a few important arguments:
bty
: setting this to "n"
removes the box around the legend. I always use this option.legend
: this is the actual text to be displayed in the legend. It accepts a character vector, so if you colored your plot by men and women, you would use c("Men", "Women")
.x
, y
: You can specify the exact coordinates of your legend, or you can specify things like: "topleft"
, "topright"
, "bottomleft"
, or "bottomright"
.horiz
: this accepts a boolean value, and turns the legend from vertical to horizontal.pch
or lty
options to tell R if you want to display points or lines next to your legend.Below is a plat with two legends (which is certainly redundant) to show off some of the different ways to customize the output.
plot(ford$mileage, ford$cost_per_mile, las = 1,
pch = 19, cex = 1.2,
col = ford$point_color, main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
legend("topright", pch = 19, bty = "n", horiz = TRUE,
legend = c("Old Ford", "New Ford"), cex = 1.5,
col = c("dodgerblue", "tomato"))
legend("bottomleft", lty = c(1, 2), pch = c(2, 19),
legend = c("Old Ford", "New Ford"),
col = c("dodgerblue", "tomato"))
When generating figures, you will sometimes need to add data from a different source to the same set of axes. As an example, let’s simply plot the data above, but in two steps instead of one.
To do this, we will use points()
. This function accepts nearly every argument plot()
does, except you are unable to impact the axes/labels of the plot.
plot(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
las = 1, pch = 19, cex = 1.2,
col = "tomato", main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
pch = 19, cex = 1.2, col = "dodgerblue")
Notice how I am subsetting the data when plotting. This is an important thing to learn!
Once you get the hang of using plot()
and points()
in tandem, you’ll find it convenient that points()
does not impact the axes. However, to start, this will be annoying. For example, let’s switch the order of the data in plot()
and points()
.
plot(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
las = 1, pch = 19, cex = 1.2,
col = "dodgerblue", main = "Cost vs Mileage",
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
pch = 19, cex = 1.2, col = "tomato")
The plot is different because when plot()
is setting the axes, it doesn’t know that you’re planning on using points()
next. So, it scales the axes so the data fed into plot()
“fits” the space.
To overcome this issue, we can use the following trick. The idea is to plot the point (0, 0) (or any point, really!), but use type = "n"
so the point is not displayed. Then, within this plot()
call, we can set ylim
and xlim
equal to range()
of the variables we’ll plot so axes fit the data perfectly.
plot(0, 0, type = "n",
ylim = range(ford$cost_per_mile),
xlim = range(ford$mileage), # range can include multiple vectors
main = "Cost vs Mileage", las = 1,
xlab = "Mileage", ylab = "Price per Mile")
points(ford$mileage[ford$point_color != "tomato"],
ford$cost_per_mile[ford$point_color != "tomato"],
pch = 19, cex = 1.2, col = "dodgerblue")
points(ford$mileage[ford$point_color == "tomato"],
ford$cost_per_mile[ford$point_color == "tomato"],
pch = 19, cex = 1.2, col = "tomato")
Of course, this is a lot more coding than the initial plot’s code. The idea of showing you this is that, now, you can always make sure your data “fits”. This is one of the little things that I use constantly, but it took me a long time to figure out.
Finally, adding lines to a plot is very similar in that one needs to use lines()
. To illustrate, we will examine panel data on cigarette consumption by state (documentation).
Read in the data, and plot sales on the y-axis and year on the x-axis below. Be sure to clean the data where appropriate.
This figure is very difficult to understand. Let’s trim it down to just a few states. In addition, we can add colors to the figure.
This plot can still be improved. It’d be a lot more natural to see the data as lines instead of points. To do this, we can use type = "l"
.
Notice two things about this plot. First, there’s only a single color. In R, you should think of a line as a single point. R cannot color different parts of line differently, so it will just take the first color it’s given (here, it’s 1, which is black). Second, there are these three crazy diagonal lines that dash across the plot. This is because R is trying to connect each line into a single one. If you look closely, R is connecting the last year of one state to the first year of another state.
To fix this, we need to use lines
like we used points
before. This is another example case of a time where we’ll want to set up the axes before we plot anything.
# before, I plotted 0, 0
# now, I am simply keeping the data
# in plot().
# this way, I don't need to set the axes
# via ylim() and xlim()
plot(cig$year, cig$sales,
las = 1, type = "n",
ylab = "Sales", xlab = "Year")
lines(cig$year[cig$state == 1],
cig$sales[cig$state == 1],
col = 1)
lines(cig$year[cig$state == 3],
cig$sales[cig$state == 3],
col = 3)
lines(cig$year[cig$state == 4],
cig$sales[cig$state == 4],
col = 4)
lines(cig$year[cig$state == 5],
cig$sales[cig$state == 5],
col = 5)
legend("bottomleft", ncol = 2,
legend = c("State 1", "State 3", "State 4", "State 5"),
bty = "n", col = c(1, 3, 4, 5), lty = 1)
Next module, you’ll learn about “loops”, which will significantly cut down on the amount of code we need to write to generate these lines.
ECON 311: Economics, Causality, and Analytics