Introduction

Web crawling and web scraping are two tools that are important for collecting data. In this tutorial, I will go over how to crawl websites, how to scrape websites, the different types of websites (in terms of crawling), and a little bit about HTML. However, I will first go over some important string functions, for loops and if statements that will be necessary for crawling & scraping.

Strings

Strings are what the computer science community calls text data.
* paste() / paste0(): combining two strings * substr(): take a portion of a string * gsub(): find and replace within a string * grepl(): testing if one string is found inside another string * regexpr(): finding the location of one string inside another string. * strsplit(), unlist(): splitting up strings by some delimiter.

paste("thom", "yorke")

## [1] "thom yorke"

paste("thom", "yorke", sep = "-")

## [1] "thom-yorke"

paste0("thom", "yorke")

## [1] "thomyorke"

paste0(c("thom", "jonny"), "_", c("yorke", "greenwood"))

## [1] "thom_yorke"      "jonny_greenwood"

substr("thom yorke", 1, 4)

## [1] "thom"

radiohead <- c("thom yorke", "jonny greenwood")

gsub("o", "_", radiohead)

## [1] "th_m y_rke"      "j_nny greenw__d"

grepl("m", radiohead); grepl("y", radiohead)

## [1]  TRUE FALSE

## [1] TRUE TRUE

regexpr("m", radiohead); regexpr("y", radiohead)

## [1]  4 -1
## attr(,"match.length")
## [1]  1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

## [1] 6 5
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

unlist(strsplit(radiohead, " "))

## [1] "thom"      "yorke"     "jonny"     "greenwood"

For Loop

Computers are really good at doing the same thing over and over again. Anything we tell the computer to repeat, we write as a loop. A trivial example is printing something over and over. We can write print("hello world") N times, once per line. Or, we can write it once and loop over it. Here is how we would do this:

for(i in 1:4){
  
  print("hello world")
}

## [1] "hello world"
## [1] "hello world"
## [1] "hello world"
## [1] "hello world"

We can also incorporate the index variable (i in the previous snippet) into our loop:

namez <- c("alex", "brad", "bryan", "adam")
for(name in namez){
  
  print(paste0("hello ", name, "!"))
}

## [1] "hello alex!"
## [1] "hello brad!"
## [1] "hello bryan!"
## [1] "hello adam!"

If statements

We can also program into our scripts some decision making logic. If some condition is met, we can make the computer do something different than if the condition is not met. We saw this in earlier tutorials where we chose colors based on specific conditions. Let’s have the computer introduce itself to people who have names starting with “b”, and ignore everyone else.

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you, ", name))
  }
}

## [1] "Nice to meet you, brad"
## [1] "Nice to meet you, bryan"

However, ignoring people isn’t nice! What if we want the computer to talk to people with names that doesn’t start with “b”.

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you ", name))
  } else {
    
    print(paste0("hi ", name))
  }
}

## [1] "hi alex"
## [1] "Nice to meet you brad"
## [1] "Nice to meet you bryan"
## [1] "hi adam"

I can also compose multiple conditions, and have one part of code that always executes at the end:

for(name in namez){
  
  if(substr(name, 1, 1) == "b"){
    
    print(paste0("Nice to meet you ", name))
  } else if(name == "adam") {
    
    print(paste0("hi ", name))
  } else {
    
    print(paste0("bye ", name))
  }
  
  print("okay, i am going to meet the next person now")
}

## [1] "bye alex"
## [1] "okay, i am going to meet the next person now"
## [1] "Nice to meet you brad"
## [1] "okay, i am going to meet the next person now"
## [1] "Nice to meet you bryan"
## [1] "okay, i am going to meet the next person now"
## [1] "hi adam"
## [1] "okay, i am going to meet the next person now"

`ifelse()`

Loops and if statements are great, but they are notoriously “slow”" in R. Rather, we have this nifty function called ifelse(). If you are familiar with using if statements in excel, this is exactly that, but for the whole vector. This is enormously helpful for all sorts of things. These functions can also be nested.

ifelse(substr(namez, 1, 1) == "b", 1, 0)
#if first letter is b, 1, otherwise if last letter is m, 2, otherwise 0.
ifelse(substr(namez, 1, 1) == "b", 1, ifelse(substr(namez, nchar(namez), nchar(namez)) == "m", 2, 0))

Sometimes loops are necessary, but try to avoid them if you can.

Before we move on, I also want to quickly show off the magrittr package. This package seems mostly cosmetic, but it can seriously simplify your code and make it much more readable. The package allows one to unpack complicated lines of code. See the following example.

Suppose we want to take the square root of \(e\) raised to the mean of some vector. This is clearly a multistep process, and I have outlined two ways to compute these. The first one is the way you may have done this if you never heard of magrittr. The second way uses %>%, which is called a “pipe”, to spread out the calculation. I prefer to read the code from top to bottom rather than from the inside-out on one line.

library("magrittr")
set.seed(678)
x <- rnorm(100)

print(sqrt(exp(mean(x))))

## [1] 0.9350935

x %>%
  mean() %>%
  exp() %>%
  sqrt() %>%
  print()

## [1] 0.9350935

One important point is that the pipe operator %>% places the left hand side into the FIRST argument of the right hand side. For example, let’s look at the rnorm function.

5 %>% rnorm() #n = 5, mean = 0, sd = 1

## [1]  0.8553342 -0.7376205  2.3744332  0.2241076 -1.4781254

5 %>% rnorm(4, .) #n = 4, mean = 5, sd = 1

## [1] 5.949385 4.747885 5.840820 6.295963

5 %>% rnorm(4, -10, .) #n = 4, mean = -10, sd = 5

## [1]  -8.267217  -4.772570  -7.242423 -10.397127

5 %>% rnorm(., ., .) #n = 5, mean = 5, sd = 5

## [1] 12.137663 -5.090002  6.119723  8.945774 -1.371957

The first line put the 5 into the N argument. The second line put the 5 into the mean argument. The third line put the 5 into the SD argument. The fourth line put 5 into each argument.

Now we have the programming requisites to talk about crawling and scraping! But first, we need some knowledge of HTML.

HTML

HTML is a coding language that is one of three standards for building websites. Extensive knowledge of HTML is not needed for this tutorial, but some points would be worth knowing. HTML objects (ex: div, table, tr) “contain” more HTML objects. We will call these subobjects “children”. For an illustration, imagine a row (tr) of a table. Within this row, there are data points (td). These data point are children of the tr. This is how HTML is structured or “nested”. Below is an example of what HTML code might look like. In this example, the <body> tag has children <h1> and <p>. This should remind you of what a list in R looks like.

We will identify “nodes” in html, using Cascading Style Sheets (“CSS”) identifiers. This is how we will tell the computer what we want from the HTML. The CSS selector for <h1> would be “body > h1”. There is an easier way to obtain these tags. You can right click on the element you are interested in, and click on Inspect from the list that appears. In Google Chrome, you can hover over the HTML code that pops up and it will highlight the part of the website it belongs to. Once you find the right element, you can write click and select “copy selector”. This takes a little bit of experience. In addition, CSS Selectors can be rather simple and inuitive (#per_game_stats), or they can be random and complex (#pj_6d153fea112bcee1 > div.sjcl > div:nth-child(1) > a > span.slNoUnderline).

Crawling

What is crawling? Crawling is simply downloadin/saving a webpage’s HTML code so you can extract, or scrape, the data afterwards.. This is important when your scripts start to crawl hundreds of thousands of pages, or if you want record data from a site over time. Unfortunately, some websites do not like to be crawled, and you can find their preferences on their robots.txt pages. This is the time I will obligitorily say to respect a website’s wishes and wait times as you crawl. Information on how to read these can be found here.

basketball-reference.com
CDC
Nike (Scroll to the bottom)

Setting up our script

Similar to how we set up our script before, we are going to load libraries, clean our workspace, set our working directory, etc. We are going to load a few new packages for this. Make sure you install them first!

library("rvest")
library("magrittr")
library("jsonlite")

FOLDER <- "C:/Users/alexc/Desktop/Empirical Workshop/scraped_html" #this might look slightly different for mac users
setwd(FOLDER)

Crawl One Page

Crawling a single page is easy, especially if there aren’t any frills on the website. You will tell R to visit the website, read the html, and save the html. Done! Let’s get the 2019 New York Knicks roster.

link <- "https://www.basketball-reference.com/teams/NYK/2019.html"
link %>%
  read_html() -> myHTML

myHTML %>%
  write_html("knicks2019.html")

Now this file is saved to your folder, and you can pop it open and check it out. It will likely look a little weird, but everything on the page at the moment of the crawl will be there!

Crawl Multiple Pages

It’s rare that data is on a single page. Rather, we need to traverse several pages to collect data. We can combine some string functions, for loops and the last snippet of crawling code to do multiple pages! Lucky for us, there is a natural pattern to the URL where we can just change the year.

for(year in 2010:2019){
  
  paste0("https://www.basketball-reference.com/teams/NYK/",year,".html") %>%
    read_html() %>%
    write_html(paste0("knicks",year,".html"))
}

This is the easiest, base level case for crawling. While there are a good number of websites that are this easy, there are more that are not. Let’s take a look at some dynamically generated websites now.

Dynamically Generated Websites

This website has the number of people on government websites at the moment. Let’s scrape this number!

"https://analytics.usa.gov/" %>%
  read_html() %>%
  html_nodes("#current_visitors") %>%
  html_text()

## [1] "..."

It gives us “…”. This is because it is generated after the website is loaded. So, we need to see where the data is coming from. Instead of looking at the “elements” tab in inspect mode, we will look at the network tab. If we reload the page by refreshing, we should see a whole bunch of things “waterfall” and it should look crazy. We are going to sort by type, scroll down and check out all the “XHR” files. Essentially, the website sees that you’ve loaded it and sends out its own code to load data. read_html() takes the original code before any loading happens once you’re on the page. In fact, if you refresh the page, you may be able to see the “…” before the code replaces it with data.

After exploring the XHR files, I found this one called realtime.json. Right click on this one and open it in a new tab. This is a JSON (JavaScript Object Notation) format, is becoming increasingly popular. Typically, we are taught to think of data in table format, but JSON is more of a “notebook”, list-type format. Luckily, we have a package to handle this for us! We can crawl these websites by saving the important JSON files.

Use their API

"https://analytics.usa.gov/data/live/realtime.json" %>%
  fromJSON() -> stuff
stuff$data$active_visitors

## [1] "435845"

#save json here

Log into websites

Some websites require users to fill out forms. These forms might be dropdown menus, logins, search bars, etc. We can deal with most of these with our rvest package. Let’s log into my stackoverflow account and check out the most recent posts.

url <- "https://stackoverflow.com/users/login?ssrc=head"
url %>%
  html_session() -> sesh
#we use html_session() instead of read_html()

sesh %>%
  html_form()

## [[1]]
## <form> 'search' (GET /search)
##   <input text> 'q': 
## 
## [[2]]
## <form> 'login-form' (POST /users/login?ssrc=head)
##   <input hidden> 'fkey': 3e87cd64c05f84833c97e611f61d69383c58faa4627b35d46979b1e41a9cbc54
##   <input hidden> 'ssrc': head
##   <input email> 'email': 
##   <input password> 'password': 
##   <button > 'submit-button
##   <input hidden> 'oauth_version': 
##   <input hidden> 'oauth_server':

#we need to fill in a usename and password...

#save the unfilled form
sesh %>%
  html_form() -> form.unfilled

#there are two forms on this page! We want the second one.  The first is the search bar.
form.unfilled[[2]] -> form.unfilled

#set the filled form
form.unfilled %>%
  set_values("email" = "alex.cardazzi@gmail.com",
             "password" = "welcome2Chilis") -> form.filled

#this step is not common, but for some reason R is having trouble recognizing stackoverflow's button as a button
form.filled$fields[[5]]$type <- "button"
#now, we have manually set the button to be a button, which R will now "click"

#log in ...
sesh <- submit_form(sesh, form.filled, POST = url)

## Submitting with 'submit-button'

#go to stackoverflow's homepage and let's get the text for the first question
sesh %>%
  jump_to("https://stackoverflow.com") -> sesh

#this is where we would save the HTML for a crawl...
#write_html(...)

#the rest of this is scraping, we will go over it later ...

# sesh %>%
#   html_node("#question-mini-list > div") %>%
#   html_children() -> questions
# 
# questions[1] %>% #get the first question
#   html_children() %>% #get it's children
#   `[`(2) %>% #select the second child
#   html_children() %>% #get the children of the second child
#   `[`(1) %>% #select the first child of the second child
#   html_text() #get the text of the first child of the second child ...
# 
# #or an easier way...
# questions[1] %>%
#   html_nodes("a") %>%
#   html_text()

Scraping

We have now reviewed a few ways to crawl websites, but now we need to actually get the data from them.

Extract Table: One Page

read_html("knicks2019.html") -> myHTML

myHTML %>%
  html_nodes("#roster") %>%
  html_table() -> roster
roster %>%
  as.data.frame() -> roster

Explaining how to get this node is not easy, but here goes. Go to the Knicks 2019 Roster. Right click on the “SG” next to Kadeem Allen’s name and click inspect. The website’s HTML should open up and a “td” should be highlighted. Hover your mouse over this, and slowly move your mouse up. You should see what’s highlighted on the webpage changing (e.g. starting on the SG, moving to Kadeem Allen’s name, then to his jersey number, etc). Eventually, you will reach an element called “table”. Right click on this, and click on “copy selector” from the dropdown. This should give you “#roster”.

Extract Table: Multiple Pages

Since we have multiple roster pages, we need to run this extraction code for each one. Obviously, we aren’t going to write the code over and over, we are instead going to loop the code. So, to keep the code as flexible as possible, we are going to introduce a new function called list.files(). This will return a vector of the names of the files in the current working directory (or any given filepath). Second, we are going to make a list of rosters where we can store our scraped data each time through the loop. After we save all of the data frames to a list, we can use rbind to stack all of them. Another thing to note is that I am going to create a new vector for each new roster that tells the file this data comes from. This is an important thing so we can track our data.

list.files() -> filez
roster <- list()
for(file in filez){
  
  file %>%
    read_html() %>%
    html_nodes("#roster") %>%
    html_table() %>%
    as.data.frame() -> temp #same roster
  
  temp$file <- file #add file column
  
  roster[[length(roster) + 1]] <- temp
}

# for(i in 1:length(filez)){
#   
#   filez[i] %>%
#     read_html() %>%
#     html_nodes("#roster") %>%
#     html_table() %>%
#     as.data.frame() -> temp
# 
#   temp$file <- filez[i]
#   
#   roster[[length(roster) + 1]] <- temp
# }

rosterFilled <- do.call(rbind, roster)

Extract Non-Tables, Hyperlinks

Let’s suppose now we want to get the twitter accounts of each player on the Knicks. If you notice, on most player’s page (DeAndre Jordan), they have their twitter account handle. Let’s augment the current roster with their handles.

To do this, we first need to access the HTML. Usually, I would get it from the already crawled HTML, but I will do the crawling and scraping in one step here.

link <- "https://www.basketball-reference.com/teams/NYK/2019.html"
link %>%
  read_html() -> myHTML

Now, let’s just get the roster and take a look.

myHTML %>%
  html_nodes("#roster") %>%
  html_table() %>%
  as.data.frame() -> knicks
head(knicks)

##   No.         Player Pos   Ht  Wt        Birth.Date Var.7 Exp
## 1   0   Kadeem Allen  SG  6-1 200  January 15, 1993    us   1
## 2  31      Ron Baker  SG  6-4 220    March 30, 1993    us   2
## 3  23     Trey Burke  PG  6-0 175 November 12, 1992    us   5
## 4  21 Damyean Dotson  SG  6-5 202       May 6, 1994    us   1
## 5  13 Henry Ellenson  PF 6-10 240  January 13, 1997    us   2
## 6   3  Billy Garrett  SG  6-6 213  October 16, 1994    us   R
##           College
## 1         Arizona
## 2   Wichita State
## 3        Michigan
## 4 Oregon, Houston
## 5       Marquette
## 6          DePaul

You might notice that the html_table() function does not keep the hyperlinks! This is annoying since these are the links we want to follow. So, let’s go back to the HTML and investigate if we can find “href” items. “href” is HTML-speak for hyperlink.

myHTML %>%
  html_nodes("#roster") %>%
  html_nodes("a") %>% #a nodes are where hrefs are kept.  you need an <a> to give text, href, and anything else.
  html_attr("href") -> linkz

head(linkz)

## [1] "/players/a/allenka01.html"            
## [2] "/friv/colleges.fcgi?college=arizona"  
## [3] "/players/b/bakerro01.html"            
## [4] "/friv/colleges.fcgi?college=wichitast"
## [5] "/players/b/burketr01.html"            
## [6] "/friv/colleges.fcgi?college=michigan"

#Remove the NA ones, keep only the ones with "players" in it.
#Next, add the front half of the website to it.
linkz <- linkz[grepl("players", linkz)]
head(linkz)

## [1] "/players/a/allenka01.html" "/players/b/bakerro01.html"
## [3] "/players/b/burketr01.html" "/players/d/dotsoda01.html"
## [5] "/players/e/ellenhe01.html" "/players/g/garrebi01.html"

linkz <- paste0("https://www.basketball-reference.com", linkz)
head(linkz)

## [1] "https://www.basketball-reference.com/players/a/allenka01.html"
## [2] "https://www.basketball-reference.com/players/b/bakerro01.html"
## [3] "https://www.basketball-reference.com/players/b/burketr01.html"
## [4] "https://www.basketball-reference.com/players/d/dotsoda01.html"
## [5] "https://www.basketball-reference.com/players/e/ellenhe01.html"
## [6] "https://www.basketball-reference.com/players/g/garrebi01.html"

knicks$playerLinkz <- linkz

Now we have all of the player pages! Notice how there are more href elements in the table. This can become tricky in two ways. First, if there is no easy way to tell between the links you want and links you do not want. Luckily for us, we were able to tell the difference since the roots of the links were different. Second, sometimes not all rows have a hyperlink. This means that the vector of hyperlinks will be shorter than the table, and you will have to be very clever about how to find where those “holes” are located.

Twitter Accounts

So now, we have to visit each player’s page, check for a twitter account, and pull it if it exists.

knicks$twitter <- NA #making a new column of all NA values
#for(i in 1:length(linkz)){ #normally, you would run this for loop, but for brevity I am running only the first 4 times through the loop
  
for(i in 1:4){
  linkz[i] %>%
    read_html() %>%
    html_nodes("#meta") %>%
    html_nodes("a") %>%
    html_attr("href") -> temp #this is where the twitter account will be located 
  #I will leave finding this as practice
  
  temp <- temp[grepl("twitter.com", temp)] #eliminate all of the links that do not contain "twitter.com"
  
  #if the player does not have a twitter account, we will have removed all links!
  #use an if statement to be sure we are only replacing the NA in the table with a twitter account.
  if(length(temp)>0){
    
    knicks$twitter[i] <- temp
  }
}
head(knicks, 4)

##   No.         Player Pos  Ht  Wt        Birth.Date Var.7 Exp
## 1   0   Kadeem Allen  SG 6-1 200  January 15, 1993    us   1
## 2  31      Ron Baker  SG 6-4 220    March 30, 1993    us   2
## 3  23     Trey Burke  PG 6-0 175 November 12, 1992    us   5
## 4  21 Damyean Dotson  SG 6-5 202       May 6, 1994    us   1
##           College
## 1         Arizona
## 2   Wichita State
## 3        Michigan
## 4 Oregon, Houston
##                                                     playerLinkz
## 1 https://www.basketball-reference.com/players/a/allenka01.html
## 2 https://www.basketball-reference.com/players/b/bakerro01.html
## 3 https://www.basketball-reference.com/players/b/burketr01.html
## 4 https://www.basketball-reference.com/players/d/dotsoda01.html
##                            twitter
## 1  https://twitter.com/AllenKadeem
## 2   https://twitter.com/RonBaker31
## 3    https://twitter.com/TreyBurke
## 4 https://twitter.com/wholeteamDot

Now we have the 2019 roster with everyone’s twitter accounts (if they have one). Using this, you might want to scrape twitter or something for player news or something.

tryCatch - Error Handling

Code breaks. In my case, a lot. When code breaks, the entire program ends. Sometimes this is okay, but other times it isn’t what we want. If we can anticipate code breaking, we might just want R to skip these cases. For example, suppose we are crawling over those twitter accounts. If a player changes their handle and basketball-reference does not have a chance to update it, our code might fail. To have R jump over this, we can use tryCatch(). Check out the following example.

"https://www.basketball-reference.com/teams/NYK/3019.html" %>%
  read_html()

## Error in open.connection(x, "rb"): HTTP error 404.

tryCatch({
  
  "https://www.basketball-reference.com/teams/NYK/3019.html" %>%
    read_html()
}, error = function(e){
  
  print("404 error ... try again")
})

## [1] "404 error ... try again"

tryCatch({
  
  "https://www.basketball-reference.com/teams/NYK/2019.html" %>%
    read_html()
}, error = function(e){
  
  print("404 error ... try again")
})

## {xml_document}
## <html data-version="klecko-" data-root="/home/bbr/build" itemscope="" itemtype="https://schema.org/WebSite" lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="bbr">\n<div id="wrap">\n  \n  <div id="header" role="ba ...

RSelenium - Notes

Sometimes, you cannot use read_html() due to the website being dynamic, and it is either too hard to crawl via html_session() or data is generated by JavaScript rather than those XHR files, there is a tool called RSelenium.
This package simulates a user experience by actually opening a browser, clicking and typing.
Setting this up is difficult, but it can be handy.
Personally, this is my last resort. It is constantly being worked on and changed, so you will have to constantly adapt.

Scraping PDFs

There is a library called “pdftools”
There is a function called “pdf_text()”
It converts each page of the pdf into a SINGLE character element
Example of a PDF with enough of a “pattern” or “structure” to scrape
Advice: stay away!

Web Crwaling & Scraping in R

Alexander Cardazzi

Quarantine, 2020