Web crawling and web scraping are tools that are important for collecting unique data. In this tutorial, we will go over how to crawl websites, how to scrape websites, the different types of websites (in terms of crawling), and a little bit about HTML.
Before we get there, though, I also want to quickly show off the
magrittr
package. This package seems mostly cosmetic, but
it can significantly simplify your code and make it much more readable.
The package allows one to declutter complicated lines of code. See the
following example.
Suppose we want to take the square root of \(e\) raised to the mean of some vector. This
is clearly a multistep process, and I have outlined two ways to compute
these. The first one is the way you may have done this if you never
heard of magrittr
. The second way uses %>%
,
which is called a “pipe”, to spread out the calculation. I prefer to
read the code from top to bottom rather than from the inside-out on one
line.
library("magrittr")
set.seed(678)
x <- rnorm(100)
print(sqrt(exp(mean(x))))
## [1] 0.9350935
x %>%
mean() %>%
exp() %>%
sqrt() %>%
print()
## [1] 0.9350935
One important point is that the pipe operator %>%
places the left hand side into the FIRST argument of the right hand
side. For example, let’s look at the rnorm function.
5 %>% rnorm() #n = 5, mean = 0, sd = 1
## [1] 0.8553342 -0.7376205 2.3744332 0.2241076 -1.4781254
5 %>% rnorm(4, .) #n = 4, mean = 5, sd = 1
## [1] 5.949385 4.747885 5.840820 6.295963
5 %>% rnorm(4, -10, .) #n = 4, mean = -10, sd = 5
## [1] -8.267217 -4.772570 -7.242423 -10.397127
5 %>% rnorm(., ., .) #n = 5, mean = 5, sd = 5
## [1] 12.137663 -5.090002 6.119723 8.945774 -1.371957
The first line put the 5 into the n argument. The second line put the 5 into the mean argument. The third line put the 5 into the sd argument. The fourth line put 5 into each argument.
As a final note about magrittr
, R now has a native
piping feature. See this
link for more discussion about this.
Now we have the programming requisites to talk about crawling and scraping! Now we need some knowledge of HTML.
HTML is a coding language that is one of three standards for building websites. Extensive knowledge of HTML is not needed for this tutorial, but some points would be worth knowing. HTML objects (ex: div, table, tr) “contain” more HTML objects. We will call these subobjects “children” or “nodes”. For an illustration, imagine a row (tr) of a table. Within this row, there are data points (td). These data point are children of the tr. This is how HTML is structured or “nested”. Below is an example of what HTML code might look like. In this example, the <body> tag has children <h1> and <p>. This should remind you of what a list in R looks like.
We will identify “nodes” in html, using (mostly) Cascading Style Sheets (“CSS”) identifiers. This is how we will tell the computer what we want from the HTML. The CSS selector for <h1> would be “body > h1”. There is an easier way to obtain these tags. You can right click on the element you are interested in, and click on Inspect from the list that appears. In Google Chrome, you can hover over the HTML code that pops up and it will highlight the part of the website it belongs to. Once you find the right element, you can write click and select “copy selector”. There is a moderate learning curve for this, but it’s not so bad. In addition, CSS Selectors can be rather simple and inuitive (#per_game_stats), or they can be random and complex (#pj_6d153fea112bcee1 > div.sjcl > div:nth-child(1) > a > span.slNoUnderline).
What is crawling? Crawling is simply downloading/saving a webpage’s HTML code so you can extract, or scrape, the data afterwards. This is important when your scripts start to crawl hundreds of thousands of pages, or if you want to record data from a site over time. Unfortunately, some websites do not like to be crawled, and you can find their preferences on their robots.txt pages. This is the time I will obligatorily say to respect a website’s wishes and wait times as you crawl. Information on how to read these can be found here.
To begin, we are going to load libraries, clean our workspace, set our working directory, etc. We are going to load a few new packages for this. Make sure you install them first!
library("xml2")
library("rvest")
library("magrittr")
library("jsonlite")
FOLDER <- "C:/Users/alexc/Desktop/Empirical Workshop 2022/scraped_html" #this might look slightly different for mac/linux users
setwd(FOLDER)
Crawling a single page is easy, especially if there aren’t any frills on the website. You will tell R to visit the website, read the html, and save the html. Done! Let’s get the 2019 New York Knicks roster.
link <- "https://www.basketball-reference.com/teams/NYK/2019.html"
link %>%
read_html() -> myHTML
myHTML %>%
write_html("knicks2019.html")
Now this file is saved to your folder, and you can pop it open and check it out. It will likely look a little weird, but everything on the page at the moment of the crawl will be there!
It’s rare that all the data you want is on a single page. Rather, we will likely need to traverse several pages to collect data. We can combine some string functions, for loops and the last snippet of crawling code to do multiple pages! Lucky for us, there is a natural pattern to the URL where we can just change the year.
for(year in 2010:2019){
paste0("https://www.basketball-reference.com/teams/NYK/",year,".html") %>%
read_html() %>%
write_html(paste0("knicks",year,".html"))
}
This is the simplest case for crawling. While there are a good number of websites that are this easy, there are many that are not. Let’s take a look at some dynamically generated websites now.
"https://analytics.usa.gov/" %>%
read_html() %>%
html_nodes("#current_visitors") %>%
html_text()
## [1] "..."
It gives us “…”. This is because it is generated after the website is
loaded. So, we need to see where the data is coming from. Instead of
looking at the “elements” tab in inspect mode, we will look at the
network tab. If we reload the page by refreshing, we should see a whole
bunch of things “waterfall” and it should look crazy. We are going to
sort by type, scroll down and check out all the “XHR” files.
Essentially, the website sees that you’ve loaded it and sends out its
own code to load data. read_html()
takes the original code
before any loading happens once you’re on the page. In fact, if you
refresh the page, you may be able to see the “…” before the code
replaces it with data.
After exploring the XHR files, I found this one called realtime.json. Right click on this one and open it in a new tab. This is a JSON (JavaScript Object Notation) format, is becoming increasingly popular. Typically, we are taught to think of data in table format, but JSON is more of a “notebook”, list-type format. Luckily, we have a package to handle this for us! We can crawl these websites by saving the important JSON files.
"https://analytics.usa.gov/data/live/realtime.json" %>%
fromJSON() -> stuff
stuff$data$active_visitors
## [1] "6914"
Some websites require users to fill out forms. These forms might be
dropdown menus, logins, search bars, etc. We can deal with most of these
with our rvest
package.
Let’s take a look at the CDC’s Wonder database. This is a great resource for datasets, but can be difficult to download from en masse. The CDC has provided an API for some of these databases (but not all), and some very nice people have written R packages to access these APIs. What happens if you need data that isn’t covered by the APIs / packages?
Here is an example of one we can pull data from: https://wonder.cdc.gov/nasa-pm.html
The following few code chunks will show how you might build a crawler step-by-step:
# Suppose we are given the following FIPS codes
FIPZ <- c(36087, 36093, 36083, 36075, 36119, 36061, 36081, # New York
51059, 51710, 51740, # Virginia
42001, 42003, 42077, 42101) # Pennsylvania
# For some reason, the website will not accept fips from different states in one go
# We need to create an identifier:
FIPZ_st <- substr(FIPZ, 1, 2)
print(FIPZ_st)
## [1] "36" "36" "36" "36" "36" "36" "36" "51" "51" "51" "42" "42" "42" "42"
for(fip_st in unique(FIPZ_st)){
cat(FIPZ[FIPZ_st == fip_st], "\t\t")
}
## 36087 36093 36083 36075 36119 36061 36081 51059 51710 51740 42001 42003 42077 42101
for(fip_st in unique(FIPZ_st)){
# save the URL
"https://wonder.cdc.gov/nasa-pm.html" -> url
# begin a "session"
url %>% session() -> sesh
# find the "forms"
sesh %>%
html_form() -> unfilled
if(fip_st == unique(FIPZ_st)[1]) print(unfilled)
}
## [[1]]
## <form> '<unnamed>' (GET https://search.cdc.gov/search)
## <field> (button) :
## <field> (text) query:
## <field> (button) :
## <field> (hidden) sitelimit:
## <field> (hidden) utf8: ✓
## <field> (hidden) affiliate: cdc-main
##
## [[2]]
## <form> 'wonderform' (POST https://wonder.cdc.gov/controller/datarequest/D73)
## <field> (hidden) saved_id:
## <field> (hidden) dataset_code: D73
## <field> (hidden) dataset_label: Fine Particulate ...
## <field> (hidden) dataset_vintage: 2011
## <field> (hidden) stage: request
## <field> (hidden) O_javascript: off
## <field> (hidden) M_1: D73.M1
## <field> (hidden) dataset_id: D73
## <field> (button) : Request Form
## <field> (submit) tab-results: Results
## <field> (submit) tab-map: Map
## <field> (submit) tab-chart: Chart
## <field> (submit) tab-about: About
## <field> (submit) action-Save: Save
## <field> (submit) action-Reset: Reset
## <field> (submit) action-Send: Send
## <field> (select) B_1:
## <field> (select) B_2:
## <field> (select) B_3:
## <field> (select) B_4:
## <field> (select) B_5:
## <field> (checkbox) M_11: D73.M11
## <field> (checkbox) M_12: D73.M12
## <field> (checkbox) M_14: D73.M14
## <field> (text) O_title:
## <field> (submit) action-Send: Send
## <field> (radio) O_location: D73.V2
## <field> (radio) O_location: D73.V1
## <field> (hidden) finder-stage-D73.V2: codeset
## <field> (hidden) O_V2_fmode: freg
## <field> (textarea) V_D73.V2:
## <field> (button) : Clear
## <field> (button) :
## <field> (button) : Browse
## <field> (submit) finder-action-D73.V2-Search: Search
## <field> (submit) finder-action-D73.V2-Details: Details
## <field> (select) F_D73.V2: *All*
## <field> (submit) finder-action-D73.V2-Open: Open
## <field> (submit) finder-action-D73.V2-Close: Close
## <field> (submit) finder-action-D73.V2-Close All: Close All
## <field> (hidden) finder-stage-D73.V1: codeset
## <field> (hidden) O_V1_fmode: freg
## <field> (textarea) V_D73.V1:
## <field> (button) : Clear
## <field> (button) :
## <field> (button) : Browse
## <field> (submit) finder-action-D73.V1-Search: Search
## <field> (submit) finder-action-D73.V1-Details: Details
## <field> (select) F_D73.V1: *All*
## <field> (submit) finder-action-D73.V1-Open: Open
## <field> (submit) finder-action-D73.V1-Open Fully: Open Fully
## <field> (submit) finder-action-D73.V1-Close: Close
## <field> (submit) finder-action-D73.V1-Close All: Close All
## <field> (submit) action-Send: Send
## <field> (radio) O_dates: D73.V7_range
## <field> (radio) O_dates: D73.V3
## <field> (radio) O_dates: D73.V7
## <field> (select) RD1_M_D73.V7: 01
## <field> (select) RD1_D_D73.V7: 01
## <field> (select) RD1_Y_D73.V7: 2003
## <field> (select) RD2_M_D73.V7: 12
## <field> (select) RD2_D_D73.V7: 31
## <field> (select) RD2_Y_D73.V7: 2011
## <field> (select) V_D73.V3: *All*
## <field> (radio) O_dates_2: D73.V8
## <field> (radio) O_dates_2: D73.V6
## <field> (select) V_D73.V4: *All*
## <field> (select) V_D73.V8: *All*
## <field> (select) V_D73.V6: *All*
## <field> (hidden) finder-stage-D73.V7: codeset
## <field> (hidden) O_V7_fmode: freg
## <field> (textarea) V_D73.V7:
## <field> (button) : Clear
## <field> (button) :
## <field> (button) : Browse
## <field> (submit) finder-action-D73.V7-Search: Search
## <field> (submit) finder-action-D73.V7-Details: Details
## <field> (select) F_D73.V7: *All*
## <field> (submit) finder-action-D73.V7-Open: Open
## <field> (submit) finder-action-D73.V7-Open Fully: Open Fully
## <field> (submit) finder-action-D73.V7-Close: Close
## <field> (submit) finder-action-D73.V7-Close All: Close All
## <field> (submit) action-Send: Send
## <field> (radio) O_pm: pm_range
## <field> (radio) O_pm: pm_list
## <field> (text) R1_D73.V10:
## <field> (text) R2_D73.V10:
## <field> (select) V_D73.V10: *All*
## <field> (submit) action-Send: Send
## <field> (checkbox) O_change_action-Send-Export Results: Export Results
## <field> (checkbox) O_show_totals: true
## <field> (checkbox) O_show_zeros: true
## <field> (select) O_precision: 2
## <field> (select) O_timeout: 600
## <field> (submit) action-Send: Send
## <field> (submit) action-Reset: Reset
for(fip_st in unique(FIPZ_st)){
"https://wonder.cdc.gov/nasa-pm.html" -> url
url %>% session() -> sesh
sesh %>%
html_form() -> unfilled
# only keep the second form
unfilled <- unfilled[[2]]
# fill in the form with the relevant information
html_form_set(unfilled,
"B_1" = "D73.V7-level3", # by month/year
"B_2" = "D73.V2-level2", # by county
"F_D73.V2" = as.list(FIPZ[FIPZ_st == fip_st]),
"RD1_M_D73.V7" = "01", # from month
"RD1_D_D73.V7" = "01", # from day
"RD1_Y_D73.V7" = "2010", # from year
"RD2_M_D73.V7" = "12", # to month
"RD2_D_D73.V7" = "31", # to day
"RD2_Y_D73.V7" = "2010") %>% # to year
#finally, submit your form to the session
session_submit(sesh, form = ., submit = "action-Send") -> k
# now, check out the response generated, and save it!
k$response %>% read_html() -> tmp
tmp %>% write_html(paste0("cdc_", fip_st, ".html"))
if(fip_st == unique(FIPZ_st)[1]) print(tmp)
}
## {html_document}
## <html>
## [1] <body><p>"Notes"\t"Month Day, Year"\t"Month Day, Year Code"\t"County"\t"C ...
paste0("cdc_", fip_st, ".html") %>%
read_html() %>%
html_nodes("p") %>%
html_text() %>%
strsplit("\n") -> tmp
tmp <- tmp[[1]]
tmp <- tmp[!grepl("^\"Total", tmp)]
tmp <- tmp[1:(which(tmp == "\"---\"\r")[1] - 1)]
colz <- tmp[1]
colz <- strsplit(colz, "\t")[[1]]
colz <- colz[2:length(colz)]
cat(paste(gsub("\"", "", colz), collapse = "\t\t"))
## Month Day, Year Month Day, Year Code County County Code Avg Fine Particulate Matter (µg/m³) # of Observations for Fine Particulate Matter Min Fine Particulate Matter Max Fine Particulate Matter Avg Fine Particulate Matter Standard Deviation
tmp <- tmp[-1]
tmp <- gsub("\\r|\"|^\t", "", tmp)
tmp <- strsplit(tmp, "\t")
tmp <- data.frame(matrix(unlist(tmp), ncol = 9, byrow = T))
head(tmp, 10)
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 Jan 01, 2010 2010/01/01 Adams County, PA 42001 9.16 16 9.00 9.40
## 2 Jan 01, 2010 2010/01/01 Allegheny County, PA 42003 10.03 22 9.80 10.20
## 3 Jan 01, 2010 2010/01/01 Lehigh County, PA 42077 10.22 8 10.10 10.40
## 4 Jan 01, 2010 2010/01/01 Philadelphia County, PA 42101 9.32 5 9.20 9.40
## 5 Jan 02, 2010 2010/01/02 Adams County, PA 42001 7.04 16 7.00 7.10
## 6 Jan 02, 2010 2010/01/02 Allegheny County, PA 42003 6.84 22 6.80 6.90
## 7 Jan 02, 2010 2010/01/02 Lehigh County, PA 42077 6.85 8 6.80 6.90
## 8 Jan 02, 2010 2010/01/02 Philadelphia County, PA 42101 6.94 5 6.90 7.00
## 9 Jan 03, 2010 2010/01/03 Adams County, PA 42001 10.46 16 10.30 10.60
## 10 Jan 03, 2010 2010/01/03 Allegheny County, PA 42003 9.54 22 9.50 9.60
## X9
## 1 0.14
## 2 0.12
## 3 0.10
## 4 0.11
## 5 0.05
## 6 0.05
## 7 0.05
## 8 0.05
## 9 0.08
## 10 0.05
Some CDC Wonder datasets (ex: Multiple Cause of Death, 1999-2020) make you “agree” to data use restrictions. However, this is just another form! You can also automate this. Below is a minor example.
session("https://wonder.cdc.gov/mcd-icd10.html") -> sesh
sesh %>% html_form() -> unfilled
submit_form(sesh, unfilled[[2]], submit = "action-I Agree") -> sesh
sesh %>%
html_form() -> unfilled
unfilled <- unfilled[[2]]
A lot of sites are using a new type of query language for their APIs called GraphQL. This is a bit complicated and I suggest you check out some other resources. For example: here and here.
read_html("knicks2019.html") -> myHTML
myHTML %>%
html_nodes("#roster") %>%
html_table() -> roster
roster %>%
as.data.frame() -> roster
Explaining how to get this node is not easy, but here goes. Go to the Knicks 2019 Roster. Right click on the “PG” next to Kadeem Allen’s name and click inspect. The website’s HTML should open up and a “td” should be highlighted. Hover your mouse over this, and slowly move your mouse up. You should see what’s highlighted on the webpage changing (e.g. starting on the PG, moving to Kadeem Allen’s name, then to his jersey number, etc). Eventually, you will reach an element called “table”. Right click on this, and click on “copy selector” from the dropdown. This should give you “#roster”.
Since we have multiple roster pages, we need to run this extraction
code for each one. Obviously, we aren’t going to write the code over and
over, we are instead going to loop the code. So, to keep the code as
flexible as possible, we are going to introduce a new function called
list.files()
. This will return a vector of the names of the
files in the current working directory (or any given filepath). Second,
we are going to make a list of rosters where we can store our scraped
data each time through the loop. After we save all of the data frames to
a list, we can use rbind
to stack all of them. Another
thing to note is that I am going to create a new vector for each new
roster that tells the file this data comes from. This is an important
thing so we can track our data.
list.files() -> filez
roster <- list()
for(file in filez){
file %>%
read_html() %>%
html_nodes("#roster") %>%
html_table() %>%
as.data.frame() -> temp #same roster
temp$file <- file #add file column
roster[[length(roster) + 1]] <- temp
}
# for(i in 1:length(filez)){
#
# filez[i] %>%
# read_html() %>%
# html_nodes("#roster") %>%
# html_table() %>%
# as.data.frame() -> temp
#
# temp$file <- filez[i]
#
# roster[[length(roster) + 1]] <- temp
# }
rosterFilled <- do.call(rbind, roster)
Let’s suppose now we want to get the twitter accounts of each player on the Knicks. If you notice, on most player’s page (ex: DeAndre Jordan), they have their twitter account handle. Let’s augment the current roster with their handles.
To do this, we first need to access the HTML. Usually, I would get it from the already crawled HTML, but I will do the crawling and scraping in one step here.
link <- "https://www.basketball-reference.com/teams/NYK/2019.html"
link %>%
read_html() -> myHTML
Now, let’s just get the roster and take a look.
myHTML %>%
html_nodes("#roster") %>%
html_table() %>%
as.data.frame() -> knicks
head(knicks)
## No. Player Pos Ht Wt Birth.Date Var.7 Exp College
## 1 0 Kadeem Allen PG 6-1 200 January 15, 1993 us 1 Arizona
## 2 31 Ron Baker SG 6-4 220 March 30, 1993 us 2 Wichita State
## 3 23 Trey Burke PG 6-0 185 November 12, 1992 us 5 Michigan
## 4 21 Damyean Dotson SG 6-5 210 May 6, 1994 us 1 Oregon, Houston
## 5 13 Henry Ellenson PF 6-10 240 January 13, 1997 us 2 Marquette
## 6 0 Enes Freedom C 6-10 250 May 20, 1992 ch 7
You might notice that the html_table()
function does not
keep the hyperlinks! This is annoying since these are the links we want
to follow. So, let’s go back to the HTML and investigate if we can find
“href” items. “href” is HTML-speak for hyperlink.
myHTML %>%
html_nodes("#roster") %>%
html_nodes("a") %>% # <a> nodes are where hrefs are kept.
# you need an <a> to give text, href, and anything else.
html_attr("href") -> linkz
head(linkz)
## [1] "/players/a/allenka01.html"
## [2] "/friv/colleges.fcgi?college=arizona"
## [3] "/players/b/bakerro01.html"
## [4] "/friv/colleges.fcgi?college=wichitast"
## [5] "/players/b/burketr01.html"
## [6] "/friv/colleges.fcgi?college=michigan"
# Remove the NA ones, keep only the ones with "players" in it.
# Next, add the front half of the website to it.
linkz <- linkz[grepl("players", linkz)]
head(linkz)
## [1] "/players/a/allenka01.html" "/players/b/bakerro01.html"
## [3] "/players/b/burketr01.html" "/players/d/dotsoda01.html"
## [5] "/players/e/ellenhe01.html" "/players/k/kanteen01.html"
linkz <- paste0("https://www.basketball-reference.com", linkz)
head(linkz)
## [1] "https://www.basketball-reference.com/players/a/allenka01.html"
## [2] "https://www.basketball-reference.com/players/b/bakerro01.html"
## [3] "https://www.basketball-reference.com/players/b/burketr01.html"
## [4] "https://www.basketball-reference.com/players/d/dotsoda01.html"
## [5] "https://www.basketball-reference.com/players/e/ellenhe01.html"
## [6] "https://www.basketball-reference.com/players/k/kanteen01.html"
knicks$playerLinkz <- linkz
Now we have all of the player pages! Notice how there are more href elements in the table. This can become tricky in two ways. First, if there is no easy way to tell between the links you want and links you do not want. Luckily for us, we were able to tell the difference since the roots of the links were different. Second, sometimes not all rows have a hyperlink. This means that the vector of hyperlinks will be shorter than the table, and you will have to be very clever about how to find where those “holes” are located.
So now, we have to visit each player’s page, check for a twitter account, and pull it if it exists.
knicks$twitter <- NA #making a new column of all NA values
#for(i in 1:length(linkz)){ #normally, you would run this for loop, but for brevity I am running only the first 4 times through the loop
for(i in 1:4){
linkz[i] %>%
read_html() %>%
html_nodes("#meta") %>%
html_nodes("a") %>%
html_attr("href") -> temp #this is where the twitter account will be located
#I will leave finding this as practice
temp <- temp[grepl("twitter.com", temp)] #eliminate all of the links that do not contain "twitter.com"
#if the player does not have a twitter account, we will have removed all links!
#use an if statement to be sure we are only replacing the NA in the table with a twitter account.
if(length(temp)>0){
knicks$twitter[i] <- temp
}
}
head(knicks, 4)
## No. Player Pos Ht Wt Birth.Date Var.7 Exp College
## 1 0 Kadeem Allen PG 6-1 200 January 15, 1993 us 1 Arizona
## 2 31 Ron Baker SG 6-4 220 March 30, 1993 us 2 Wichita State
## 3 23 Trey Burke PG 6-0 185 November 12, 1992 us 5 Michigan
## 4 21 Damyean Dotson SG 6-5 210 May 6, 1994 us 1 Oregon, Houston
## playerLinkz
## 1 https://www.basketball-reference.com/players/a/allenka01.html
## 2 https://www.basketball-reference.com/players/b/bakerro01.html
## 3 https://www.basketball-reference.com/players/b/burketr01.html
## 4 https://www.basketball-reference.com/players/d/dotsoda01.html
## twitter
## 1 https://twitter.com/AllenKadeem
## 2 https://twitter.com/RonBaker31
## 3 https://twitter.com/TreyBurke
## 4 https://twitter.com/wholeteamDot
Now we have the 2019 roster with everyone’s twitter accounts (if they have one). Using this, you might want to scrape twitter or something for player news or something.
Code breaks. In my case, a lot. When code breaks, the entire program
ends. Sometimes this is okay, but other times it isn’t what we want. If
we can anticipate code breaking, we might just want R to skip these
cases. For example, suppose we are crawling over those twitter accounts.
If a player changes their handle and basketball-reference does not have
a chance to update it, our code might fail. To have R jump over this, we
can use tryCatch()
. Check out the following example.
"https://www.basketball-reference.com/teams/NYK/3019.html" %>%
read_html()
## Error in open.connection(x, "rb"): HTTP error 404.
tryCatch({
"https://www.basketball-reference.com/teams/NYK/3019.html" %>%
read_html()
}, error = function(e){
print("404 error ... try again")
})
## [1] "404 error ... try again"
tryCatch({
"https://www.basketball-reference.com/teams/NYK/2019.html" %>%
read_html()
}, error = function(e){
print("404 error ... try again")
})
## {html_document}
## <html data-version="klecko-" data-root="/home/bbr/build" lang="en" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="bbr">\n\n \n\n<!-- Google Tag Manager (noscript) -->\n<no ...
read_html()
due to the
website being dynamic, and it is either too hard to crawl via
session()
or data is generated by JavaScript rather than
those XHR files or GraphQL. For these cases, there is a tool called
RSelenium.Suppose you need to scrape many many many pages. Sometimes, we can speed things up by using parallel programming!
library("foreach")
library("doParallel")
corez <- detectCores() * 3 / 4; print(corez)
yearz <- c(2010:2019)
teamz <- c("NYK", "DAL", "LAL", "LAC", "BOS", "CHI")
registerDoParallel(corez)
foreach(tm = teamz,
.packages = c("rvest", "magrittr")) %dopar%
{
for(year in 2010:2019){
paste0("https://www.basketball-reference.com/teams/",tm,"/",year,".html") %>%
read_html() %>%
write_html(paste0(tm, "_", year, ".html"))
}
}
closeAllConnections(); stopImplicitCluster()