In this exercise, we will learn how to scrape tables from a webpage using the rvest
package in R.
This time, we will scrape a HTML table from the hockey teams page of the www.scrapethissite.com website.
url <- "https://www.scrapethissite.com/pages/forms/"
The steps are the same as in the previous exercise.
We use the read_html()
function to download and parse the HTML file of the webpage.
Instead of using the html_nodes()
function,
we will use the html_table()
function to extract the table from the HTML file.
Since there is only a single table on the webpage,
we will pluck the first element of the list that is returned by html_table()
.
hockey_teams <- url %>% read_html() %>% html_table() %>% .[[1]]
head(hockey_teams)
# A tibble: 6 x 9
`Team Name` Year Wins Losses `OT Losses` `Win %` `Goals For (GF)` `Goals Against (GA)` `+ / -`
<chr> <int> <int> <int> <lgl> <dbl> <int> <int> <int>
1 Boston Bruins 1990 44 24 NA 0.55 299 264 35
2 Buffalo Sabres 1990 31 30 NA 0.388 292 278 14
3 Calgary Flames 1990 46 26 NA 0.575 344 263 81
4 Chicago Blackhawks 1990 49 23 NA 0.613 284 211 73
5 Detroit Red Wings 1990 34 38 NA 0.425 273 298 -25
6 Edmonton Oilers 1990 37 37 NA 0.463 272 272 0
By default, the webpage only shows 25 entries per page. To scrape more entries, we need to select a higher number in the dropdown menu at the bottom right of the page. This allows us to scrape up to 100 entries. Note that the url changes when you select a different number of entries. The url for 100 entries is as follows.
url <- "https://www.scrapethissite.com/pages/forms/?per_page=100"
We can now scrape the table again, and we will see that the table has 100 rows.
hockey_teams <- url %>% read_html() %>% html_table() %>% .[[1]]
nrow(hockey_teams)
[1] 100
Play around with the webpage and try to find a way to scrape the first 200 entries.
Store the dataframe in hockey_teams
.
Assume that: