Scraping Tables with Rvest

In this exercise, we will learn how to scrape tables from a webpage using the rvest package in R.

Defining the URL

This time, we will scrape a HTML table from the hockey teams page of the www.scrapethissite.com website.

url <- "https://www.scrapethissite.com/pages/forms/"

Downloading and Parsing the HTML File

The steps are the same as in the previous exercise. We use the read_html() function to download and parse the HTML file of the webpage. Instead of using the html_nodes() function, we will use the html_table() function to extract the table from the HTML file. Since there is only a single table on the webpage, we will pluck the first element of the list that is returned by html_table().

hockey_teams <- url %>% read_html() %>% html_table() %>% .[[1]]
head(hockey_teams)
# A tibble: 6 x 9
  `Team Name`         Year  Wins Losses `OT Losses` `Win %` `Goals For (GF)` `Goals Against (GA)` `+ / -`
  <chr>              <int> <int>  <int> <lgl>         <dbl>            <int>                <int>   <int>
1 Boston Bruins       1990    44     24 NA            0.55               299                  264      35
2 Buffalo Sabres      1990    31     30 NA            0.388              292                  278      14
3 Calgary Flames      1990    46     26 NA            0.575              344                  263      81
4 Chicago Blackhawks  1990    49     23 NA            0.613              284                  211      73
5 Detroit Red Wings   1990    34     38 NA            0.425              273                  298     -25
6 Edmonton Oilers     1990    37     37 NA            0.463              272                  272       0

Scraping more entries

By default, the webpage only shows 25 entries per page. To scrape more entries, we need to select a higher number in the dropdown menu at the bottom right of the page. This allows us to scrape up to 100 entries. Note that the url changes when you select a different number of entries. The url for 100 entries is as follows.

url <- "https://www.scrapethissite.com/pages/forms/?per_page=100"

We can now scrape the table again, and we will see that the table has 100 rows.

hockey_teams <- url %>% read_html() %>% html_table() %>% .[[1]]
nrow(hockey_teams)
[1] 100

Exercise

Play around with the webpage and try to find a way to scrape the first 200 entries. Store the dataframe in hockey_teams.


Assume that: