Web Scraping HTML Tables
By Nicholas Vietto
March 12, 2024
If you share my passion for the NFL, you may have considered finding some NFL data to analyze. Instead of manually inputting data into Excel, you can use a more efficient method: web scraping with the {rvest} package in R.
Let’s scrape the career stats of Jared Goff from my favorite team - the Detroit Lions 🦁
Set up & Scrape
Lets install and load our packages.
We will need the tidyverse, rvest, and janitor packages.
pacman::p_load(tidyverse, rvest, janitor)
Next, we need to get the webpage URL here. And copy it and create an object.
# here we are creating the URL object
goffstats_url <- 'https://en.wikipedia.org/wiki/Jared_Goff'
# this reads the html and passes it through the html_table function
goffstats <- read_html(goffstats_url) |>
html_table()
Note: If you’re curious what html_table() does, copy and paste the following code into your console and press enter: ?html_table
One effective method for finding the right HTML table involves pulling up the object (i.e., goffstats) from the environment and then examining the value column to find the correct dimensions of the table you want from the website you scraped (i.e., the wiki page). For example, head here and scroll down to NFL career statistics then regular season and count the rows and columns. After you have a decent idea of the dimensions, head to your object (i.e., goff_stats) and try to match the dimensions. Object [[6]] with the value 10 x 23 is the one we want so we create a new object with that information.
career_stats <- goffstats[[7]]
🎉 Alright, we have officially scraped the website and got our HTML table! 🎉
Cleaning
Alright, here’s were it gets tricky. We want to use the {janitor} package and its function clean_names(), which transforms all the columns into a data friendly format.
career_stats1 <- career_stats |>
clean_names() |>
as.tibble() |> # I prefer to look at things in tibble format, it's just a bit nicer
select(- c(rushing, rushing_2, rushing_4, sacked_2)) # this removes the columns that I don't really find important for the QB position
Next, we are going to rearrange some of our data table so the character variables are in front of our tibble.
# here I'm creating an object that holds the columns I want to move up front
# we will find out soon why I want these up front
career_statsM <- c("year",
"team",
"games_3")
# this create a new object with the non-numerical columns in the front and everything else after it
career_stats2 <- career_stats1 |>
select(all_of(career_statsM), everything())
Now, we have to do some major cleaning of this data. We are going to rename, slice, and mutate. Examples of these commands can be located here.
career_stats3 <- career_stats2 |>
rename(games_played = games, # making these columns a bit easier to understand and work with
games_started = games_2,
record = games_3,
completions = passing,
attempts = passing_2,
comp_pct = passing_3,
yards = passing_4,
ypera = passing_5,
longest_pass = passing_6,
tds = passing_7,
intercept = passing_8,
pass_rate = passing_9,
rush_ypera = rushing_3,
rush_td = rushing_5,
fumbles_lost = fumbles_2) |>
slice_tail(n = -1) |> # this slices the top row out of our data frame (i.e., the extra row of column names)
slice_head(n = -1) |> # this slices the bottom row (i.e., career stats)
mutate(yards = str_remove_all(yards, ",")) |> # we don't want a " , " in our variable bc R considers it a string or a character
mutate(across(4:19, as.numeric)) # Here's why we moved those other columns to the front, so we can easily convert the others to numerics
Optional
The next code chunk will create an interactive table but it has to be implemented into an html file using RMarkdown/Quarto
interactivetable <- career_stats3 |>
knitr::kable()
interactivetable
That’s it! the data is clean and now we can analyze it! 🏈
Analyze
You can do anything with the data now, but make sure you are using the numeric values.
mean(career_stats3$yards)
[1] 3803.625
mean(career_stats3$tds)
[1] 23.125
Final Remarks
There are legalities to consider before scraping any website. Although if it is public domain like Wikipedia, you’re probably ok. But please refer to this section in the R4DS book for more information. They do a much better job explaining things than me.
If you want to learn more about {rvest} check out R for Data Science (2e).
- Posted on:
- March 12, 2024
- Length:
- 4 minute read, 712 words
- See Also: