Code
library(rvest)
library(tidyverse)
Tony D
March 12, 2025
A guide to web scraping in R using the rvest
package, with examples of how to extract text, links, and tables from web pages.
This document provides a comprehensive guide to web scraping in R using the rvest
package. It covers the entire workflow, from reading HTML content from a URL to extracting specific elements like text, links, and tables. The guide also introduces advanced techniques with read_html_live()
for dynamic web pages that require interaction, such as scrolling, to load content. This document is also part of the R handbook.
find table xpath
# A tibble: 6 × 11
`Driver name` Nationality `Seasons competed` `Drivers' Championships`
<chr> <chr> <chr> <chr>
1 Carlo Abate Italy 1962–1963 0
2 George Abecassis United Kingdom 1951–1952 0
3 Kenny Acheson United Kingdom 1983, 1985 0
4 Andrea de Adamich Italy 1968, 1970–1973 0
5 Philippe Adams Belgium 1994 0
6 Walt Ader United States 1950 0
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
# `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
# `Fastest laps` <chr>, `Points[a]` <chr>
find table xpath
# A tibble: 6 × 7
Country Totaldrivers Champions Championships `Race wins` `First driver(s)`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Argentinad… 26 1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
2 Australiad… 18 2(Brabha… 4(1959, 1960… "50\n(Brab… Tony Gaze(1952 B…
3 Austriadet… 16 2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
4 Belgiumdet… 24 0 0 "11\n(Ickx… Johnny Claes(195…
5 Brazildeta… 33 3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
6 Canadadeta… 15 1(J. Vil… 1(1997) "17\n(G. V… Peter Ryan(1961 …
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
[1] "Welcome to WHISKYBASEIn the last 10 years whiskybase has been building into the platform that we are right now. It started as a small project by Menno but has become the main resource for whiskies. In this anniversary year we have been changing a lot within Whiskybase. This new release with a complete new design is the biggest release so far.There are lot's of new functionalities to be discovered. We are proud of the result but also have big plans for the coming years. And the past months have inspired us to build out Whiskybase even moreWe hope you will be with us for a long time and welcome new members to Whiskybase."
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
[9] ggplot2_3.5.2 tidyverse_2.0.0 rvest_1.0.4
loaded via a namespace (and not attached):
[1] utf8_1.2.6 generics_0.1.4 xml2_1.3.8 stringi_1.8.7
[5] hms_1.1.3 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.4
[9] grid_4.5.1 timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0
[13] jsonlite_2.0.0 processx_3.8.6 chromote_0.5.1 ps_1.9.1
[17] promises_1.3.3 httr_1.4.7 selectr_0.4-2 scales_1.4.0
[21] cli_3.6.5 rlang_1.1.6 withr_3.0.2 yaml_2.3.10
[25] tools_4.5.1 tzdb_0.5.0 curl_6.4.0 vctrs_0.6.5
[29] R6_2.6.1 lifecycle_1.0.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
[33] pillar_1.10.2 later_1.4.2 gtable_0.3.6 glue_1.8.0
[37] Rcpp_1.0.14 xfun_0.52 tidyselect_1.2.1 rstudioapi_0.17.1
[41] knitr_1.50 farver_2.1.2 htmltools_0.5.8.1 websocket_1.4.4
[45] rmarkdown_2.29 compiler_4.5.1
https://r4ds.hadley.nz/webscraping.html
https://rvest.tidyverse.org/reference/read_html_live.html
---
title: "Web scraping in R with rvest"
author: "Tony D"
date: "2025-03-12"
categories:
- Tool
- Webscrap
- R
execute:
warning: false
error: false
image: "logo.png"
---
A guide to web scraping in R using the `rvest` package, with examples of how to extract text, links, and tables from web pages.
This document provides a comprehensive guide to web scraping in R using the `rvest` package. It covers the entire workflow, from reading HTML content from a URL to extracting specific elements like text, links, and tables. The guide also introduces advanced techniques with `read_html_live()` for dynamic web pages that require interaction, such as scrolling, to load content. This document is also part of the [R handbook](https://jcfly3000.github.io/Into-R/other/2%20web%20scrap/1%20web%20scrap%20with%20rvest.html).
# loal pacakge
```{r}
library(rvest)
library(tidyverse)
```
# read html
```{r}
url='https://www.r-project.org/'
page=read_html(url)
```
# get HTML text
```{r}
page %>%html_element(css = "h1") |> html_text(trim = TRUE)
```
# get HTML link
```{r}
page %>%html_element(css = "strong a") |> html_text(trim = TRUE)
```
```{r}
page %>%html_element(css = "strong a") |> html_attr("href")
```
# get table
```{r}
url='https://en.wikipedia.org/wiki/List_of_Formula_One_drivers'
page=read_html(url)
```
## get 3rd table
find table xpath
```{r}
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[3]') |> html_table()
table |> head()
```
## get 4th table
find table xpath
```{r}
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[4]') |> html_table()
table |> head()
```
# using read_html_live() with more advance web scraping
```{r}
library(rvest)
library(tidyverse)
```
```{r}
url="https://www.whiskybase.com/whiskies/"
web <- read_html_live(url)
```
```{r}
web$view()
```
```{r}
web$get_scroll_position()
```
```{r}
web$scroll_into_view(css='.bs5__col-lg-3:nth-child(1) li:nth-child(5) a')
```
```{r}
url_rating='https://www.whiskybase.com/profile/georges'
content <- read_html_live(url_rating)
info_title=content %>% html_elements('.title') %>% html_text(trim = TRUE)
info_value=content %>% html_elements('.value') %>% html_text(trim = TRUE)
info_data=data.frame(info_title,info_value)
```
```{r}
#| eval: false
web$view()
```
```{r}
web %>% html_elements(".widget-article-content") |> html_text(trim = TRUE)
```
<br><br><br>
```{r, attr.output='.details summary="sessionInfo()"'}
sessionInfo()
```
# Reference
https://r4ds.hadley.nz/webscraping.html
https://rvest.tidyverse.org/reference/read_html_live.html