Web scraping in R with rvest

Tool
Webscrap
R
Author

Tony D

Published

March 12, 2025

Basic Web scraping in R with rvest.

Also update to R handbook

loal pacakge

Code
library(rvest)
library(tidyverse)

read html

Code
url='https://www.r-project.org/'
page=read_html(url)

get HTML text

Code
page %>%html_element(css = "h1") |> html_text(trim = TRUE)
[1] "The R Project for Statistical Computing"

get table

Code
url='https://en.wikipedia.org/wiki/List_of_Formula_One_drivers'
page=read_html(url)

get 3rd table

find table xpath

Code
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[3]') |> html_table()
table |> head()
# A tibble: 6 × 11
  `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
  <chr>             <chr>          <chr>              <chr>                   
1 Carlo Abate       Italy          1962–1963          0                       
2 George Abecassis  United Kingdom 1951–1952          0                       
3 Kenny Acheson     United Kingdom 1983, 1985         0                       
4 Andrea de Adamich Italy          1968, 1970–1973    0                       
5 Philippe Adams    Belgium        1994               0                       
6 Walt Ader         United States  1950               0                       
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
#   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
#   `Fastest laps` <chr>, `Points[a]` <chr>

get 4th table

find table xpath

Code
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[4]') |> html_table()
table |> head()
# A tibble: 6 × 7
  Country     Totaldrivers Champions Championships `Race wins` `First driver(s)`
  <chr>       <chr>        <chr>     <chr>         <chr>       <chr>            
1 Argentinad… 26           1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
2 Australiad… 18           2(Brabha… 4(1959, 1960… "45\n(Brab… Tony Gaze(1952 B…
3 Austriadet… 16           2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
4 Belgiumdet… 24           0         0             "11\n(Ickx… Johnny Claes(195…
5 Brazildeta… 33           3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
6 Canadadeta… 15           1(J. Vil… 1(1997)       "17\n(G. V… Peter Ryan(1961 …
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>

using read_html_live() with more advance web scraping

Code
library(rvest)
library(tidyverse)
Code
url="https://www.whiskybase.com/whiskies/"
web <- read_html_live(url)
Code
url_rating='https://www.whiskybase.com/profile/georges'
content <- read_html_live(url_rating) 
info_title=content %>% html_elements('.title') %>% html_text(trim = TRUE)
info_value=content %>% html_elements('.value') %>% html_text(trim = TRUE)
info_data=data.frame(info_title,info_value)
Code
web$view()
Code
web %>% html_elements(".widget-article-content") |> html_text(trim = TRUE)
[1] "Welcome to WHISKYBASEIn the last 10 years whiskybase has been building into the platform that we are right now. It started as a small project by Menno but has become the main resource for whiskies. In this anniversary year we have been changing a lot within Whiskybase. This new release with a complete new design is the biggest release so far.There are lot's of new functionalities to be discovered. We are proud of the result but also have big plans for the coming years. And the past months have inspired us to build out Whiskybase even moreWe hope you will be with us for a long time and welcome new members to Whiskybase."




Code
sessionInfo()
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 rvest_1.0.4    

loaded via a namespace (and not attached):
 [1] utf8_1.2.4          generics_0.1.3      xml2_1.3.7         
 [4] stringi_1.8.4       hms_1.1.3           digest_0.6.37      
 [7] magrittr_2.0.3      evaluate_1.0.3      grid_4.4.1         
[10] timechange_0.3.0    fastmap_1.2.0       jsonlite_1.9.1     
[13] processx_3.8.6      chromote_0.4.0.9000 ps_1.9.0           
[16] promises_1.3.2      httr_1.4.7          selectr_0.4-2      
[19] scales_1.3.0        cli_3.6.4           rlang_1.1.5        
[22] munsell_0.5.1       withr_3.0.2         yaml_2.3.10        
[25] tools_4.4.1         tzdb_0.4.0          colorspace_2.1-1   
[28] curl_6.2.1          vctrs_0.6.5         R6_2.6.1           
[31] lifecycle_1.0.4     htmlwidgets_1.6.4   pkgconfig_2.0.3    
[34] pillar_1.10.1       later_1.4.1         gtable_0.3.6       
[37] glue_1.8.0          Rcpp_1.0.14         xfun_0.51          
[40] tidyselect_1.2.1    rstudioapi_0.17.1   knitr_1.49         
[43] htmltools_0.5.8.1   websocket_1.4.2     rmarkdown_2.29     
[46] compiler_4.4.1     

Reference

https://r4ds.hadley.nz/webscraping.html

https://rvest.tidyverse.org/reference/read_html_live.html