Code
library(rvest)
library(tidyverse)
Tony D
March 12, 2025
Basic Web scraping in R with rvest.
Also update to R handbook
find table xpath
# A tibble: 6 × 11
`Driver name` Nationality `Seasons competed` `Drivers' Championships`
<chr> <chr> <chr> <chr>
1 Carlo Abate Italy 1962–1963 0
2 George Abecassis United Kingdom 1951–1952 0
3 Kenny Acheson United Kingdom 1983, 1985 0
4 Andrea de Adamich Italy 1968, 1970–1973 0
5 Philippe Adams Belgium 1994 0
6 Walt Ader United States 1950 0
# ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
# `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
# `Fastest laps` <chr>, `Points[a]` <chr>
find table xpath
# A tibble: 6 × 7
Country Totaldrivers Champions Championships `Race wins` `First driver(s)`
<chr> <chr> <chr> <chr> <chr> <chr>
1 Argentinad… 26 1(Fangio… 5(1951, 1954… "38\n(Fang… Juan Manuel Fang…
2 Australiad… 18 2(Brabha… 4(1959, 1960… "45\n(Brab… Tony Gaze(1952 B…
3 Austriadet… 16 2(Rindt,… 4(1970, 1975… "41\n(Rind… Jochen Rindt(196…
4 Belgiumdet… 24 0 0 "11\n(Ickx… Johnny Claes(195…
5 Brazildeta… 33 3(Fittip… 8(1972, 1974… "101\n(Fit… Chico Landi(1951…
6 Canadadeta… 15 1(J. Vil… 1(1997) "17\n(G. V… Peter Ryan(1961 …
# ℹ 1 more variable: `Most recent driver(s)/Current driver(s)` <chr>
[1] "Welcome to WHISKYBASEIn the last 10 years whiskybase has been building into the platform that we are right now. It started as a small project by Menno but has become the main resource for whiskies. In this anniversary year we have been changing a lot within Whiskybase. This new release with a complete new design is the biggest release so far.There are lot's of new functionalities to be discovered. We are proud of the result but also have big plans for the coming years. And the past months have inspired us to build out Whiskybase even moreWe hope you will be with us for a long time and welcome new members to Whiskybase."
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.3.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Asia/Shanghai
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 rvest_1.0.4
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 xml2_1.3.7
[4] stringi_1.8.4 hms_1.1.3 digest_0.6.37
[7] magrittr_2.0.3 evaluate_1.0.3 grid_4.4.1
[10] timechange_0.3.0 fastmap_1.2.0 jsonlite_1.9.1
[13] processx_3.8.6 chromote_0.4.0.9000 ps_1.9.0
[16] promises_1.3.2 httr_1.4.7 selectr_0.4-2
[19] scales_1.3.0 cli_3.6.4 rlang_1.1.5
[22] munsell_0.5.1 withr_3.0.2 yaml_2.3.10
[25] tools_4.4.1 tzdb_0.4.0 colorspace_2.1-1
[28] curl_6.2.1 vctrs_0.6.5 R6_2.6.1
[31] lifecycle_1.0.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
[34] pillar_1.10.1 later_1.4.1 gtable_0.3.6
[37] glue_1.8.0 Rcpp_1.0.14 xfun_0.51
[40] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.49
[43] htmltools_0.5.8.1 websocket_1.4.2 rmarkdown_2.29
[46] compiler_4.4.1
https://r4ds.hadley.nz/webscraping.html
https://rvest.tidyverse.org/reference/read_html_live.html
---
title: "Web scraping in R with rvest"
author: "Tony D"
date: "2025-03-12"
categories:
- Tool
- Webscrap
- R
execute:
warning: false
error: false
image: "logo.png"
---
Basic Web scraping in R with rvest.
Also update to [R handbook](https://jcfly3000.github.io/Into-R/other/2%20web%20scrap/1%20web%20scrap%20with%20rvest.html)
# loal pacakge
```{r}
library(rvest)
library(tidyverse)
```
# read html
```{r}
url='https://www.r-project.org/'
page=read_html(url)
```
# get HTML text
```{r}
page %>%html_element(css = "h1") |> html_text(trim = TRUE)
```
# get HTML link
```{r}
page %>%html_element(css = "strong a") |> html_text(trim = TRUE)
```
```{r}
page %>%html_element(css = "strong a") |> html_attr("href")
```
# get table
```{r}
url='https://en.wikipedia.org/wiki/List_of_Formula_One_drivers'
page=read_html(url)
```
## get 3rd table
find table xpath
```{r}
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[3]') |> html_table()
table |> head()
```
## get 4th table
find table xpath
```{r}
table=page %>%html_element(xpath = '//*[@id="mw-content-text"]/div[1]/table[4]') |> html_table()
table |> head()
```
# using read_html_live() with more advance web scraping
```{r}
library(rvest)
library(tidyverse)
```
```{r}
url="https://www.whiskybase.com/whiskies/"
web <- read_html_live(url)
```
```{r}
url_rating='https://www.whiskybase.com/profile/georges'
content <- read_html_live(url_rating)
info_title=content %>% html_elements('.title') %>% html_text(trim = TRUE)
info_value=content %>% html_elements('.value') %>% html_text(trim = TRUE)
info_data=data.frame(info_title,info_value)
```
```{r}
#| eval: false
web$view()
```
```{r}
web %>% html_elements(".widget-article-content") |> html_text(trim = TRUE)
```
<br><br><br>
```{r, attr.output='.details summary="sessionInfo()"'}
sessionInfo()
```
# Reference
https://r4ds.hadley.nz/webscraping.html
https://rvest.tidyverse.org/reference/read_html_live.html