Code
library(tidyverse)
The tidyverse
is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This guide focuses on dplyr
, tidyr
, stringr
, and lubridate
, the core packages for data manipulation.
dplyr
: A Grammar of Data Manipulationdplyr
provides a consistent and intuitive grammar for data manipulation. Its functions, or “verbs,” are easy to read and can be chained together using the pipe operator (%>%
), which passes the result of one function as the first argument to the next.
First, we load the tidyverse
suite of packages.
We will use the built-in mtcars
dataset. For clarity, we convert the row names to a proper column called car_name
.
# A tibble: 6 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
The names()
function returns a character vector of the column names in a data frame.
select()
: Pick columns by nameselect()
allows you to subset your data by choosing specific columns.
# A tibble: 32 × 4
car_name mpg hp cyl
<chr> <dbl> <dbl> <dbl>
1 Mazda RX4 21 110 6
2 Mazda RX4 Wag 21 110 6
3 Datsun 710 22.8 93 4
4 Hornet 4 Drive 21.4 110 6
5 Hornet Sportabout 18.7 175 8
6 Valiant 18.1 105 6
7 Duster 360 14.3 245 8
8 Merc 240D 24.4 62 4
9 Merc 230 22.8 95 4
10 Merc 280 19.2 123 6
# ℹ 22 more rows
Use helper functions like starts_with()
, ends_with()
, and contains()
for powerful selections.
# A tibble: 32 × 2
disp drat
<dbl> <dbl>
1 160 3.9
2 160 3.9
3 108 3.85
4 258 3.08
5 360 3.15
6 225 2.76
7 360 3.21
8 147. 3.69
9 141. 3.92
10 168. 3.92
# ℹ 22 more rows
Use the -
sign to deselect or drop columns.
# A tibble: 32 × 10
car_name mpg cyl disp hp drat wt qsec gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 4 4
# ℹ 22 more rows
You can also select columns by their position.
Use the -
sign to deselect or drop columns.
# A tibble: 32 × 11
car_name mpg disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportabout 18.7 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
The rename()
verb changes the name of a column.
# A tibble: 32 × 12
car_name mpg new_cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
filter()
: Pick rows by conditionfilter()
subsets rows based on logical conditions. Only rows where the condition is TRUE
are kept.
# A tibble: 14 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
11 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
12 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
13 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
14 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Combine conditions with logical operators: - ,
or &
for AND - |
for OR - !
for NOT
# A tibble: 7 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 Cadillac Fl… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 Lincoln Con… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
4 Chrysler Im… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
5 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 Ford Panter… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
7 Maserati Bo… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
# A tibble: 14 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
11 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
12 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
13 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
14 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
arrange()
: Reorder rowsarrange()
sorts the rows of a data frame by one or more columns.
# A tibble: 32 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# ℹ 22 more rows
Use desc()
to sort in descending order.
# A tibble: 32 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
2 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
3 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
4 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
5 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
6 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
10 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
# ℹ 22 more rows
mutate()
: Create new columnsmutate()
adds new columns or transforms existing ones, while keeping all existing columns.
# A tibble: 32 × 13
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: hp_per_wt <dbl>
Use case_when()
for complex conditional logic, which is like a more powerful version of an if-else statement.
# A tibble: 32 × 13
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: cyl_group <chr>
summarise()
and group_by()
: Collapse rows to a summaryThis is one of the most powerful combinations in dplyr
. group_by()
groups the data by one or more variables. Subsequent operations are then performed on each group independently. summarise()
then collapses each group into a single-row summary.
# A tibble: 3 × 6
cyl avg_mpg median_hp min_wt max_wt car_count
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 4 26.7 91 1.51 3.19 11
2 6 19.7 110 2.62 3.46 7
3 8 15.1 192. 3.17 5.42 14
dplyr
provides a family of join functions to combine data from different tables.
inner_join()
: Returns only the rows where the key exists in both tables.left_join()
: Returns all rows from the left table, and matching rows from the right table.full_join()
: Returns all rows from both tables.anti_join()
: Returns all rows from the left table that do not have a match in the right table.# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones vocals
2 John Beatles guitar
3 Paul Beatles <NA>
# A tibble: 1 × 2
name band
<chr> <chr>
1 Paul Beatles
tidyr
: Tidy Your Datatidyr
provides tools for reshaping data. The goal is to create “tidy” data, which has a specific structure: 1. Every column is a variable. 2. Every row is an observation. 3. Every cell is a single value.
pivot_longer()
: Wide to Longpivot_longer()
makes data “longer” by gathering multiple columns into key-value pairs. This is often the first step to making data “tidy.”
# A tibble: 6 × 4
subject sex condition measurement
<int> <chr> <chr> <dbl>
1 1 M control 7.9
2 1 M cond1 12.3
3 1 M cond2 10.7
4 2 F control 6.3
5 2 F cond1 10.6
6 2 F cond2 11.1
pivot_wider()
: Long to Widepivot_wider()
does the opposite, making data “wider” by spreading a key-value pair into multiple columns.
stringr
: String Manipulationstringr
provides a consistent and user-friendly interface for common string operations, built on top of the stringi
package.
str_detect()
: Check for the presence of a pattern.str_replace()
: Replace the first match of a pattern.str_extract()
: Extract the first match of a pattern.str_split()
: Split a string into pieces.[1] TRUE FALSE FALSE TRUE
[1] "12345"
[[1]]
[1] "a" "b" "c"
lubridate
: Date and Time Manipulationlubridate
simplifies working with dates and times in R, which can otherwise be quite complex.
today:
ymd()
, mdy()
, dmy()
parse strings into dates reliably, automatically handling different separators.year()
, month()
, day()
, wday()
extract parts of a date.duration
and period
objects.[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
[1] "2023-11-11"
[1] 66
Window functions are functions that operate on a “window” of data (e.g., within a group) but, unlike summarise()
, they return a value for every row.
row_number
row_number()
assigns a unique rank to each row within a group.
lag
and lead
lag()
and lead()
are useful for comparing a value to its predecessor or successor.
cumsum()
calculates the cumulative sum.
arrange()
sorts the rows of a data frame by one or more columns.
car_name cyl mpg hp
1 Datsun 710 4 22.8 93
2 Mazda RX4 6 21.0 110
3 Mazda RX4 Wag 6 21.0 110
4 Hornet 4 Drive 6 21.4 110
5 Valiant 6 18.1 105
6 Hornet Sportabout 8 18.7 175
Sort in descending order using desc()
.
Let’s create a sample data frame with missing values (NA
).
is.na()
returns a logical vector indicating which values are missing.
You can use filter()
with !is.na()
to remove rows with missing values in a specific column.
tidyr::drop_na()
removes rows with any missing values.
car_name cyl mpg hp
[1,] 4 6 21.0 110
[2,] 5 6 21.0 110
[3,] 1 4 22.8 93
[4,] 2 6 21.4 110
[5,] 3 8 18.7 175
[6,] 6 6 18.1 105
$car_name
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
$cyl
[1] 6 6 4 6 8 6
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1
$hp
[1] 110 110 93 110 175 105
---
title: "Data Manipulation with tidyverse"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: false
code-block-bg: true
code-block-border-left: "#31BAE9"
---
The `tidyverse` is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This guide focuses on `dplyr`, `tidyr`, `stringr`, and `lubridate`, the core packages for data manipulation.
{width="600"}
# 1. `dplyr`: A Grammar of Data Manipulation
`dplyr` provides a consistent and intuitive grammar for data manipulation. Its functions, or "verbs," are easy to read and can be chained together using the pipe operator (`%>%`), which passes the result of one function as the first argument to the next.
{width="231"}
## Load Packages and Data
First, we load the `tidyverse` suite of packages.
```{r}
library(tidyverse)
```
We will use the built-in `mtcars` dataset. For clarity, we convert the row names to a proper column called `car_name`.
```{r}
data(mtcars)
small_mtcars <- mtcars %>%
select(cyl, mpg, hp) %>%
head()
small_mtcars <- rownames_to_column(small_mtcars, var = "car_name")
mtcars_df <- mtcars %>%
rownames_to_column(var = "car_name") %>%
as_tibble() # Convert to a tibble for better printing
head(mtcars_df)
```
## Get Column Names
The `names()` function returns a character vector of the column names in a data frame.
```{r}
names(mtcars_df)
```
## `select()`: Pick columns by name
`select()` allows you to subset your data by choosing specific columns.
```{r}
# Select a few specific columns
mtcars_df %>% select(car_name, mpg, hp, cyl)
```
Use helper functions like `starts_with()`, `ends_with()`, and `contains()` for powerful selections.
```{r}
# Select all columns that start with the letter "d"
mtcars_df %>% select(starts_with("d"))
```
Use the `-` sign to deselect or drop columns.
```{r}
# Select all columns except `vs` and `am`
mtcars_df %>% select(-vs, -am)
```
## Select by Index
You can also select columns by their position.
```{r}
mtcars_df %>% select(1, 2)
```
## Drop Columns
Use the `-` sign to deselect or drop columns.
```{r}
mtcars_df %>% select(-cyl)
```
## Rename Columns
The `rename()` verb changes the name of a column.
```{r}
mtcars_df %>% rename(new_cyl = cyl)
```
## `filter()`: Pick rows by condition
`filter()` subsets rows based on logical conditions. Only rows where the condition is `TRUE` are kept.
```{r}
# Filter for cars with 8 cylinders
mtcars_df %>% filter(cyl == 8)
```
Combine conditions with logical operators:
- `,` or `&` for AND
- `|` for OR
- `!` for NOT
```{r}
# Filter for 8-cylinder cars with more than 200 horsepower
mtcars_df %>% filter(cyl == 8 & hp > 200)
# Filter for cars that are either 8-cylinder OR have more than 300 horsepower
mtcars_df %>% filter(cyl == 8 | hp > 300)
```
## `arrange()`: Reorder rows
`arrange()` sorts the rows of a data frame by one or more columns.
```{r}
# Sort cars by miles per gallon (mpg) in ascending order (the default)
mtcars_df %>% arrange(mpg)
```
Use `desc()` to sort in descending order.
```{r}
# Sort by cylinder count (desc) and then by mpg (desc) for tie-breaking
mtcars_df %>% arrange(desc(cyl), desc(mpg))
```
## `mutate()`: Create new columns
`mutate()` adds new columns or transforms existing ones, while keeping all existing columns.
```{r}
# Create a new column for horsepower-to-weight ratio
mtcars_df %>% mutate(hp_per_wt = hp / wt)
```
Use `case_when()` for complex conditional logic, which is like a more powerful version of an if-else statement.
```{r}
# Create a column for cylinder category
mtcars_df %>% mutate(cyl_group = case_when(
cyl == 4 ~ "Four-Cylinder",
cyl == 6 ~ "Six-Cylinder",
cyl == 8 ~ "Eight-Cylinder",
TRUE ~ "Other" # Fallback for any other case
))
```
## `summarise()` and `group_by()`: Collapse rows to a summary
This is one of the most powerful combinations in `dplyr`. `group_by()` groups the data by one or more variables. Subsequent operations are then performed on each group independently. `summarise()` then collapses each group into a single-row summary.
```{r}
# Calculate summary statistics for each cylinder group
mtcars_df %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
median_hp = median(hp, na.rm = TRUE),
min_wt = min(wt, na.rm = TRUE),
max_wt = max(wt, na.rm = TRUE),
car_count = n() # n() counts the number of rows in the group
)
```
## Joining Tables
`dplyr` provides a family of join functions to combine data from different tables.
```{r}
band_members_df <- tribble(
~name, ~band,
"Mick", "Stones",
"John", "Beatles",
"Paul", "Beatles"
)
band_instruments_df <- tribble(
~name, ~plays,
"Mick", "vocals",
"John", "guitar",
"Keith", "guitar"
)
```
- `inner_join()`: Returns only the rows where the key exists in both tables.
- `left_join()`: Returns all rows from the left table, and matching rows from the right table.
- `full_join()`: Returns all rows from both tables.
- `anti_join()`: Returns all rows from the left table that do not have a match in the right table.
```{r}
# Left join to keep all members and see who plays an instrument
left_join(band_members_df, band_instruments_df, by = "name")
# Anti join to find which members are not in the instruments table
anti_join(band_members_df, band_instruments_df, by = "name")
```
# 2. `tidyr`: Tidy Your Data
`tidyr` provides tools for reshaping data. The goal is to create "tidy" data, which has a specific structure:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.
```{r}
# Create a sample wide-format (untidy) dataset
wide_data <- read.table(header = TRUE, text = '
subject sex control cond1 cond2
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
')
```
### `pivot_longer()`: Wide to Long
`pivot_longer()` makes data "longer" by gathering multiple columns into key-value pairs. This is often the first step to making data "tidy."
```{r}
long_data <- wide_data %>%
pivot_longer(
cols = c(control, cond1, cond2), # The columns to pivot
names_to = 'condition', # New column for the old column names
values_to = 'measurement' # New column for the old cell values
)
long_data
```
### `pivot_wider()`: Long to Wide
`pivot_wider()` does the opposite, making data "wider" by spreading a key-value pair into multiple columns.
```{r}
long_data %>%
pivot_wider(
names_from = condition, # Column to get new column names from
values_from = measurement # Column to get cell values from
)
```
# 3. `stringr`: String Manipulation
`stringr` provides a consistent and user-friendly interface for common string operations, built on top of the `stringi` package.

```{r}
text_vector <- c("apple", "banana", "pear", "pineapple")
```
- `str_detect()`: Check for the presence of a pattern.
- `str_replace()`: Replace the first match of a pattern.
- `str_extract()`: Extract the first match of a pattern.
- `str_split()`: Split a string into pieces.
```{r}
# Find elements containing "apple"
str_detect(text_vector, "apple")
# Extract numbers from a string using a regular expression
# \d+ is a regex for one or more digits
str_extract("Order_ID_12345", "\\d+")
# Split a string by a delimiter
str_split("a-b-c", "-")
```
# 4. `lubridate`: Date and Time Manipulation
`lubridate` simplifies working with dates and times in R, which can otherwise be quite complex.

```{r}
library(lubridate)
```
today:
```{r}
today()
```
- **Parsing:** `ymd()`, `mdy()`, `dmy()` parse strings into dates reliably, automatically handling different separators.
- **Component Extraction:** `year()`, `month()`, `day()`, `wday()` extract parts of a date.
- **Arithmetic:** Perform calculations with date and time objects using `duration` and `period` objects.
```{r}
# Parse a date
date_obj <- ymd("2023-10-27")
# Get the day of the week
wday(date_obj, label = TRUE, abbr = FALSE)
# Add 15 days to the date
date_obj + days(15)
# Calculate the time difference between two dates
new_year <- ymd("2024-01-01")
interval(date_obj, new_year) / days(1)
```
# 4. Window Functions
Window functions are functions that operate on a "window" of data (e.g., within a group) but, unlike `summarise()`, they return a value for every row.
## Ranking with `row_number`
`row_number()` assigns a unique rank to each row within a group.
```{r}
small_mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(desc(mpg))) %>%
select(cyl, mpg, rank)
```
## Accessing Previous/Next Values with `lag` and `lead`
`lag()` and `lead()` are useful for comparing a value to its predecessor or successor.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_previous = lag(mpg, n = 1))
```
## Cumulative Summaries
`cumsum()` calculates the cumulative sum.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_running_total = cumsum(mpg))
```
## Order Rows
`arrange()` sorts the rows of a data frame by one or more columns.
```{r}
small_mtcars %>% arrange(cyl)
```
Sort in descending order using `desc()`.
```{r}
small_mtcars %>% arrange(desc(cyl))
```
# 5. Handle Missing Data
Let's create a sample data frame with missing values (`NA`).
```{r}
missing_df <- data.frame(
x = c(1, 2, NA, 4),
y = c("a", NA, "c", "d")
)
```
## Find Missing Data
`is.na()` returns a logical vector indicating which values are missing.
```{r}
is.na(missing_df)
```
## Filter Out Missing Data
You can use `filter()` with `!is.na()` to remove rows with missing values in a specific column.
```{r}
missing_df %>% filter(!is.na(y))
```
`tidyr::drop_na()` removes rows with any missing values.
```{r}
missing_df %>% drop_na()
```
# 6. dataframe to other data format
## dataframe to vector
```{r}
data=small_mtcars$cyl
data
```
```{r}
class(data)
```
## dataframe to matrix
```{r}
data=data.matrix(small_mtcars)
data
```
```{r}
class(data)
```
## dataframe to list
```{r}
data=as.list(small_mtcars)
data
```
```{r}
class(data)
```
# 5. References
- [R for Data Science, 2nd Edition](https://r4ds.hadley.nz/)
- [dplyr Documentation](https://dplyr.tidyverse.org/)
- [tidyr Documentation](https://tidyr.tidyverse.org/)
- [stringr Documentation](https://stringr.tidyverse.org/)
- [lubridate Documentation](https://lubridate.tidyverse.org/)