Code
library(tidyverse)The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This guide focuses on dplyr, tidyr, stringr, and lubridate, the core packages for data manipulation.

dplyr: A Grammar of Data Manipulationdplyr provides a consistent and intuitive grammar for data manipulation. Its functions, or “verbs,” are easy to read and can be chained together using the pipe operator (%>%), which passes the result of one function as the first argument to the next.

First, we load the tidyverse suite of packages.
We will use the built-in mtcars dataset. For clarity, we convert the row names to a proper column called car_name.
# A tibble: 6 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
The names() function returns a character vector of the column names in a data frame.
select(): Pick columns by nameselect() allows you to subset your data by choosing specific columns.
# A tibble: 32 × 4
car_name mpg hp cyl
<chr> <dbl> <dbl> <dbl>
1 Mazda RX4 21 110 6
2 Mazda RX4 Wag 21 110 6
3 Datsun 710 22.8 93 4
4 Hornet 4 Drive 21.4 110 6
5 Hornet Sportabout 18.7 175 8
6 Valiant 18.1 105 6
7 Duster 360 14.3 245 8
8 Merc 240D 24.4 62 4
9 Merc 230 22.8 95 4
10 Merc 280 19.2 123 6
# ℹ 22 more rows
Use helper functions like starts_with(), ends_with(), and contains() for powerful selections.
# A tibble: 32 × 2
disp drat
<dbl> <dbl>
1 160 3.9
2 160 3.9
3 108 3.85
4 258 3.08
5 360 3.15
6 225 2.76
7 360 3.21
8 147. 3.69
9 141. 3.92
10 168. 3.92
# ℹ 22 more rows
Use the - sign to deselect or drop columns.
# A tibble: 32 × 10
car_name mpg cyl disp hp drat wt qsec gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 4 4
# ℹ 22 more rows
You can also select columns by their position.
Use the - sign to deselect or drop columns.
# A tibble: 32 × 11
car_name mpg disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportabout 18.7 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
The rename() verb changes the name of a column.
# A tibble: 32 × 12
car_name mpg new_cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 7… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet S… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 3… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
filter(): Pick rows by conditionfilter() subsets rows based on logical conditions. Only rows where the condition is TRUE are kept.
# A tibble: 14 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
11 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
12 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
13 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
14 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Combine conditions with logical operators: - , or & for AND - | for OR - ! for NOT
# A tibble: 7 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
2 Cadillac Fl… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
3 Lincoln Con… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
4 Chrysler Im… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
5 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
6 Ford Panter… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
7 Maserati Bo… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
# A tibble: 14 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
11 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
12 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
13 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
14 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
arrange(): Reorder rowsarrange() sorts the rows of a data frame by one or more columns.
# A tibble: 32 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cadillac F… 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 Lincoln Co… 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 Camaro Z28 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# ℹ 22 more rows
Use desc() to sort in descending order.
# A tibble: 32 × 12
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Pontiac Fi… 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
2 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
3 Merc 450SL 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
4 Merc 450SE 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
5 Ford Pante… 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
6 Dodge Chal… 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
7 Merc 450SLC 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 Maserati B… 15 8 301 335 3.54 3.57 14.6 0 1 5 8
10 Chrysler I… 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
# ℹ 22 more rows
mutate(): Create new columnsmutate() adds new columns or transforms existing ones, while keeping all existing columns.
# A tibble: 32 × 13
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: hp_per_wt <dbl>
Use case_when() for complex conditional logic, which is like a more powerful version of an if-else statement.
# A tibble: 32 × 13
car_name mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: cyl_group <chr>
summarise() and group_by(): Collapse rows to a summaryThis is one of the most powerful combinations in dplyr. group_by() groups the data by one or more variables. Subsequent operations are then performed on each group independently. summarise() then collapses each group into a single-row summary.
# A tibble: 3 × 6
cyl avg_mpg median_hp min_wt max_wt car_count
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 4 26.7 91 1.51 3.19 11
2 6 19.7 110 2.62 3.46 7
3 8 15.1 192. 3.17 5.42 14
dplyr provides a family of join functions to combine data from different tables.
inner_join(): Returns only the rows where the key exists in both tables.left_join(): Returns all rows from the left table, and matching rows from the right table.full_join(): Returns all rows from both tables.anti_join(): Returns all rows from the left table that do not have a match in the right table.# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones vocals
2 John Beatles guitar
3 Paul Beatles <NA>
# A tibble: 1 × 2
name band
<chr> <chr>
1 Paul Beatles
tidyr: Tidy Your Datatidyr provides tools for reshaping data. The goal is to create “tidy” data, which has a specific structure: 1. Every column is a variable. 2. Every row is an observation. 3. Every cell is a single value.
pivot_longer(): Wide to Longpivot_longer() makes data “longer” by gathering multiple columns into key-value pairs. This is often the first step to making data “tidy.”
# A tibble: 6 × 4
subject sex condition measurement
<int> <chr> <chr> <dbl>
1 1 M control 7.9
2 1 M cond1 12.3
3 1 M cond2 10.7
4 2 F control 6.3
5 2 F cond1 10.6
6 2 F cond2 11.1
pivot_wider(): Long to Widepivot_wider() does the opposite, making data “wider” by spreading a key-value pair into multiple columns.
stringr: String Manipulationstringr provides a consistent and user-friendly interface for common string operations, built on top of the stringi package.

str_detect(): Check for the presence of a pattern.str_replace(): Replace the first match of a pattern.str_extract(): Extract the first match of a pattern.str_split(): Split a string into pieces.[1] TRUE FALSE FALSE TRUE
[1] "12345"
[[1]]
[1] "a" "b" "c"
lubridate: Date and Time Manipulationlubridate simplifies working with dates and times in R, which can otherwise be quite complex.

today:
ymd(), mdy(), dmy() parse strings into dates reliably, automatically handling different separators.year(), month(), day(), wday() extract parts of a date.duration and period objects.[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
[1] "2023-11-11"
[1] 66
Window functions are functions that operate on a “window” of data (e.g., within a group) but, unlike summarise(), they return a value for every row.
row_numberrow_number() assigns a unique rank to each row within a group.
lag and leadlag() and lead() are useful for comparing a value to its predecessor or successor.
cumsum() calculates the cumulative sum.
arrange() sorts the rows of a data frame by one or more columns.
car_name cyl mpg hp
1 Datsun 710 4 22.8 93
2 Mazda RX4 6 21.0 110
3 Mazda RX4 Wag 6 21.0 110
4 Hornet 4 Drive 6 21.4 110
5 Valiant 6 18.1 105
6 Hornet Sportabout 8 18.7 175
Sort in descending order using desc().
Let’s create a sample data frame with missing values (NA).
is.na() returns a logical vector indicating which values are missing.
You can use filter() with !is.na() to remove rows with missing values in a specific column.
tidyr::drop_na() removes rows with any missing values.
car_name cyl mpg hp
[1,] 4 6 21.0 110
[2,] 5 6 21.0 110
[3,] 1 4 22.8 93
[4,] 2 6 21.4 110
[5,] 3 8 18.7 175
[6,] 6 6 18.1 105
$car_name
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
$cyl
[1] 6 6 4 6 8 6
$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1
$hp
[1] 110 110 93 110 175 105
---
title: "Data Manipulation with tidyverse"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: false
code-block-bg: true
code-block-border-left: "#31BAE9"
---
The `tidyverse` is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This guide focuses on `dplyr`, `tidyr`, `stringr`, and `lubridate`, the core packages for data manipulation.
{width="600"}
# 1. `dplyr`: A Grammar of Data Manipulation
`dplyr` provides a consistent and intuitive grammar for data manipulation. Its functions, or "verbs," are easy to read and can be chained together using the pipe operator (`%>%`), which passes the result of one function as the first argument to the next.
{width="231"}
## Load Packages and Data
First, we load the `tidyverse` suite of packages.
```{r}
library(tidyverse)
```
We will use the built-in `mtcars` dataset. For clarity, we convert the row names to a proper column called `car_name`.
```{r}
data(mtcars)
small_mtcars <- mtcars %>%
select(cyl, mpg, hp) %>%
head()
small_mtcars <- rownames_to_column(small_mtcars, var = "car_name")
mtcars_df <- mtcars %>%
rownames_to_column(var = "car_name") %>%
as_tibble() # Convert to a tibble for better printing
head(mtcars_df)
```
## Get Column Names
The `names()` function returns a character vector of the column names in a data frame.
```{r}
names(mtcars_df)
```
## `select()`: Pick columns by name
`select()` allows you to subset your data by choosing specific columns.
```{r}
# Select a few specific columns
mtcars_df %>% select(car_name, mpg, hp, cyl)
```
Use helper functions like `starts_with()`, `ends_with()`, and `contains()` for powerful selections.
```{r}
# Select all columns that start with the letter "d"
mtcars_df %>% select(starts_with("d"))
```
Use the `-` sign to deselect or drop columns.
```{r}
# Select all columns except `vs` and `am`
mtcars_df %>% select(-vs, -am)
```
## Select by Index
You can also select columns by their position.
```{r}
mtcars_df %>% select(1, 2)
```
## Drop Columns
Use the `-` sign to deselect or drop columns.
```{r}
mtcars_df %>% select(-cyl)
```
## Rename Columns
The `rename()` verb changes the name of a column.
```{r}
mtcars_df %>% rename(new_cyl = cyl)
```
## `filter()`: Pick rows by condition
`filter()` subsets rows based on logical conditions. Only rows where the condition is `TRUE` are kept.
```{r}
# Filter for cars with 8 cylinders
mtcars_df %>% filter(cyl == 8)
```
Combine conditions with logical operators:
- `,` or `&` for AND
- `|` for OR
- `!` for NOT
```{r}
# Filter for 8-cylinder cars with more than 200 horsepower
mtcars_df %>% filter(cyl == 8 & hp > 200)
# Filter for cars that are either 8-cylinder OR have more than 300 horsepower
mtcars_df %>% filter(cyl == 8 | hp > 300)
```
## `arrange()`: Reorder rows
`arrange()` sorts the rows of a data frame by one or more columns.
```{r}
# Sort cars by miles per gallon (mpg) in ascending order (the default)
mtcars_df %>% arrange(mpg)
```
Use `desc()` to sort in descending order.
```{r}
# Sort by cylinder count (desc) and then by mpg (desc) for tie-breaking
mtcars_df %>% arrange(desc(cyl), desc(mpg))
```
## `mutate()`: Create new columns
`mutate()` adds new columns or transforms existing ones, while keeping all existing columns.
```{r}
# Create a new column for horsepower-to-weight ratio
mtcars_df %>% mutate(hp_per_wt = hp / wt)
```
Use `case_when()` for complex conditional logic, which is like a more powerful version of an if-else statement.
```{r}
# Create a column for cylinder category
mtcars_df %>% mutate(cyl_group = case_when(
cyl == 4 ~ "Four-Cylinder",
cyl == 6 ~ "Six-Cylinder",
cyl == 8 ~ "Eight-Cylinder",
TRUE ~ "Other" # Fallback for any other case
))
```
## `summarise()` and `group_by()`: Collapse rows to a summary
This is one of the most powerful combinations in `dplyr`. `group_by()` groups the data by one or more variables. Subsequent operations are then performed on each group independently. `summarise()` then collapses each group into a single-row summary.
```{r}
# Calculate summary statistics for each cylinder group
mtcars_df %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg, na.rm = TRUE),
median_hp = median(hp, na.rm = TRUE),
min_wt = min(wt, na.rm = TRUE),
max_wt = max(wt, na.rm = TRUE),
car_count = n() # n() counts the number of rows in the group
)
```
## Joining Tables
`dplyr` provides a family of join functions to combine data from different tables.
```{r}
band_members_df <- tribble(
~name, ~band,
"Mick", "Stones",
"John", "Beatles",
"Paul", "Beatles"
)
band_instruments_df <- tribble(
~name, ~plays,
"Mick", "vocals",
"John", "guitar",
"Keith", "guitar"
)
```
- `inner_join()`: Returns only the rows where the key exists in both tables.
- `left_join()`: Returns all rows from the left table, and matching rows from the right table.
- `full_join()`: Returns all rows from both tables.
- `anti_join()`: Returns all rows from the left table that do not have a match in the right table.
```{r}
# Left join to keep all members and see who plays an instrument
left_join(band_members_df, band_instruments_df, by = "name")
# Anti join to find which members are not in the instruments table
anti_join(band_members_df, band_instruments_df, by = "name")
```
# 2. `tidyr`: Tidy Your Data
`tidyr` provides tools for reshaping data. The goal is to create "tidy" data, which has a specific structure:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.
```{r}
# Create a sample wide-format (untidy) dataset
wide_data <- read.table(header = TRUE, text = '
subject sex control cond1 cond2
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
')
```
### `pivot_longer()`: Wide to Long
`pivot_longer()` makes data "longer" by gathering multiple columns into key-value pairs. This is often the first step to making data "tidy."
```{r}
long_data <- wide_data %>%
pivot_longer(
cols = c(control, cond1, cond2), # The columns to pivot
names_to = 'condition', # New column for the old column names
values_to = 'measurement' # New column for the old cell values
)
long_data
```
### `pivot_wider()`: Long to Wide
`pivot_wider()` does the opposite, making data "wider" by spreading a key-value pair into multiple columns.
```{r}
long_data %>%
pivot_wider(
names_from = condition, # Column to get new column names from
values_from = measurement # Column to get cell values from
)
```
# 3. `stringr`: String Manipulation
`stringr` provides a consistent and user-friendly interface for common string operations, built on top of the `stringi` package.

```{r}
text_vector <- c("apple", "banana", "pear", "pineapple")
```
- `str_detect()`: Check for the presence of a pattern.
- `str_replace()`: Replace the first match of a pattern.
- `str_extract()`: Extract the first match of a pattern.
- `str_split()`: Split a string into pieces.
```{r}
# Find elements containing "apple"
str_detect(text_vector, "apple")
# Extract numbers from a string using a regular expression
# \d+ is a regex for one or more digits
str_extract("Order_ID_12345", "\\d+")
# Split a string by a delimiter
str_split("a-b-c", "-")
```
# 4. `lubridate`: Date and Time Manipulation
`lubridate` simplifies working with dates and times in R, which can otherwise be quite complex.

```{r}
library(lubridate)
```
today:
```{r}
today()
```
- **Parsing:** `ymd()`, `mdy()`, `dmy()` parse strings into dates reliably, automatically handling different separators.
- **Component Extraction:** `year()`, `month()`, `day()`, `wday()` extract parts of a date.
- **Arithmetic:** Perform calculations with date and time objects using `duration` and `period` objects.
```{r}
# Parse a date
date_obj <- ymd("2023-10-27")
# Get the day of the week
wday(date_obj, label = TRUE, abbr = FALSE)
# Add 15 days to the date
date_obj + days(15)
# Calculate the time difference between two dates
new_year <- ymd("2024-01-01")
interval(date_obj, new_year) / days(1)
```
# 4. Window Functions
Window functions are functions that operate on a "window" of data (e.g., within a group) but, unlike `summarise()`, they return a value for every row.
## Ranking with `row_number`
`row_number()` assigns a unique rank to each row within a group.
```{r}
small_mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(desc(mpg))) %>%
select(cyl, mpg, rank)
```
## Accessing Previous/Next Values with `lag` and `lead`
`lag()` and `lead()` are useful for comparing a value to its predecessor or successor.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_previous = lag(mpg, n = 1))
```
## Cumulative Summaries
`cumsum()` calculates the cumulative sum.
```{r}
small_mtcars %>%
select(cyl, mpg) %>%
mutate(mpg_running_total = cumsum(mpg))
```
## Order Rows
`arrange()` sorts the rows of a data frame by one or more columns.
```{r}
small_mtcars %>% arrange(cyl)
```
Sort in descending order using `desc()`.
```{r}
small_mtcars %>% arrange(desc(cyl))
```
# 5. Handle Missing Data
Let's create a sample data frame with missing values (`NA`).
```{r}
missing_df <- data.frame(
x = c(1, 2, NA, 4),
y = c("a", NA, "c", "d")
)
```
## Find Missing Data
`is.na()` returns a logical vector indicating which values are missing.
```{r}
is.na(missing_df)
```
## Filter Out Missing Data
You can use `filter()` with `!is.na()` to remove rows with missing values in a specific column.
```{r}
missing_df %>% filter(!is.na(y))
```
`tidyr::drop_na()` removes rows with any missing values.
```{r}
missing_df %>% drop_na()
```
# 6. dataframe to other data format
## dataframe to vector
```{r}
data=small_mtcars$cyl
data
```
```{r}
class(data)
```
## dataframe to matrix
```{r}
data=data.matrix(small_mtcars)
data
```
```{r}
class(data)
```
## dataframe to list
```{r}
data=as.list(small_mtcars)
data
```
```{r}
class(data)
```
# 5. References
- [R for Data Science, 2nd Edition](https://r4ds.hadley.nz/)
- [dplyr Documentation](https://dplyr.tidyverse.org/)
- [tidyr Documentation](https://tidyr.tidyverse.org/)
- [stringr Documentation](https://stringr.tidyverse.org/)
- [lubridate Documentation](https://lubridate.tidyverse.org/)