Data Manipulation with tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. This guide focuses on dplyr, tidyr, stringr, and lubridate, the core packages for data manipulation.

1. `dplyr`: A Grammar of Data Manipulation

dplyr provides a consistent and intuitive grammar for data manipulation. Its functions, or “verbs,” are easy to read and can be chained together using the pipe operator (%>%), which passes the result of one function as the first argument to the next.

Load Packages and Data

First, we load the tidyverse suite of packages.

Code

library(tidyverse)

We will use the built-in mtcars dataset. For clarity, we convert the row names to a proper column called car_name.

Code

data(mtcars)

small_mtcars <- mtcars %>% 
  select(cyl, mpg, hp) %>% 
  head()

small_mtcars <- rownames_to_column(small_mtcars, var = "car_name")

mtcars_df <- mtcars %>% 
  rownames_to_column(var = "car_name") %>% 
  as_tibble() # Convert to a tibble for better printing

head(mtcars_df)

# A tibble: 6 × 12
  car_name       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

Get Column Names

The names() function returns a character vector of the column names in a data frame.

Code

names(mtcars_df)

 [1] "car_name" "mpg"      "cyl"      "disp"     "hp"       "drat"    
 [7] "wt"       "qsec"     "vs"       "am"       "gear"     "carb"

`select()`: Pick columns by name

select() allows you to subset your data by choosing specific columns.

Code

# Select a few specific columns
mtcars_df %>% select(car_name, mpg, hp, cyl)

# A tibble: 32 × 4
   car_name            mpg    hp   cyl
   <chr>             <dbl> <dbl> <dbl>
 1 Mazda RX4          21     110     6
 2 Mazda RX4 Wag      21     110     6
 3 Datsun 710         22.8    93     4
 4 Hornet 4 Drive     21.4   110     6
 5 Hornet Sportabout  18.7   175     8
 6 Valiant            18.1   105     6
 7 Duster 360         14.3   245     8
 8 Merc 240D          24.4    62     4
 9 Merc 230           22.8    95     4
10 Merc 280           19.2   123     6
# ℹ 22 more rows

Use helper functions like starts_with(), ends_with(), and contains() for powerful selections.

Code

# Select all columns that start with the letter "d"
mtcars_df %>% select(starts_with("d"))

# A tibble: 32 × 2
    disp  drat
   <dbl> <dbl>
 1  160   3.9 
 2  160   3.9 
 3  108   3.85
 4  258   3.08
 5  360   3.15
 6  225   2.76
 7  360   3.21
 8  147.  3.69
 9  141.  3.92
10  168.  3.92
# ℹ 22 more rows

Use the - sign to deselect or drop columns.

Code

# Select all columns except `vs` and `am`
mtcars_df %>% select(-vs, -am)

# A tibble: 32 × 10
   car_name            mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4          21       6  160    110  3.9   2.62  16.5     4     4
 2 Mazda RX4 Wag      21       6  160    110  3.9   2.88  17.0     4     4
 3 Datsun 710         22.8     4  108     93  3.85  2.32  18.6     4     1
 4 Hornet 4 Drive     21.4     6  258    110  3.08  3.22  19.4     3     1
 5 Hornet Sportabout  18.7     8  360    175  3.15  3.44  17.0     3     2
 6 Valiant            18.1     6  225    105  2.76  3.46  20.2     3     1
 7 Duster 360         14.3     8  360    245  3.21  3.57  15.8     3     4
 8 Merc 240D          24.4     4  147.    62  3.69  3.19  20       4     2
 9 Merc 230           22.8     4  141.    95  3.92  3.15  22.9     4     2
10 Merc 280           19.2     6  168.   123  3.92  3.44  18.3     4     4
# ℹ 22 more rows

Select by Index

You can also select columns by their position.

Code

mtcars_df %>% select(1, 2)

# A tibble: 32 × 2
   car_name            mpg
   <chr>             <dbl>
 1 Mazda RX4          21  
 2 Mazda RX4 Wag      21  
 3 Datsun 710         22.8
 4 Hornet 4 Drive     21.4
 5 Hornet Sportabout  18.7
 6 Valiant            18.1
 7 Duster 360         14.3
 8 Merc 240D          24.4
 9 Merc 230           22.8
10 Merc 280           19.2
# ℹ 22 more rows

Drop Columns

Use the - sign to deselect or drop columns.

Code

mtcars_df %>% select(-cyl)

# A tibble: 32 × 11
   car_name            mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4          21    160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 Wag      21    160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710         22.8  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 Drive     21.4  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Sportabout  18.7  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant            18.1  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360         14.3  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D          24.4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230           22.8  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280           19.2  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Rename Columns

The rename() verb changes the name of a column.

Code

mtcars_df %>% rename(new_cyl = cyl)

# A tibble: 32 × 12
   car_name    mpg new_cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4  21         6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX…  21         6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 7…  22.8       4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4…  21.4       6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet S…  18.7       8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant    18.1       6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 3…  14.3       8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D  24.4       4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230   22.8       4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280   19.2       6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

`filter()`: Pick rows by condition

filter() subsets rows based on logical conditions. Only rows where the condition is TRUE are kept.

Code

# Filter for cars with 8 cylinders
mtcars_df %>% filter(cyl == 8)

# A tibble: 14 × 12
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 2 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 3 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
 4 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
 5 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 6 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 7 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 8 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
11 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
12 Pontiac Fi…  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
13 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
14 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8

Combine conditions with logical operators: - , or & for AND - | for OR - ! for NOT

Code

# Filter for 8-cylinder cars with more than 200 horsepower
mtcars_df %>% filter(cyl == 8 & hp > 200)

# A tibble: 7 × 12
  car_name       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Duster 360    14.3     8   360   245  3.21  3.57  15.8     0     0     3     4
2 Cadillac Fl…  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
3 Lincoln Con…  10.4     8   460   215  3     5.42  17.8     0     0     3     4
4 Chrysler Im…  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
5 Camaro Z28    13.3     8   350   245  3.73  3.84  15.4     0     0     3     4
6 Ford Panter…  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
7 Maserati Bo…  15       8   301   335  3.54  3.57  14.6     0     1     5     8

Code

# Filter for cars that are either 8-cylinder OR have more than 300 horsepower
mtcars_df %>% filter(cyl == 8 | hp > 300)

# A tibble: 14 × 12
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 2 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 3 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
 4 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
 5 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 6 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 7 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 8 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
11 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
12 Pontiac Fi…  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
13 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
14 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8

`arrange()`: Reorder rows

arrange() sorts the rows of a data frame by one or more columns.

Code

# Sort cars by miles per gallon (mpg) in ascending order (the default)
mtcars_df %>% arrange(mpg)

# A tibble: 32 × 12
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Cadillac F…  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 2 Lincoln Co…  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 3 Camaro Z28   13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
 4 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 5 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 6 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
 7 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 8 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
 9 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
# ℹ 22 more rows

Use desc() to sort in descending order.

Code

# Sort by cylinder count (desc) and then by mpg (desc) for tie-breaking
mtcars_df %>% arrange(desc(cyl), desc(mpg))

# A tibble: 32 × 12
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Pontiac Fi…  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2
 2 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 3 Merc 450SL   17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
 4 Merc 450SE   16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
 5 Ford Pante…  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
 6 Dodge Chal…  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
 7 Merc 450SLC  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 8 AMC Javelin  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
 9 Maserati B…  15       8  301    335  3.54  3.57  14.6     0     1     5     8
10 Chrysler I…  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
# ℹ 22 more rows

`mutate()`: Create new columns

mutate() adds new columns or transforms existing ones, while keeping all existing columns.

Code

# Create a new column for horsepower-to-weight ratio
mtcars_df %>% mutate(hp_per_wt = hp / wt)

# A tibble: 32 × 13
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: hp_per_wt <dbl>

Use case_when() for complex conditional logic, which is like a more powerful version of an if-else statement.

Code

# Create a column for cylinder category
mtcars_df %>% mutate(cyl_group = case_when(
    cyl == 4 ~ "Four-Cylinder",
    cyl == 6 ~ "Six-Cylinder",
    cyl == 8 ~ "Eight-Cylinder",
    TRUE ~ "Other" # Fallback for any other case
  ))

# A tibble: 32 × 13
   car_name      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: cyl_group <chr>

`summarise()` and `group_by()`: Collapse rows to a summary

This is one of the most powerful combinations in dplyr. group_by() groups the data by one or more variables. Subsequent operations are then performed on each group independently. summarise() then collapses each group into a single-row summary.

Code

# Calculate summary statistics for each cylinder group
mtcars_df %>% 
  group_by(cyl) %>% 
  summarise(
    avg_mpg = mean(mpg, na.rm = TRUE),
    median_hp = median(hp, na.rm = TRUE),
    min_wt = min(wt, na.rm = TRUE),
    max_wt = max(wt, na.rm = TRUE),
    car_count = n() # n() counts the number of rows in the group
  )

# A tibble: 3 × 6
    cyl avg_mpg median_hp min_wt max_wt car_count
  <dbl>   <dbl>     <dbl>  <dbl>  <dbl>     <int>
1     4    26.7       91    1.51   3.19        11
2     6    19.7      110    2.62   3.46         7
3     8    15.1      192.   3.17   5.42        14

Joining Tables

dplyr provides a family of join functions to combine data from different tables.

Code

band_members_df <- tribble(
  ~name,     ~band, 
  "Mick",    "Stones",
  "John",    "Beatles",
  "Paul",    "Beatles"
)

band_instruments_df <- tribble(
  ~name,     ~plays, 
  "Mick",    "vocals",
  "John",    "guitar",
  "Keith",   "guitar"
)

inner_join(): Returns only the rows where the key exists in both tables.
left_join(): Returns all rows from the left table, and matching rows from the right table.
full_join(): Returns all rows from both tables.
anti_join(): Returns all rows from the left table that do not have a match in the right table.

Code

# Left join to keep all members and see who plays an instrument
left_join(band_members_df, band_instruments_df, by = "name")

# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  vocals
2 John  Beatles guitar
3 Paul  Beatles <NA>

Code

# Anti join to find which members are not in the instruments table
anti_join(band_members_df, band_instruments_df, by = "name")

# A tibble: 1 × 2
  name  band   
  <chr> <chr>  
1 Paul  Beatles

2. `tidyr`: Tidy Your Data

tidyr provides tools for reshaping data. The goal is to create “tidy” data, which has a specific structure: 1. Every column is a variable. 2. Every row is an observation. 3. Every cell is a single value.

Code

# Create a sample wide-format (untidy) dataset
wide_data <- read.table(header = TRUE, text = '
 subject sex control cond1 cond2
       1   M     7.9  12.3  10.7
       2   F     6.3  10.6  11.1
')

`pivot_longer()`: Wide to Long

pivot_longer() makes data “longer” by gathering multiple columns into key-value pairs. This is often the first step to making data “tidy.”

Code

long_data <- wide_data %>% 
  pivot_longer(
    cols = c(control, cond1, cond2), # The columns to pivot
    names_to = 'condition',         # New column for the old column names
    values_to = 'measurement'       # New column for the old cell values
  )

long_data

# A tibble: 6 × 4
  subject sex   condition measurement
    <int> <chr> <chr>           <dbl>
1       1 M     control           7.9
2       1 M     cond1            12.3
3       1 M     cond2            10.7
4       2 F     control           6.3
5       2 F     cond1            10.6
6       2 F     cond2            11.1

`pivot_wider()`: Long to Wide

pivot_wider() does the opposite, making data “wider” by spreading a key-value pair into multiple columns.

Code

long_data %>% 
  pivot_wider(
    names_from = condition,   # Column to get new column names from
    values_from = measurement # Column to get cell values from
  )

# A tibble: 2 × 5
  subject sex   control cond1 cond2
    <int> <chr>   <dbl> <dbl> <dbl>
1       1 M         7.9  12.3  10.7
2       2 F         6.3  10.6  11.1

3. `stringr`: String Manipulation

stringr provides a consistent and user-friendly interface for common string operations, built on top of the stringi package.

Code

text_vector <- c("apple", "banana", "pear", "pineapple")

str_detect(): Check for the presence of a pattern.
str_replace(): Replace the first match of a pattern.
str_extract(): Extract the first match of a pattern.
str_split(): Split a string into pieces.

Code

# Find elements containing "apple"
str_detect(text_vector, "apple")

[1]  TRUE FALSE FALSE  TRUE

Code

# Extract numbers from a string using a regular expression
# \d+ is a regex for one or more digits
str_extract("Order_ID_12345", "\\d+")

[1] "12345"

Code

# Split a string by a delimiter
str_split("a-b-c", "-")

[[1]]
[1] "a" "b" "c"

4. `lubridate`: Date and Time Manipulation

lubridate simplifies working with dates and times in R, which can otherwise be quite complex.

Code

library(lubridate)

today:

Code

today()

[1] "2025-07-01"

Parsing: ymd(), mdy(), dmy() parse strings into dates reliably, automatically handling different separators.
Component Extraction: year(), month(), day(), wday() extract parts of a date.
Arithmetic: Perform calculations with date and time objects using duration and period objects.

Code

# Parse a date
date_obj <- ymd("2023-10-27")

# Get the day of the week
wday(date_obj, label = TRUE, abbr = FALSE)

[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Code

# Add 15 days to the date
date_obj + days(15)

[1] "2023-11-11"

Code

# Calculate the time difference between two dates
new_year <- ymd("2024-01-01")
interval(date_obj, new_year) / days(1)

[1] 66

4. Window Functions

Window functions are functions that operate on a “window” of data (e.g., within a group) but, unlike summarise(), they return a value for every row.

Ranking with `row_number`

row_number() assigns a unique rank to each row within a group.

Code

small_mtcars %>% 
  group_by(cyl) %>% 
  mutate(rank = row_number(desc(mpg))) %>% 
  select(cyl, mpg, rank)

# A tibble: 6 × 3
# Groups:   cyl [3]
    cyl   mpg  rank
  <dbl> <dbl> <int>
1     6  21       2
2     6  21       3
3     4  22.8     1
4     6  21.4     1
5     8  18.7     1
6     6  18.1     4

Accessing Previous/Next Values with `lag` and `lead`

lag() and lead() are useful for comparing a value to its predecessor or successor.

Code

small_mtcars %>% 
  select(cyl, mpg) %>% 
  mutate(mpg_previous = lag(mpg, n = 1))

  cyl  mpg mpg_previous
1   6 21.0           NA
2   6 21.0         21.0
3   4 22.8         21.0
4   6 21.4         22.8
5   8 18.7         21.4
6   6 18.1         18.7

Cumulative Summaries

cumsum() calculates the cumulative sum.

Code

small_mtcars %>% 
  select(cyl, mpg) %>% 
  mutate(mpg_running_total = cumsum(mpg))

  cyl  mpg mpg_running_total
1   6 21.0              21.0
2   6 21.0              42.0
3   4 22.8              64.8
4   6 21.4              86.2
5   8 18.7             104.9
6   6 18.1             123.0

Order Rows

arrange() sorts the rows of a data frame by one or more columns.

Code

small_mtcars %>% arrange(cyl)

           car_name cyl  mpg  hp
1        Datsun 710   4 22.8  93
2         Mazda RX4   6 21.0 110
3     Mazda RX4 Wag   6 21.0 110
4    Hornet 4 Drive   6 21.4 110
5           Valiant   6 18.1 105
6 Hornet Sportabout   8 18.7 175

Sort in descending order using desc().

Code

small_mtcars %>% arrange(desc(cyl))

           car_name cyl  mpg  hp
1 Hornet Sportabout   8 18.7 175
2         Mazda RX4   6 21.0 110
3     Mazda RX4 Wag   6 21.0 110
4    Hornet 4 Drive   6 21.4 110
5           Valiant   6 18.1 105
6        Datsun 710   4 22.8  93

5. Handle Missing Data

Let’s create a sample data frame with missing values (NA).

Code

missing_df <- data.frame(
  x = c(1, 2, NA, 4),
  y = c("a", NA, "c", "d")
)

Find Missing Data

is.na() returns a logical vector indicating which values are missing.

Code

is.na(missing_df)

         x     y
[1,] FALSE FALSE
[2,] FALSE  TRUE
[3,]  TRUE FALSE
[4,] FALSE FALSE

Filter Out Missing Data

You can use filter() with !is.na() to remove rows with missing values in a specific column.

Code

missing_df %>% filter(!is.na(y))

   x y
1  1 a
2 NA c
3  4 d

tidyr::drop_na() removes rows with any missing values.

Code

missing_df %>% drop_na()

  x y
1 1 a
2 4 d

6. dataframe to other data format

dataframe to vector

Code

data=small_mtcars$cyl
data

[1] 6 6 4 6 8 6

Code

class(data)

[1] "numeric"

dataframe to matrix

Code

data=data.matrix(small_mtcars)
data

     car_name cyl  mpg  hp
[1,]        4   6 21.0 110
[2,]        5   6 21.0 110
[3,]        1   4 22.8  93
[4,]        2   6 21.4 110
[5,]        3   8 18.7 175
[6,]        6   6 18.1 105

Code

class(data)

[1] "matrix" "array"

dataframe to list

Code

data=as.list(small_mtcars)
data

$car_name
[1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
[4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"          

$cyl
[1] 6 6 4 6 8 6

$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1

$hp
[1] 110 110  93 110 175 105

Code

class(data)

[1] "list"

1. dplyr: A Grammar of Data Manipulation

Load Packages and Data

Get Column Names

select(): Pick columns by name

Select by Index

Drop Columns

Rename Columns

filter(): Pick rows by condition

arrange(): Reorder rows

mutate(): Create new columns

summarise() and group_by(): Collapse rows to a summary

Joining Tables

2. tidyr: Tidy Your Data

pivot_longer(): Wide to Long

pivot_wider(): Long to Wide

3. stringr: String Manipulation

4. lubridate: Date and Time Manipulation

4. Window Functions

Ranking with row_number

Accessing Previous/Next Values with lag and lead

Cumulative Summaries

Order Rows

5. Handle Missing Data

Find Missing Data

Filter Out Missing Data

6. dataframe to other data format

dataframe to vector

dataframe to matrix

dataframe to list

5. References

1. `dplyr`: A Grammar of Data Manipulation

`select()`: Pick columns by name

`filter()`: Pick rows by condition

`arrange()`: Reorder rows

`mutate()`: Create new columns

`summarise()` and `group_by()`: Collapse rows to a summary

2. `tidyr`: Tidy Your Data

`pivot_longer()`: Wide to Long

`pivot_wider()`: Long to Wide

3. `stringr`: String Manipulation

4. `lubridate`: Date and Time Manipulation

Ranking with `row_number`

Accessing Previous/Next Values with `lag` and `lead`