Working with SQL Databases in R

Author

Tony Duan

This document provides a comprehensive guide on how to connect to and interact with SQL databases directly from R. We will explore three primary methods: writing raw SQL queries, using the modern dbplyr package for a tidyverse-like experience, and using sqldf to query local data frames with SQL syntax.

1. Why Use a Database with R?

While R is excellent at in-memory data manipulation, connecting to a database is essential when:

Data size is too large: The dataset is too big to fit into your computer’s RAM. Databases are optimized for storing and querying massive datasets efficiently.
Data is centralized: The data is stored in a central location, accessed by multiple users or applications, ensuring data consistency and integrity.
Production environments: You need to query or write to a production database system as part of an analytical workflow. The database handles tasks like security, backups, and concurrent access.

2. Connecting to a Database

The DBI (Database Interface) package provides a universal, consistent interface for connecting to any database. Specific driver packages like RSQLite, RPostgres, or odbc handle the connection to different database systems.

For these examples, we will use RSQLite, which creates a lightweight, file-based SQL database that is perfect for learning and demonstration.

Code

# Load the necessary libraries
library(DBI)
library(RSQLite)
library(tidyverse)

Create and Populate a Database

First, we establish a connection to a database file. If the file doesn’t exist, dbConnect() will create it. Then, we’ll copy the mtcars and iris R data frames into the database, creating new tables.

Code

# Add car names as a column in mtcars, as they are originally row names
mtcars_df <- mtcars %>% 
  rownames_to_column(var = "car_name")

# Create a connection to an SQLite database file in the 'data' subdirectory
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")

# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)

We can list the tables to confirm they were successfully created.

Code

dbListTables(con)

[1] "car_parts" "iris"      "mtcars"    "part_info"

3. Method 1: Writing Raw SQL Queries

This approach involves sending SQL statements as text strings to the database. The dbGetQuery() function executes the query and returns the result as a data frame in R.

SELECT Statement

Select all columns (*) from the mtcars table. LIMIT restricts the output to the first 3 rows.

Code

sql_select <- "SELECT * FROM mtcars LIMIT 3"
dbGetQuery(con, sql_select)

       car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Column Aliasing and Creation

Use AS to rename a column in the query’s output. You can also create new columns based on calculations on existing ones.

Code

sql_create_col <- "SELECT mpg, mpg * 0.425 AS kpl FROM mtcars" # Convert MPG to KPL
head(dbGetQuery(con, sql_create_col))

   mpg    kpl
1 21.0 8.9250
2 21.0 8.9250
3 22.8 9.6900
4 21.4 9.0950
5 18.7 7.9475
6 18.1 7.6925

Filtering with WHERE

The WHERE clause filters rows based on one or more conditions. AND requires both conditions to be true.

Code

sql_filter <- "SELECT * FROM mtcars WHERE hp > 200 AND cyl = 8"
head(dbGetQuery(con, sql_filter))

             car_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1          Duster 360 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
2  Cadillac Fleetwood 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
3 Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
4   Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
5          Camaro Z28 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
6      Ford Pantera L 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4

Aggregation with GROUP BY

GROUP BY is a powerful tool for summarizing data. It collapses multiple rows into a single summary row based on a grouping variable. It is almost always used with aggregate functions like AVG() (average), COUNT() (count), MIN() (minimum), and MAX() (maximum).

This query calculates the average horsepower and the number of cars for each cylinder group.

Code

sql_groupby <- "SELECT cyl, AVG(hp) AS avg_hp, COUNT(*) AS num_cars FROM mtcars GROUP BY cyl"
dbGetQuery(con, sql_groupby)

  cyl    avg_hp num_cars
1   4  82.63636       11
2   6 122.28571        7
3   8 209.21429       14

Sorting with ORDER BY

ORDER BY sorts the result set. The default is ascending order (ASC), but you can specify DESC for descending order.

Code

sql_orderby <- "SELECT car_name, mpg, hp FROM mtcars ORDER BY hp DESC LIMIT 5"
dbGetQuery(con, sql_orderby)

           car_name  mpg  hp
1     Maserati Bora 15.0 335
2    Ford Pantera L 15.8 264
3        Duster 360 14.3 245
4        Camaro Z28 13.3 245
5 Chrysler Imperial 14.7 230

Joins

Joins are used to combine rows from two or more tables based on a related column. We’ll create two new tables to demonstrate.

Code

dbWriteTable(con, "car_parts", data.frame(car_name = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive"), part_id = c(1, 2, 3)), overwrite = TRUE)
dbWriteTable(con, "part_info", data.frame(part_id = c(1, 2, 4), part_name = c("Engine", "Tire", "Brake")), overwrite = TRUE)

INNER JOIN

Returns only the rows where the join key (part_id) exists in both tables.

Code

sql_inner_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
INNER JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_inner_join)

    car_name part_name
1  Mazda RX4    Engine
2 Datsun 710      Tire

LEFT JOIN

Returns all rows from the left table (car_parts) and any matched rows from the right table (part_info). If there is no match, the columns from the right table will be NA.

Code

sql_left_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
LEFT JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_left_join)

        car_name part_name
1      Mazda RX4    Engine
2     Datsun 710      Tire
3 Hornet 4 Drive      <NA>

Modifying Data

For SQL commands that modify data but do not return a result set (like UPDATE, CREATE, DROP), use the dbExecute() function.

UPDATE Table

UPDATE modifies existing records in a table.

Code

# Set the horsepower to 120 for the 'Mazda RX4'
dbExecute(con, "UPDATE mtcars SET hp = 120 WHERE car_name = 'Mazda RX4'")

[1] 1

Code

# Verify the change
dbGetQuery(con, "SELECT car_name, hp FROM mtcars WHERE car_name = 'Mazda RX4'")

   car_name  hp
1 Mazda RX4 120

CREATE and DROP Table

CREATE TABLE builds a new table, and DROP TABLE permanently deletes one.

Code

# Create a copy of the mtcars table
sql_create <- "CREATE TABLE mtcars_copy AS SELECT * FROM mtcars"
dbExecute(con, sql_create)

[1] 0

Code

dbListTables(con)

[1] "car_parts"   "iris"        "mtcars"      "mtcars_copy" "part_info"

Code

# Drop the copy
sql_drop <- "DROP TABLE mtcars_copy"
dbExecute(con, sql_drop)

[1] 0

Code

dbListTables(con)

[1] "car_parts" "iris"      "mtcars"    "part_info"

4. Method 2: Using `dbplyr` (Recommended)

dbplyr is a tidyverse backend for databases. It allows you to write familiar dplyr code, which dbplyr translates into SQL for you. This is the recommended modern approach because it is:

Easier to write and read: No need to remember the specific SQL dialect of your database.
Less error-prone: Avoids syntax errors from manually writing SQL strings.
More secure: Protects against SQL injection attacks automatically.
Lazy evaluation: dbplyr only sends the query to the database when you explicitly ask for the results, which can save a lot of computation.

First, we create a reference to the mtcars table in the database. This object is a pointer to the remote table; the data is not pulled into R yet.

Code

mtcars_db <- tbl(con, "mtcars")

Now, we can use dplyr verbs on this mtcars_db object. The R code is not executed in R; it is translated to SQL and sent to the database for execution.

Code

query <- mtcars_db %>%
  select(car_name, mpg, hp, cyl) %>%
  filter(hp > 150) %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg, na.rm = TRUE), .groups = 'drop') %>%
  arrange(desc(avg_mpg))

We can use show_query() to inspect the SQL that dbplyr generated. This is a great way to learn SQL!

Code

show_query(query)

<SQL>
SELECT `cyl`, AVG(`mpg`) AS `avg_mpg`
FROM (
  SELECT `car_name`, `mpg`, `hp`, `cyl`
  FROM `mtcars`
  WHERE (`hp` > 150.0)
) AS `q01`
GROUP BY `cyl`
ORDER BY `avg_mpg` DESC

To execute the query and pull the results back into a standard R data frame, we use collect().

Code

results <- collect(query)
results

# A tibble: 2 × 2
    cyl avg_mpg
  <dbl>   <dbl>
1     6    19.7
2     8    15.1

5. Best Practice: Parameterized Queries

When you must write raw SQL, it is critical to avoid using functions like paste() or glue() to insert values directly into a query string. This practice makes you vulnerable to SQL injection, a serious security risk.

The correct and safe method is to use parameterized queries. You use a placeholder (?) in the SQL string and provide the values in a separate list.

Code

# Define the R variable to be used in the query
hp_threshold <- 150

# Write the SQL with a placeholder (?)
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"

# Provide the variable in the `params` list. DBI handles the safe substitution.
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()

            car_name  hp
1  Hornet Sportabout 175
2         Duster 360 245
3         Merc 450SE 180
4         Merc 450SL 180
5        Merc 450SLC 180
6 Cadillac Fleetwood 205

6. Method 3: Using `sqldf` on Local Data Frames

The sqldf package offers a unique functionality: it allows you to run SQL queries on local R data frames, not on an external database. This can be a convenient tool for users who are more comfortable with SQL syntax than with dplyr for data manipulation tasks.

Code

library(sqldf)

# Note the use of the local R data frame `mtcars_df`, not a database connection
sqldf("SELECT car_name, mpg FROM mtcars_df WHERE cyl = 8 LIMIT 5")

           car_name  mpg
1 Hornet Sportabout 18.7
2        Duster 360 14.3
3        Merc 450SE 16.4
4        Merc 450SL 17.3
5       Merc 450SLC 15.2

7. Database Transactions

A transaction is a sequence of operations performed as a single logical unit of work. All the operations in a transaction must succeed; if any of them fail, the entire transaction is rolled back, and the database is left unchanged. This ensures data integrity.

Code

# Start a transaction
dbBegin(con)

# Try to execute two statements
# The first will succeed, the second will fail because the table doesn't exist
dbExecute(con, "UPDATE mtcars SET hp = 999 WHERE car_name = 'Mazda RX4'")
dbExecute(con, "UPDATE non_existent_table SET col = 1")

# If there was an error, dbCommit will fail and you should rollback
# In a real script, you would wrap this in tryCatch()
# For this example, we manually rollback
dbRollback(con)

# Check the value - it should NOT be 999 because the transaction was rolled back
dbGetQuery(con, "SELECT hp FROM mtcars WHERE car_name = 'Mazda RX4'")

8. Disconnecting from the Database

It is very important to close the database connection when you are finished. This releases resources on both your machine and the database server.

Code

dbDisconnect(con)

1. Why Use a Database with R?

2. Connecting to a Database

Create and Populate a Database

3. Method 1: Writing Raw SQL Queries

SELECT Statement

Column Aliasing and Creation

Filtering with WHERE

Aggregation with GROUP BY

Sorting with ORDER BY

Joins

INNER JOIN

LEFT JOIN

Modifying Data

UPDATE Table

CREATE and DROP Table

4. Method 2: Using dbplyr (Recommended)

5. Best Practice: Parameterized Queries

6. Method 3: Using sqldf on Local Data Frames

7. Database Transactions

8. Disconnecting from the Database

4. Method 2: Using `dbplyr` (Recommended)

6. Method 3: Using `sqldf` on Local Data Frames