Code
# Load the necessary libraries
library(DBI)
library(RSQLite)
library(tidyverse)
Tony Duan
This document provides a comprehensive guide on how to connect to and interact with SQL databases directly from R. We will explore three primary methods: writing raw SQL queries, using the modern dbplyr
package for a tidyverse
-like experience, and using sqldf
to query local data frames with SQL syntax.
While R is excellent at in-memory data manipulation, connecting to a database is essential when:
The DBI
(Database Interface) package provides a universal, consistent interface for connecting to any database. Specific driver packages like RSQLite
, RPostgres
, or odbc
handle the connection to different database systems.
For these examples, we will use RSQLite
, which creates a lightweight, file-based SQL database that is perfect for learning and demonstration.
First, we establish a connection to a database file. If the file doesn’t exist, dbConnect()
will create it. Then, we’ll copy the mtcars
and iris
R data frames into the database, creating new tables.
# Add car names as a column in mtcars, as they are originally row names
mtcars_df <- mtcars %>%
rownames_to_column(var = "car_name")
# Create a connection to an SQLite database file in the 'data' subdirectory
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)
We can list the tables to confirm they were successfully created.
This approach involves sending SQL statements as text strings to the database. The dbGetQuery()
function executes the query and returns the result as a data frame in R.
Select all columns (*
) from the mtcars
table. LIMIT
restricts the output to the first 3 rows.
Use AS
to rename a column in the query’s output. You can also create new columns based on calculations on existing ones.
The WHERE
clause filters rows based on one or more conditions. AND
requires both conditions to be true.
car_name mpg cyl disp hp drat wt qsec vs am gear carb
1 Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
2 Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
3 Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
4 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
5 Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
6 Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
GROUP BY
is a powerful tool for summarizing data. It collapses multiple rows into a single summary row based on a grouping variable. It is almost always used with aggregate functions like AVG()
(average), COUNT()
(count), MIN()
(minimum), and MAX()
(maximum).
This query calculates the average horsepower and the number of cars for each cylinder group.
ORDER BY
sorts the result set. The default is ascending order (ASC
), but you can specify DESC
for descending order.
Joins are used to combine rows from two or more tables based on a related column. We’ll create two new tables to demonstrate.
Returns only the rows where the join key (part_id
) exists in both tables.
Returns all rows from the left table (car_parts
) and any matched rows from the right table (part_info
). If there is no match, the columns from the right table will be NA
.
For SQL commands that modify data but do not return a result set (like UPDATE
, CREATE
, DROP
), use the dbExecute()
function.
UPDATE
modifies existing records in a table.
CREATE TABLE
builds a new table, and DROP TABLE
permanently deletes one.
[1] 0
[1] "car_parts" "iris" "mtcars" "mtcars_copy" "part_info"
dbplyr
(Recommended)dbplyr
is a tidyverse
backend for databases. It allows you to write familiar dplyr
code, which dbplyr
translates into SQL for you. This is the recommended modern approach because it is:
dbplyr
only sends the query to the database when you explicitly ask for the results, which can save a lot of computation.First, we create a reference to the mtcars
table in the database. This object is a pointer to the remote table; the data is not pulled into R yet.
Now, we can use dplyr
verbs on this mtcars_db
object. The R code is not executed in R; it is translated to SQL and sent to the database for execution.
We can use show_query()
to inspect the SQL that dbplyr
generated. This is a great way to learn SQL!
<SQL>
SELECT `cyl`, AVG(`mpg`) AS `avg_mpg`
FROM (
SELECT `car_name`, `mpg`, `hp`, `cyl`
FROM `mtcars`
WHERE (`hp` > 150.0)
) AS `q01`
GROUP BY `cyl`
ORDER BY `avg_mpg` DESC
To execute the query and pull the results back into a standard R data frame, we use collect()
.
When you must write raw SQL, it is critical to avoid using functions like paste()
or glue()
to insert values directly into a query string. This practice makes you vulnerable to SQL injection, a serious security risk.
The correct and safe method is to use parameterized queries. You use a placeholder (?
) in the SQL string and provide the values in a separate list.
# Define the R variable to be used in the query
hp_threshold <- 150
# Write the SQL with a placeholder (?)
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"
# Provide the variable in the `params` list. DBI handles the safe substitution.
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()
car_name hp
1 Hornet Sportabout 175
2 Duster 360 245
3 Merc 450SE 180
4 Merc 450SL 180
5 Merc 450SLC 180
6 Cadillac Fleetwood 205
sqldf
on Local Data FramesThe sqldf
package offers a unique functionality: it allows you to run SQL queries on local R data frames, not on an external database. This can be a convenient tool for users who are more comfortable with SQL syntax than with dplyr
for data manipulation tasks.
A transaction is a sequence of operations performed as a single logical unit of work. All the operations in a transaction must succeed; if any of them fail, the entire transaction is rolled back, and the database is left unchanged. This ensures data integrity.
# Start a transaction
dbBegin(con)
# Try to execute two statements
# The first will succeed, the second will fail because the table doesn't exist
dbExecute(con, "UPDATE mtcars SET hp = 999 WHERE car_name = 'Mazda RX4'")
dbExecute(con, "UPDATE non_existent_table SET col = 1")
# If there was an error, dbCommit will fail and you should rollback
# In a real script, you would wrap this in tryCatch()
# For this example, we manually rollback
dbRollback(con)
# Check the value - it should NOT be 999 because the transaction was rolled back
dbGetQuery(con, "SELECT hp FROM mtcars WHERE car_name = 'Mazda RX4'")
It is very important to close the database connection when you are finished. This releases resources on both your machine and the database server.
---
title: "Working with SQL Databases in R"
author: "Tony Duan"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: false
code-block-bg: true
code-block-border-left: "#31BAE9"
---
This document provides a comprehensive guide on how to connect to and interact with SQL databases directly from R. We will explore three primary methods: writing raw SQL queries, using the modern `dbplyr` package for a `tidyverse`-like experience, and using `sqldf` to query local data frames with SQL syntax.
{width="600"}
# 1. Why Use a Database with R?
While R is excellent at in-memory data manipulation, connecting to a database is essential when:
- **Data size is too large:** The dataset is too big to fit into your computer's RAM. Databases are optimized for storing and querying massive datasets efficiently.
- **Data is centralized:** The data is stored in a central location, accessed by multiple users or applications, ensuring data consistency and integrity.
- **Production environments:** You need to query or write to a production database system as part of an analytical workflow. The database handles tasks like security, backups, and concurrent access.
# 2. Connecting to a Database
The `DBI` (Database Interface) package provides a universal, consistent interface for connecting to any database. Specific driver packages like `RSQLite`, `RPostgres`, or `odbc` handle the connection to different database systems.
For these examples, we will use `RSQLite`, which creates a lightweight, file-based SQL database that is perfect for learning and demonstration.
```{r}
# Load the necessary libraries
library(DBI)
library(RSQLite)
library(tidyverse)
```
## Create and Populate a Database
First, we establish a connection to a database file. If the file doesn't exist, `dbConnect()` will create it. Then, we'll copy the `mtcars` and `iris` R data frames into the database, creating new tables.
```{r}
# Add car names as a column in mtcars, as they are originally row names
mtcars_df <- mtcars %>%
rownames_to_column(var = "car_name")
# Create a connection to an SQLite database file in the 'data' subdirectory
con <- dbConnect(RSQLite::SQLite(), "data/my_sql_database.db")
# Write the R data frames to the database
# `overwrite = TRUE` will replace the table if it already exists
dbWriteTable(con, "mtcars", mtcars_df, overwrite = TRUE)
dbWriteTable(con, "iris", iris, overwrite = TRUE)
```
We can list the tables to confirm they were successfully created.
```{r}
dbListTables(con)
```
# 3. Method 1: Writing Raw SQL Queries
This approach involves sending SQL statements as text strings to the database. The `dbGetQuery()` function executes the query and returns the result as a data frame in R.
## SELECT Statement
Select all columns (`*`) from the `mtcars` table. `LIMIT` restricts the output to the first 3 rows.
```{r}
sql_select <- "SELECT * FROM mtcars LIMIT 3"
dbGetQuery(con, sql_select)
```
## Column Aliasing and Creation
Use `AS` to rename a column in the query's output. You can also create new columns based on calculations on existing ones.
```{r}
sql_create_col <- "SELECT mpg, mpg * 0.425 AS kpl FROM mtcars" # Convert MPG to KPL
head(dbGetQuery(con, sql_create_col))
```
## Filtering with WHERE
The `WHERE` clause filters rows based on one or more conditions. `AND` requires both conditions to be true.
```{r}
sql_filter <- "SELECT * FROM mtcars WHERE hp > 200 AND cyl = 8"
head(dbGetQuery(con, sql_filter))
```
## Aggregation with GROUP BY
`GROUP BY` is a powerful tool for summarizing data. It collapses multiple rows into a single summary row based on a grouping variable. It is almost always used with aggregate functions like `AVG()` (average), `COUNT()` (count), `MIN()` (minimum), and `MAX()` (maximum).
This query calculates the average horsepower and the number of cars for each cylinder group.
```{r}
sql_groupby <- "SELECT cyl, AVG(hp) AS avg_hp, COUNT(*) AS num_cars FROM mtcars GROUP BY cyl"
dbGetQuery(con, sql_groupby)
```
## Sorting with ORDER BY
`ORDER BY` sorts the result set. The default is ascending order (`ASC`), but you can specify `DESC` for descending order.
```{r}
sql_orderby <- "SELECT car_name, mpg, hp FROM mtcars ORDER BY hp DESC LIMIT 5"
dbGetQuery(con, sql_orderby)
```
## Joins
Joins are used to combine rows from two or more tables based on a related column. We'll create two new tables to demonstrate.
```{r}
dbWriteTable(con, "car_parts", data.frame(car_name = c("Mazda RX4", "Datsun 710", "Hornet 4 Drive"), part_id = c(1, 2, 3)), overwrite = TRUE)
dbWriteTable(con, "part_info", data.frame(part_id = c(1, 2, 4), part_name = c("Engine", "Tire", "Brake")), overwrite = TRUE)
```
### INNER JOIN
Returns only the rows where the join key (`part_id`) exists in **both** tables.
```{r}
sql_inner_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
INNER JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_inner_join)
```
### LEFT JOIN
Returns **all** rows from the left table (`car_parts`) and any matched rows from the right table (`part_info`). If there is no match, the columns from the right table will be `NA`.
```{r}
sql_left_join <- "
SELECT cp.car_name, pi.part_name
FROM car_parts cp
LEFT JOIN part_info pi ON cp.part_id = pi.part_id"
dbGetQuery(con, sql_left_join)
```
## Modifying Data
For SQL commands that modify data but do not return a result set (like `UPDATE`, `CREATE`, `DROP`), use the `dbExecute()` function.
### UPDATE Table
`UPDATE` modifies existing records in a table.
```{r}
# Set the horsepower to 120 for the 'Mazda RX4'
dbExecute(con, "UPDATE mtcars SET hp = 120 WHERE car_name = 'Mazda RX4'")
# Verify the change
dbGetQuery(con, "SELECT car_name, hp FROM mtcars WHERE car_name = 'Mazda RX4'")
```
### CREATE and DROP Table
`CREATE TABLE` builds a new table, and `DROP TABLE` permanently deletes one.
```{r}
# Create a copy of the mtcars table
sql_create <- "CREATE TABLE mtcars_copy AS SELECT * FROM mtcars"
dbExecute(con, sql_create)
dbListTables(con)
```
```{r}
# Drop the copy
sql_drop <- "DROP TABLE mtcars_copy"
dbExecute(con, sql_drop)
dbListTables(con)
```
# 4. Method 2: Using `dbplyr` (Recommended)
`dbplyr` is a `tidyverse` backend for databases. It allows you to write familiar `dplyr` code, which `dbplyr` translates into SQL for you. This is the **recommended modern approach** because it is:
- **Easier to write and read:** No need to remember the specific SQL dialect of your database.
- **Less error-prone:** Avoids syntax errors from manually writing SQL strings.
- **More secure:** Protects against SQL injection attacks automatically.
- **Lazy evaluation:** `dbplyr` only sends the query to the database when you explicitly ask for the results, which can save a lot of computation.
First, we create a reference to the `mtcars` table in the database. This object is a pointer to the remote table; the data is not pulled into R yet.
```{r}
mtcars_db <- tbl(con, "mtcars")
```
Now, we can use `dplyr` verbs on this `mtcars_db` object. The R code is not executed in R; it is translated to SQL and sent to the database for execution.
```{r}
query <- mtcars_db %>%
select(car_name, mpg, hp, cyl) %>%
filter(hp > 150) %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg, na.rm = TRUE), .groups = 'drop') %>%
arrange(desc(avg_mpg))
```
We can use `show_query()` to inspect the SQL that `dbplyr` generated. This is a great way to learn SQL!
```{r}
show_query(query)
```
To execute the query and pull the results back into a standard R data frame, we use `collect()`.
```{r}
results <- collect(query)
results
```
# 5. Best Practice: Parameterized Queries
When you must write raw SQL, it is critical to avoid using functions like `paste()` or `glue()` to insert values directly into a query string. This practice makes you vulnerable to **SQL injection**, a serious security risk.
The correct and safe method is to use **parameterized queries**. You use a placeholder (`?`) in the SQL string and provide the values in a separate list.
```{r}
# Define the R variable to be used in the query
hp_threshold <- 150
# Write the SQL with a placeholder (?)
sql_safe <- "SELECT car_name, hp FROM mtcars WHERE hp > ?"
# Provide the variable in the `params` list. DBI handles the safe substitution.
dbGetQuery(con, sql_safe, params = list(hp_threshold)) %>% head()
```
# 6. Method 3: Using `sqldf` on Local Data Frames
The `sqldf` package offers a unique functionality: it allows you to run SQL queries on **local R data frames**, not on an external database. This can be a convenient tool for users who are more comfortable with SQL syntax than with `dplyr` for data manipulation tasks.
```{r}
library(sqldf)
# Note the use of the local R data frame `mtcars_df`, not a database connection
sqldf("SELECT car_name, mpg FROM mtcars_df WHERE cyl = 8 LIMIT 5")
```
# 7. Database Transactions
A transaction is a sequence of operations performed as a single logical unit of work. All the operations in a transaction must succeed; if any of them fail, the entire transaction is rolled back, and the database is left unchanged. This ensures data integrity.
```{r}
#| eval: false
# Start a transaction
dbBegin(con)
# Try to execute two statements
# The first will succeed, the second will fail because the table doesn't exist
dbExecute(con, "UPDATE mtcars SET hp = 999 WHERE car_name = 'Mazda RX4'")
dbExecute(con, "UPDATE non_existent_table SET col = 1")
# If there was an error, dbCommit will fail and you should rollback
# In a real script, you would wrap this in tryCatch()
# For this example, we manually rollback
dbRollback(con)
# Check the value - it should NOT be 999 because the transaction was rolled back
dbGetQuery(con, "SELECT hp FROM mtcars WHERE car_name = 'Mazda RX4'")
```
# 8. Disconnecting from the Database
It is very important to close the database connection when you are finished. This releases resources on both your machine and the database server.
```{r}
dbDisconnect(con)
```