This document provides a comprehensive guide to performing common data manipulation tasks using SQL, R, and Python. It serves as a reference for understanding how to achieve similar outcomes across these three popular data analysis tools.
1 Setup and Configuration
Code
library(reticulate)py_require(c("pandas","Great-Tables","polars","pyarrow"))# Load libraries for database interaction, data manipulation, and connections.library(DBI)library(tidyverse)library(RSQLite)library(connections)library(duckdb)# Prepare the R dataframes.# Remove the existing mtcars dataset if it exists.rm(mtcars)# Create the mtcars dataframe from the base R dataset, adding the row names as a new column.mtcars =cbind(model_name =rownames(mtcars), mtcars) |>head(10)# Create the iris dataframe from the base R dataset.iris = iris |>head(10)# Remove the database file if it already exists to start with a clean slate.if (file.exists("my-db.duckdb")) {file.remove("my-db.duckdb")}# Establish a connection to the DuckDB database.con <-dbConnect(duckdb(), dbdir ="my-db.duckdb", read_only =FALSE)# Write the iris and mtcars dataframes to the database as tables.dbWriteTable(con, "iris_table", iris, overwrite =TRUE)dbWriteTable(con, "mtcars_table", mtcars, overwrite =TRUE)
Code
import pandas as pdimport polars as plimport osfrom platform import python_version#print(python_version())cities_pd = pd.DataFrame({'country': ['NL', 'NL', 'NL', 'US', 'US', 'US', 'US', 'US', 'US'],'name': ['Amsterdam', 'Amsterdam', 'Amsterdam', 'Seattle', 'Seattle', 'Seattle', 'New York City', 'New York City', 'New York City'],'year': [2000, 2010, 2020, 2000, 2010, 2020, 2000, 2010, 2020],'population': [1005, 1065, 1158, 564, 608, 738, 8015, 8175, 8772]})# Make the R dataframes available in the Python environment.mtcars_pd = r.mtcarsiris_pd = r.irismtcars_pl = pl.from_pandas(mtcars_pd)iris_pl = pl.from_pandas(iris_pd)cities_pl=pl.from_pandas(cities_pd)
There are tables iris_table and mtcars_table in the DuckDB database my-db.duckdb.
There are dataframes iris and mtcars in the R environment.
There are dataframes iris and mtcars in the Python environment.
2 Show All Tables
This section demonstrates how to list all available tables or dataframes in each environment.
# List all dataframes in the current R environment.dflist <-Filter(is.data.frame, as.list(.GlobalEnv))names(dflist)
[1] "mtcars" "iris"
Code
# List all pandas dataframes in the current Python environment.import pandas as pdalldfs = [var for var indir() ifisinstance(eval(var), pd.core.frame.DataFrame)]print(alldfs)
['cities_pd', 'iris_pd', 'mtcars_pd']
Code
import polars as plalldfs = [name for name, val inglobals().items() ifisinstance(val, pl.DataFrame)]print(alldfs)
['mtcars_pl', 'iris_pl', 'cities_pl']
3 Describe a Table
This section shows how to get a summary of a table’s structure and statistics.
-- Describe the columns and data types of the mtcars_table.DESCRIBE mtcars_table;
Displaying records 1 - 10
column_name
column_type
null
key
default
extra
model_name
VARCHAR
YES
NA
NA
NA
mpg
DOUBLE
YES
NA
NA
NA
cyl
DOUBLE
YES
NA
NA
NA
disp
DOUBLE
YES
NA
NA
NA
hp
DOUBLE
YES
NA
NA
NA
drat
DOUBLE
YES
NA
NA
NA
wt
DOUBLE
YES
NA
NA
NA
qsec
DOUBLE
YES
NA
NA
NA
vs
DOUBLE
YES
NA
NA
NA
am
DOUBLE
YES
NA
NA
NA
Code
# Provide a detailed summary of the mtcars dataframe.skimr::skim(mtcars)
Data summary
Name
mtcars
Number of rows
10
Number of columns
12
_______________________
Column type frequency:
character
1
numeric
11
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
model_name
0
1
7
17
0
10
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
mpg
0
1
20.37
2.91
14.30
18.82
21.00
22.45
24.40
▂▂▅▇▇
cyl
0
1
5.80
1.48
4.00
4.50
6.00
6.00
8.00
▅▁▇▁▃
disp
0
1
208.61
90.37
108.00
150.02
163.80
249.75
360.00
▇▇▅▁▅
hp
0
1
122.80
51.45
62.00
97.50
110.00
119.75
245.00
▅▇▁▂▂
drat
0
1
3.54
0.44
2.76
3.16
3.77
3.90
3.92
▁▃▁▁▇
wt
0
1
3.13
0.41
2.32
2.94
3.20
3.44
3.57
▂▂▂▆▇
qsec
0
1
18.58
2.14
15.84
17.02
18.45
19.86
22.90
▇▃▃▂▂
vs
0
1
0.60
0.52
0.00
0.00
1.00
1.00
1.00
▅▁▁▁▇
am
0
1
0.30
0.48
0.00
0.00
0.00
0.75
1.00
▇▁▁▁▃
gear
0
1
3.60
0.52
3.00
3.00
4.00
4.00
4.00
▅▁▁▁▇
carb
0
1
2.50
1.35
1.00
1.25
2.00
4.00
4.00
▆▆▁▁▇
Code
# Generate descriptive statistics for the mtcars dataframe.mtcars_pd.describe(include='all')
model_name mpg cyl ... am gear carb
count 10 10.000000 10.00000 ... 10.000000 10.000000 10.000000
unique 10 NaN NaN ... NaN NaN NaN
top Mazda RX4 NaN NaN ... NaN NaN NaN
freq 1 NaN NaN ... NaN NaN NaN
mean NaN 20.370000 5.80000 ... 0.300000 3.600000 2.500000
std NaN 2.907099 1.47573 ... 0.483046 0.516398 1.354006
min NaN 14.300000 4.00000 ... 0.000000 3.000000 1.000000
25% NaN 18.825000 4.50000 ... 0.000000 3.000000 1.250000
50% NaN 21.000000 6.00000 ... 0.000000 4.000000 2.000000
75% NaN 22.450000 6.00000 ... 0.750000 4.000000 4.000000
max NaN 24.400000 8.00000 ... 1.000000 4.000000 4.000000
[11 rows x 12 columns]
Code
mtcars_pl.describe()
shape: (9, 13)
statistic
model_name
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
str
str
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
f64
"count"
"10"
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
10.0
"null_count"
"0"
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
"mean"
null
20.37
5.8
208.61
122.8
3.538
3.128
18.581
0.6
0.3
3.6
2.5
"std"
null
2.907099
1.47573
90.373896
51.449652
0.440651
0.40649
2.135845
0.516398
0.483046
0.516398
1.354006
"min"
"Datsun 710"
14.3
4.0
108.0
62.0
2.76
2.32
15.84
0.0
0.0
3.0
1.0
"25%"
null
18.7
4.0
146.7
95.0
3.15
2.875
17.02
0.0
0.0
3.0
1.0
"50%"
null
21.0
6.0
167.6
110.0
3.85
3.215
18.61
1.0
0.0
4.0
2.0
"75%"
null
22.8
6.0
258.0
123.0
3.9
3.44
20.0
1.0
1.0
4.0
4.0
"max"
"Valiant"
24.4
8.0
360.0
245.0
3.92
3.57
22.9
1.0
1.0
4.0
4.0
4 Show Column Names
This section demonstrates how to retrieve the column names of a table.
-- Group by model_name and calculate the sum of mpg and the mean of cyl.SELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1LIMIT5;
5 records
model_name
total_mpg
cyl_mean
Merc 240D
24.4
4
Duster 360
14.3
8
Merc 230
22.8
4
Mazda RX4
21.0
6
Datsun 710
22.8
4
Code
# Group by model_name and summarize the data.mtcars |>group_by(model_name) |>summarise(total_mpg =sum(mpg), cyl_mean =mean(cyl)) |>head(5)
This command creates a new table or overwrites an existing one.
Code
-- Create a temporary table, replacing it if it already exists.CREATEORREPLACE TEMP TABLE mtcars_table_group ASSELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1;
11.2 CREATE TABLE IF NOT EXISTS
This command creates a table only if it does not already exist.
Code
-- Create a new table only if it does not already exist.CREATETABLEIFNOTEXISTS new_mtcars_table_group ASSELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1;
12 Unique
12.1 Check Unique
This section demonstrates how to verify the uniqueness of values in a column.
-- Perform a left join between mtcars_table and mtcars_table_group.SELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1LEFTJOIN mtcars_table_group t2 ON t1.model_name = t2.model_nameLIMIT5;
5 records
model_name
mpg
cyl
model_name
total_mpg
cyl_mean
Mazda RX4
21.0
6
Mazda RX4
21.0
6
Mazda RX4 Wag
21.0
6
Mazda RX4 Wag
21.0
6
Datsun 710
22.8
4
Datsun 710
22.8
4
Hornet 4 Drive
21.4
6
Hornet 4 Drive
21.4
6
Hornet Sportabout
18.7
8
Hornet Sportabout
18.7
8
Code
# Perform a left join on the mtcars dataframe with itself.mtcars |>left_join(mtcars, by =join_by(model_name == model_name)) |>head(5)
-- Count the rows in the original table.SELECTcount(*) FROM mtcars_table;
1 records
count_star()
10
Code
-- Create a new table by appending mtcars_table to itself, including duplicates.CREATE TEMP TABLE double_mtcars_table ASSELECT*FROM mtcars_tableUNIONALLSELECT*FROM mtcars_table;
Code
-- Count the rows in the new table.SELECTcount(*) FROM double_mtcars_table;
1 records
count_star()
20
Code
# Append rows using bind_rows.mtcars %>%bind_rows(mtcars)
This section demonstrates how to analyze the execution plan of a query.
Code
-- Show the logical execution plan of a query.EXPLAINSELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1INNERJOIN (SELECT*FROM mtcars_table LIMIT5) t2 ON t1.model_name = t2.model_name;
Code
-- Show the detailed execution plan, including timing information.EXPLAINANALYZESELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1INNERJOIN (SELECT*FROM mtcars_table LIMIT5) t2 ON t1.model_name = t2.model_name;
Code
# Disconnect from the database and shut it down.dbDisconnect(con, shutdown =TRUE)
21 Display Beautiful Table
This section shows how to create nicely formatted tables for presentation.
---title: "SQL handbook"author: "Tony Duan"execute: warning: false error: falseformat: html: toc: true toc-location: right code-fold: show code-tools: true number-sections: true code-block-bg: true code-block-border-left: "#31BAE9"---This document provides a comprehensive guide to performing common data manipulation tasks using SQL, R, and Python. It serves as a reference for understanding how to achieve similar outcomes across these three popular data analysis tools.# Setup and Configuration```{r}#| code-fold: true#| output: falselibrary(reticulate)py_require(c("pandas","Great-Tables","polars","pyarrow"))# Load libraries for database interaction, data manipulation, and connections.library(DBI)library(tidyverse)library(RSQLite)library(connections)library(duckdb)# Prepare the R dataframes.# Remove the existing mtcars dataset if it exists.rm(mtcars)# Create the mtcars dataframe from the base R dataset, adding the row names as a new column.mtcars =cbind(model_name =rownames(mtcars), mtcars) |>head(10)# Create the iris dataframe from the base R dataset.iris = iris |>head(10)# Remove the database file if it already exists to start with a clean slate.if (file.exists("my-db.duckdb")) {file.remove("my-db.duckdb")}# Establish a connection to the DuckDB database.con <-dbConnect(duckdb(), dbdir ="my-db.duckdb", read_only =FALSE)# Write the iris and mtcars dataframes to the database as tables.dbWriteTable(con, "iris_table", iris, overwrite =TRUE)dbWriteTable(con, "mtcars_table", mtcars, overwrite =TRUE)``````{python}#| code-fold: true#| output: falseimport pandas as pdimport polars as plimport osfrom platform import python_version#print(python_version())cities_pd = pd.DataFrame({'country': ['NL', 'NL', 'NL', 'US', 'US', 'US', 'US', 'US', 'US'],'name': ['Amsterdam', 'Amsterdam', 'Amsterdam', 'Seattle', 'Seattle', 'Seattle', 'New York City', 'New York City', 'New York City'],'year': [2000, 2010, 2020, 2000, 2010, 2020, 2000, 2010, 2020],'population': [1005, 1065, 1158, 564, 608, 738, 8015, 8175, 8772]})# Make the R dataframes available in the Python environment.mtcars_pd = r.mtcarsiris_pd = r.irismtcars_pl = pl.from_pandas(mtcars_pd)iris_pl = pl.from_pandas(iris_pd)cities_pl=pl.from_pandas(cities_pd)```- There are tables `iris_table` and `mtcars_table` in the DuckDB database `my-db.duckdb`.- There are dataframes `iris` and `mtcars` in the R environment.- There are dataframes `iris` and `mtcars` in the Python environment.# Show All TablesThis section demonstrates how to list all available tables or dataframes in each environment.::: panel-tabset## SQL```{sql}#| connection: con-- Show all tables in the connected database.SHOWALLTABLES;```## R```{r}# List all dataframes in the current R environment.dflist <-Filter(is.data.frame, as.list(.GlobalEnv))names(dflist)```## Python pandas```{python}# List all pandas dataframes in the current Python environment.import pandas as pdalldfs = [var for var indir() ifisinstance(eval(var), pd.core.frame.DataFrame)]print(alldfs)```## Python Polars```{python}import polars as plalldfs = [name for name, val inglobals().items() ifisinstance(val, pl.DataFrame)]print(alldfs)```:::# Describe a TableThis section shows how to get a summary of a table's structure and statistics.::: panel-tabset## SQL```{sql}#| connection: con-- Describe the columns and data types of the mtcars_table.DESCRIBE mtcars_table;```## R```{r}# Provide a detailed summary of the mtcars dataframe.skimr::skim(mtcars)```## Python pandas```{python}# Generate descriptive statistics for the mtcars dataframe.mtcars_pd.describe(include='all')```## Python Polars```{python}mtcars_pl.describe()```:::# Show Column NamesThis section demonstrates how to retrieve the column names of a table.::: panel-tabset## SQL```{sql}#| connection: con-- Show information about the columns in mtcars_table.PRAGMA table_info(mtcars_table);```## R```{r}# Get the names of the columns in the mtcars dataframe.names(mtcars)```## Python pandas```{python}# Get a list of column names from the mtcars dataframe.import pandas as pdlist(mtcars_pd.columns.values)```## Python Polars```{python}# Get a list of column names from the mtcars dataframe.mtcars_pl.columns```:::# Select## Select Top 6 and RenameThis section shows how to select a subset of columns, rename them, and limit the number of rows returned.::: panel-tabset## SQL```{sql}#| connection: con-- Select and rename columns, limiting the result to the top 6 rows.SELECT model_name as model, mpg, cyl FROM mtcars_table LIMIT6;```## R```{r}# Select and rename columns from the first 6 rows of the mtcars dataframe.head(mtcars, 6) |>select(model = model_name, mpg, cyl)```## Python pandas```{python}# Rename a column in the mtcars dataframe.mtcars_pd.rename(columns={'model_name': 'model'})```## Python Polars```{python}mtcars_pl.select(pl.col("model_name").alias("model"), pl.col("mpg"), pl.col("cyl")).head(6)```:::## Select DistinctThis section demonstrates how to retrieve unique rows based on specified columns.::: panel-tabset## SQL```{sql}#| connection: con-- Select distinct combinations of mpg and cyl.SELECTDISTINCT mpg, cyl FROM mtcars_table;```## R```{r}# Get distinct rows based on mpg and cyl.mtcars |>distinct(mpg, cyl)```## Python pandas```{python}# Select specific columns and drop duplicate rows.df = mtcars_pd[["mpg", "cyl"]]print(df.drop_duplicates())```## Python Polars```{python}mtcars_pl.select(pl.col("mpg"), pl.col("cyl")).unique()```:::# Check Row and Column NumberThis section shows how to find the dimensions of a table.::: panel-tabset## SQL```{sql}#| connection: con-- Get the total number of rows.SELECTcount(*) AS row_number FROM mtcars_table;``````{sql}#| connection: con-- Get the total number of columns.SELECTcount(*) AS column_number FROM (DESCRIBE mtcars_table);```## R```{r}# Get the number of rows.nrow(mtcars)``````{r}# Get the number of columns.ncol(mtcars)```## Python pandas```{python}# Get the number of rows.mtcars_pd.shape[0]``````{python}# Get the number of columns.mtcars_pd.shape[1]```## Python Polars```{python}# Get the number of rows.mtcars_pl.shape[0]``````{python}# Get the number of columns.mtcars_pl.shape[1]```:::# Create ColumnThis section demonstrates how to add a new column to a table based on existing data.::: panel-tabset## SQL```{sql}#| connection: con-- Create a new column by performing a calculation on an existing column.SELECT*, mpg +1AS new_mpg FROM mtcars_table;```## R```{r}# Add a new column to the dataframe.mtcars |>mutate(new_mpg = mpg +1)```## Python pandas```{python}# Create a new column in the dataframe.mtcars_pd["new_mpg"] = mtcars_pd["mpg"] +1mtcars_pd```## Python Polars```{python}mtcars_pl.with_columns((pl.col("mpg") +1).alias("new_mpg"))```:::# FilterThis section shows how to select rows that meet specific criteria.::: panel-tabset## SQL```{sql}#| connection: con-- Use the AND operator to filter rows based on multiple conditions.SELECT*FROM mtcars_table WHERE mpg =21AND cyl =6;``````{sql}#| connection: con-- Use the OR operator to filter rows.SELECT*FROM mtcars_table WHERE mpg =21OR cyl =6;```## R```{r}# Use the & operator for "and" filtering.mtcars |>filter(mpg ==21& cyl ==6)``````{r}# Use the | operator for "or" filtering.mtcars |>filter(mpg ==21| cyl ==6)```## Python pandas```{python}# Use the query method for "and" filtering.mtcars_pd.query('mpg == 21 and cyl == 6')``````{python}# Use the query method for "or" filtering.mtcars_pd.query('mpg == 21 or cyl == 6')```## Python Polars```{python}# "and" filtering with polarsmtcars_pl.filter((pl.col("mpg") ==21) & (pl.col("cyl") ==6))``````{python}# "or" filtering with polarsmtcars_pl.filter((pl.col("mpg") ==21) | (pl.col("cyl") ==6))```:::# OrderThis section demonstrates how to sort a table based on one or more columns.::: panel-tabset## SQL```{sql}#| connection: con-- Order the results by mpg in descending order and show the top 3.SELECT model_name AS model, mpg, cyl FROM mtcars_table ORDERBY mpg DESCLIMIT3;```## R```{r}# Arrange the dataframe by mpg in descending order.mtcars |>select(model = model_name, mpg, cyl) |>arrange(desc(mpg)) |>head(3)```## Python pandas```{python}# Sort the dataframe by mpg in descending order.mtcars_pd[["model_name", "mpg", "cyl"]].sort_values(by='mpg', ascending=False).head(3)```## Python Polars```{python}mtcars_pl.select("model_name", "mpg", "cyl").sort("mpg", descending=True).head(3)```:::# Group ByThis section shows how to group rows and perform aggregate calculations.::: panel-tabset## SQL```{sql}#| connection: con-- Group by model_name and calculate the sum of mpg and the mean of cyl.SELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1LIMIT5;```## R```{r}# Group by model_name and summarize the data.mtcars |>group_by(model_name) |>summarise(total_mpg =sum(mpg), cyl_mean =mean(cyl)) |>head(5)```## Python pandas```{python}# Group by model_name and aggregate the data.mtcars_pd.groupby('model_name').agg({'mpg': 'sum', 'cyl': 'mean'}).head(5)```## Python Polars```{python}mtcars_pl.group_by("model_name").agg([ pl.sum("mpg").alias("total_mpg"), pl.mean("cyl").alias("cyl_mean")]).head(5)```:::# Create Table## CREATE OR REPLACEThis command creates a new table or overwrites an existing one.```{sql}#| connection: con-- Create a temporary table, replacing it if it already exists.CREATEORREPLACE TEMP TABLE mtcars_table_group ASSELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1;```## CREATE TABLE IF NOT EXISTSThis command creates a table only if it does not already exist.```{sql}#| connection: con-- Create a new table only if it does not already exist.CREATETABLEIFNOTEXISTS new_mtcars_table_group ASSELECT model_name, sum(mpg) AS total_mpg, mean(cyl) AS cyl_mean FROM mtcars_tableGROUPBY1;```# Unique## Check UniqueThis section demonstrates how to verify the uniqueness of values in a column.::: panel-tabset### SQL```{sql}#| connection: con-- Count the total number of rows and the number of distinct model names.SELECTcount(*), count(DISTINCT model_name) FROM mtcars_table;```### Python pandas```{python}# Count the total number of rows and the number of distinct model names.print(len(mtcars_pd), mtcars_pd.model_name.nunique())```### Python Polars```{python}# Count the total number of rows and the number of distinct model names.mtcars_pl.select(pl.count(), pl.col("model_name").n_unique())```:::## Get Duplicate and Non-Duplicate DataThis section demonstrates how to get duplicate and non-duplicate data based on a single column.::: panel-tabset### SQL#### Show all Duplicate```{sql}#| connection: con-- Get duplicate rows based on the mpg column.SELECT*FROM mtcars_table WHERE mpg IN (SELECT mpg FROM mtcars_table GROUPBY mpg HAVINGcount(*) >1);```keep non Duplicate```{sql}#| connection: con-- Get non-duplicate rows based on the mpg column.SELECT*FROM mtcars_table WHERE mpg IN (SELECT mpg FROM mtcars_table GROUPBY mpg HAVINGcount(*) =1);```### R#### Show all Duplicate```{r}# Get duplicate rows based on the mpg column.mtcars |>filter(duplicated(mpg) |duplicated(mpg, fromLast =TRUE))```#### keep non Duplicate```{r}# Get non-duplicate rows based on the mpg column.mtcars |>filter(!duplicated(mpg) &!duplicated(mpg, fromLast =TRUE))```### Python pandas#### Show all Duplicate```{python}# Get duplicate rows based on the mpg column.mtcars_pd[mtcars_pd.duplicated(subset=['mpg'], keep=False)]```#### keep non Duplicate```{python}# Get non-duplicate rows based on the mpg column.mtcars_pd[~mtcars_pd.duplicated(subset=['mpg'], keep=False)]```### Python Polars```{python}# Get duplicate rows based on the mpg column.mtcars_pl.filter(pl.col("mpg").is_duplicated())```#### keep non Duplicate```{python}# Get non-duplicate rows based on the mpg column.mtcars_pl.filter(pl.col("mpg").is_unique())```:::# Join```{sql}#| connection: con-- Select all data from the newly created temporary table.SELECT*FROM mtcars_table_group t1;```## Left JoinThis section shows how to perform a left join to combine data from two tables.::: panel-tabset# SQL```{sql}#| connection: con-- Perform a left join between mtcars_table and mtcars_table_group.SELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1LEFTJOIN mtcars_table_group t2 ON t1.model_name = t2.model_nameLIMIT5;```# R```{r}# Perform a left join on the mtcars dataframe with itself.mtcars |>left_join(mtcars, by =join_by(model_name == model_name)) |>head(5)```# Python pandas```{python}# Perform a left join using pandas.pd.merge(mtcars_pd, mtcars_pd, left_on='model_name', right_on='model_name', how='left').head(5)```# Python Polars```{python}mtcars_pl.join(mtcars_pl, on="model_name", how="left").head(5)```:::## Inner JoinThis section demonstrates how to perform an inner join.::: panel-tabset# SQL```{sql}#| connection: con-- Perform an inner join with a subquery.SELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1INNERJOIN (SELECT*FROM mtcars_table_group LIMIT5) t2 ON t1.model_name = t2.model_name;```# R```{r}# Perform an inner join.mtcars |>inner_join(mtcars, by =join_by(model_name == model_name)) |>head(5)```# Python pandas```{python}# Perform an inner join with pandas.pd.merge(mtcars_pd, mtcars_pd, left_on='model_name', right_on='model_name', how='inner').head(5)```# Python Polars```{python}mtcars_pl.join(mtcars_pl, on="model_name", how="inner").head(5)```:::# Append Rows## Append Without Duplicate Elimination (`union all`)This section shows how to combine rows from two tables, keeping all duplicates.::: panel-tabset### SQL```{sql}#| connection: con-- Count the rows in the original table.SELECTcount(*) FROM mtcars_table;``````{sql}#| connection: con-- Create a new table by appending mtcars_table to itself, including duplicates.CREATE TEMP TABLE double_mtcars_table ASSELECT*FROM mtcars_tableUNIONALLSELECT*FROM mtcars_table;``````{sql}#| connection: con-- Count the rows in the new table.SELECTcount(*) FROM double_mtcars_table;```### R```{r}# Append rows using bind_rows.mtcars %>%bind_rows(mtcars)```### Python pandas```{python}# Concatenate dataframes, keeping all rows.pd.concat([mtcars_pd, mtcars_pd], ignore_index=True)```:::## Append With Duplicate Elimination (`union`)This section demonstrates how to combine rows while removing duplicate entries.::: panel-tabset### SQL```{sql}#| connection: con-- Create a table, removing duplicate rows.CREATEORREPLACE TEMP TABLE double_mtcars_table ASSELECT*FROM mtcars_tableUNIONSELECT*FROM mtcars_table;``````{sql}#| connection: con-- Count the rows in the new table.SELECTcount(*) FROM double_mtcars_table;``````{sql}#| connection: con-- Verify the count of distinct rows.SELECTcount(*) FROM (SELECTDISTINCT*FROM double_mtcars_table);```### R```{r}# Append rows and then remove duplicates.mtcars %>%bind_rows(mtcars) |>distinct()```### Python pandas```{python}# Concatenate dataframes and then drop duplicates.pd.concat([mtcars_pd, mtcars_pd], ignore_index=True).drop_duplicates()```:::# Delete RowsThis section shows how to remove rows from a table based on a condition.::: panel-tabset## SQL```{sql}#| connection: con-- Delete a specific row from the table.DELETEFROM mtcars_table WHERE model_name ='Mazda RX4';``````{sql}#| connection: con-- Verify the row count after deletion.SELECTcount(*) FROM mtcars_table;```## R```{r}# Filter out the specified row.mtcars |>filter(model_name !='Mazda RX4')```## Python pandas```{python}# Select all rows except the one specified.mtcars_pd[mtcars_pd['model_name'] !='Mazda RX4']```## Python Polars```{python}mtcars_pl.filter(pl.col("model_name") !='Mazda RX4')```:::# Update RowsThis section demonstrates how to modify existing data in a table.::: panel-tabset## SQL```{sql}#| connection: con-- Select the row to be updated.SELECT model_name, mpg FROM mtcars_table WHERE model_name ='Mazda RX4 Wag';``````{sql}#| connection: con-- Update the value of a column for a specific row.UPDATE mtcars_tableSET mpg =999WHERE model_name ='Mazda RX4 Wag';``````{sql}#| connection: con-- Verify the update.SELECT model_name, mpg FROM mtcars_table WHERE model_name ='Mazda RX4 Wag';```## R```{r}# Update a value within a filtered selection.mtcars |>filter(model_name =='Mazda RX4 Wag') |>mutate(mpg =999)```## Python pandas```{python}# Update a specific value using .loc.mtcars_pd.loc[mtcars_pd['model_name'] =='Mazda RX4 Wag', 'mpg'] =999mtcars_pd```## Python Polars```{python}mtcars_pl.with_columns( pl.when(pl.col("model_name") =='Mazda RX4 Wag') .then(999) .otherwise(pl.col("mpg")) .alias("mpg"))```:::# Drop TableThis section shows how to remove a table from the database.Before dropping the table:```{sql}#| connection: con-- Show all tables before dropping one.SHOWALLTABLES;```Dropping the table:```{sql}#| connection: con-- Drop the specified table if it exists.DROPTABLEIFEXISTS mtcars_table_group;```After dropping the table:```{sql}#| connection: con-- Show all tables again to confirm the deletion.SHOWALLTABLES;```# PIVOTThis section demonstrates how to transform data from a long to a wide format.```{sql}#| connection: con-- Create a temporary table for the pivot demonstration.DROPTABLEIFEXISTS cities;CREATE TEMP TABLE cities ( country VARCHAR, name VARCHAR, yearINTEGER, population INTEGER);INSERTINTO cities VALUES ('NL', 'Amsterdam', 2000, 1005), ('NL', 'Amsterdam', 2010, 1065), ('NL', 'Amsterdam', 2020, 1158), ('US', 'Seattle', 2000, 564), ('US', 'Seattle', 2010, 608), ('US', 'Seattle', 2020, 738), ('US', 'New York City', 2000, 8015), ('US', 'New York City', 2010, 8175), ('US', 'New York City', 2020, 8772);``````{sql}#| connection: con-- View the raw data.SELECT*FROM cities;```## PIVOT on One Column::: panel-tabset### SQL```{sql}#| connection: con-- Pivot the table on the 'year' column.PIVOT citiesONyearUSINGsum(population)GROUPBY country;```### Python pandas```{python}cities_pd.pivot_table(index='country', columns='year', values='population', aggfunc='sum').reset_index()```### Python Polars```{python}cities_pl.pivot(index='country', columns='year', values='population', aggregate_function='sum')```:::## PIVOT on Two Columns::: panel-tabset### SQL```{sql}#| connection: con-- Pivot on both 'country' and 'name'.PIVOT citiesON country, nameUSINGsum(population);```### Python pandas```{python}cities_pd.pivot_table(index=['country', 'name'], columns='year', values='population', aggfunc='sum').reset_index()```### Python Polars```{python}cities_pl.pivot(index=['country', 'name'], columns='year', values='population', aggregate_function='sum')```:::# UNPIVOTThis section shows how to transform data from a wide to a long format.::: panel-tabset### SQL```{sql}#| connection: con-- Unpivot the previously pivoted table.UNPIVOT( PIVOT citiesONyearUSINGsum(population)GROUPBY country)ONCOLUMNS(* EXCLUDE (country))INTO NAME year VALUE population;```### Python pandas```{python}cities_pivot_pd = cities_pd.pivot_table(index='country', columns='year', values='population', aggfunc='sum').reset_index()cities_pivot_pd.melt(id_vars='country', value_vars=[2000, 2010, 2020], var_name='year', value_name='population')```### Python Polars```{python}cities_pl = pl.from_pandas(cities_pd)``````{python}cities_pivot_pl = cities_pl.pivot(index='country', columns='year', values='population', aggregate_function='sum')cities_pivot_pl.unpivot(index='country', on=['2000', '2010', '2020'], variable_name='year', value_name='population')```:::# EXPLAINThis section demonstrates how to analyze the execution plan of a query.```{sql}#| eval: false#| connection: con-- Show the logical execution plan of a query.EXPLAINSELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1INNERJOIN (SELECT*FROM mtcars_table LIMIT5) t2 ON t1.model_name = t2.model_name;``````{sql}#| eval: false#| connection: con-- Show the detailed execution plan, including timing information.EXPLAINANALYZESELECT t1.model_name, t1.mpg, t1.cyl, t2.*FROM mtcars_table t1INNERJOIN (SELECT*FROM mtcars_table LIMIT5) t2 ON t1.model_name = t2.model_name;``````{r}# Disconnect from the database and shut it down.dbDisconnect(con, shutdown =TRUE)```# Display Beautiful TableThis section shows how to create nicely formatted tables for presentation.::: panel-tabset## R```{r}# Use the gt library to create a formatted table.library(gt)mtcars |>gt()```## Python pandas```{python}# Use the great_tables library to create a formatted table.from great_tables import GTGT(mtcars_pd)```## Python Polars```{python}# Use the great_tables library to create a formatted table.from great_tables import GTGT(mtcars_pl)```:::# Reference:- [DuckDB R Client Documentation](https://duckdb.org/docs/stable/clients/r.html)- [DuckDB SQL Statements](https://duckdb.org/docs/stable/sql/statements)