Code
Sys.setenv(RETICULATE_PYTHON = "/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
library(reticulate)
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
Tony Duan
because siuba only work with python 3.11 for now
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Show package’s dependencies
['python-dateutil>=2.8.2', 'pytz>=2020.1', 'tzdata>=2022.1', 'numpy>=1.21.0; python_version >= "3.10"', 'numpy>=1.23.2; python_version >= "3.11"']
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6.0 160.0 110.0 3.90 ... 16.46 0.0 1.0 4.0 4.0
Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 ... 17.02 0.0 1.0 4.0 4.0
Datsun 710 22.8 4.0 108.0 93.0 3.85 ... 18.61 1.0 1.0 4.0 1.0
Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 ... 19.44 1.0 0.0 3.0 1.0
Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 ... 17.02 0.0 0.0 3.0 2.0
[5 rows x 11 columns]
row name to column
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
4 Hornet Sportabout 18.7 8.0 360.0 175.0 ... 17.02 0.0 0.0 3.0 2.0
[5 rows x 12 columns]
cyl mpg hp
0 6.0 21.0 110.0
1 6.0 21.0 110.0
2 4.0 22.8 93.0
3 6.0 21.4 110.0
4 8.0 18.7 175.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cyl 5 non-null float64
1 mpg 5 non-null float64
2 hp 5 non-null float64
dtypes: float64(3)
memory usage: 252.0 bytes
cyl mpg hp
0 6.0 21.0 110.0
1 6.0 21.0 110.0
2 4.0 22.8 93.0
3 6.0 21.4 110.0
4 8.0 18.7 175.0
other way:
mpg hp
0 21.0 110.0
1 21.0 110.0
2 22.8 93.0
3 21.4 110.0
4 18.7 175.0
other way:
cyl mpg hp mpg2 mpg3 mpg4
0 6.0 21.0 110.0 22.0 long Medium
1 6.0 21.0 110.0 22.0 long Medium
2 4.0 22.8 93.0 23.8 long long
3 6.0 21.4 110.0 22.4 long Medium
4 8.0 18.7 175.0 19.7 short short
mpg2
0 22.0
1 22.0
2 23.8
3 22.4
4 19.7
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
7 Merc 240D 24.4 4.0 146.7 62.0 ... 20.00 1.0 0.0 4.0 2.0
8 Merc 230 22.8 4.0 140.8 95.0 ... 22.90 1.0 0.0 4.0 2.0
9 Merc 280 19.2 6.0 167.6 123.0 ... 18.30 1.0 0.0 4.0 4.0
10 Merc 280C 17.8 6.0 167.6 123.0 ... 18.90 1.0 0.0 4.0 4.0
17 Fiat 128 32.4 4.0 78.7 66.0 ... 19.47 1.0 1.0 4.0 1.0
18 Honda Civic 30.4 4.0 75.7 52.0 ... 18.52 1.0 1.0 4.0 2.0
19 Toyota Corolla 33.9 4.0 71.1 65.0 ... 19.90 1.0 1.0 4.0 1.0
25 Fiat X1-9 27.3 4.0 79.0 66.0 ... 18.90 1.0 1.0 4.0 1.0
31 Volvo 142E 21.4 4.0 121.0 109.0 ... 18.60 1.0 1.0 4.0 2.0
[12 rows x 12 columns]
other way: ::: {.cell}
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
7 Merc 240D 24.4 4.0 146.7 62.0 ... 20.00 1.0 0.0 4.0 2.0
8 Merc 230 22.8 4.0 140.8 95.0 ... 22.90 1.0 0.0 4.0 2.0
9 Merc 280 19.2 6.0 167.6 123.0 ... 18.30 1.0 0.0 4.0 4.0
10 Merc 280C 17.8 6.0 167.6 123.0 ... 18.90 1.0 0.0 4.0 4.0
17 Fiat 128 32.4 4.0 78.7 66.0 ... 19.47 1.0 1.0 4.0 1.0
18 Honda Civic 30.4 4.0 75.7 52.0 ... 18.52 1.0 1.0 4.0 2.0
19 Toyota Corolla 33.9 4.0 71.1 65.0 ... 19.90 1.0 1.0 4.0 1.0
25 Fiat X1-9 27.3 4.0 79.0 66.0 ... 18.90 1.0 1.0 4.0 1.0
31 Volvo 142E 21.4 4.0 121.0 109.0 ... 18.60 1.0 1.0 4.0 2.0
[12 rows x 12 columns]
:::
newhead mpg cyl disp hp ... qsec vs am gear carb
28 Ford Pantera L 15.8 8.0 351.0 264.0 ... 14.5 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 145.0 175.0 ... 15.5 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.0 335.0 ... 14.6 0.0 1.0 5.0 8.0
[3 rows x 12 columns]
other way: ::: {.cell}
newhead mpg cyl disp hp ... qsec vs am gear carb
28 Ford Pantera L 15.8 8.0 351.0 264.0 ... 14.5 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 145.0 175.0 ... 15.5 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.0 335.0 ... 14.6 0.0 1.0 5.0 8.0
[3 rows x 12 columns]
:::
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
5 Valiant 18.1 6.0 225.0 105.0 ... 20.22 1.0 0.0 3.0 1.0
9 Merc 280 19.2 6.0 167.6 123.0 ... 18.30 1.0 0.0 4.0 4.0
10 Merc 280C 17.8 6.0 167.6 123.0 ... 18.90 1.0 0.0 4.0 4.0
26 Porsche 914-2 26.0 4.0 120.3 91.0 ... 16.70 0.0 1.0 5.0 2.0
27 Lotus Europa 30.4 4.0 95.1 113.0 ... 16.90 1.0 1.0 5.0 2.0
28 Ford Pantera L 15.8 8.0 351.0 264.0 ... 14.50 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 145.0 175.0 ... 15.50 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.0 335.0 ... 14.60 0.0 1.0 5.0 8.0
[11 rows x 12 columns]
other way: ::: {.cell}
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
5 Valiant 18.1 6.0 225.0 105.0 ... 20.22 1.0 0.0 3.0 1.0
9 Merc 280 19.2 6.0 167.6 123.0 ... 18.30 1.0 0.0 4.0 4.0
10 Merc 280C 17.8 6.0 167.6 123.0 ... 18.90 1.0 0.0 4.0 4.0
26 Porsche 914-2 26.0 4.0 120.3 91.0 ... 16.70 0.0 1.0 5.0 2.0
27 Lotus Europa 30.4 4.0 95.1 113.0 ... 16.90 1.0 1.0 5.0 2.0
28 Ford Pantera L 15.8 8.0 351.0 264.0 ... 14.50 0.0 1.0 5.0 4.0
29 Ferrari Dino 19.7 6.0 145.0 175.0 ... 15.50 0.0 1.0 5.0 6.0
30 Maserati Bora 15.0 8.0 301.0 335.0 ... 14.60 0.0 1.0 5.0 8.0
[11 rows x 12 columns]
:::
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
[4 rows x 12 columns]
newhead mpg cyl disp hp ... qsec vs am gear carb
29 Ferrari Dino 19.7 6.0 145.0 175.0 ... 15.50 0.0 1.0 5.0 6.0
15 Lincoln Continental 10.4 8.0 460.0 215.0 ... 17.82 0.0 0.0 3.0 4.0
24 Pontiac Firebird 19.2 8.0 400.0 175.0 ... 17.05 0.0 0.0 3.0 2.0
17 Fiat 128 32.4 4.0 78.7 66.0 ... 19.47 1.0 1.0 4.0 1.0
8 Merc 230 22.8 4.0 140.8 95.0 ... 22.90 1.0 0.0 4.0 2.0
[5 rows x 12 columns]
newhead mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6.0 160.0 110.0 ... 16.46 0.0 1.0 4.0 4.0
1 Mazda RX4 Wag 21.0 6.0 160.0 110.0 ... 17.02 0.0 1.0 4.0 4.0
2 Datsun 710 22.8 4.0 108.0 93.0 ... 18.61 1.0 1.0 4.0 1.0
3 Hornet 4 Drive 21.4 6.0 258.0 110.0 ... 19.44 1.0 0.0 3.0 1.0
4 Merc 280C 17.8 6.0 167.6 123.0 ... 18.90 1.0 0.0 4.0 4.0
[5 rows x 12 columns]
cyl
4.0 82.636364
6.0 122.285714
8.0 209.214286
Name: hp, dtype: float64
cyl mpg hp mpg2 mpg3 mpg4
2 4.0 22.8 93.0 23.8 long long
0 6.0 21.0 110.0 22.0 long Medium
1 6.0 21.0 110.0 22.0 long Medium
3 6.0 21.4 110.0 22.4 long Medium
4 8.0 18.7 175.0 19.7 short short
id price_x price_y price_z
0 1 0.1 0.4 0.7
1 2 0.2 0.5 0.8
id variable value
0 1 price_x 0.1
1 2 price_x 0.2
2 1 price_y 0.4
3 2 price_y 0.5
4 1 price_z 0.7
5 2 price_z 0.8
text num
0 abc 3
1 DDD 4
2 1243c 7
3 aeEe 8
Use str.extract() with a regular expression to pull out a matching piece of text.
For example the regular expression “^(.*) ” contains the following pieces:
a matches the literal letter “a”
.* has a . which matches anything, and * which modifies it to apply 0 or more times.
dates raw
0 2021-01-02 2023-04-05 06:07:08
1 2021-02-03 2024-05-06 07:08:09
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dates 2 non-null datetime64[ns]
1 raw 2 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 164.0+ bytes
https://pandas.pydata.org/docs/user_guide
---
title: "Data manipulation with Pandas"
author: "Tony Duan"
execute:
warning: false
error: false
format:
html:
toc: true
toc-location: right
code-fold: show
code-tools: true
number-sections: true
code-block-bg: true
code-block-border-left: "#31BAE9"
---
# Set up Python:Using python 3.11 instead of 3.13
because siuba only work with python 3.11 for now
```{r}
Sys.setenv(RETICULATE_PYTHON = "/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
library(reticulate)
use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
```
```{r}
repl_python()
```
```{python}
from platform import python_version
print(python_version())
```
{width="600"}
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.
Show package's dependencies
```{python}
import os
#os.system('pip show pandas')
from pip._vendor import pkg_resources
_package_name = 'pandas'
_package = pkg_resources.working_set.by_key[_package_name]
print([str(r) for r in _package.requires()]) # retrieve deps from setup.py
```
# install pacakge
```{python}
#| eval: false
!pip3.11 install siuba seaborn pandasql nbformat nbclient
```
# load package
```{python}
import pandas as pd
from pandasql import sqldf
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from siuba.siu import call
from siuba import _, mutate, filter, group_by, summarize,show_query
from siuba import *
from siuba.data import mtcars,penguins
```
```{r}
mtcars=mtcars
```
```{python}
mtcars=r.mtcars
```
```{python}
mtcars.head(5)
```
row name to column
```{python}
mtcars.index.name = 'newhead'
mtcars.reset_index(inplace=True)
```
```{python}
mtcars.head(5)
```
```{python}
small_mtcars = mtcars[["cyl", "mpg",'hp']]
small_mtcars=small_mtcars.head(5)
small_mtcars
```
## get info on the data
```{python}
small_mtcars.info()
```
# select column
## get column names
```{python}
list(small_mtcars)
```
## select by name
::: panel-tabset
### pandas
```{python}
small_mtcars [["cyl", "mpg",'hp']]
```
other way:
```{python}
small_mtcars.filter(items=['cyl', 'mpg','hp'])
```
### siuba
```{python}
small_mtcars >> select(_.cyl, _.mpg, _.hp)
```
:::
### select columns by name match with 'p'
```{python}
small_mtcars.loc[:,small_mtcars.columns.str.contains("p")]
```
other way:
```{python}
small_mtcars.filter(regex='p.*', axis=1)
```
### select columns by index
#### select first and 3rd columns
```{python}
small_mtcars.iloc[[0,2]]
```
#### select first to 3rd columns
```{python}
small_mtcars[0:3]
```
# drop column
::: panel-tabset
## pandas
```{python}
small_mtcars.drop('cyl', axis=1)
```
## siuba
```{python}
small_mtcars>>select(-_.cyl)
```
:::
# Renaming column
::: panel-tabset
## pandas
```{python}
small_mtcars.rename(columns={'mpg':"new_name_mpg", 'cyl':'new_name_cyl'})
```
## siuba
```{python}
small_mtcars>> rename(new_name_mpg=_.mpg)
```
:::
# Create column
::: panel-tabset
## pandas
```{python}
small_mtcars['mpg2'] = small_mtcars['mpg']+1
small_mtcars['mpg3'] = np.where(small_mtcars['mpg']> 20, "long", "short")
small_mtcars['mpg4'] =np.where(small_mtcars["mpg"]<19, "short",
np.where(small_mtcars["mpg"]<=22, "Medium",
np.where(small_mtcars["mpg"]>22, "long","else")))
small_mtcars
```
## siuba
```{python}
small_mtcars >> mutate(mpg2 = _.mpg+1
,mpg3=if_else(_.mpg > 20, "long", "short")
)
```
:::
# Transmute,create column and only keep this column
```{python}
small_mtcars['mpg2'] = small_mtcars['mpg']+1
new_data=small_mtcars[['mpg2']]
new_data
```
## Filter rows
```{python}
mtcars[(mtcars['gear'] ==4)]
```
other way:
```{python}
mtcars.query('gear==4')
```
### Filters with AND conditions
```{python}
mtcars[(mtcars['cyl'] >4)&(mtcars['gear'] ==5) ]
```
other way:
```{python}
mtcars.query('cyl>4 and gear==5')
```
### Filters with OR conditions
```{python}
mtcars[(mtcars['cyl'] ==6) |(mtcars['gear'] ==5) ]
```
other way:
```{python}
mtcars.query('cyl==6 or gear==5')
```
### filter row with index
#### 5th rows
```{python}
# not in siuba, in pandas
mtcars.iloc[[4]]
```
#### 1 and 5tj rows
```{python}
# not in siuba, in pandas
mtcars.iloc[[0,4]]
```
#### 1 to 5th rows
```{python}
# not in siuba, in pandas
mtcars.iloc[0:4]
```
#### get ramdon 5 rows
```{python}
mtcars.sample(5, random_state=42)
```
## Append
### append by row
```{python}
# not available in siuba yet
#from siuba import bind_rows
```
```{python}
# using pandas
# get 1 to 4 rows
data1=mtcars.iloc[0:4]
# get 9 rows
data2=mtcars.iloc[10:11]
data3=pd.concat([data1, data2], ignore_index = True,axis=0)
data3
```
### append by column
```{python}
# using pandas
data1=small_mtcars[["cyl", "mpg"]]
data2=small_mtcars[['hp']]
data3=pd.concat([data1, data2], axis=1).reindex(data2.index)
data3
```
### Dropping NA values
### keep NA values
## group by
### average,min,max,sum
```{python}
mtcars.groupby("cyl")["hp"].mean()
```
```{python}
mtcars.groupby("cyl")["hp"].min()
```
```{python}
mtcars.groupby("cyl")["hp"].max()
```
```{python}
mtcars.groupby("cyl")["hp"].sum()
```
### count record and count distinct record
```{python}
mtcars.groupby("cyl")["hp"].count()
```
```{python}
mtcars.groupby("cyl")["hp"].nunique()
```
## order rows
```{python}
small_mtcars.sort_values('hp')
```
### Sort in descending order
```{python}
small_mtcars.sort_values('hp',ascending=False)
```
### Arrange by multiple variables
```{python}
small_mtcars.sort_values(by=['cyl','mpg'])
```
## join
```{python}
lhs = pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
rhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
```
```{python}
lhs
```
```{python}
rhs
```
### inner_join
```{python}
result=pd.merge(lhs, rhs, on='id', how='inner')
result
```
### full join
```{python}
result=pd.merge(lhs, rhs, on='id', how='outer')
result
```
### left join
```{python}
result=pd.merge(lhs, rhs, on='id', how='left')
result
```
## Reshape tables
```{python}
costs = pd.DataFrame({
'id': [1,2],
'price_x': [.1, .2],
'price_y': [.4, .5],
'price_z': [.7, .8]
})
costs
```
### Gather data long(wide to long)
```{python}
# selecting each variable manually
long_date=pd.melt(costs,id_vars=['id'], value_vars=['price_x', 'price_y','price_z'])
long_date
#costs >> gather('measure', 'value', _.price_x, _.price_y, _.price_z)
```
### Spread data wide (long to wide)
```{python}
long_date.pivot(index="id", columns="variable", values="value")
```
## string
```{python}
df = pd.DataFrame({'text': ['abc', 'DDD','1243c','aeEe'], 'num': [3, 4,7,8]})
df
```
### upper case
### lower case
### match
### concatenation
### replace
### extract
Use str.extract() with a regular expression to pull out a matching piece of text.
For example the regular expression “^(.*) ” contains the following pieces:
- a matches the literal letter “a”
- .* has a . which matches anything, and * which modifies it to apply 0 or more times.
## date
```{python}
df_dates = pd.DataFrame({
"dates": pd.to_datetime(["2021-01-02", "2021-02-03"]),
"raw": ["2023-04-05 06:07:08", "2024-05-06 07:08:09"],
})
df_dates
```
```{python}
df_dates.info()
```
## dataframe to numpy array
```{python}
df_dates.to_numpy()
```
# Reference:
https://pandas.pydata.org/docs/user_guide
```{python}
#| eval: false
#| include: false
this is hidding code chunk
```