Data manipulation with Pandas

Author

Tony Duan

1 Set Up Python:

Code
#Sys.setenv(RETICULATE_PYTHON = "/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
library(reticulate)
py_require(c('pandas','pandasql','matplotlib','seaborn'))
#use_python("/Library/Frameworks/Python.framework/Versions/3.11/bin/python3.11")
Code
repl_python()
exit
Code
from platform import python_version
print(python_version())
3.11.11

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Show package’s dependencies

Code
# Import the os module
import os
# Import pkg_resources from pip._vendor to get package information
from pip._vendor import pkg_resources
# Define the package name
_package_name = 'pandas'
# Get the package object
_package = pkg_resources.working_set.by_key[_package_name]
# Print the dependencies of the package
print([str(r) for r in _package.requires()])  # retrieve deps from setup.py
['numpy>=1.23.2; python_version == "3.11"', 'python-dateutil>=2.8.2', 'pytz>=2020.1', 'tzdata>=2022.7']

2 Install Package

Code
!pip3.11 install seaborn pandasql nbformat nbclient

3 Load Package

Code
# Import the pandas library, commonly aliased as pd
import pandas as pd
# Import sqldf from pandasql for SQL-like queries on pandas DataFrames
from pandasql import sqldf
# Import numpy for numerical operations, commonly aliased as np
import numpy as np
# Import matplotlib.pylab for plotting, commonly aliased as plt
import matplotlib.pylab as plt
# Import seaborn for statistical data visualization
import seaborn as sns
Code
mtcars=mtcars
Code
# Assign the R dataset mtcars to a pandas DataFrame named mtcars
mtcars=r.mtcars
Code
# Display the first 5 rows of the mtcars DataFrame
mtcars.head(5)
                    mpg  cyl   disp     hp  drat  ...   qsec   vs   am  gear  carb
Mazda RX4          21.0  6.0  160.0  110.0  3.90  ...  16.46  0.0  1.0   4.0   4.0
Mazda RX4 Wag      21.0  6.0  160.0  110.0  3.90  ...  17.02  0.0  1.0   4.0   4.0
Datsun 710         22.8  4.0  108.0   93.0  3.85  ...  18.61  1.0  1.0   4.0   1.0
Hornet 4 Drive     21.4  6.0  258.0  110.0  3.08  ...  19.44  1.0  0.0   3.0   1.0
Hornet Sportabout  18.7  8.0  360.0  175.0  3.15  ...  17.02  0.0  0.0   3.0   2.0

[5 rows x 11 columns]

row name to column

Code
# Set the name of the index to 'newhead'
mtcars.index.name = 'newhead'
# Reset the index of the DataFrame, making the old index a new column
mtcars.reset_index(inplace=True)
Code
# Display the first 5 rows of the mtcars DataFrame after index manipulation
mtcars.head(5)
             newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0          Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1      Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
2         Datsun 710  22.8  4.0  108.0   93.0  ...  18.61  1.0  1.0   4.0   1.0
3     Hornet 4 Drive  21.4  6.0  258.0  110.0  ...  19.44  1.0  0.0   3.0   1.0
4  Hornet Sportabout  18.7  8.0  360.0  175.0  ...  17.02  0.0  0.0   3.0   2.0

[5 rows x 12 columns]
Code
# Create a new DataFrame small_mtcars with selected columns and the first 5 rows
small_mtcars = mtcars[["cyl", "mpg",'hp']]
small_mtcars=small_mtcars.head(5)
small_mtcars
   cyl   mpg     hp
0  6.0  21.0  110.0
1  6.0  21.0  110.0
2  4.0  22.8   93.0
3  6.0  21.4  110.0
4  8.0  18.7  175.0

3.1 Get Info on the Data

Code
# Print a concise summary of the small_mtcars DataFrame, including data types and non-null values
small_mtcars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   cyl     5 non-null      float64
 1   mpg     5 non-null      float64
 2   hp      5 non-null      float64
dtypes: float64(3)
memory usage: 252.0 bytes

4 Select Columns

4.1 Get Column Names

Code
# Get a list of column names from the small_mtcars DataFrame
list(small_mtcars)
['cyl', 'mpg', 'hp']

4.2 Select by Name

Code
# Select specific columns ('cyl', 'mpg', 'hp') from the small_mtcars DataFrame
small_mtcars [["cyl", "mpg",'hp']]
   cyl   mpg     hp
0  6.0  21.0  110.0
1  6.0  21.0  110.0
2  4.0  22.8   93.0
3  6.0  21.4  110.0
4  8.0  18.7  175.0

other way:

Code
# Select columns by providing a list of items to keep
small_mtcars.filter(items=['cyl', 'mpg','hp'])
   cyl   mpg     hp
0  6.0  21.0  110.0
1  6.0  21.0  110.0
2  4.0  22.8   93.0
3  6.0  21.4  110.0
4  8.0  18.7  175.0

4.2.1 Select columns by name matching with ‘p’

Code
# Select columns where the column name contains the letter 'p'
small_mtcars.loc[:,small_mtcars.columns.str.contains("p")]
    mpg     hp
0  21.0  110.0
1  21.0  110.0
2  22.8   93.0
3  21.4  110.0
4  18.7  175.0

other way:

Code
# Select columns whose names match the regular expression 'p.*' (starts with 'p')
small_mtcars.filter(regex='p.*', axis=1)
    mpg     hp
0  21.0  110.0
1  21.0  110.0
2  22.8   93.0
3  21.4  110.0
4  18.7  175.0

4.2.2 Select columns by index

4.2.2.1 Select first and 3rd columns

Code
# Select rows at index 0 and 2 (first and third rows)
small_mtcars.iloc[[0,2]]
   cyl   mpg     hp
0  6.0  21.0  110.0
2  4.0  22.8   93.0

4.2.2.2 Select first to 3rd columns

Code
# Select rows from index 0 up to (but not including) index 3
small_mtcars[0:3]
   cyl   mpg     hp
0  6.0  21.0  110.0
1  6.0  21.0  110.0
2  4.0  22.8   93.0

5 Drop Column

Code
# Drop the column named 'cyl' from the DataFrame (axis=1 indicates column)
small_mtcars.drop('cyl', axis=1)
    mpg     hp
0  21.0  110.0
1  21.0  110.0
2  22.8   93.0
3  21.4  110.0
4  18.7  175.0

6 Renaming Columns

Code
# Rename columns 'mpg' to 'new_name_mpg' and 'cyl' to 'new_name_cyl'
small_mtcars.rename(columns={'mpg':"new_name_mpg", 'cyl':'new_name_cyl'})
   new_name_cyl  new_name_mpg     hp
0           6.0          21.0  110.0
1           6.0          21.0  110.0
2           4.0          22.8   93.0
3           6.0          21.4  110.0
4           8.0          18.7  175.0

7 Create Column

Code
# Create a new column 'mpg2' by adding 1 to the 'mpg' column
small_mtcars['mpg2'] = small_mtcars['mpg']+1

# Create a new column 'mpg3'. If 'mpg' is greater than 20, assign 'long', otherwise 'short'
small_mtcars['mpg3']  = np.where(small_mtcars['mpg']> 20, "long", "short")

# Create a new column 'mpg4' with nested np.where conditions for categorical assignment
small_mtcars['mpg4'] =np.where(small_mtcars["mpg"]<19, "short",
                   np.where(small_mtcars["mpg"]<=22, "Medium",
                   np.where(small_mtcars["mpg"]>22, "long","else")))

# Display the DataFrame with the new columns
small_mtcars
   cyl   mpg     hp  mpg2   mpg3    mpg4
0  6.0  21.0  110.0  22.0   long  Medium
1  6.0  21.0  110.0  22.0   long  Medium
2  4.0  22.8   93.0  23.8   long    long
3  6.0  21.4  110.0  22.4   long  Medium
4  8.0  18.7  175.0  19.7  short   short

8 Transmute: Create Column and Only Keep This Column

Code
# Create a new column 'mpg2' by adding 1 to the 'mpg' column
small_mtcars['mpg2'] = small_mtcars['mpg']+1

# Create a new DataFrame containing only the 'mpg2' column
new_data=small_mtcars[['mpg2']]

# Display the new DataFrame
new_data
   mpg2
0  22.0
1  22.0
2  23.8
3  22.4
4  19.7

8.1 Filter Rows

Code
# Filter rows where the 'gear' column is equal to 4
mtcars[(mtcars['gear'] ==4)]
           newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0        Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1    Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
2       Datsun 710  22.8  4.0  108.0   93.0  ...  18.61  1.0  1.0   4.0   1.0
7        Merc 240D  24.4  4.0  146.7   62.0  ...  20.00  1.0  0.0   4.0   2.0
8         Merc 230  22.8  4.0  140.8   95.0  ...  22.90  1.0  0.0   4.0   2.0
9         Merc 280  19.2  6.0  167.6  123.0  ...  18.30  1.0  0.0   4.0   4.0
10       Merc 280C  17.8  6.0  167.6  123.0  ...  18.90  1.0  0.0   4.0   4.0
17        Fiat 128  32.4  4.0   78.7   66.0  ...  19.47  1.0  1.0   4.0   1.0
18     Honda Civic  30.4  4.0   75.7   52.0  ...  18.52  1.0  1.0   4.0   2.0
19  Toyota Corolla  33.9  4.0   71.1   65.0  ...  19.90  1.0  1.0   4.0   1.0
25       Fiat X1-9  27.3  4.0   79.0   66.0  ...  18.90  1.0  1.0   4.0   1.0
31      Volvo 142E  21.4  4.0  121.0  109.0  ...  18.60  1.0  1.0   4.0   2.0

[12 rows x 12 columns]

other way: ::: {.cell}

Code
# Filter rows where the 'gear' column is equal to 4 using the query method
mtcars.query('gear==4')
           newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0        Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1    Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
2       Datsun 710  22.8  4.0  108.0   93.0  ...  18.61  1.0  1.0   4.0   1.0
7        Merc 240D  24.4  4.0  146.7   62.0  ...  20.00  1.0  0.0   4.0   2.0
8         Merc 230  22.8  4.0  140.8   95.0  ...  22.90  1.0  0.0   4.0   2.0
9         Merc 280  19.2  6.0  167.6  123.0  ...  18.30  1.0  0.0   4.0   4.0
10       Merc 280C  17.8  6.0  167.6  123.0  ...  18.90  1.0  0.0   4.0   4.0
17        Fiat 128  32.4  4.0   78.7   66.0  ...  19.47  1.0  1.0   4.0   1.0
18     Honda Civic  30.4  4.0   75.7   52.0  ...  18.52  1.0  1.0   4.0   2.0
19  Toyota Corolla  33.9  4.0   71.1   65.0  ...  19.90  1.0  1.0   4.0   1.0
25       Fiat X1-9  27.3  4.0   79.0   66.0  ...  18.90  1.0  1.0   4.0   1.0
31      Volvo 142E  21.4  4.0  121.0  109.0  ...  18.60  1.0  1.0   4.0   2.0

[12 rows x 12 columns]

:::

8.1.1 Filters with AND conditions

Code
# Filter rows where 'cyl' is greater than 4 AND 'gear' is equal to 5
mtcars[(mtcars['cyl'] >4)&(mtcars['gear'] ==5) ]
           newhead   mpg  cyl   disp     hp  ...  qsec   vs   am  gear  carb
28  Ford Pantera L  15.8  8.0  351.0  264.0  ...  14.5  0.0  1.0   5.0   4.0
29    Ferrari Dino  19.7  6.0  145.0  175.0  ...  15.5  0.0  1.0   5.0   6.0
30   Maserati Bora  15.0  8.0  301.0  335.0  ...  14.6  0.0  1.0   5.0   8.0

[3 rows x 12 columns]

other way: ::: {.cell}

Code
# Filter rows where 'cyl' is greater than 4 AND 'gear' is equal to 5 using the query method
mtcars.query('cyl>4 and gear==5')
           newhead   mpg  cyl   disp     hp  ...  qsec   vs   am  gear  carb
28  Ford Pantera L  15.8  8.0  351.0  264.0  ...  14.5  0.0  1.0   5.0   4.0
29    Ferrari Dino  19.7  6.0  145.0  175.0  ...  15.5  0.0  1.0   5.0   6.0
30   Maserati Bora  15.0  8.0  301.0  335.0  ...  14.6  0.0  1.0   5.0   8.0

[3 rows x 12 columns]

:::

8.1.2 Filters with OR conditions

Code
# Filter rows where 'cyl' is equal to 6 OR 'gear' is equal to 5
mtcars[(mtcars['cyl'] ==6) |(mtcars['gear'] ==5) ]
           newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0        Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1    Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
3   Hornet 4 Drive  21.4  6.0  258.0  110.0  ...  19.44  1.0  0.0   3.0   1.0
5          Valiant  18.1  6.0  225.0  105.0  ...  20.22  1.0  0.0   3.0   1.0
9         Merc 280  19.2  6.0  167.6  123.0  ...  18.30  1.0  0.0   4.0   4.0
10       Merc 280C  17.8  6.0  167.6  123.0  ...  18.90  1.0  0.0   4.0   4.0
26   Porsche 914-2  26.0  4.0  120.3   91.0  ...  16.70  0.0  1.0   5.0   2.0
27    Lotus Europa  30.4  4.0   95.1  113.0  ...  16.90  1.0  1.0   5.0   2.0
28  Ford Pantera L  15.8  8.0  351.0  264.0  ...  14.50  0.0  1.0   5.0   4.0
29    Ferrari Dino  19.7  6.0  145.0  175.0  ...  15.50  0.0  1.0   5.0   6.0
30   Maserati Bora  15.0  8.0  301.0  335.0  ...  14.60  0.0  1.0   5.0   8.0

[11 rows x 12 columns]

other way: ::: {.cell}

Code
# Filter rows where 'cyl' is equal to 6 OR 'gear' is equal to 5 using the query method
mtcars.query('cyl==6 or gear==5')
           newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0        Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1    Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
3   Hornet 4 Drive  21.4  6.0  258.0  110.0  ...  19.44  1.0  0.0   3.0   1.0
5          Valiant  18.1  6.0  225.0  105.0  ...  20.22  1.0  0.0   3.0   1.0
9         Merc 280  19.2  6.0  167.6  123.0  ...  18.30  1.0  0.0   4.0   4.0
10       Merc 280C  17.8  6.0  167.6  123.0  ...  18.90  1.0  0.0   4.0   4.0
26   Porsche 914-2  26.0  4.0  120.3   91.0  ...  16.70  0.0  1.0   5.0   2.0
27    Lotus Europa  30.4  4.0   95.1  113.0  ...  16.90  1.0  1.0   5.0   2.0
28  Ford Pantera L  15.8  8.0  351.0  264.0  ...  14.50  0.0  1.0   5.0   4.0
29    Ferrari Dino  19.7  6.0  145.0  175.0  ...  15.50  0.0  1.0   5.0   6.0
30   Maserati Bora  15.0  8.0  301.0  335.0  ...  14.60  0.0  1.0   5.0   8.0

[11 rows x 12 columns]

:::

8.1.3 Filter rows with index

8.1.3.1 5th rows

Code
# Select the row at index 4 (which is the 5th row)
mtcars.iloc[[4]]
             newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
4  Hornet Sportabout  18.7  8.0  360.0  175.0  ...  17.02  0.0  0.0   3.0   2.0

[1 rows x 12 columns]

8.1.3.2 1st and 5th rows

Code
# Select rows at index 0 (1st row) and 4 (5th row)
mtcars.iloc[[0,4]]
             newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0          Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
4  Hornet Sportabout  18.7  8.0  360.0  175.0  ...  17.02  0.0  0.0   3.0   2.0

[2 rows x 12 columns]

8.1.3.3 1st to 5th rows

Code
# Select rows from index 0 up to (but not including) index 4
mtcars.iloc[0:4]
          newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0       Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1   Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
2      Datsun 710  22.8  4.0  108.0   93.0  ...  18.61  1.0  1.0   4.0   1.0
3  Hornet 4 Drive  21.4  6.0  258.0  110.0  ...  19.44  1.0  0.0   3.0   1.0

[4 rows x 12 columns]

8.1.3.4 Get random 5 rows

Code
# Select 5 random rows from the mtcars DataFrame, with a fixed random_state for reproducibility
mtcars.sample(5, random_state=42)
                newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
29         Ferrari Dino  19.7  6.0  145.0  175.0  ...  15.50  0.0  1.0   5.0   6.0
15  Lincoln Continental  10.4  8.0  460.0  215.0  ...  17.82  0.0  0.0   3.0   4.0
24     Pontiac Firebird  19.2  8.0  400.0  175.0  ...  17.05  0.0  0.0   3.0   2.0
17             Fiat 128  32.4  4.0   78.7   66.0  ...  19.47  1.0  1.0   4.0   1.0
8              Merc 230  22.8  4.0  140.8   95.0  ...  22.90  1.0  0.0   4.0   2.0

[5 rows x 12 columns]

8.2 Append

8.2.1 Append by row

Code
# using pandas

# get 1 to 4 rows
data1=mtcars.iloc[0:4]

# get 9 rows
data2=mtcars.iloc[10:11]

# Concatenate data1 and data2 DataFrames by row, ignoring the original index
data3=pd.concat([data1, data2], ignore_index = True,axis=0)

# Display the concatenated DataFrame
data3
          newhead   mpg  cyl   disp     hp  ...   qsec   vs   am  gear  carb
0       Mazda RX4  21.0  6.0  160.0  110.0  ...  16.46  0.0  1.0   4.0   4.0
1   Mazda RX4 Wag  21.0  6.0  160.0  110.0  ...  17.02  0.0  1.0   4.0   4.0
2      Datsun 710  22.8  4.0  108.0   93.0  ...  18.61  1.0  1.0   4.0   1.0
3  Hornet 4 Drive  21.4  6.0  258.0  110.0  ...  19.44  1.0  0.0   3.0   1.0
4       Merc 280C  17.8  6.0  167.6  123.0  ...  18.90  1.0  0.0   4.0   4.0

[5 rows x 12 columns]

8.2.2 Append by column

Code
# using pandas
# Select 'cyl' and 'mpg' columns from small_mtcars
data1=small_mtcars[["cyl", "mpg"]]

# Select 'hp' column from small_mtcars
data2=small_mtcars[['hp']]

# Concatenate data1 and data2 DataFrames by column, and reindex to match data2's index
data3=pd.concat([data1, data2], axis=1).reindex(data2.index)

# Display the concatenated DataFrame
data3
   cyl   mpg     hp
0  6.0  21.0  110.0
1  6.0  21.0  110.0
2  4.0  22.8   93.0
3  6.0  21.4  110.0
4  8.0  18.7  175.0

8.2.3 Dropping NA values

Missing values (NaN) can be handled by either removing rows/columns with missing data or by filling them with appropriate values.

8.2.3.1 Drop rows with any NA values

# Create a sample DataFrame with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})
print("Original DataFrame:")
print(df_missing)

# Drop rows that contain any NaN values
df_dropped_all = df_missing.dropna()
print("\nDataFrame after dropping rows with any NA:")
print(df_dropped_all)

8.2.3.2 Drop rows with NAs in specific columns

# Drop rows that have NaN values in column 'A'
df_dropped_col_a = df_missing.dropna(subset=['A'])
print("\nDataFrame after dropping rows with NA in column 'A':")
print(df_dropped_col_a)

8.2.4 Filling NA values

Missing values can be filled with a specific value, the mean, median, or previous/next valid observation.

8.2.4.1 Fill with a specific value

# Fill all NaN values with 0
df_filled_zero = df_missing.fillna(0)
print("\nDataFrame after filling NA with 0:")
print(df_filled_zero)

8.2.4.2 Fill with mean of the column

# Fill NaN values in column 'B' with the mean of column 'B'
df_filled_mean = df_missing.copy()
df_filled_mean['B'] = df_filled_mean['B'].fillna(df_filled_mean['B'].mean())
print("\nDataFrame after filling NA in column 'B' with its mean:")
print(df_filled_mean)

8.2.4.3 Forward fill (ffill)

# Forward fill NaN values (propagate last valid observation forward to next valid observation)
df_ffill = df_missing.fillna(method='ffill')
print("\nDataFrame after forward fill:")
print(df_ffill)

8.2.4.4 Backward fill (bfill)

# Backward fill NaN values (propagate next valid observation backward to next valid observation)
df_bfill = df_missing.fillna(method='bfill')
print("\nDataFrame after backward fill:")
print(df_bfill)

8.3 Group By

8.3.1 Average, Min, Max, Sum

Code
# Group the mtcars DataFrame by the 'cyl' column and calculate the mean of the 'hp' column for each group
mtcars.groupby("cyl")["hp"].mean()
cyl
4.0     82.636364
6.0    122.285714
8.0    209.214286
Name: hp, dtype: float64
Code
# Group the mtcars DataFrame by the 'cyl' column and find the minimum value of the 'hp' column for each group
mtcars.groupby("cyl")["hp"].min()
cyl
4.0     52.0
6.0    105.0
8.0    150.0
Name: hp, dtype: float64
Code
# Group the mtcars DataFrame by the 'cyl' column and find the maximum value of the 'hp' column for each group
mtcars.groupby("cyl")["hp"].max()
cyl
4.0    113.0
6.0    175.0
8.0    335.0
Name: hp, dtype: float64
Code
# Group the mtcars DataFrame by the 'cyl' column and calculate the sum of the 'hp' column for each group
mtcars.groupby("cyl")["hp"].sum()
cyl
4.0     909.0
6.0     856.0
8.0    2929.0
Name: hp, dtype: float64

8.3.2 Count Record and Count Distinct Record

Code
# Group the mtcars DataFrame by the 'cyl' column and count the number of non-null 'hp' values for each group
mtcars.groupby("cyl")["hp"].count()
cyl
4.0    11
6.0     7
8.0    14
Name: hp, dtype: int64
Code
# Group the mtcars DataFrame by the 'cyl' column and count the number of unique 'hp' values for each group
mtcars.groupby("cyl")["hp"].nunique()
cyl
4.0    10
6.0     4
8.0     9
Name: hp, dtype: int64

8.4 Order Rows

Code
# Sort the small_mtcars DataFrame by the 'hp' column in ascending order
small_mtcars.sort_values('hp')
   cyl   mpg     hp  mpg2   mpg3    mpg4
2  4.0  22.8   93.0  23.8   long    long
0  6.0  21.0  110.0  22.0   long  Medium
1  6.0  21.0  110.0  22.0   long  Medium
3  6.0  21.4  110.0  22.4   long  Medium
4  8.0  18.7  175.0  19.7  short   short

8.4.1 Sort in descending order

Code
# Sort the small_mtcars DataFrame by the 'hp' column in descending order
small_mtcars.sort_values('hp',ascending=False)
   cyl   mpg     hp  mpg2   mpg3    mpg4
4  8.0  18.7  175.0  19.7  short   short
0  6.0  21.0  110.0  22.0   long  Medium
1  6.0  21.0  110.0  22.0   long  Medium
3  6.0  21.4  110.0  22.4   long  Medium
2  4.0  22.8   93.0  23.8   long    long

8.4.2 Arrange by multiple variables

Code
# Sort the small_mtcars DataFrame by 'cyl' then by 'mpg' in ascending order
small_mtcars.sort_values(by=['cyl','mpg'])
   cyl   mpg     hp  mpg2   mpg3    mpg4
2  4.0  22.8   93.0  23.8   long    long
0  6.0  21.0  110.0  22.0   long  Medium
1  6.0  21.0  110.0  22.0   long  Medium
3  6.0  21.4  110.0  22.4   long  Medium
4  8.0  18.7  175.0  19.7  short   short

8.5 Join

Code

# Create a DataFrame named lhs
lhs = pd.DataFrame({'id': [1,2,3], 'val': ['lhs.1', 'lhs.2', 'lhs.3']})
# Create a DataFrame named rhs
rhs = pd.DataFrame({'id': [1,2,4], 'val': ['rhs.1', 'rhs.2', 'rhs.3']})
Code
# Display the lhs DataFrame
lhs
   id    val
0   1  lhs.1
1   2  lhs.2
2   3  lhs.3
Code
# Display the rhs DataFrame
rhs
   id    val
0   1  rhs.1
1   2  rhs.2
2   4  rhs.3

8.5.1 Inner Join

Code
# Perform an inner merge of lhs and rhs DataFrames on the 'id' column
result=pd.merge(lhs, rhs, on='id', how='inner')
# Display the result
result
   id  val_x  val_y
0   1  lhs.1  rhs.1
1   2  lhs.2  rhs.2

8.5.2 Full Join

Code
# Perform a full outer merge of lhs and rhs DataFrames on the 'id' column
result=pd.merge(lhs, rhs, on='id', how='outer')
# Display the result
result
   id  val_x  val_y
0   1  lhs.1  rhs.1
1   2  lhs.2  rhs.2
2   3  lhs.3    NaN
3   4    NaN  rhs.3

8.5.3 Left Join

Code
# Perform a left merge of lhs and rhs DataFrames on the 'id' column
result=pd.merge(lhs, rhs, on='id', how='left')
# Display the result
result
   id  val_x  val_y
0   1  lhs.1  rhs.1
1   2  lhs.2  rhs.2
2   3  lhs.3    NaN

8.6 Reshape Tables

Code
# Create a DataFrame named costs
costs = pd.DataFrame({
    'id': [1,2],
    'price_x': [.1, .2],
    'price_y': [.4, .5],
    'price_z': [.7, .8]
})

# Display the DataFrame
costs
   id  price_x  price_y  price_z
0   1      0.1      0.4      0.7
1   2      0.2      0.5      0.8

8.6.1 Gather data long(wide to long)

Code
# selecting each variable manually
# Melt the costs DataFrame from wide to long format
long_date=pd.melt(costs,id_vars=['id'], value_vars=['price_x', 'price_y','price_z'])

# Display the long format DataFrame
long_date
   id variable  value
0   1  price_x    0.1
1   2  price_x    0.2
2   1  price_y    0.4
3   2  price_y    0.5
4   1  price_z    0.7
5   2  price_z    0.8
Code
#costs >> gather('measure', 'value', _.price_x, _.price_y, _.price_z)

8.6.2 Spread data wide (long to wide)

Code
# Pivot the long_date DataFrame from long to wide format
long_date.pivot(index="id", columns="variable", values="value")
variable  price_x  price_y  price_z
id                                 
1             0.1      0.4      0.7
2             0.2      0.5      0.8

8.7 string

Code
# Create a DataFrame named df with 'text' and 'num' columns
df = pd.DataFrame({'text': ['abc', 'DDD','1243c','aeEe'], 'num': [3, 4,7,8]})

# Display the DataFrame
df
    text  num
0    abc    3
1    DDD    4
2  1243c    7
3   aeEe    8

8.7.1 upper case

# Convert the 'text' column to uppercase
df['text'].str.upper()

8.7.2 lower case

# Convert the 'text' column to lowercase
df['text'].str.lower()

8.7.3 match

# Check if the 'text' column contains 'a'
df['text'].str.contains('a')

8.7.4 concatenation

# Concatenate the 'text' column with itself, separated by a space
df['text'] + " " + df['text']

8.7.5 replace

# Replace 'a' with 'X' in the 'text' column
df['text'].str.replace('a', 'X')

8.7.6 regular expression

https://regex101.com/

Code
# Import the re module for regular expressions
import re

# Example string
text = "The rain in Spain falls mainly in the plain. My email is example@email.com"
print(text)
The rain in Spain falls mainly in the plain. My email is example@email.com
Code
print("")
Code
print("Regular expression to find all words ending with 'ain'")
Regular expression to find all words ending with 'ain'
Code
# Define a regex pattern to find words ending with 'ain'
pattern1 = r'\b\w*ain\b'
# Find all matches of pattern1 in the text
matches1 = re.findall(pattern1, text)
print("Words ending with 'ain':", matches1)
Words ending with 'ain': ['rain', 'Spain', 'plain']
Code
print("")
Code
print("Regular expression to extract an email address")
Regular expression to extract an email address
Code
# Define a regex pattern to extract email addresses
pattern2 = r"[\w\.-]+@[\w\.-]+"
# Find all matches of pattern2 in the text
matches2 = re.findall(pattern2, text)
print("Email addresses found:", matches2)
Email addresses found: ['example@email.com']
Code
print("")
Code
print("Replace all 'ain' endings with 'ANE'")
Replace all 'ain' endings with 'ANE'
Code
# Replace all occurrences of 'ain' at the end of a word with 'ANE'
replaced_text = re.sub(r'ain\b', 'ANE', text)
print("Text after replacement:", replaced_text)
Text after replacement: The rANE in SpANE falls mainly in the plANE. My email is example@email.com
Code
print("")

8.7.7 extract

Use str.extract() with a regular expression to pull out a matching piece of text.

For example the regular expression “^(.*) ” contains the following pieces:

  • a matches the literal letter “a”

  • .* has a . which matches anything, and * which modifies it to apply 0 or more times.

8.8 date

Pandas provides powerful tools for working with dates and times, including converting to datetime objects, extracting components, and formatting.

Code
# Create a DataFrame with 'dates' and 'raw' columns
df_dates = pd.DataFrame({
    "dates": pd.to_datetime(["2021-01-02", "2021-02-03"]),
    "raw": ["2023-04-05 06:07:08", "2024-05-06 07:08:09"],
})
# Display the DataFrame
df_dates
       dates                  raw
0 2021-01-02  2023-04-05 06:07:08
1 2021-02-03  2024-05-06 07:08:09

8.8.1 Extracting Date Components

You can extract various components like year, month, day, hour, minute, second from datetime objects.

# Extract year, month, and day from the 'dates' column
df_dates['dates'].dt.year
df_dates['dates'].dt.month
df_dates['dates'].dt.day

8.8.2 Formatting Dates

Dates can be formatted into different string representations using strftime().

# Format the 'dates' column as YYYY-MM-DD
df_dates['dates'].dt.strftime('%Y-%m-%d')
Code
# Print a concise summary of the df_dates DataFrame
df_dates.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   dates   2 non-null      datetime64[ns]
 1   raw     2 non-null      object        
dtypes: datetime64[ns](1), object(1)
memory usage: 164.0+ bytes

8.9 dataframe to numpy array

Code
# Convert the df_dates DataFrame to a NumPy array
df_dates.to_numpy()
array([[Timestamp('2021-01-02 00:00:00'), '2023-04-05 06:07:08'],
       [Timestamp('2021-02-03 00:00:00'), '2024-05-06 07:08:09']],
      dtype=object)

9 Reference:

https://pandas.pydata.org/docs/user_guide

Back to top