Clustering: Income Data with Python

Using sklearn

1 Introduction

This document demonstrates how to perform clustering in Python using the scikit-learn library. Clustering is an unsupervised learning technique that groups similar data points together based on their inherent characteristics. We will use the adult_income_dataset.csv for this demonstration.

2 Load Data

First, we load the necessary libraries and the income dataset.

Code
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.cluster.hierarchy as shc
from sklearn.decomposition import PCA

# Load the income dataset
income_df = pd.read_csv("../data/adult_income_dataset.csv")
Code
# Handle missing values
income_df_clean = income_df.drop('income', axis=1).dropna().sample(n=1000, random_state=42)

# Separate numerical and categorical columns
numerical_cols = income_df_clean.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = income_df_clean.select_dtypes(include=['object']).columns

# One-hot encode categorical features
income_df_encoded = pd.get_dummies(income_df_clean, columns=categorical_cols, drop_first=True)

# Standardize numerical features
scaler = StandardScaler()
income_df_encoded[numerical_cols] = scaler.fit_transform(income_df_encoded[numerical_cols])

scaled_income = income_df_encoded.values

3 Elbow Method

The Elbow Method is a heuristic used to determine the optimal number of clusters in a dataset. We can visualize the total within-cluster sum of squares (WCSS) as a function of the number of clusters.

Code
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(scaled_income)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.show()

4 K-Means Clustering

K-Means is a popular clustering algorithm. We will use it to group the income data into clusters. The optimal number of clusters can be determined from the Elbow Method plot.

Code
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10) # Assuming 5 clusters from Elbow Method
income_df_encoded['kmeans_cluster'] = kmeans.fit_predict(scaled_income)

# Visualize the clusters (using first two principal components for visualization)
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_income)
principal_df = pd.DataFrame(data = principal_components, columns = ['principal component 1', 'principal component 2'])
principal_df['kmeans_cluster'] = income_df_encoded['kmeans_cluster'].values

plt.figure(figsize=(10, 6))
sns.scatterplot(x='principal component 1', y='principal component 2', hue='kmeans_cluster', data=principal_df, palette='viridis', s=100)
plt.title('K-Means Clustering of Income Data (PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

5 Hierarchical Clustering

Hierarchical clustering is another common clustering method. We can also visualize the result as a dendrogram.

Code
plt.figure(figsize=(10, 7))
plt.title("Income Data Dendrogram")
dend = shc.dendrogram(shc.linkage(scaled_income, method='ward'))
plt.show()

hierarchical = AgglomerativeClustering(n_clusters=5) # Assuming 5 clusters
income_df_encoded['hierarchical_cluster'] = hierarchical.fit_predict(scaled_income)

# Visualize the clusters (using first two principal components for visualization)
principal_df['hierarchical_cluster'] = income_df_encoded['hierarchical_cluster'].values

plt.figure(figsize=(10, 6))
sns.scatterplot(x='principal component 1', y='principal component 2', hue='hierarchical_cluster', data=principal_df, palette='viridis', s=100)
plt.title('Hierarchical Clustering of Income Data (PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

6 Comparison of K-Means and Hierarchical Clustering

Here’s a comparison of K-Means and Hierarchical Clustering:

Feature K-Means Clustering Hierarchical Clustering
Approach Partitioning (divides data into k clusters) Agglomerative (bottom-up) or Divisive (top-down)
Number of Clusters Requires pre-specification (k) Does not require pre-specification; dendrogram helps
Computational Cost Faster for large datasets Slower for large datasets (O(n^3) or O(n^2))
Cluster Shape Tends to form spherical clusters Can discover arbitrarily shaped clusters
Sensitivity to Outliers Sensitive to outliers Less sensitive to outliers
Interpretability Easy to interpret Dendrogram can be complex for large datasets
Reproducibility Can vary with initial centroids (unless fixed) Reproducible

7 Conclusion

This document provided a brief overview of clustering in Python using scikit-learn. We demonstrated both K-Means and Hierarchical clustering on the income dataset.