Clustering in Python

1 Introduction

This document demonstrates how to perform clustering in Python using the scikit-learn library. Clustering is an unsupervised learning technique that groups similar data points together based on their inherent characteristics. We will use the iris dataset for this demonstration.

2 Load Data

First, we load the necessary libraries and the iris dataset.

Code
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load the iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Standardize the data
scaler = StandardScaler()
scaled_iris = scaler.fit_transform(iris_df)

3 K-Means Clustering

K-Means is a popular clustering algorithm. We will use it to group the iris data into 3 clusters.

Code
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
iris_df['kmeans_cluster'] = kmeans.fit_predict(scaled_iris)

# Visualize the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=scaled_iris[:, 0], y=scaled_iris[:, 1], hue=iris_df['kmeans_cluster'], palette='viridis', s=100)
plt.title('K-Means Clustering of Iris Data')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.show()

4 Hierarchical Clustering

Hierarchical clustering is another common clustering method.

Code
hierarchical = AgglomerativeClustering(n_clusters=3)
iris_df['hierarchical_cluster'] = hierarchical.fit_predict(scaled_iris)

# Visualize the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=scaled_iris[:, 0], y=scaled_iris[:, 1], hue=iris_df['hierarchical_cluster'], palette='viridis', s=100)
plt.title('Hierarchical Clustering of Iris Data')
plt.xlabel(iris..feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.show()

5 Conclusion

This document provided a brief overview of clustering in Python using scikit-learn. We demonstrated both K-Means and Hierarchical clustering on the iris dataset.