K-Means Clustering
K-Means is the most popular unsupervised learning algorithm. It partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence.
What is Clustering?
Clustering is an unsupervised learning task - we have data but no labels. The goal is to discover natural groupings or structure in the data.
Applications of clustering: - Customer segmentation (group customers by behavior) - Document clustering (group similar articles) - Image compression (reduce colors by clustering pixels) - Anomaly detection (points far from any cluster are anomalies) - Gene expression analysis in bioinformatics
The K-Means Algorithm
K-Means partitions n data points into K clusters. The algorithm is elegantly simple:
- Step 1: Initialize - Randomly place K centroids in the feature space.
- Step 2: Assign - Assign each data point to the nearest centroid (using Euclidean distance).
- Step 3: Update - Recompute each centroid as the mean of all points assigned to it.
- Step 4: Repeat - Repeat steps 2-3 until centroids stop moving (convergence).
- The algorithm minimizes the Within-Cluster Sum of Squares (WCSS): sum of squared distances from each point to its centroid.
Choosing K: The Elbow Method
The biggest challenge with K-Means is choosing the right number of clusters K. The Elbow Method plots WCSS against K:
- As K increases, WCSS always decreases (more clusters = tighter fit) - The "elbow" point where the rate of decrease sharply slows is the optimal K
Other methods include the Silhouette Score (measures how similar a point is to its own cluster vs other clusters) and the Gap Statistic.
K-Means++ Initialization
Standard K-Means uses random initialization, which can lead to poor results. K-Means++ improves initialization:
1. Choose first centroid randomly from data points 2. For each subsequent centroid, choose a point with probability proportional to its squared distance from the nearest existing centroid 3. This spreads centroids out, leading to faster convergence and better results
Scikit-learn uses K-Means++ by default (init='k-means++').
Implementing K-Means in Python
Complete implementation with elbow method and visualization:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Generate synthetic clustered data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# ---- Elbow Method ----
wcss = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
wcss.append(km.inertia_)
print("WCSS values for K=1 to 10:")
for k, w in zip(K_range, wcss):
print(f" K={k}: {w:.2f}")
# ---- Fit with optimal K ----
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)
# ---- Evaluate ----
sil_score = silhouette_score(X_scaled, labels)
print(f"\nSilhouette Score (K={optimal_k}): {sil_score:.4f}")
print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
# ---- Cluster centers (in original scale) ----
centers_original = scaler.inverse_transform(kmeans.cluster_centers_)
print(f"\nCluster centers (original scale):\n{centers_original.round(2)}")Limitations and Alternatives
K-Means has important limitations to be aware of:
- Assumes spherical clusters - Struggles with elongated, irregular, or nested cluster shapes.
- Sensitive to outliers - Outliers can pull centroids away from true cluster centers.
- Requires K upfront - You must specify the number of clusters before training.
- DBSCAN - Density-based clustering that finds arbitrary shapes and automatically detects outliers.
- Gaussian Mixture Models (GMM) - Probabilistic clustering that handles elliptical clusters and provides soft assignments.
- Hierarchical Clustering - Builds a dendrogram; no need to specify K in advance.
Key Takeaways
- K-Means iteratively assigns points to the nearest centroid and updates centroids until convergence.
- Always scale features before K-Means since it is distance-based and sensitive to feature magnitudes.
- Use the Elbow Method or Silhouette Score to determine the optimal number of clusters K.
- K-Means++ initialization produces better, more consistent results than random initialization.
- K-Means assumes spherical clusters of similar size - use DBSCAN or GMM for complex cluster shapes.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com