Machine LearningBeginner12 min readUpdated March 2025

K-Means Clustering

K-Means is the most popular unsupervised learning algorithm. It partitions data into K clusters by iteratively assigning points to the nearest centroid and updating centroids until convergence.

What is Clustering?

Clustering is an unsupervised learning task - we have data but no labels. The goal is to discover natural groupings or structure in the data.

Applications of clustering: - Customer segmentation (group customers by behavior) - Document clustering (group similar articles) - Image compression (reduce colors by clustering pixels) - Anomaly detection (points far from any cluster are anomalies) - Gene expression analysis in bioinformatics

The K-Means Algorithm

K-Means partitions n data points into K clusters. The algorithm is elegantly simple:

Step 1: Initialize - Randomly place K centroids in the feature space.
Step 2: Assign - Assign each data point to the nearest centroid (using Euclidean distance).
Step 3: Update - Recompute each centroid as the mean of all points assigned to it.
Step 4: Repeat - Repeat steps 2-3 until centroids stop moving (convergence).
The algorithm minimizes the Within-Cluster Sum of Squares (WCSS): sum of squared distances from each point to its centroid.

Choosing K: The Elbow Method

The biggest challenge with K-Means is choosing the right number of clusters K. The Elbow Method plots WCSS against K:

- As K increases, WCSS always decreases (more clusters = tighter fit) - The "elbow" point where the rate of decrease sharply slows is the optimal K

Other methods include the Silhouette Score (measures how similar a point is to its own cluster vs other clusters) and the Gap Statistic.

K-Means++ Initialization

Standard K-Means uses random initialization, which can lead to poor results. K-Means++ improves initialization:

1. Choose first centroid randomly from data points 2. For each subsequent centroid, choose a point with probability proportional to its squared distance from the nearest existing centroid 3. This spreads centroids out, leading to faster convergence and better results

Scikit-learn uses K-Means++ by default (init='k-means++').

Implementing K-Means in Python

Complete implementation with elbow method and visualization:

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Generate synthetic clustered data
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)

# Scale features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ---- Elbow Method ----
wcss = []
K_range = range(1, 11)
for k in K_range:
    km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    km.fit(X_scaled)
    wcss.append(km.inertia_)

print("WCSS values for K=1 to 10:")
for k, w in zip(K_range, wcss):
    print(f"  K={k}: {w:.2f}")

# ---- Fit with optimal K ----
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# ---- Evaluate ----
sil_score = silhouette_score(X_scaled, labels)
print(f"\nSilhouette Score (K={optimal_k}): {sil_score:.4f}")
print(f"Cluster sizes: {np.bincount(labels)}")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")

# ---- Cluster centers (in original scale) ----
centers_original = scaler.inverse_transform(kmeans.cluster_centers_)
print(f"\nCluster centers (original scale):\n{centers_original.round(2)}")

Limitations and Alternatives

K-Means has important limitations to be aware of:

Assumes spherical clusters - Struggles with elongated, irregular, or nested cluster shapes.
Sensitive to outliers - Outliers can pull centroids away from true cluster centers.
Requires K upfront - You must specify the number of clusters before training.
DBSCAN - Density-based clustering that finds arbitrary shapes and automatically detects outliers.
Gaussian Mixture Models (GMM) - Probabilistic clustering that handles elliptical clusters and provides soft assignments.
Hierarchical Clustering - Builds a dendrogram; no need to specify K in advance.

Key Takeaways

K-Means iteratively assigns points to the nearest centroid and updates centroids until convergence.
Always scale features before K-Means since it is distance-based and sensitive to feature magnitudes.
Use the Elbow Method or Silhouette Score to determine the optimal number of clusters K.
K-Means++ initialization produces better, more consistent results than random initialization.
K-Means assumes spherical clusters of similar size - use DBSCAN or GMM for complex cluster shapes.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com