Machine LearningIntermediate18 min readUpdated March 2025

Support Vector Machines

Support Vector Machines find the optimal hyperplane that maximizes the margin between classes. With the kernel trick, SVMs can classify non-linearly separable data by mapping it into higher-dimensional spaces.

The Core Idea: Maximum Margin

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal decision boundary - the hyperplane that maximizes the margin between two classes.

The margin is the distance between the decision boundary and the nearest data points from each class. These nearest points are called support vectors - they are the only points that define the boundary.

Why maximize the margin? A larger margin means the model is more confident in its predictions and generalizes better to unseen data.

Hard Margin vs Soft Margin

Hard Margin SVM assumes data is perfectly linearly separable - no misclassifications allowed. This is rarely the case in real data.

Soft Margin SVM introduces a slack variable that allows some misclassifications, controlled by the hyperparameter C: - High C - Small margin, fewer misclassifications (risk of overfitting) - Low C - Large margin, more misclassifications allowed (better generalization)

Finding the right C is crucial and is typically done via cross-validation.

The Kernel Trick

When data is not linearly separable, the kernel trick maps data into a higher-dimensional space where a linear boundary can be found - without explicitly computing the transformation.

Common kernels:

Linear Kernel - No transformation. Best for linearly separable data and high-dimensional text data.
RBF (Radial Basis Function) / Gaussian Kernel - Maps to infinite dimensions. Most versatile, works for most problems. Controlled by gamma parameter.
Polynomial Kernel - Maps to polynomial feature space. Degree parameter controls complexity.
Sigmoid Kernel - Similar to neural network activation. Less commonly used.

SVM for Regression (SVR)

SVMs can also perform regression using Support Vector Regression (SVR). Instead of maximizing margin between classes, SVR fits a tube of width epsilon around the data:

- Points inside the tube have zero loss - Points outside the tube are penalized proportionally to their distance from the tube

This makes SVR robust to outliers compared to linear regression.

Implementing SVM in Python

Complete SVM implementation with kernel comparison:

python

import numpy as np
from sklearn.svm import SVC, SVR
from sklearn.datasets import make_classification, make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# ---- Non-linear data (moons) ----
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# IMPORTANT: Always scale features for SVM
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# ---- Compare kernels ----
kernels = ['linear', 'rbf', 'poly']
for kernel in kernels:
    svm = SVC(kernel=kernel, C=1.0, random_state=42)
    svm.fit(X_train, y_train)
    acc = accuracy_score(y_test, svm.predict(X_test))
    print(f"Kernel: {kernel:8s} | Accuracy: {acc:.4f} | Support Vectors: {svm.n_support_}")

# ---- Hyperparameter tuning with GridSearchCV ----
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.01, 0.1],
    'kernel': ['rbf']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"\nBest params: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

# ---- SVR for regression ----
from sklearn.datasets import make_regression
Xr, yr = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)
svr = SVR(kernel='rbf', C=100, epsilon=0.1)
svr.fit(Xr, yr)
print(f"\nSVR R^2 score: {svr.score(Xr, yr):.4f}")

When to Use SVMs

SVMs are particularly effective in these scenarios:

High-dimensional data (e.g., text classification, genomics) where n_features >> n_samples.
When the number of features is larger than the number of training examples.
Binary classification problems where a clear margin of separation exists.
When you need a non-probabilistic classifier with strong theoretical guarantees.
SVMs scale poorly to very large datasets (>100K samples) - prefer tree-based methods or neural networks in that case.

Key Takeaways

SVMs find the hyperplane that maximizes the margin between classes, defined by support vectors.
The C parameter controls the bias-variance tradeoff: high C = low bias, high variance.
The kernel trick enables SVMs to classify non-linearly separable data without explicit feature mapping.
RBF kernel is the most versatile choice; always scale features before training an SVM.
SVMs are powerful for high-dimensional, small-to-medium datasets but scale poorly to millions of samples.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com