Machine LearningIntermediate12 min readUpdated March 2025

Cross-Validation

Cross-validation provides a robust estimate of model performance by training and evaluating the model multiple times on different data subsets. K-Fold and Stratified K-Fold are the standard techniques for reliable model selection and hyperparameter tuning.

Why Cross-Validation?

A single train/validation split has a critical weakness: the performance estimate depends heavily on which samples happen to end up in the validation set.

With a small dataset, this variance can be enormous - you might get 85% accuracy on one split and 72% on another, purely due to randomness.

Cross-validation solves this by averaging performance across multiple splits, giving a much more reliable and stable estimate of true generalization performance.

K-Fold Cross-Validation

K-Fold CV divides the dataset into K equal-sized folds. The model is trained K times, each time using K-1 folds for training and 1 fold for validation:

Split data into K equal folds (typically K=5 or K=10).
For each fold i: train on all folds except i, evaluate on fold i.
Record the validation score for each fold.
Final score = mean of all K scores. Standard deviation measures stability.
Every sample is used for validation exactly once, and for training K-1 times.

Stratified K-Fold

Stratified K-Fold ensures each fold has the same class distribution as the full dataset. This is essential for:

- Imbalanced classification (e.g., 1% fraud, 99% normal) - Small datasets where random splits might create folds with no minority class examples - Getting reliable per-class metrics

Scikit-learn's cross_val_score uses StratifiedKFold automatically for classifiers.

Leave-One-Out and Other Variants

Beyond standard K-Fold, several specialized variants exist:

Leave-One-Out (LOO) - K = n (one sample per fold). Unbiased but computationally expensive for large datasets.
Repeated K-Fold - Run K-Fold multiple times with different random splits. More stable estimates.
Group K-Fold - Ensures samples from the same group (e.g., same patient) never appear in both train and validation.
Time Series Split - For temporal data: always train on past, validate on future. Never shuffle time series data.

Implementing Cross-Validation in Python

Complete cross-validation workflow including nested CV for hyperparameter tuning:

python

import numpy as np
from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold,
    GridSearchCV, cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

data = load_breast_cancer()
X, y = data.data, data.target

# ---- Basic K-Fold CV ----
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')
print(f"5-Fold CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Individual folds: {scores.round(4)}")

# ---- Stratified K-Fold (recommended for classification) ----
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"\nStratified 5-Fold: {scores_strat.mean():.4f} (+/- {scores_strat.std():.4f})")

# ---- Multiple metrics at once ----
results = cross_validate(rf, X, y, cv=skf,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)
print(f"\nMultiple metrics (mean across 5 folds):")
for metric in ['accuracy', 'precision', 'recall', 'f1']:
    print(f"  {metric}: {results[f'test_{metric}'].mean():.4f}")

# ---- Nested CV: Hyperparameter tuning + evaluation ----
# Inner CV: tune hyperparameters
# Outer CV: evaluate the tuned model
pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
param_grid = {'svm__C': [0.1, 1, 10], 'svm__kernel': ['rbf', 'linear']}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"\nNested CV (unbiased estimate): {nested_scores.mean():.4f} (+/- {nested_scores.std():.4f})")

Cross-Validation Best Practices

Key rules for reliable cross-validation:

Never use the test set during CV - CV is for model selection; the test set is for final reporting only.
Include preprocessing in the CV pipeline - Use sklearn Pipeline to prevent data leakage from scaling/encoding.
Use nested CV for hyperparameter tuning - Outer CV evaluates, inner CV tunes. Prevents optimistic bias.
K=5 or K=10 is the standard - K=10 gives slightly better estimates; K=5 is faster. LOO is only for very small datasets.
Report mean AND standard deviation - High std means the model is sensitive to data split; consider more data or simpler model.

Key Takeaways

K-Fold CV trains and evaluates K times on different splits, giving a more reliable performance estimate than a single split.
Stratified K-Fold preserves class distribution in each fold - always use it for classification.
Use sklearn Pipeline to include preprocessing inside CV, preventing data leakage.
Nested CV (inner for tuning, outer for evaluation) gives an unbiased estimate of a tuned model's performance.
Report both mean and standard deviation of CV scores - high variance indicates instability.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com