Cross-Validation
Cross-validation provides a robust estimate of model performance by training and evaluating the model multiple times on different data subsets. K-Fold and Stratified K-Fold are the standard techniques for reliable model selection and hyperparameter tuning.
Why Cross-Validation?
A single train/validation split has a critical weakness: the performance estimate depends heavily on which samples happen to end up in the validation set.
With a small dataset, this variance can be enormous - you might get 85% accuracy on one split and 72% on another, purely due to randomness.
Cross-validation solves this by averaging performance across multiple splits, giving a much more reliable and stable estimate of true generalization performance.
K-Fold Cross-Validation
K-Fold CV divides the dataset into K equal-sized folds. The model is trained K times, each time using K-1 folds for training and 1 fold for validation:
- Split data into K equal folds (typically K=5 or K=10).
- For each fold i: train on all folds except i, evaluate on fold i.
- Record the validation score for each fold.
- Final score = mean of all K scores. Standard deviation measures stability.
- Every sample is used for validation exactly once, and for training K-1 times.
Stratified K-Fold
Stratified K-Fold ensures each fold has the same class distribution as the full dataset. This is essential for:
- Imbalanced classification (e.g., 1% fraud, 99% normal) - Small datasets where random splits might create folds with no minority class examples - Getting reliable per-class metrics
Scikit-learn's cross_val_score uses StratifiedKFold automatically for classifiers.
Leave-One-Out and Other Variants
Beyond standard K-Fold, several specialized variants exist:
- Leave-One-Out (LOO) - K = n (one sample per fold). Unbiased but computationally expensive for large datasets.
- Repeated K-Fold - Run K-Fold multiple times with different random splits. More stable estimates.
- Group K-Fold - Ensures samples from the same group (e.g., same patient) never appear in both train and validation.
- Time Series Split - For temporal data: always train on past, validate on future. Never shuffle time series data.
Implementing Cross-Validation in Python
Complete cross-validation workflow including nested CV for hyperparameter tuning:
import numpy as np
from sklearn.model_selection import (
cross_val_score, KFold, StratifiedKFold,
GridSearchCV, cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
data = load_breast_cancer()
X, y = data.data, data.target
# ---- Basic K-Fold CV ----
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')
print(f"5-Fold CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
print(f"Individual folds: {scores.round(4)}")
# ---- Stratified K-Fold (recommended for classification) ----
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_strat = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"\nStratified 5-Fold: {scores_strat.mean():.4f} (+/- {scores_strat.std():.4f})")
# ---- Multiple metrics at once ----
results = cross_validate(rf, X, y, cv=skf,
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
print(f"\nMultiple metrics (mean across 5 folds):")
for metric in ['accuracy', 'precision', 'recall', 'f1']:
print(f" {metric}: {results[f'test_{metric}'].mean():.4f}")
# ---- Nested CV: Hyperparameter tuning + evaluation ----
# Inner CV: tune hyperparameters
# Outer CV: evaluate the tuned model
pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
param_grid = {'svm__C': [0.1, 1, 10], 'svm__kernel': ['rbf', 'linear']}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"\nNested CV (unbiased estimate): {nested_scores.mean():.4f} (+/- {nested_scores.std():.4f})")Cross-Validation Best Practices
Key rules for reliable cross-validation:
- Never use the test set during CV - CV is for model selection; the test set is for final reporting only.
- Include preprocessing in the CV pipeline - Use sklearn Pipeline to prevent data leakage from scaling/encoding.
- Use nested CV for hyperparameter tuning - Outer CV evaluates, inner CV tunes. Prevents optimistic bias.
- K=5 or K=10 is the standard - K=10 gives slightly better estimates; K=5 is faster. LOO is only for very small datasets.
- Report mean AND standard deviation - High std means the model is sensitive to data split; consider more data or simpler model.
Key Takeaways
- K-Fold CV trains and evaluates K times on different splits, giving a more reliable performance estimate than a single split.
- Stratified K-Fold preserves class distribution in each fold - always use it for classification.
- Use sklearn Pipeline to include preprocessing inside CV, preventing data leakage.
- Nested CV (inner for tuning, outer for evaluation) gives an unbiased estimate of a tuned model's performance.
- Report both mean and standard deviation of CV scores - high variance indicates instability.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com