Machine LearningBeginner8 min readUpdated March 2025

Train/Validation/Test Split

Properly splitting your dataset into training, validation, and test sets is fundamental to building ML models that generalize. This article covers the why, how, and common pitfalls of dataset splitting.

Why Split the Dataset?

When training a machine learning model, we need to evaluate how well it will perform on new, unseen data - not just the data it was trained on.

Without splitting: - A model could memorize all training examples (overfitting) - It would appear to perform perfectly but fail on real-world data - We would have no honest estimate of generalization performance

The split creates an honest evaluation framework by keeping some data completely hidden from the training process.

The Three Splits Explained

A proper ML workflow uses three separate data splits:

Training Set (60-80%) - Used to fit the model parameters (weights, coefficients). The model sees and learns from this data.
Validation Set (10-20%) - Used to tune hyperparameters (learning rate, depth, regularization). The model does NOT train on this but it influences model selection.
Test Set (10-20%) - Used ONCE at the very end to report final performance. Never used for any decisions during development.
The test set must remain completely untouched until final evaluation - using it multiple times leads to optimistic, unreliable estimates.

Stratified Splitting

For classification problems, stratified splitting ensures each split has the same class distribution as the original dataset.

This is critical when: - Classes are imbalanced (e.g., 95% negative, 5% positive) - The dataset is small - You need reliable performance estimates per class

Without stratification, a random split might put all rare class examples in training, leaving none in the test set.

Implementing Dataset Splits in Python

Complete splitting workflow with stratification and data leakage prevention:

python

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.bincount(y)}")

# ---- Step 1: Split off test set first ----
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify preserves class ratio
)

# ---- Step 2: Split remaining into train/validation ----
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
    # 0.25 of 0.8 = 0.2 of total -> 60/20/20 split
)

print(f"\nSplit sizes:")
print(f"  Train:      {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"  Validation: {len(X_val)} ({len(X_val)/len(X)*100:.0f}%)")
print(f"  Test:       {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

# ---- CRITICAL: Fit scaler on TRAIN only, transform all ----
# This prevents data leakage from validation/test into training
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)   # fit + transform
X_val_s   = scaler.transform(X_val)          # transform only
X_test_s  = scaler.transform(X_test)         # transform only

# ---- Train and evaluate ----
model = LogisticRegression(max_iter=1000)
model.fit(X_train_s, y_train)

print(f"\nTrain accuracy:      {model.score(X_train_s, y_train):.4f}")
print(f"Validation accuracy: {model.score(X_val_s, y_val):.4f}")
# Only report test accuracy at the very end!
print(f"Test accuracy:       {model.score(X_test_s, y_test):.4f}")

Data Leakage: The Silent Killer

Data leakage occurs when information from the validation or test set "leaks" into the training process, producing overly optimistic results that don't hold in production.

Common sources of leakage: - Fitting a scaler/normalizer on the full dataset before splitting - Feature engineering using statistics computed on the full dataset - Target encoding using the full dataset - Time series data split randomly instead of chronologically

Always fit preprocessing transformers on training data only to prevent data leakage.

Key Takeaways

Split data into train (60-80%), validation (10-20%), and test (10-20%) sets before any modeling.
Use stratified splitting for classification to preserve class distribution across all splits.
The test set must be used only ONCE at the very end - never for hyperparameter tuning.
Fit all preprocessing (scalers, encoders) on training data only to prevent data leakage.
For time series data, always split chronologically - never randomly - to avoid future data leaking into training.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com