Decision Trees & Random Forests
Decision Trees split data using feature thresholds to make predictions. Random Forests combine hundreds of trees via bagging and feature randomness to create a powerful, robust ensemble that reduces overfitting.
How Decision Trees Work
A Decision Tree is a flowchart-like model that makes predictions by asking a series of yes/no questions about the input features. Each internal node represents a feature test, each branch represents an outcome, and each leaf node represents a prediction.
For example, to predict if a loan will default: 1. Is income > $50,000? If No -> High Risk 2. Is credit score > 700? If Yes -> Low Risk, else -> Medium Risk
The tree is built by recursively splitting the data to maximize purity - each split should separate classes as cleanly as possible.
Splitting Criteria
Decision trees use impurity measures to decide the best split at each node:
- Gini Impurity - Measures the probability of misclassifying a randomly chosen element. Gini = 1 - sum(p_i^2). Used by CART algorithm.
- Entropy / Information Gain - Measures the reduction in uncertainty after a split. Used by ID3 and C4.5 algorithms.
- Mean Squared Error - Used for regression trees; minimizes variance within each leaf.
- The feature and threshold that produce the highest information gain (or lowest impurity) are chosen for each split.
Overfitting and Pruning
Decision trees are prone to overfitting - they can grow deep enough to memorize the training data perfectly but generalize poorly.
Solutions: - max_depth - Limit how deep the tree can grow - min_samples_split - Minimum samples required to split a node - min_samples_leaf - Minimum samples required at a leaf node - Post-pruning - Grow the full tree, then remove branches that don't improve validation performance
Random Forests: Ensemble of Trees
A Random Forest is an ensemble of many decision trees, each trained on a random subset of the data and features. Predictions are made by majority vote (classification) or averaging (regression).
Two key sources of randomness: 1. Bootstrap sampling (Bagging) - Each tree trains on a random sample with replacement 2. Feature randomness - At each split, only a random subset of features is considered
This diversity among trees reduces variance and prevents overfitting - the hallmark weakness of single decision trees.
Implementing Decision Trees and Random Forests
Full implementation with feature importance analysis:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
# Load dataset
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ---- Decision Tree ----
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))
print(f"Decision Tree Accuracy: {dt_acc:.4f}")
print("\nTree structure (first 3 levels):")
print(export_text(dt, feature_names=data.feature_names, max_depth=3))
# ---- Random Forest ----
rf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))
print(f"\nRandom Forest Accuracy: {rf_acc:.4f}")
# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5)
print(f"5-Fold CV Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Feature importance
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
print("\nTop 5 Most Important Features:")
for i in range(5):
print(f" {data.feature_names[indices[i]]}: {importances[indices[i]]:.4f}")Gradient Boosting vs Random Forests
Both are tree ensembles but differ fundamentally:
- Random Forest - Trees are built in parallel, independently. Fast to train. Reduces variance.
- Gradient Boosting (XGBoost, LightGBM) - Trees are built sequentially, each correcting the errors of the previous. Reduces bias. Often more accurate but slower.
- Random Forests are more robust to hyperparameter choices and less prone to overfitting.
- Gradient Boosting typically achieves higher accuracy on structured/tabular data with proper tuning.
- For most production use cases, start with Random Forest for speed and reliability.
Key Takeaways
- Decision Trees split data recursively using Gini impurity or information gain to maximize class purity.
- Single decision trees overfit easily - control depth with max_depth, min_samples_leaf, and pruning.
- Random Forests combine 100s of trees using bagging and feature randomness to reduce variance.
- Feature importance from Random Forests is a powerful tool for understanding which variables drive predictions.
- Random Forests are robust, require minimal preprocessing, and handle missing values and mixed feature types well.
Contact Us
Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com