Machine LearningIntermediate20 min readUpdated March 2025

Decision Trees & Random Forests

Decision Trees split data using feature thresholds to make predictions. Random Forests combine hundreds of trees via bagging and feature randomness to create a powerful, robust ensemble that reduces overfitting.

How Decision Trees Work

A Decision Tree is a flowchart-like model that makes predictions by asking a series of yes/no questions about the input features. Each internal node represents a feature test, each branch represents an outcome, and each leaf node represents a prediction.

For example, to predict if a loan will default: 1. Is income > $50,000? If No -> High Risk 2. Is credit score > 700? If Yes -> Low Risk, else -> Medium Risk

The tree is built by recursively splitting the data to maximize purity - each split should separate classes as cleanly as possible.

Splitting Criteria

Decision trees use impurity measures to decide the best split at each node:

Gini Impurity - Measures the probability of misclassifying a randomly chosen element. Gini = 1 - sum(p_i^2). Used by CART algorithm.
Entropy / Information Gain - Measures the reduction in uncertainty after a split. Used by ID3 and C4.5 algorithms.
Mean Squared Error - Used for regression trees; minimizes variance within each leaf.
The feature and threshold that produce the highest information gain (or lowest impurity) are chosen for each split.

Overfitting and Pruning

Decision trees are prone to overfitting - they can grow deep enough to memorize the training data perfectly but generalize poorly.

Solutions: - max_depth - Limit how deep the tree can grow - min_samples_split - Minimum samples required to split a node - min_samples_leaf - Minimum samples required at a leaf node - Post-pruning - Grow the full tree, then remove branches that don't improve validation performance

Random Forests: Ensemble of Trees

A Random Forest is an ensemble of many decision trees, each trained on a random subset of the data and features. Predictions are made by majority vote (classification) or averaging (regression).

Two key sources of randomness: 1. Bootstrap sampling (Bagging) - Each tree trains on a random sample with replacement 2. Feature randomness - At each split, only a random subset of features is considered

This diversity among trees reduces variance and prevents overfitting - the hallmark weakness of single decision trees.

Implementing Decision Trees and Random Forests

Full implementation with feature importance analysis:

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ---- Decision Tree ----
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt.fit(X_train, y_train)
dt_acc = accuracy_score(y_test, dt.predict(X_test))
print(f"Decision Tree Accuracy: {dt_acc:.4f}")
print("\nTree structure (first 3 levels):")
print(export_text(dt, feature_names=data.feature_names, max_depth=3))

# ---- Random Forest ----
rf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf_acc = accuracy_score(y_test, rf.predict(X_test))
print(f"\nRandom Forest Accuracy: {rf_acc:.4f}")

# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5)
print(f"5-Fold CV Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Feature importance
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
print("\nTop 5 Most Important Features:")
for i in range(5):
    print(f"  {data.feature_names[indices[i]]}: {importances[indices[i]]:.4f}")

Gradient Boosting vs Random Forests

Both are tree ensembles but differ fundamentally:

Random Forest - Trees are built in parallel, independently. Fast to train. Reduces variance.
Gradient Boosting (XGBoost, LightGBM) - Trees are built sequentially, each correcting the errors of the previous. Reduces bias. Often more accurate but slower.
Random Forests are more robust to hyperparameter choices and less prone to overfitting.
Gradient Boosting typically achieves higher accuracy on structured/tabular data with proper tuning.
For most production use cases, start with Random Forest for speed and reliability.

Key Takeaways

Decision Trees split data recursively using Gini impurity or information gain to maximize class purity.
Single decision trees overfit easily - control depth with max_depth, min_samples_leaf, and pruning.
Random Forests combine 100s of trees using bagging and feature randomness to reduce variance.
Feature importance from Random Forests is a powerful tool for understanding which variables drive predictions.
Random Forests are robust, require minimal preprocessing, and handle missing values and mixed feature types well.

Contact Us

Have a question or feedback? Fill out the form below or reach us directly at support@nvaitraining.com