Skip to content

7. Random Forest

Random forests are a popular ensemble learning algorithm in machine learning, primarily used for classification and regression tasks. Introduced by Leo Breiman in 2001, they build upon decision trees by combining multiple trees into a "forest" to improve accuracy, reduce overfitting, and enhance generalization. The key idea is to create diversity among the trees through randomness, which helps in averaging out errors from individual trees.

Random Forest Algorithm. Source: GeeksforGeeks

Key Concepts:

  • Ensemble Method: Random forests use bagging (bootstrap aggregating), where each decision tree is trained on a random subset of the training data (with replacement). This reduces variance.
  • Random Feature Selection: At each node split in a tree, only a random subset of features is considered, which decorrelates the trees and further reduces overfitting.
  • Prediction:
    • For classification: The final output is the majority vote (mode) from all trees.
    • For regression: The final output is the average (mean) of predictions from all trees.
  • Advantages: Robust to noise, handles missing values well, provides feature importance scores, and works well with high-dimensional data.
  • Disadvantages: Can be computationally intensive, less interpretable than single decision trees, and may require tuning hyperparameters like number of trees (n_estimators), maximum depth (max_depth), and number of features per split (max_features).
  • Applications: Used in finance (credit scoring), healthcare (disease prediction), e-commerce (recommendation systems), and more.

Random forests also offer out-of-bag (OOB) error estimation, where each tree is evaluated on the data not included in its bootstrap sample, providing a built-in cross-validation metric.

Formulas and Mathematical Foundations

Random forests build on decision trees, where each tree minimizes impurity (e.g., Gini or entropy for classification, MSE for regression) at splits.

  1. Bootstrap Sampling:

    • Given a dataset \( D \) with \( N \) samples, for each tree \( t = 1 \) to \( T \):
      • Sample \( D_t \) (bootstrap dataset) with replacement from \( D \), typically of size \( N \).
  2. Feature Subset Selection:

    • At each node, select a random subset of \( m \) features from total \( p \) features (often \( m = \sqrt{p} \) for classification or \( m = p/3 \) for regression).
  3. Tree Construction:

    • For classification, impurity measures:

      • Gini Impurity


        \[ G = \sum_{k=1}^K p_k (1 - p_k) \]

        where \( p_k \) is the proportion of class \( k \) in the node.

      • Entropy


        \[ E = -\sum_{k=1}^K p_k \log_2(p_k) \]

        where \( p_k \) is the proportion of class \( k \) in the node.

    • Split to minimize weighted impurity:

      \[ \Delta I = I(parent) - \sum_{child} \frac{N_{child}}{N_{parent}} I(child) \]
  4. Ensemble Prediction:

    • For classification:

      \[ \hat{y} = \arg\max_k \left( \frac{1}{T} \sum_{t=1}^T I(\hat{y}_t = k) \right) \]

      where \( I \) is the indicator function, and \( \hat{y}_t \) is the prediction from tree \( t \).

    • For regression:

      \[ \hat{y} = \frac{1}{T} \sum_{t=1}^T \hat{y}_t \]
  5. Out-of-Bag (OOB) Error:

    • For each sample \( i \), predict using only trees where \( i \) was not in the bootstrap sample.
    • OOB error = average error over all samples (e.g., misclassification rate or MSE).
  6. Feature Importance:

    • Often measured by mean decrease in impurity (MDI): Sum of \( \Delta I \) across all splits using that feature, averaged over trees.
    • Or permutation importance: Decrease in model score when feature values are randomly shuffled.

These formulas ensure the forest's bias-variance tradeoff is optimized, with low bias from deep trees and low variance from averaging.

From Scratch

Implementing a random forest from scratch requires building a basic decision tree first, then ensembling them. Below is a simplified version for classification using only standard Python (no external libraries like NumPy for arrays—using lists instead). This is for educational purposes; real implementations use optimized libraries.

We'll assume a binary classification problem with features as lists of lists and labels as a list (0 or 1). It uses Gini impurity and random subsets.

import random
from collections import Counter

# Helper: Calculate Gini impurity
def gini_impurity(y):
    if not y:
        return 0
    counts = Counter(y)
    impurity = 1
    for count in counts.values():
        prob = count / len(y)
        impurity -= prob ** 2
    return impurity

# Helper: Split dataset based on feature index and value
def split_dataset(X, y, feature_idx, value):
    left_X, left_y, right_X, right_y = [], [], [], []
    for i in range(len(X)):
        if X[i][feature_idx] <= value:
            left_X.append(X[i])
            left_y.append(y[i])
        else:
            right_X.append(X[i])
            right_y.append(y[i])
    return left_X, left_y, right_X, right_y

# Decision Tree Node
class Node:
    def __init__(self, feature_idx=None, value=None, left=None, right=None, label=None):
        self.feature_idx = feature_idx
        self.value = value
        self.left = left
        self.right = right
        self.label = label  # Leaf node label

# Build a single decision tree (recursive)
def build_tree(X, y, max_depth, min_samples_split, max_features):
    if len(y) < min_samples_split or max_depth == 0:
        return Node(label=Counter(y).most_common(1)[0][0])

    n_features = len(X[0])
    features = random.sample(range(n_features), max_features)  # Random subset

    best_gini = float('inf')
    best_feature_idx, best_value = None, None
    best_left_X, best_left_y, best_right_X, best_right_y = None, None, None, None

    for feature_idx in features:
        values = sorted(set(row[feature_idx] for row in X))
        for value in values:
            left_X, left_y, right_X, right_y = split_dataset(X, y, feature_idx, value)
            if not left_y or not right_y:
                continue
            p_left = len(left_y) / len(y)
            gini = p_left * gini_impurity(left_y) + (1 - p_left) * gini_impurity(right_y)
            if gini < best_gini:
                best_gini = gini
                best_feature_idx = feature_idx
                best_value = value
                best_left_X, best_left_y = left_X, left_y
                best_right_X, best_right_y = right_X, right_y

    if best_gini == float('inf'):
        return Node(label=Counter(y).most_common(1)[0][0])

    left = build_tree(best_left_X, best_left_y, max_depth - 1, min_samples_split, max_features)
    right = build_tree(best_right_X, best_right_y, max_depth - 1, min_samples_split, max_features)
    return Node(best_feature_idx, best_value, left, right)

# Predict with a single tree
def predict_tree(node, x):
    if node.label is not None:
        return node.label
    if x[node.feature_idx] <= node.value:
        return predict_tree(node.left, x)
    else:
        return predict_tree(node.right, x)

# Random Forest class
class RandomForest:
    def __init__(self, n_estimators=10, max_depth=5, min_samples_split=2, max_features='sqrt'):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.trees = []

    def fit(self, X, y):
        n_samples = len(X)
        n_features = len(X[0])
        max_features = int(n_features ** 0.5) if self.max_features == 'sqrt' else self.max_features

        for _ in range(self.n_estimators):
            # Bootstrap sample
            bootstrap_idx = [random.randint(0, n_samples - 1) for _ in range(n_samples)]
            X_boot = [X[i] for i in bootstrap_idx]
            y_boot = [y[i] for i in bootstrap_idx]
            tree = build_tree(X_boot, y_boot, self.max_depth, self.min_samples_split, max_features)
            self.trees.append(tree)

    def predict(self, X):
        predictions = []
        for x in X:
            tree_preds = [predict_tree(tree, x) for tree in self.trees]
            predictions.append(Counter(tree_preds).most_common(1)[0][0])
        return predictions

# Example usage
# X = [[0, 0], [1, 1], [1, 0], [0, 1]]  # Features
# y = [0, 1, 0, 1]  # Labels
# rf = RandomForest(n_estimators=3, max_depth=2)
# rf.fit(X, y)
# print(rf.predict([[0.5, 0.5]]))  # Output: [1] or similar

This implementation is basic and not optimized (e.g., no handling for continuous features beyond simple splits, no OOB). For real use, add error handling and optimizations.

With Library

For practical applications, use scikit-learn's RandomForestClassifier or RandomForestRegressor. It handles everything efficiently, including parallel tree building.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load example dataset (Iris for classification)
iris = load_iris()
X = iris.data
y = iris.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
rf = RandomForestClassifier(n_estimators=100,  # Number of trees
                            max_depth=5,       # Max depth of trees
                            max_features='sqrt',  # Features per split
                            random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
predictions = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")

# Feature importances
print(f"Feature Importances: {rf.feature_importances_}")

This uses the Iris dataset for demonstration. You can replace it with your data. scikit-learn handles bootstrapping, randomness, and predictions automatically. For regression, swap to RandomForestRegressor and use metrics like MSE.

Additional


Exercício

Entrega

📆 28.out 🕒 23:59

Individual

Entrega do link via Canvas.

Dentre os datasets disponíveis, escolha um cujo objetivo seja prever uma variável categórica (classificação). Utilize o algoritmo de Random Forest para treinar um modelo e avaliar seu desempenho.

Utilize as bibliotecas pandas, numpy, matplotlib e scikit-learn para auxiliar no desenvolvimento do projeto.

A entrega deve ser feita através do Canvas - Exercício Random Forest. Só serão aceitos links para repositórios públicos do GitHub contendo a documentação (relatório) e o código do projeto. Conforme exemplo do template-projeto-integrador. ESTE EXERCÍCIO É INDIVIDUAL.

A entrega deve incluir as seguintes etapas:

Etapa Critério Descrição Pontos
1 Exploração dos Dados Análise inicial do conjunto de dados - com explicação sobre a natureza dos dados -, incluindo visualizações e estatísticas descritivas. 20
2 Pré-processamento Limpeza dos dados, tratamento de valores ausentes e normalização. 10
3 Divisão dos Dados Separação do conjunto de dados em treino e teste. 20
4 Treinamento do Modelo Implementação do modelo Random Forest. 10
5 Avaliação do Modelo Avaliação do desempenho do modelo utilizando métricas apropriadas. 20
6 Relatório Final Documentação do processo, resultados obtidos e possíveis melhorias. Obrigatório: uso do template-projeto-integrador, individual. 20