USAAIO 2024 R1 · Problem 3 · Decision trees & ensembles [reconstructed]

Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Classical ML / trees / bagging / boosting.

Official sources: usaaio.org/past-problems · USAAIO syllabus.

Reconstruction notice. [reconstructed] from the syllabus.

1. Problem restatement

Multi-part: define Gini impurity and entropy; compute both for a small synthetic split; implement a decision-tree splitter from scratch; train and prune a tree on a tabular dataset; build a small random-forest by bagging; add boosting (AdaBoost or gradient boosting); compare all four (single tree, pruned tree, RF, GBM) on the same data.

2. What's being tested

Information-theoretic intuition. Entropy as expected bits-of-surprise; Gini as expected misclassification.
Recursive algorithm design. Tree building is the cleanest "recursive function with a base case" any student will write before college.
Variance reduction. Why bagging works and what bootstrap sampling does.
Boosting intuition. Why fitting residuals beats fitting raw labels with a deeper tree.

3. Data exploration / setup

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True)
X, y = data.data.values, data.target.values
print(X.shape, y.mean())

4. Baseline approach

import numpy as np

def gini(y):
    if len(y) == 0: return 0.0
    p = np.bincount(y) / len(y)
    return 1.0 - (p ** 2).sum()

def best_split(X, y):
    best = None
    base = gini(y)
    for j in range(X.shape[1]):
        thresholds = np.unique(X[:, j])
        for t in thresholds[1:]:
            left = X[:, j] < t
            if left.sum() == 0 or left.sum() == len(y): continue
            g = (left.sum() * gini(y[left]) + (~left).sum() * gini(y[~left])) / len(y)
            gain = base - g
            if best is None or gain > best[0]:
                best = (gain, j, t)
    return best

def build_tree(X, y, depth=0, max_depth=4):
    if depth == max_depth or len(np.unique(y)) == 1:
        return {"leaf": np.bincount(y).argmax()}
    s = best_split(X, y)
    if s is None: return {"leaf": np.bincount(y).argmax()}
    _, j, t = s
    left = X[:, j] < t
    return {"j": j, "t": t,
            "L": build_tree(X[left], y[left], depth + 1, max_depth),
            "R": build_tree(X[~left], y[~left], depth + 1, max_depth)}

5. Improvements that move the needle

5.1 · Vectorise the split scan

The naive loop above is O(n·d·k). Sort each column once, scan in cumulative-sum fashion, get O(n·d·log n) total. Late parts of the problem stress this; mention complexity in the write-up.

5.2 · Pruning by minimal-cost-complexity

Build the full tree, then prune leaves whose removal least increases validation error. Equivalent to sklearn's ccp_alpha knob.

5.3 · Bagging

def bag(X, y, n_trees=50, max_depth=8):
    rng = np.random.default_rng(0)
    n = len(y); forest = []
    for _ in range(n_trees):
        idx = rng.integers(0, n, n)             # bootstrap
        forest.append(build_tree(X[idx], y[idx], 0, max_depth))
    return forest

Average tree predictions. Reduces variance without increasing bias.

5.4 · Feature subsampling per split

At each split, consider only sqrt(d) random features. Decorrelates trees and lifts ensemble accuracy 1–2 points over plain bagging.

5.5 · Gradient boosting on residuals

Fit shallow trees sequentially on residuals from previous predictions with a learning rate of 0.05. Plot training loss vs number of trees.

6. Submission format & gotchas

Show the Gini / entropy calculation step by step in the hand parts.
From-scratch implementation parts forbid sklearn; sklearn is allowed for the comparison parts only.
Random seeds: bagging is variance-dependent, so seed your RNG and report seed in the write-up.
Discuss why a single deep tree overfits while a forest of equally deep trees does not.

7. What top solutions did

[reconstructed] Full-marks pattern: clean recursive tree code; vectorised split scan; pruning + bagging + boosting all implemented from scratch; an honest comparison table of (model, train acc, val acc, parameter count); short paragraph on bias-variance.

8. Drill

D · Why does bagging reduce variance but not bias?

Each tree has roughly the same bias (they're all deep enough to fit the data on their bootstrap sample). Averaging predictions of B independent trees reduces variance by 1/B for fully independent trees, and by (1 − ρ)/B + ρ for correlated trees with correlation ρ. So bagging directly attacks variance but doesn't touch bias — that's why bagged deep trees beat single deep trees and bagged shallow trees do not beat single shallow trees by much.

D2 · Gradient boosting with depth-1 trees ("stumps"). Why does it work?

Each stump is a high-bias, low-variance learner. Boosting fits stumps to the gradient of the loss with respect to the current prediction (the residuals for squared loss). After B rounds, the cumulative ensemble approximates the negative-gradient direction in function space, like a coordinate-descent step in feature space. Bias drops fast because each round adds a new direction; variance stays low because each stump only sees one feature. Final accuracy often beats deep trees with no boosting.

← USAAIO 2024 Round 1 set