USAAIO 2024 R1 · Problem 3 · Decision trees & ensembles [reconstructed]
Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Classical ML / trees / bagging / boosting.
Official sources: usaaio.org/past-problems · USAAIO syllabus.
1. Problem restatement
Multi-part: define Gini impurity and entropy; compute both for a small synthetic split; implement a decision-tree splitter from scratch; train and prune a tree on a tabular dataset; build a small random-forest by bagging; add boosting (AdaBoost or gradient boosting); compare all four (single tree, pruned tree, RF, GBM) on the same data.
2. What's being tested
- Information-theoretic intuition. Entropy as expected bits-of-surprise; Gini as expected misclassification.
- Recursive algorithm design. Tree building is the cleanest "recursive function with a base case" any student will write before college.
- Variance reduction. Why bagging works and what bootstrap sampling does.
- Boosting intuition. Why fitting residuals beats fitting raw labels with a deeper tree.
3. Data exploration / setup
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True)
X, y = data.data.values, data.target.values
print(X.shape, y.mean())
4. Baseline approach
import numpy as np
def gini(y):
if len(y) == 0: return 0.0
p = np.bincount(y) / len(y)
return 1.0 - (p ** 2).sum()
def best_split(X, y):
best = None
base = gini(y)
for j in range(X.shape[1]):
thresholds = np.unique(X[:, j])
for t in thresholds[1:]:
left = X[:, j] < t
if left.sum() == 0 or left.sum() == len(y): continue
g = (left.sum() * gini(y[left]) + (~left).sum() * gini(y[~left])) / len(y)
gain = base - g
if best is None or gain > best[0]:
best = (gain, j, t)
return best
def build_tree(X, y, depth=0, max_depth=4):
if depth == max_depth or len(np.unique(y)) == 1:
return {"leaf": np.bincount(y).argmax()}
s = best_split(X, y)
if s is None: return {"leaf": np.bincount(y).argmax()}
_, j, t = s
left = X[:, j] < t
return {"j": j, "t": t,
"L": build_tree(X[left], y[left], depth + 1, max_depth),
"R": build_tree(X[~left], y[~left], depth + 1, max_depth)}
5. Improvements that move the needle
5.1 · Vectorise the split scan
The naive loop above is O(n·d·k). Sort each column once, scan in cumulative-sum fashion, get O(n·d·log n) total. Late parts of the problem stress this; mention complexity in the write-up.
5.2 · Pruning by minimal-cost-complexity
Build the full tree, then prune leaves whose removal least increases validation error.
Equivalent to sklearn's ccp_alpha knob.
5.3 · Bagging
def bag(X, y, n_trees=50, max_depth=8):
rng = np.random.default_rng(0)
n = len(y); forest = []
for _ in range(n_trees):
idx = rng.integers(0, n, n) # bootstrap
forest.append(build_tree(X[idx], y[idx], 0, max_depth))
return forest
Average tree predictions. Reduces variance without increasing bias.
5.4 · Feature subsampling per split
At each split, consider only sqrt(d) random features. Decorrelates trees and lifts ensemble accuracy 1–2 points over plain bagging.
5.5 · Gradient boosting on residuals
Fit shallow trees sequentially on residuals from previous predictions with a learning rate of 0.05. Plot training loss vs number of trees.
6. Submission format & gotchas
- Show the Gini / entropy calculation step by step in the hand parts.
- From-scratch implementation parts forbid sklearn; sklearn is allowed for the comparison parts only.
- Random seeds: bagging is variance-dependent, so seed your RNG and report seed in the write-up.
- Discuss why a single deep tree overfits while a forest of equally deep trees does not.
7. What top solutions did
[reconstructed] Full-marks pattern: clean recursive tree code; vectorised split scan; pruning + bagging + boosting all implemented from scratch; an honest comparison table of (model, train acc, val acc, parameter count); short paragraph on bias-variance.
8. Drill
D · Why does bagging reduce variance but not bias?
Each tree has roughly the same bias (they're all deep enough to fit the data on their bootstrap sample). Averaging predictions of B independent trees reduces variance by 1/B for fully independent trees, and by (1 − ρ)/B + ρ for correlated trees with correlation ρ. So bagging directly attacks variance but doesn't touch bias — that's why bagged deep trees beat single deep trees and bagged shallow trees do not beat single shallow trees by much.
D2 · Gradient boosting with depth-1 trees ("stumps"). Why does it work?
Each stump is a high-bias, low-variance learner. Boosting fits stumps to the gradient of the loss with respect to the current prediction (the residuals for squared loss). After B rounds, the cumulative ensemble approximates the negative-gradient direction in function space, like a coordinate-descent step in feature space. Bias drops fast because each round adds a new direction; variance stays low because each stump only sees one feature. Final accuracy often beats deep trees with no boosting.