USAAIO 2024 R1 · Problem 4 · MLP for MNIST-style classification [reconstructed]
Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Deep learning / MLP from scratch.
Official sources: usaaio.org/past-problems · USAAIO syllabus.
1. Problem restatement
Multi-part: build a 2-layer MLP from scratch in NumPy (forward + backward), train on MNIST (or a small fashion-MNIST subset), report training/validation curves, then re-implement in PyTorch and compare. Late parts ask for regularisation (dropout, weight decay) and an analysis of the effect of hidden-layer width.
2. What's being tested
- From-scratch backprop. Cross-entropy + softmax derivative shortcut.
- Numerical stability. Why softmax requires subtracting the max.
- SGD with momentum. Implementing and tuning it.
- Regularisation. Dropout, weight decay, early stopping.
3. Data exploration / setup
import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", as_frame=False, parser="liac-arff")
X = mnist.data.astype(np.float32) / 255.0
y = mnist.target.astype(int)
print(X.shape, y.shape, np.bincount(y))
4. Baseline approach
def softmax(z):
z = z - z.max(1, keepdims=True)
e = np.exp(z)
return e / e.sum(1, keepdims=True)
def init(D, H, K, seed=0):
rng = np.random.default_rng(seed)
W1 = rng.normal(0, np.sqrt(2/D), (D, H)); b1 = np.zeros(H)
W2 = rng.normal(0, np.sqrt(2/H), (H, K)); b2 = np.zeros(K)
return W1, b1, W2, b2
def train(X, y, H=128, lr=0.1, epochs=20, batch=128):
D, K = X.shape[1], int(y.max()) + 1
W1, b1, W2, b2 = init(D, H, K)
Y = np.eye(K)[y]
n = len(X); rng = np.random.default_rng(0)
for _ in range(epochs):
idx = rng.permutation(n)
for i in range(0, n, batch):
ix = idx[i:i+batch]
Xb, Yb = X[ix], Y[ix]
Z1 = Xb @ W1 + b1; A1 = np.maximum(0, Z1)
P = softmax(A1 @ W2 + b2)
dZ2 = (P - Yb) / batch
dW2 = A1.T @ dZ2; db2 = dZ2.sum(0)
dZ1 = (dZ2 @ W2.T) * (Z1 > 0)
dW1 = Xb.T @ dZ1; db1 = dZ1.sum(0)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return W1, b1, W2, b2
Baseline accuracy: ~97% on MNIST with H=128. [illustrative]
5. Improvements that move the needle
5.1 · Momentum
Maintain a velocity buffer per parameter. v ← μv − lr·g; θ ← θ + v. μ=0.9 is the
standard. Often 2× faster convergence to the same accuracy.
5.2 · Dropout
During training only, mask hidden activations at probability p=0.2–0.5. Rescale by 1/(1-p) so the inference-time forward pass doesn't need the mask. Reliable 0.5–1 point lift on small datasets.
5.3 · Weight decay (L2)
Add lambda·W to the gradient of W. lambda=1e-4 to 1e-3 is typical.
5.4 · Learning-rate schedule
Step decay (×0.1 every 10 epochs) or cosine annealing. Plot val accuracy with and without to illustrate.
5.5 · Compare with PyTorch nn.Sequential
Late parts ask for a PyTorch port. Cross-check accuracy is within 0.5 points of your NumPy implementation; differences usually trace back to optimiser defaults (PyTorch SGD has no momentum by default; specify it).
6. Submission format & gotchas
- Softmax must subtract max for numerical stability. Without it, large logits overflow to inf.
- Loss numerical stability: combine softmax + cross-entropy into a single
−log P_correctcomputation rather than computing each separately. - Track training and validation loss separately; plot both curves.
- Random seeds for reproducibility — graders rerun.
7. What top solutions did
[reconstructed] Full-marks pattern: from-scratch MLP with numerical stability tricks, momentum, dropout, weight decay; PyTorch port with matching accuracy; loss curves; short analysis on hidden width.
8. Drill
D · Derive ∂L/∂Z₂ for softmax + cross-entropy.
Loss is L = −Σ Y_k log P_k with P = softmax(Z₂). Compute
∂L/∂Z₂_j = Σ_k (∂L/∂P_k)(∂P_k/∂Z₂_j) = Σ_k (−Y_k / P_k) · P_k(δ_{kj} − P_j)
= −Y_j + P_j Σ_k Y_k = P_j − Y_j (since labels sum to 1). So
∂L/∂Z₂ = P − Y — beautifully clean and the reason softmax + CE is the canonical
classification loss. Dividing by batch size N (if loss is averaged) gives the SGD gradient.
D2 · Your training accuracy is 99.9% but val accuracy is 92%. List three fixes.
Classical overfitting. (1) Add dropout (p=0.3 in the hidden layer); (2) add L2 weight decay (lambda=1e-3); (3) early-stop on val accuracy. Optional bonus: reduce hidden width slightly, or augment the data with small random translations/rotations. Don't increase model capacity — that almost certainly makes the gap worse.