USAAIO 2024 R1 · Problem 4 · MLP for MNIST-style classification [reconstructed]

Contest: 2024 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Deep learning / MLP from scratch.

Official sources: usaaio.org/past-problems · USAAIO syllabus.

Reconstruction notice. [reconstructed] from the syllabus.

1. Problem restatement

Multi-part: build a 2-layer MLP from scratch in NumPy (forward + backward), train on MNIST (or a small fashion-MNIST subset), report training/validation curves, then re-implement in PyTorch and compare. Late parts ask for regularisation (dropout, weight decay) and an analysis of the effect of hidden-layer width.

2. What's being tested

From-scratch backprop. Cross-entropy + softmax derivative shortcut.
Numerical stability. Why softmax requires subtracting the max.
SGD with momentum. Implementing and tuning it.
Regularisation. Dropout, weight decay, early stopping.

3. Data exploration / setup

import numpy as np
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784", as_frame=False, parser="liac-arff")
X = mnist.data.astype(np.float32) / 255.0
y = mnist.target.astype(int)
print(X.shape, y.shape, np.bincount(y))

4. Baseline approach

def softmax(z):
    z = z - z.max(1, keepdims=True)
    e = np.exp(z)
    return e / e.sum(1, keepdims=True)

def init(D, H, K, seed=0):
    rng = np.random.default_rng(seed)
    W1 = rng.normal(0, np.sqrt(2/D), (D, H)); b1 = np.zeros(H)
    W2 = rng.normal(0, np.sqrt(2/H), (H, K)); b2 = np.zeros(K)
    return W1, b1, W2, b2

def train(X, y, H=128, lr=0.1, epochs=20, batch=128):
    D, K = X.shape[1], int(y.max()) + 1
    W1, b1, W2, b2 = init(D, H, K)
    Y = np.eye(K)[y]
    n = len(X); rng = np.random.default_rng(0)
    for _ in range(epochs):
        idx = rng.permutation(n)
        for i in range(0, n, batch):
            ix = idx[i:i+batch]
            Xb, Yb = X[ix], Y[ix]
            Z1 = Xb @ W1 + b1; A1 = np.maximum(0, Z1)
            P  = softmax(A1 @ W2 + b2)
            dZ2 = (P - Yb) / batch
            dW2 = A1.T @ dZ2; db2 = dZ2.sum(0)
            dZ1 = (dZ2 @ W2.T) * (Z1 > 0)
            dW1 = Xb.T @ dZ1; db1 = dZ1.sum(0)
            W1 -= lr * dW1; b1 -= lr * db1
            W2 -= lr * dW2; b2 -= lr * db2
    return W1, b1, W2, b2

Baseline accuracy: ~97% on MNIST with H=128. [illustrative]

5. Improvements that move the needle

5.1 · Momentum

Maintain a velocity buffer per parameter. v ← μv − lr·g; θ ← θ + v. μ=0.9 is the standard. Often 2× faster convergence to the same accuracy.

5.2 · Dropout

During training only, mask hidden activations at probability p=0.2–0.5. Rescale by 1/(1-p) so the inference-time forward pass doesn't need the mask. Reliable 0.5–1 point lift on small datasets.

5.3 · Weight decay (L2)

Add lambda·W to the gradient of W. lambda=1e-4 to 1e-3 is typical.

5.4 · Learning-rate schedule

Step decay (×0.1 every 10 epochs) or cosine annealing. Plot val accuracy with and without to illustrate.

5.5 · Compare with PyTorch nn.Sequential

Late parts ask for a PyTorch port. Cross-check accuracy is within 0.5 points of your NumPy implementation; differences usually trace back to optimiser defaults (PyTorch SGD has no momentum by default; specify it).

6. Submission format & gotchas

Softmax must subtract max for numerical stability. Without it, large logits overflow to inf.
Loss numerical stability: combine softmax + cross-entropy into a single −log P_correct computation rather than computing each separately.
Track training and validation loss separately; plot both curves.
Random seeds for reproducibility — graders rerun.

7. What top solutions did

[reconstructed] Full-marks pattern: from-scratch MLP with numerical stability tricks, momentum, dropout, weight decay; PyTorch port with matching accuracy; loss curves; short analysis on hidden width.

8. Drill

D · Derive ∂L/∂Z₂ for softmax + cross-entropy.

Loss is L = −Σ Y_k log P_k with P = softmax(Z₂). Compute ∂L/∂Z₂_j = Σ_k (∂L/∂P_k)(∂P_k/∂Z₂_j) = Σ_k (−Y_k / P_k) · P_k(δ_{kj} − P_j) = −Y_j + P_j Σ_k Y_k = P_j − Y_j (since labels sum to 1). So ∂L/∂Z₂ = P − Y — beautifully clean and the reason softmax + CE is the canonical classification loss. Dividing by batch size N (if loss is averaged) gives the SGD gradient.

D2 · Your training accuracy is 99.9% but val accuracy is 92%. List three fixes.

Classical overfitting. (1) Add dropout (p=0.3 in the hidden layer); (2) add L2 weight decay (lambda=1e-3); (3) early-stop on val accuracy. Optional bonus: reduce hidden width slightly, or augment the data with small random translations/rotations. Don't increase model capacity — that almost certainly makes the gap worse.

← USAAIO 2024 Round 1 set