USAAIO 2025 Round 1 · Problem 2 · Affine transformations & NN fundamentals
Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Linear algebra / neural network basics.
Official sources: usaaio.org/past-problems · P2 Part 1 forum thread.
1. Problem restatement
The full problem is split across many parts (13+ on the forum). Part 1 is small and verbatim from
the thread: given a weight matrix W and bias b, identify the input/output
dimensions of the affine transformation y = Wx + b, then compute y by
hand for a specific x. Later parts build from this base: compose two affine layers,
add a nonlinearity, derive a gradient by hand, implement forward/backward in NumPy, and finally
train a tiny classifier on a supplied dataset.
2. What's being tested
- Manual matrix arithmetic. Multiply small matrices and add biases without a calculator. Sounds trivial; many students lose easy points by sign errors.
- Composition of affine maps. Two affine layers without a nonlinearity collapse to a single affine layer — a fact later parts hammer on with an exercise.
- Backprop by hand. Given a tiny network, write out
dL/dWanddL/db. - NumPy implementation. Stand up a 2-layer MLP with forward, backward, and an SGD loop in ≤ 50 lines.
Connects directly to Math (matrices, gradients) and Deep Learning (forward/backward, SGD).
3. Data exploration / setup
Most parts of P2 do not need a dataset — they're pencil-and-paper. The late parts use a small synthetic 2D dataset (two-moons, three-class blob, similar). Generate it locally:
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=400, noise=0.2, random_state=0)
4. Baseline approach
Part 1 is a one-liner. Read the matrix shapes off the dimensions, then compute.
import numpy as np
W = np.array([[1, -2, 0], [3, 1, -1]]) # 2x3
b = np.array([0, 1]) # 2
x = np.array([1, 2, -1]) # 3
y = W @ x + b
print(y) # [-3, 5+1] -> [-3, 6]
Input dim N = 3, output dim M = 2. The same conventions extend to every later part — keep dimensions written down explicitly.
5. Improvements that move the needle
5.1 · Annotate shapes everywhere
Comment every line of code with shapes: # X: (N, D), W1: (D, H), Z: (N, H). This single
habit eliminates 90% of off-by-one transpose bugs that cost partial credit.
5.2 · Implement forward and backward as pure NumPy
def forward(X, W1, b1, W2, b2):
Z1 = X @ W1 + b1 # (N, H)
A1 = np.maximum(0, Z1) # ReLU
Z2 = A1 @ W2 + b2 # (N, K)
P = softmax(Z2) # (N, K)
return Z1, A1, Z2, P
def backward(X, Y, Z1, A1, Z2, P, W2):
N = X.shape[0]
dZ2 = (P - Y) / N
dW2 = A1.T @ dZ2
db2 = dZ2.sum(0)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * (Z1 > 0)
dW1 = X.T @ dZ1
db1 = dZ1.sum(0)
return dW1, db1, dW2, db2
5.3 · Cross-check gradient with finite differences
For one weight at a time, perturb by ε = 1e-5, compute (L(+) − L(−))/(2ε), compare to your analytic gradient. If max relative error < 1e-5 you're correct. This is a standard sanity check and the grader rewards code that includes it.
5.4 · Initialise weights properly (Glorot / He)
Zero-initialising weights kills training. For a ReLU network use He init:
W ~ N(0, 2/fan_in). Late parts of P2 fail silently with bad init.
5.5 · Show the training curve
Even if the question only asks for final accuracy, plot loss vs epoch. Graders for theory-coding problems often allocate a "write-up quality" portion, and a plot scores it cheaply.
6. Submission format & gotchas
- Each part is graded separately on the forum. Post your answer to each part inline in the designated thread — don't lump multiple parts into one reply.
- For hand-computation parts, show intermediate steps. Wrong final answer with correct method earns partial credit; correct answer alone with no work shown sometimes does not.
- For code parts, paste the function plus a usage example that prints expected output.
- Use the variable names the problem uses. The grader pattern-matches.
7. What top solutions did
Public solutions on the forum show that perfect scores on P2 come from disciplined documentation: shape annotations on every line, a gradient-check cell included, He-initialised weights, and a saved loss curve. Mathematical write-ups use LaTeX in the forum's MathJax. No clever tricks — the problem rewards mechanical correctness. [illustrative]
8. Drill
D · Two affine layers without a nonlinearity. Why is this a single affine layer?
Let z = W2 (W1 x + b1) + b2 = (W2 W1) x + (W2 b1 + b2). Define
W' = W2 W1, b' = W2 b1 + b2. Then z = W' x + b', exactly
one affine map. This is why you need a nonlinearity between linear layers — without it, no matter
how deep, the network is a single matrix in disguise.
D2 · Your gradient check fails by a factor of N. What's wrong?
You almost certainly forgot to divide by batch size in the loss or in the gradient.
If L = mean(...) then dL/dZ = (P - Y) / N. If L = sum(...)
then dL/dZ = P - Y. Match the loss reduction and the gradient reduction. A factor of
N is the canonical "off-by-batch-size" bug.