← USAAIO 2025 Round 1 set

USAAIO 2025 Round 1 · Problem 5 · Transformer attention & a tiny encoder [reconstructed]

Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Transformers / attention mechanism.

Official sources: usaaio.org/past-problems · 2025 Round 1 forum.

Reconstruction notice. P5 here is a [reconstructed] skeleton based on the Round-1 syllabus, since only Problems 1–3 are visible on the forum. Update once the official PDF publishes.

1. Problem restatement

The problem progresses from the math of scaled dot-product attention through a hand-built single-head attention layer, on to multi-head attention and finally a 2-layer encoder applied to a small text-classification dataset. Sub-parts ask:

  1. Given Q, K, V (small matrices), compute the attention output by hand.
  2. Explain the role of the 1/√d_k scaling factor.
  3. Implement single-head attention in NumPy.
  4. Extend to multi-head attention in PyTorch.
  5. Build a 2-layer encoder with residuals + LayerNorm.
  6. Train on a tiny corpus and report accuracy + attention-map plots.

2. What's being tested

3. Data exploration / setup

import torch, torch.nn as nn, torch.nn.functional as F

# tiny synthetic classification dataset
torch.manual_seed(0)
vocab = 1000; seq_len = 24; nc = 5
X = torch.randint(0, vocab, (2000, seq_len))
y = torch.randint(0, nc, (2000,))

4. Baseline approach

Single-head attention in NumPy, by hand:

import numpy as np

def attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)              # (Lq, Lk)
    weights = np.exp(scores - scores.max(-1, keepdims=True))
    weights /= weights.sum(-1, keepdims=True)
    return weights @ V                              # (Lq, d_v)

Q = np.array([[1, 0, 0], [0, 1, 0]], dtype=float)
K = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]], dtype=float)
V = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
print(attention(Q, K, V))

Then in PyTorch with multi-head:

class MHA(nn.Module):
    def __init__(self, d, h):
        super().__init__()
        self.h, self.d = h, d // h
        self.qkv = nn.Linear(d, 3 * d)
        self.proj = nn.Linear(d, d)
    def forward(self, x):
        B, L, _ = x.shape
        q, k, v = self.qkv(x).chunk(3, dim=-1)
        q, k, v = (t.view(B, L, self.h, self.d).transpose(1, 2) for t in (q, k, v))
        att = (q @ k.transpose(-2, -1)) / self.d ** 0.5
        att = att.softmax(-1)
        out = (att @ v).transpose(1, 2).contiguous().view(B, L, -1)
        return self.proj(out)

5. Improvements that move the needle

5.1 · Pre-norm rather than post-norm

Apply LayerNorm before each sublayer rather than after. Trains more stably without learning-rate warmup at the small scale Round-1 problems use.

5.2 · Sinusoidal positional encoding done from scratch

def pos_enc(L, d):
    pe = torch.zeros(L, d)
    pos = torch.arange(L).unsqueeze(1)
    div = torch.exp(torch.arange(0, d, 2) * -(np.log(10000.0) / d))
    pe[:, 0::2] = torch.sin(pos * div)
    pe[:, 1::2] = torch.cos(pos * div)
    return pe

5.3 · Plot the attention weights

After training, take a batch and visualise the attention matrix from layer 2, head 0 as a heatmap. Graders reward visualisations that confirm the model "attends to" the parts you'd expect.

5.4 · Sanity-check on a copy task before the real data

Train your encoder on a "copy" task (output the input sequence). If it converges to near-zero loss, your attention + residual plumbing is correct. If not, fix that before touching the real task.

5.5 · Print parameter counts and per-layer activations

A short paragraph reporting head dim, layer count, FFN expansion ratio, parameter count, and a histogram of attention entropies is the kind of writeup that earns the analysis points.

6. Submission format & gotchas

7. What top solutions did

[reconstructed — verify against published solutions] Top P5 submissions on similar problems: hand-computations done with explicit intermediate matrices; clean from-scratch MHA + pre-norm encoder; copy-task sanity check; attention-map plots; a short paragraph linking the 1/√d scaling to gradient stability. Parameter counts kept under 100k — there's no need for a BERT-sized model.

8. Drill

D · Why divide by √d_k in scaled dot-product attention?

When Q and K are vectors of dimension d_k with entries drawn i.i.d. with variance 1, the dot product q · k has variance d_k — so its standard deviation grows like √d_k. Without scaling, softmax over large scores saturates: one position gets ~1 weight, others ~0, gradients vanish. Dividing by √d_k normalises the variance back to 1 and keeps the softmax in its useful regime. Try it: implement attention without the scaling, check that for d=64 training stalls in a few hundred steps.

D2 · Why is LayerNorm preferred over BatchNorm in transformers?

LayerNorm normalises across feature dimensions for each token independently — sequence-length changes don't break it, and there's no batch-statistics machinery. BatchNorm normalises across the batch, which (a) breaks on tiny batches, (b) needs running-mean bookkeeping at inference, and (c) couples tokens in a batch that should be independent. For autoregressive decoders it also leaks information across positions. LayerNorm dodges all three issues.

← USAAIO 2025 Round 1 set