Problem sets by topic
Original USAAIO-style theory questions and short coding tasks, organized by topic. Use these between mock contests to drill a specific weakness. For the actual past problems, always cross-reference usaaio.org.
Math
M-1 · Linear algebra
Let A be a 3×3 matrix with eigenvalues 1, 2, and 4. What is det(A)? What is trace(A)?
Answer: det(A) = 1 · 2 · 4 = 8; trace(A) = 1 + 2 + 4 = 7. The determinant equals the product of eigenvalues and the trace equals the sum.
M-2 · PCA
You compute the covariance matrix of a centered dataset and find eigenvalues 12, 6, 1.5, 0.5. What fraction of the total variance is captured by the top 2 principal components?
Answer: (12 + 6) / (12 + 6 + 1.5 + 0.5) = 18 / 20 = 0.9, i.e. 90%.
M-3 · Probability
A disease occurs in 1% of a population. A test is 95% sensitive (true-positive rate) and 90% specific (true-negative rate). A randomly selected person tests positive. What is the probability that they actually have the disease?
Answer: Use Bayes' rule.
P(D | +) = P(+ | D) P(D) / P(+).
Numerator = 0.95 × 0.01 = 0.0095.
P(+) = 0.95 × 0.01 + 0.10 × 0.99 = 0.0095 + 0.099 = 0.1085.
P(D | +) ≈ 0.0876, i.e. about 8.8%.
M-4 · Gradient descent
For f(x, y) = x² + 4y², compute the gradient at (1, 1). If you take one gradient-descent step with learning rate η = 0.1, what is the new point?
Answer: ∇f = (2x, 8y) = (2, 8) at (1, 1). New point: (1, 1) − 0.1 (2, 8) = (0.8, 0.2).
M-5 · Maximum likelihood
You observe n i.i.d. samples x₁, …, xₙ from a Bernoulli(p) distribution. What is the maximum-likelihood estimate of p?
Answer: p̂ = (1/n) Σ xᵢ — the sample mean. Derive by maximizing the log-likelihood Σ xᵢ log p + (1 − xᵢ) log(1 − p).
Python / data wrangling
P-1 · NumPy: row-wise softmax
Implement a numerically stable row-wise softmax for a 2-D NumPy array X of shape (n, k).
import numpy as np
def softmax_rows(X):
# subtract per-row max for numerical stability
X = X - X.max(axis=1, keepdims=True)
expX = np.exp(X)
return expX / expX.sum(axis=1, keepdims=True)
Why subtract the per-row max? Without it, exp(X) overflows for inputs above ~709.
P-2 · pandas: group-and-rank
Given a DataFrame df with columns store, product, sales, compute the top 3 products by total sales within each store.
top3 = (
df.groupby(["store", "product"], as_index=False)["sales"]
.sum()
.sort_values(["store", "sales"], ascending=[True, False])
.groupby("store")
.head(3)
)
P-3 · pandas: missing-value strategy
A dataset has a numerical column income with 8% missing values. Describe two valid strategies and one strategy that leaks information.
Valid: (1) fill missing with the training-set median, then apply the same value to validation/test. (2) Use a model (e.g. HistGradientBoostingRegressor) that natively handles NaN.
Leak: Fill with the median computed across train + validation + test combined — the fill value contains future information.
Classical ML
ML-1 · Overfitting diagnosis
You train a random forest and observe: training accuracy = 0.99, validation accuracy = 0.72. Name three concrete fixes.
Fixes: (1) Reduce max_depth or increase min_samples_leaf. (2) Reduce n_estimators won't help — random forest doesn't overfit with more trees, but each individual tree can. (3) Add more training data, or use stronger regularization (e.g. switch to gradient boosting with early stopping).
ML-2 · Cross-validation pitfall
A student does the following: scaler.fit(X), then cross_val_score(model, scaler.transform(X), y, cv=5). What's wrong, and what's the fix?
Problem: The scaler was fit on all data including each fold's validation set — leak. The reported score is optimistically biased.
Fix: Wrap scaler and model in a Pipeline: Pipeline([("scale", StandardScaler()), ("clf", model)]), then pass the pipeline to cross_val_score. Each fold refits the scaler on its training half.
ML-3 · Imbalanced classification
You're classifying fraudulent vs. legitimate transactions. The positive (fraud) class is 0.5% of the data. Why is accuracy a poor metric, and what should you use instead?
Why: Predicting "not fraud" for every transaction gives 99.5% accuracy and zero useful information.
Use instead: Precision, recall, F1, ROC-AUC, precision-recall AUC. Often the relevant business metric is recall at a fixed precision (or vice versa).
ML-4 · K-Means initialization
What problem does k-means++ initialization solve?
Answer: Random initialization can place initial centroids close together, producing poor local minima and slow convergence. K-means++ picks each new initial centroid with probability proportional to its squared distance from the nearest already-chosen centroid, spreading centroids out and dramatically improving the chance of converging to a good clustering.
Deep learning
DL-1 · Forward pass dimensions
A batch of 32 images, each 3 channels × 64 × 64, is fed to a Conv2d layer with in_channels=3, out_channels=16, kernel_size=3, padding=1, stride=1. What is the output shape?
Answer: (32, 16, 64, 64). Same-padded 3×3 convolution preserves spatial dimensions; channels go from 3 to 16; batch size unchanged.
DL-2 · Cross-entropy loss
For a 3-class classifier, a sample has true label 1 (one-hot [0, 1, 0]) and predicted probabilities [0.2, 0.5, 0.3]. Compute the cross-entropy loss for this sample.
Answer: −log(0.5) ≈ 0.693. Cross-entropy = −Σ yᵢ log(pᵢ) = −log(p_true_class).
DL-3 · NaN loss
Your training loss is non-NaN for the first 20 steps, then jumps to NaN and stays. Name three things to check.
Check: (1) Learning rate too high — try lowering by 10×. (2) Numerical instability: are you using BCEWithLogitsLoss instead of BCELoss(sigmoid(...))? Did you take log(0) somewhere? (3) Exploding gradients on a deep net — apply gradient clipping with torch.nn.utils.clip_grad_norm_.
DL-4 · Train / eval mode
Your model evaluates well during training but produces very different numbers when you save the weights and run inference on a single sample. Why?
Answer: You probably forgot to put the model in inference mode. BatchNorm computes per-batch statistics in training mode but uses the running averages in inference mode; Dropout zeros random activations in training but is the identity in inference. A batch size of 1 with BatchNorm in training mode is especially broken.
Transformers
T-1 · Attention complexity
For an input sequence of length n and model dimension d, what is the time and memory complexity of standard scaled dot-product self-attention (single head)?
Answer: Time = O(n² · d); memory = O(n²) for the attention matrix plus O(n · d) for Q, K, V. The quadratic-in-n cost is why long-context models use sparse / linear / flash variants.
T-2 · Why scale by √d_k?
The attention formula is softmax(Q Kᵀ / √dₖ) V. Why divide by √dₖ?
Answer: Q · K is a sum of dₖ products of (assumed standard normal) random variables, so its variance grows as dₖ and its standard deviation as √dₖ. Without scaling, the softmax inputs grow with dₖ, pushing the softmax into a saturated regime where one entry is ~1 and all others ~0 — gradients vanish. Dividing by √dₖ keeps the variance ~1.
T-3 · Causal mask
For a decoder-only language model, what shape and pattern should the attention mask have, and how is it applied?
Answer: Shape (n, n). Lower-triangular (including the diagonal) of 1s; upper triangle of 0s. Applied by setting masked positions in the pre-softmax score matrix to −∞ (in practice, a large negative number) so the softmax assigns them zero weight. This prevents each position from attending to future positions.
T-4 · Positional encoding
Why is positional encoding necessary in a transformer?
Answer: Self-attention is permutation-equivariant — shuffling the input tokens shuffles the outputs in the same way. The model has no notion of token order. Positional encoding (sinusoidal, learned, or rotary) injects position information into the token embeddings so the model can distinguish "cat sat on mat" from "mat sat on cat."
Next
- Mock contests — Two full mini-mocks in USAAIO style.
- After every miss: log it, identify the gap, drill that topic this week.