Original mock contests

Two original USAAIO-style mocks: a theory mini-mock (10 short-answer questions, 45 min) and a coding mini-mock (one end-to-end ML task on a synthetic dataset, 90 min). Use these between major Kaggle attempts.

How to take a mock. Strict timer. Closed editorial. After the timer, score honestly, log every miss, then re-attempt failed problems with no time limit.

Mock 1 · Theory mini-mock (10 questions, 45 min)

Linear algebra. For matrix A = [[2, 1], [1, 2]], find both eigenvalues.
Probability. Two fair dice are rolled. What is the probability that the sum is at least 10?
Calculus. For f(x, y) = x² y + 3y, compute ∂f/∂y at (2, 1).
Statistics. A sample has values 4, 8, 10, 14, 14. Compute the median and the (population) standard deviation.
Numpy / ML. Given a 2-D array X of shape (100, 5), write one line of NumPy that standardizes each column (zero mean, unit variance).
Classical ML. A model has 95% training accuracy and 70% validation accuracy. Is this overfitting or underfitting? Name two interventions.
Loss functions. Why use BCEWithLogitsLoss instead of BCELoss applied to a sigmoid output?
Deep learning. A Conv2d with in_channels=16, out_channels=32, kernel_size=5, no bias. How many parameters?
Transformers. A self-attention layer has model dimension d_model = 512 and n_heads = 8. What is the dimension per head?
Optimization. Name one situation where you'd prefer SGD with momentum over AdamW.

Answer key

Reveal answers

Eigenvalues: λ = 1 and λ = 3. (Characteristic polynomial: (2 − λ)² − 1 = 0.)
Outcomes with sum ≥ 10: (4,6), (5,5), (5,6), (6,4), (6,5), (6,6) → 6 out of 36 → 1/6.
∂f/∂y = x² + 3 = 4 + 3 = 7.
Median = 10. Mean = 10. Squared deviations: 36, 4, 0, 16, 16 → sum 72 → variance 14.4 → SD ≈ 3.79.
(X - X.mean(axis=0)) / X.std(axis=0).
Overfitting. Interventions: more regularization (weight decay, dropout, smaller model); more training data; data augmentation; early stopping.
Numerical stability: BCEWithLogitsLoss uses the log-sum-exp trick to handle very large or very small logits without intermediate overflow.
16 × 32 × 5 × 5 = 12 800 parameters.
512 / 8 = 64 dimensions per head.
Large-scale vision training (e.g. ImageNet ResNet) with a well-tuned learning-rate schedule — SGD+momentum often generalizes slightly better than Adam in this regime.

Mock 2 · Coding mini-mock (90 min)

Task: "Synthetic species classification"

You are given a small tabular dataset with 8 numerical features and a 3-class label. The training set has 2 000 rows; the held-out test set has 500 rows. Your task is to produce a notebook that trains a model and outputs predictions for the test set, maximizing macro-F1.

Synthetic data generator (use this to create your dataset locally)

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=2500, n_features=8, n_informative=5, n_redundant=2,
    n_classes=3, class_sep=1.2, weights=[0.5, 0.3, 0.2], random_state=42,
)

feature_cols = [f"f{i}" for i in range(8)]
df = pd.DataFrame(X, columns=feature_cols)
df["label"] = y

train_df, test_df = train_test_split(df, test_size=500, random_state=0, stratify=df["label"])
train_df.to_csv("train.csv", index=False)
test_df.drop(columns=["label"]).to_csv("test.csv", index=False)
test_df[["label"]].to_csv("test_labels.csv", index=False)  # for your own scoring

Deliverables

A single notebook solution.ipynb that runs end-to-end.
A predictions.csv with one prediction per row of the test set (header: label).
A 3-sentence write-up of your approach.

Scoring rubric

Macro-F1 achieved	Score (out of 100)
< 0.55	0
0.55 – 0.65	40
0.65 – 0.75	70
0.75 – 0.82	90
≥ 0.82	100

Reference baseline (open after your attempt)

Reveal reference baseline

import pandas as pd
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score

train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

X = train.drop(columns=["label"])
y = train["label"]

# scale (HGB doesn't need it, but it doesn't hurt)
scaler = StandardScaler().fit(X)
X_s = scaler.transform(X)
test_s = scaler.transform(test)

model = HistGradientBoostingClassifier(
    max_depth=6, learning_rate=0.05, max_iter=400, random_state=42,
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_s, y, cv=cv, scoring="f1_macro")
print("CV macro-F1:", scores.mean(), "±", scores.std())

model.fit(X_s, y)
preds = model.predict(test_s)
pd.DataFrame({"label": preds}).to_csv("predictions.csv", index=False)

Reference baseline typically scores around macro-F1 ≈ 0.78–0.82 on this generator with the default seed. Beating it needs careful CV-driven hyperparameter tuning or a small ensemble (HGB + logistic regression + MLP averaged).

Tips during the mock

Build a baseline first. A 5-line logistic regression takes 60 seconds and grounds your CV scores.
Trust cross-validation, not the single train/val split. Use 5-fold stratified.
Track the seed. Reproducibility is part of the grade — pin random_state on every model and splitter.
Last 10 minutes: stop tuning, write the write-up, save the notebook, verify it runs from a clean kernel.

After each mock

Score honestly using the rubric.
For each missed theory item: re-derive without looking, then read the explanation.
For the coding mock: identify your single biggest score-leak (was it feature engineering? model choice? CV?), and drill that next week.
Log everything into the same error log you use for problem sets.