Contest-day cheatsheets

One page, ten sections, every API and shape rule I want at my fingertips during a proctored USAAIO round. The goal is recognition speed: skim once a day until you don't need it open.

How to use. The contest is on Colab without internet, so practise these from memory. If you can read each table and immediately picture the snippet, you're ready. If a cell makes you pause, that's your next drill.

1. NumPy

Broadcasting rules

Compare shapes right-to-left. For each pair: equal, or one is 1, or one is missing → OK. Otherwise error.

A shapeB shapeResultWhy
(4, 3)(3,)(4, 3)row vector added to each row
(4, 3)(4, 1)(4, 3)column vector added to each column
(4, 3)(4,)errorreshape to (4, 1) first
(n, 1, d)(1, n, d)(n, n, d)pairwise broadcast trick
(b, h, n, d)(d,)(b, h, n, d)scalar feature scaling

Shape ops

a.reshape(2, -1)              # -1 means "infer"
a.ravel()                     # flatten, returns view if possible
np.squeeze(a, axis=1)         # drop length-1 dims
np.expand_dims(a, axis=0)     # equivalent to a[None]
a.transpose(0, 2, 1)          # permute axes
np.stack([a, b], axis=0)      # new axis, equal shapes
np.concatenate([a, b], axis=1)# join on existing axis
np.tile(a, (2, 3))            # repeat along axes
np.repeat(a, 3, axis=0)       # repeat each element

Axis-aware reductions

CallInputOutputNote
X.mean(axis=0)(N, D)(D,)per-column mean
X.mean(axis=1)(N, D)(N,)per-row mean
X.sum(axis=-1, keepdims=True)(N, D)(N, 1)preserve rank for broadcasting
X.argmax(axis=1)(N, K)(N,)predicted class indices
np.linalg.norm(X, axis=1)(N, D)(N,)L2 norm per row

Slicing & indexing

a[1:5:2]                      # start:stop:step
a[::-1]                       # reverse
a[mask]                       # boolean mask, mask.shape == a.shape (or broadcastable)
a[[0, 2, 5]]                  # fancy index — returns a copy
a[np.arange(N), y]            # pick a[i, y[i]] for each i (gather)
a[:, None] - b[None, :]       # pairwise diff via broadcasting
np.where(cond, x, y)          # vectorized ternary
np.clip(a, 0, 1)              # element-wise clip

Random (modern API)

rng = np.random.default_rng(0)
rng.standard_normal((4, 3))   # N(0, 1)
rng.normal(loc=0, scale=1, size=(4, 3))
rng.uniform(0, 1, size=10)
rng.integers(0, 10, size=5)   # exclusive of high
rng.choice(N, size=k, replace=False)
rng.permutation(N)

2. pandas

read_csv common args

ArgUse
sep=","or "\t", r"\s+" for whitespace
header=0 / Nonerow index of header, or no header
names=[...]supply column names (with header=None)
index_col="id"use this column as the row index
usecols=["a","b"]only load these columns (saves memory)
dtype={"x": "float32"}force column dtypes
parse_dates=["ts"]parse to datetime64
na_values=["?", "NA"]extra strings to treat as NaN
nrows=1000peek at file
chunksize=10000iterator of DataFrames for big files

loc vs iloc

df.loc["a":"c", ["x", "y"]]   # LABEL slicing, INCLUSIVE on both ends
df.iloc[0:3, [0, 1]]          # POSITION slicing, exclusive right
df.loc[df["x"] > 0, "y"] = 1  # safe conditional write
df.at["row3", "x"]            # scalar label access (fastest)
df.iat[2, 0]                  # scalar position access

GroupBy + agg

df.groupby("user").size()                       # count per group
df.groupby("user")["amt"].mean()                 # series result
df.groupby(["user", "day"])["amt"].sum()         # multi-key, MultiIndex

df.groupby("user").agg(
    n=("amt", "size"),
    mean_amt=("amt", "mean"),
    max_amt=("amt", "max"),
).reset_index()                                  # named columns, flat

df.groupby("user")["amt"].transform("mean")      # broadcast back to row shape

Merge / join

pd.merge(a, b, on="user_id", how="left")        # how: left/right/inner/outer
pd.merge(a, b, left_on="uid", right_on="user_id")
pd.merge(a, b, on="k", how="outer", indicator=True)  # _merge col shows source
a.join(b, on="user_id", how="left")              # join uses b's index
pd.concat([a, b], axis=0, ignore_index=True)     # stack rows

value_counts & pivot_table

df["label"].value_counts()                       # counts, sorted desc
df["label"].value_counts(normalize=True)         # proportions
pd.crosstab(df["a"], df["b"])                    # 2-way frequency

df.pivot_table(
    index="date", columns="product",
    values="sales", aggfunc="sum", fill_value=0,
)

Datetime

df["ts"] = pd.to_datetime(df["ts"], errors="coerce")
df["hour"] = df["ts"].dt.hour
df["dow"]  = df["ts"].dt.dayofweek
df["month"] = df["ts"].dt.month
df["is_weekend"] = df["dow"] >= 5
df.set_index("ts").resample("1D")["amt"].sum()

3. matplotlib

Figure / axes anatomy

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 4))   # one axes
fig, axes = plt.subplots(2, 3, figsize=(12, 6), sharex=True)
ax = axes[0, 1]                          # row, col indexing
fig.suptitle("Diagnostics")
plt.tight_layout()
fig.savefig("out.png", dpi=150, bbox_inches="tight")

The six plots

PlotCallWhen
lineax.plot(x, y, label="train")loss curves, time series
scatterax.scatter(x, y, c=labels, s=8, alpha=0.6)2-D feature view, residuals
histogramax.hist(x, bins=50, density=True)feature distribution
barax.bar(names, values)class counts, model comparison
imageax.imshow(img, cmap="gray")image samples, weight maps
heatmapax.imshow(M); fig.colorbar(im, ax=ax)confusion matrix, attention

Style fixes

ax.set_xlabel("epoch"); ax.set_ylabel("loss"); ax.set_title("...")
ax.set_yscale("log")          # long loss curves
ax.set_xlim(0, 100); ax.set_ylim(0, 1)
ax.legend(loc="best"); ax.grid(alpha=0.3)
ax.tick_params(axis="x", rotation=45)
fig.colorbar(im, ax=ax, shrink=0.8)

4. scikit-learn

Estimator API

est.fit(X, y)            # learn parameters
est.predict(X)           # supervised: discrete labels or regression values
est.predict_proba(X)     # classifiers (where supported)
est.transform(X)         # transformers (scalers, encoders, PCA)
est.fit_transform(X)     # both in one call (on TRAIN only)
est.score(X, y)          # default metric (accuracy / R^2)

Preprocessing

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

pre = ColumnTransformer([
    ("num", StandardScaler(), ["age", "income"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]),
])
X_train_t = pre.fit_transform(X_train)
X_val_t   = pre.transform(X_val)        # NO refit on val

Pipeline

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=1000, C=1.0)),
])
pipe.fit(X_train, y_train)
pipe.predict(X_val)

model_selection

from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold,
    cross_val_score, GridSearchCV,
)

X_tr, X_va, y_tr, y_va = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipe, X, y, cv=cv, scoring="f1_macro")

grid = GridSearchCV(
    pipe,
    param_grid={"clf__C": [0.1, 1, 10]},
    cv=cv, scoring="f1_macro", n_jobs=-1,
)
grid.fit(X, y)
grid.best_params_, grid.best_score_

Metrics

TaskMetricCall
classificationaccuracyaccuracy_score(y, yp)
classificationF1 (macro / binary)f1_score(y, yp, average="macro")
classificationROC-AUCroc_auc_score(y, yp_proba)
classificationconfusion matrixconfusion_matrix(y, yp)
regressionMSE / RMSEmean_squared_error(y, yp, squared=False)
regressionMAEmean_absolute_error(y, yp)
regressionr2_score(y, yp)

Common estimators

EstimatorImportKey defaults
LogisticRegressionlinear_modelC=1.0, penalty="l2", max_iter=100 (bump to 1000)
Ridge / Lassolinear_modelalpha=1.0
KNeighborsClassifierneighborsn_neighbors=5
DecisionTreeClassifiertreemax_depth=None, min_samples_split=2
RandomForestClassifierensemblen_estimators=100, max_features="sqrt"
GradientBoostingClassifierensemblen_estimators=100, lr=0.1, max_depth=3
SVCsvmC=1.0, kernel="rbf", gamma="scale"
KMeansclustern_clusters=8, n_init="auto"
PCAdecompositionn_components=None

5. PyTorch shape operations

view vs reshape

x.view(B, -1)        # requires contiguous memory; errors if not
x.reshape(B, -1)     # works on non-contiguous (may copy)
x.contiguous().view(B, -1)   # canonical fix after transpose/permute

permute vs transpose

x.transpose(1, 2)            # swap exactly two dims
x.permute(0, 2, 3, 1)        # arbitrary reorder, e.g. NCHW -> NHWC
# Both return a view with strided memory; follow with .contiguous() before .view().

unsqueeze / squeeze / expand / repeat

x.unsqueeze(0)               # add axis at position 0:  (D,) -> (1, D)
x.squeeze(1)                 # remove axis 1 if it is length 1
x.expand(B, -1, -1)          # broadcast view, NO memory copy, READ-ONLY-ish
x.repeat(B, 1, 1)            # physical copy along each dim

matmul broadcasting

ABA @ B
(n, k)(k, m)(n, m) — plain matmul
(B, n, k)(B, k, m)(B, n, m) — batched
(B, n, k)(k, m)(B, n, m) — right broadcasts
(B, H, n, k)(B, H, k, m)(B, H, n, m) — attention block

einsum patterns

import torch
# batched matmul
torch.einsum("bij,bjk->bik", A, B)

# multi-head attention QK^T:  Q (B, H, N, d), K (B, H, N, d)
attn = torch.einsum("bhid,bhjd->bhij", Q, K) / d**0.5
out  = torch.einsum("bhij,bhjd->bhid", attn.softmax(-1), V)

# dot product of every row pair
torch.einsum("nd,md->nm", X, Y)

# weighted sum across feature dim
torch.einsum("nd,d->n", X, w)

6. PyTorch training loop boilerplate

import torch, random, numpy as np
from torch.utils.data import DataLoader

def set_seed(s=42):
    random.seed(s); np.random.seed(s)
    torch.manual_seed(s); torch.cuda.manual_seed_all(s)

set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=128, shuffle=False, num_workers=2, pin_memory=True)

model = MyModel().to(device)
optim = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=1e-2)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
loss_fn = torch.nn.CrossEntropyLoss()

best_val = float("inf")
for epoch in range(EPOCHS):
    model.train()
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optim.zero_grad()
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optim.step()
    sched.step()

    model.eval()
    tot, n = 0.0, 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            tot += loss_fn(model(xb), yb).item() * xb.size(0)
            n += xb.size(0)
    val_loss = tot / n
    if val_loss < best_val:
        best_val = val_loss
        torch.save({"model": model.state_dict(),
                    "optim": optim.state_dict(),
                    "epoch": epoch}, "best.pt")

Resume: ckpt = torch.load("best.pt"); model.load_state_dict(ckpt["model"]).

7. Loss / activation / optimizer reference

Losses

LossUseTarget shape & dtype
nn.CrossEntropyLossmulti-class, integer labels(B,) long, expects raw logits (B, K)
nn.BCEWithLogitsLossbinary / multi-label(B,) or (B, K) float in {0,1}, raw logits
nn.NLLLossafter log_softmaxsame as CE but log-probs input
nn.MSELossregressionfloat, same shape as input
nn.L1Loss / SmoothL1Lossrobust regressionfloat, same shape
nn.KLDivLossdistillationinput is log-probs, target is probs

Activations

ActivationWhen to use
ReLUdefault hidden activation for CNN/MLP
GELUtransformers (BERT, GPT)
SiLU / Swishmodern CNNs, EfficientNet
LeakyReLUwhen ReLU is dying (lots of zeros)
TanhRNN cells, bounded output
Sigmoidbinary head only — never as hidden
Softmax(dim=-1)turn logits into probabilities (manual, not for CE loss)

Optimizers

OptimizerDefaultsNotes
SGDlr, momentum=0.9CV from scratch with LR schedule
Adamlr=1e-3, betas=(0.9, 0.999)safe default; faster early progress
AdamWlr=3e-4, weight_decay=1e-2transformers / modern default
RMSproplr=1e-2RNNs, RL

Schedulers

torch.optim.lr_scheduler.StepLR(optim, step_size=10, gamma=0.1)
torch.optim.lr_scheduler.CosineAnnealingLR(optim, T_max=EPOCHS)
torch.optim.lr_scheduler.OneCycleLR(optim, max_lr=1e-3, total_steps=N)
torch.optim.lr_scheduler.ReduceLROnPlateau(optim, mode="min", patience=3)

8. Layer shape rules

Conv2d

nn.Conv2d(in_channels=C, out_channels=F,
          kernel_size=k, stride=s, padding=p)
# Input  : (B, C, H, W)
# Output : (B, F, H_out, W_out)
# H_out = floor((H + 2p - k) / s) + 1
# Params : F * (C * k * k + 1)  (the +1 is the bias)

Other layers

LayerInput → OutputParam count
Linear(in, out)(*, in)(*, out)in * out + out
BatchNorm2d(C)(B, C, H, W) → same2C (γ, β) + 2C buffers
LayerNorm(d)(*, d) → same2d
MaxPool2d(k, s)(B, C, H, W)(B, C, H/s, W/s)0
AdaptiveAvgPool2d(1)(B, C, H, W)(B, C, 1, 1)0
Dropout(p)same shape0
Embedding(V, d)(B, L) long → (B, L, d)V * d
MultiheadAttention(d, h, batch_first=True)(B, L, d)(B, L, d)4 * d * d (Q,K,V,out)

MultiheadAttention call

attn = nn.MultiheadAttention(embed_dim=d, num_heads=h, batch_first=True)
out, weights = attn(query, key, value,
                    key_padding_mask=pad_mask,  # (B, L) bool, True = ignore
                    attn_mask=causal_mask)      # (L, L) bool/float
# query/key/value: (B, L, d). out: (B, L, d). weights: (B, L, L).

Conv output rule of thumb

"Same" padding for odd k: padding = k // 2 with stride=1 keeps H, W. Halving spatial dims is stride=2.

9. Hugging Face datasets & transformers

datasets

from datasets import load_dataset

ds = load_dataset("imdb")                        # DatasetDict {train, test}
ds = load_dataset("csv", data_files="train.csv") # local
small = ds["train"].shuffle(seed=0).select(range(2000))

def add_len(ex):
    ex["n_words"] = len(ex["text"].split())
    return ex
ds = ds.map(add_len)
ds = ds.filter(lambda ex: ex["n_words"] < 256)
ds.set_format("torch", columns=["input_ids", "attention_mask", "label"])

AutoTokenizer

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize(batch):
    return tok(batch["text"], truncation=True, max_length=256, padding=False)

ds_tok = ds.map(tokenize, batched=True, remove_columns=["text"])

AutoModelForSequenceClassification + Trainer

from transformers import (
    AutoModelForSequenceClassification, TrainingArguments,
    Trainer, DataCollatorWithPadding,
)
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2,
)

args = TrainingArguments(   # note: arg renamed eval_strategy in transformers >=4.41
    output_dir="out",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

def metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {"accuracy": accuracy_score(p.label_ids, preds),
            "f1": f1_score(p.label_ids, preds, average="macro")}

trainer = Trainer(
    model=model, args=args,
    train_dataset=ds_tok["train"], eval_dataset=ds_tok["test"],
    tokenizer=tok,
    data_collator=DataCollatorWithPadding(tok),
    compute_metrics=metrics,
)
trainer.train()
results = trainer.predict(ds_tok["test"])

10. Complexity quick reference

Attention & transformers

QuantityCostNote
Self-attention computeO(n² · d)n = seq len, d = model dim
Self-attention memory (attn matrix)O(n²)per head; FlashAttention reduces wall memory but not the FLOPs
FFN blockO(n · d²)two linears with hidden 4d
KV cache (autoregressive)2 · L · n · d bytes per layerL layers, halves with fp16
Decode step with cacheO(n · d) per tokenvs O(n² · d) without cache
Multi-head Q,K,V,O projections4 · d² paramsindependent of head count

CNN layer math

QuantityFormula
Conv2d params (with bias)F · (C · k · k + 1)
Conv2d FLOPs (per image)H_out · W_out · F · C · k · k
Dense layer paramsin · out + out
Dense layer FLOPs2 · in · out (mult + add)
Embedding paramsV · d (usually the biggest matrix)

Classical ML algorithms

AlgorithmTrainPredict (per sample)
Linear / logistic regressionO(n · d²) closed form; O(n · d) per SGD stepO(d)
k-NNO(1) (just store)O(n · d) brute; O(d · log n) with KD-tree (low d)
Decision treeO(n · d · log n)O(depth)
Random forest (T trees)O(T · n · d · log n)O(T · depth)
Gradient boosting (T trees)O(T · n · d · log n) sequentialO(T · depth)
SVM (RBF kernel)O(n² · d) to O(n³ · d)O(SV · d)
K-means (k clusters, I iters)O(I · n · k · d)O(k · d)
PCA via SVDO(min(n · d², n² · d))O(k · d) per sample

Sanity checks for the contest

Where to go next

Once this page feels boring, you have the syntax internalised. Run the mock contests closed-book and only open this page when you actually get stuck — that's the contest-day workflow. The end-to-end notebooks show every API here in context.