Model evaluation — metrics, validation, leaderboards

A model is only as good as the number you score it with. Picking the wrong metric or the wrong validation split is the most common way strong contestants lose to weaker ones — and it is also the single highest-leverage skill you can sharpen before the contest.

TL;DR. Train loss tells you nothing about real-world performance; you need a held-out estimator. The two knobs are which metric (accuracy / F1 / AUC / RMSE / MAE / NDCG / BLEU / FID …) and which split (random k-fold, stratified, grouped, rolling-origin, nested). The wrong combination silently overfits the leaderboard or the validation set itself. This page catalogues the metrics, the validation strategies, and the leaderboard tactics (probing, ensembling, calibration, pseudo-labels) that win USAAIO and IOAI scoreboards.

1. The intuition

Training loss measures how well the model memorised the data it has already seen. Generalisation error — the quantity you actually care about — is the loss on data the model has not seen. The whole point of evaluation is to estimate that generalisation error before the contest grader does it for you.

A held-out split is a Monte Carlo estimator of expected loss over the data distribution. Its variance shrinks like 1 / n_val, so tiny validation sets give noisy estimates and tempt you to chase noise. Cross-validation averages several such estimators to reduce variance, at the cost of training the model k times. Stratification preserves class balance across folds; grouping prevents the same patient / user / image showing up in both train and val (leakage); rolling-origin splits respect time so the model never "sees the future."

The metric encodes what counts as a correct prediction. Accuracy is symmetric and ignores class imbalance. F1 cares about positives. ROC AUC is threshold-free but breaks under extreme imbalance. RMSE punishes outliers; MAE shrugs at them. Choosing badly means you optimise the wrong objective even with a perfect pipeline.

2. The math & content

Classification metrics

From a confusion matrix with TP, FP, FN, TN:

Accuracy = (TP + TN) / (TP + FP + FN + TN)
Precision = TP / (TP + FP), Recall = TP / (TP + FN)
F1 = 2 * P * R / (P + R) = 2 TP / (2 TP + FP + FN)

Macro F1 averages per-class F1 with equal weight (each class counts the same regardless of frequency). Micro F1 aggregates TP/FP/FN globally and in multi-class single-label settings collapses to accuracy. Weighted averages by class support. USAAIO classification tasks usually report macro-F1 because they want minority classes to count.

ROC AUC integrates the true-positive rate against the false-positive rate over all decision thresholds; it equals the probability that a random positive is ranked above a random negative. PR AUC integrates precision against recall and is the right curve when positives are rare — ROC AUC stays close to 1 on imbalanced data because the FPR denominator (TN) is huge, making the model look better than it is.

Log loss (cross-entropy) penalises confidently wrong probabilities:

LogLoss = -(1/n) * sum_i sum_k y_{i,k} * log(p_{i,k})

Use it when the contest scores calibrated probabilities, not labels.

Regression metrics

MSE = (1/n) sum (y - yhat)^2, RMSE = sqrt(MSE)
MAE = (1/n) sum |y - yhat|
R^2 = 1 - SS_res / SS_tot = 1 - sum (y - yhat)^2 / sum (y - mean(y))^2
MAPE = (100/n) sum |(y - yhat) / y|

RMSE is in the same units as the target and punishes large errors quadratically — pick it when outlier predictions are catastrophic. MAE is robust to outliers and easier to interpret as "average miss." R² compares your model to a constant-mean baseline; negative R² means you do worse than predicting the mean. MAPE breaks when y is near zero. IOAI chicken-counting and similar counting tasks are scored with MAE.

Ranking / retrieval metrics (brief)

Generative metrics (brief catalogue)

Validation strategies

Bias–variance and learning curves

Generalisation error decomposes into three pieces:

E[(y - yhat)^2] = Bias(yhat)^2 + Var(yhat) + sigma^2

where sigma^2 is irreducible noise. A learning curve plots train and validation error against the number of training samples. High train error tracking high val error = high bias (underfit) — pick a more expressive model. Low train error with a big gap to val = high variance (overfit) — regularise, augment, get more data. The two curves converging at high error and staying flat means you are near the noise floor: more data will not help.

Leaderboard tactics

3. Python reference implementation (scikit-learn)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, cross_val_score, learning_curve,
)
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, f1_score,
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve, average_precision_score,
    log_loss,
)


# ---------- 1. data + stratified train / val / test split ----------
X, y = make_classification(
    n_samples=4000, n_features=20, n_informative=8,
    weights=[0.85, 0.15], random_state=0,            # imbalanced
)

X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=0,
)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.1765,        # ~15% of total
    stratify=y_trainval, random_state=0,
)
print("split sizes:", len(X_train), len(X_val), len(X_test))


# ---------- 2. manual stratified k-fold loop + sklearn parity ----------
def manual_cv(model_fn, X, y, k=5, seed=0):
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
    scores = []
    for fold, (tr, va) in enumerate(skf.split(X, y)):
        m = model_fn()
        m.fit(X[tr], y[tr])
        p = m.predict_proba(X[va])[:, 1]
        scores.append(roc_auc_score(y[va], p))
    return np.array(scores)

manual_scores = manual_cv(
    lambda: LogisticRegression(max_iter=1000), X_trainval, y_trainval,
)
sk_scores = cross_val_score(
    LogisticRegression(max_iter=1000), X_trainval, y_trainval,
    cv=StratifiedKFold(5, shuffle=True, random_state=0), scoring="roc_auc",
)
print(f"manual AUC: {manual_scores.mean():.4f} +/- {manual_scores.std():.4f}")
print(f"sklearn AUC: {sk_scores.mean():.4f} +/- {sk_scores.std():.4f}")


# ---------- 3. confusion matrix + classification_report ----------
clf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_train, y_train)
y_val_hat = clf.predict(X_val)
y_val_prob = clf.predict_proba(X_val)[:, 1]

print("confusion matrix:\n", confusion_matrix(y_val, y_val_hat))
print(classification_report(y_val, y_val_hat, digits=3))
print(f"log loss: {log_loss(y_val, y_val_prob):.4f}")


# ---------- 4. ROC + PR curves ----------
fpr, tpr, _ = roc_curve(y_val, y_val_prob)
prec, rec, pr_thr = precision_recall_curve(y_val, y_val_prob)
auc, ap = roc_auc_score(y_val, y_val_prob), average_precision_score(y_val, y_val_prob)

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(fpr, tpr, label=f"AUC = {auc:.3f}")
ax[0].plot([0, 1], [0, 1], "--", color="grey")
ax[0].set(xlabel="FPR", ylabel="TPR", title="ROC")
ax[0].legend()
ax[1].plot(rec, prec, label=f"AP = {ap:.3f}")
ax[1].set(xlabel="Recall", ylabel="Precision", title="PR")
ax[1].legend()
fig.tight_layout()


# ---------- 5. learning curve helper ----------
def plot_learning_curve(estimator, X, y, cv=5):
    sizes, tr, va = learning_curve(
        estimator, X, y, cv=cv, scoring="roc_auc",
        train_sizes=np.linspace(0.1, 1.0, 8), random_state=0,
    )
    plt.figure()
    plt.plot(sizes, tr.mean(axis=1), "o-", label="train")
    plt.plot(sizes, va.mean(axis=1), "o-", label="val")
    plt.xlabel("training samples"); plt.ylabel("AUC")
    plt.legend(); plt.title("learning curve")

plot_learning_curve(LogisticRegression(max_iter=1000), X_trainval, y_trainval)


# ---------- 6. threshold tuning to maximise F1 ----------
def best_f1_threshold(y_true, y_prob):
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    # precision_recall_curve returns one extra prec/rec point; align lengths
    f1 = 2 * prec[:-1] * rec[:-1] / (prec[:-1] + rec[:-1] + 1e-12)
    j = int(np.nanargmax(f1))
    return float(thr[j]), float(f1[j])

t_star, f1_star = best_f1_threshold(y_val, y_val_prob)
y_test_prob = clf.predict_proba(X_test)[:, 1]
y_test_hat  = (y_test_prob >= t_star).astype(int)
print(f"tuned threshold: {t_star:.3f}  val F1: {f1_star:.3f}")
print(f"test F1 @ tuned threshold: {f1_score(y_test, y_test_hat):.3f}")
print(f"test F1 @ 0.5 default:     {f1_score(y_test, (y_test_prob >= 0.5).astype(int)):.3f}")

The threshold is tuned on the validation split, then frozen and applied to the test set — never tune a decision threshold on the same data you report on. For multi-fold setups, tune the threshold on the concatenated out-of-fold probabilities instead.

4. Common USAAIO / IOAI applications

5. Drills

D1 · Derive F1 from precision and recall

You have precision P = 0.6 and recall R = 0.9. Compute F1 and verify it lies between min and max of P and R.

Solution

F1 = 2 * 0.6 * 0.9 / (0.6 + 0.9) = 1.08 / 1.5 = 0.72. Since F1 is the harmonic mean it is always between min(P, R) and the arithmetic mean (P + R) / 2 = 0.75; 0.6 ≤ 0.72 ≤ 0.75. The harmonic mean is biased toward the smaller of the two, which is why F1 punishes imbalance between precision and recall.

D2 · Why ROC AUC fails on extreme imbalance

Positives are 0.1% of the data. A model gets ROC AUC = 0.98 but PR AUC = 0.12. Why can both be true and which one should you trust?

Solution

FPR = FP / (FP + TN). When negatives dominate, the denominator is enormous and FPR stays tiny even for many false positives, inflating ROC AUC. PR AUC uses precision = TP / (TP + FP), whose denominator scales with FP, so it reflects the actual cost of false alarms. Trust PR AUC (or average precision) when the positive class is rare.

D3 · Time-series CV vs random k-fold gotcha

You forecast daily sales. Random 5-fold CV gives RMSE 120; the contest test set (future months) gives RMSE 410. What happened and how do you fix the CV?

Solution

Random k-fold mixes future rows into the training folds, leaking trend and seasonality that the model could not have known at prediction time. Switch to rolling-origin CV (TimeSeriesSplit): each fold trains on [0, t] and validates on [t + 1, t + h]. The CV RMSE will jump and start matching the test RMSE — that is the honest estimate you should have been optimising all along.

D4 · Expected public → private leaderboard gap

Your CV is 0.842, public LB is 0.861, private LB ends up 0.838. Which numbers were you overfitting and what is the lesson?

Solution

Public LB is well above CV, private LB drops back near CV. The model was overfitting the public split — either by repeated probing (each submission leaks a bit of public-set information) or by tuning a threshold against public LB feedback. CV was the honest estimate; the private board confirmed it. Lesson: trust your CV, submit sparingly, and freeze hyperparameters from CV not from public LB.

D5 · When to ensemble vs ship a single model

You have two models: model A with CV 0.81, model B with CV 0.80. Their predictions are correlated 0.96. Is averaging worth it?

Solution

Variance reduction from averaging k correlated predictors scales roughly with (1 + (k - 1) * rho) / k. With rho = 0.96 the reduction is ~2% — within the noise of CV. The ensemble is unlikely to beat A on the private board and adds inference cost. Ensemble pays off when predictors are diverse (rho < 0.9) and individually strong; otherwise ship A.

Next step

Now that you can score models honestly, loop back to Classical ML for the model families themselves, walk through the pitfalls checklist (leakage, target encoding, scaler-after-split, etc.), and run a full mock contest end-to-end with the right metric and the right CV scheme.