Model evaluation — metrics, validation, leaderboards

A model is only as good as the number you score it with. Picking the wrong metric or the wrong validation split is the most common way strong contestants lose to weaker ones — and it is also the single highest-leverage skill you can sharpen before the contest.

TL;DR. Train loss tells you nothing about real-world performance; you need a held-out estimator. The two knobs are which metric (accuracy / F1 / AUC / RMSE / MAE / NDCG / BLEU / FID …) and which split (random k-fold, stratified, grouped, rolling-origin, nested). The wrong combination silently overfits the leaderboard or the validation set itself. This page catalogues the metrics, the validation strategies, and the leaderboard tactics (probing, ensembling, calibration, pseudo-labels) that win USAAIO and IOAI scoreboards.

1. The intuition

Training loss measures how well the model memorised the data it has already seen. Generalisation error — the quantity you actually care about — is the loss on data the model has not seen. The whole point of evaluation is to estimate that generalisation error before the contest grader does it for you.

A held-out split is a Monte Carlo estimator of expected loss over the data distribution. Its variance shrinks like 1 / n_val, so tiny validation sets give noisy estimates and tempt you to chase noise. Cross-validation averages several such estimators to reduce variance, at the cost of training the model k times. Stratification preserves class balance across folds; grouping prevents the same patient / user / image showing up in both train and val (leakage); rolling-origin splits respect time so the model never "sees the future."

The metric encodes what counts as a correct prediction. Accuracy is symmetric and ignores class imbalance. F1 cares about positives. ROC AUC is threshold-free but breaks under extreme imbalance. RMSE punishes outliers; MAE shrugs at them. Choosing badly means you optimise the wrong objective even with a perfect pipeline.

2. The math & content

Classification metrics

From a confusion matrix with TP, FP, FN, TN:

Accuracy = (TP + TN) / (TP + FP + FN + TN)

Precision = TP / (TP + FP), Recall = TP / (TP + FN)

F1 = 2 * P * R / (P + R) = 2 TP / (2 TP + FP + FN)

Macro F1 averages per-class F1 with equal weight (each class counts the same regardless of frequency). Micro F1 aggregates TP/FP/FN globally and in multi-class single-label settings collapses to accuracy. Weighted averages by class support. USAAIO classification tasks usually report macro-F1 because they want minority classes to count.

ROC AUC integrates the true-positive rate against the false-positive rate over all decision thresholds; it equals the probability that a random positive is ranked above a random negative. PR AUC integrates precision against recall and is the right curve when positives are rare — ROC AUC stays close to 1 on imbalanced data because the FPR denominator (TN) is huge, making the model look better than it is.

Log loss (cross-entropy) penalises confidently wrong probabilities:

LogLoss = -(1/n) * sum_i sum_k y_{i,k} * log(p_{i,k})

Use it when the contest scores calibrated probabilities, not labels.

Regression metrics

MSE = (1/n) sum (y - yhat)^2, RMSE = sqrt(MSE)

MAE = (1/n) sum |y - yhat|

R^2 = 1 - SS_res / SS_tot = 1 - sum (y - yhat)^2 / sum (y - mean(y))^2

MAPE = (100/n) sum |(y - yhat) / y|

RMSE is in the same units as the target and punishes large errors quadratically — pick it when outlier predictions are catastrophic. MAE is robust to outliers and easier to interpret as "average miss." R² compares your model to a constant-mean baseline; negative R² means you do worse than predicting the mean. MAPE breaks when y is near zero. IOAI chicken-counting and similar counting tasks are scored with MAE.

Ranking / retrieval metrics (brief)

NDCG@k — Normalised Discounted Cumulative Gain. Rewards high-relevance items appearing near the top of a ranked list; logarithmic position discount.
MAP — Mean Average Precision; average over queries of the area under the per-query precision-recall curve.
MRR — Mean Reciprocal Rank of the first relevant result; cheap and useful when only the top hit matters.

Generative metrics (brief catalogue)

BLEU — n-gram precision against reference translations with a brevity penalty. Standard for MT; weak for open-ended text.
ROUGE-N / ROUGE-L — recall-flavoured n-gram and longest-common-subsequence overlap; standard for summarisation.
Perplexity — exp(NLL) of a held-out corpus; the standard intrinsic LM metric. Lower is better.
FID — Fréchet distance between Inception feature distributions of real and generated images. Lower is better.
IS — Inception Score; rewards high-confidence class predictions with diverse class marginals. Largely superseded by FID.

Validation strategies

Train / val / test — single random split, e.g. 70 / 15 / 15. Fast but high variance; use only when the dataset is large.
k-fold CV — partition the data into k equal folds, train on k - 1, validate on the held-out one, rotate. Report mean and std across folds. k = 5 is the default.
Stratified k-fold — preserve the class distribution in each fold. Mandatory for imbalanced classification.
Group k-fold — keep all rows sharing a group id (patient, user, image) on the same side of each split. Without it, near-duplicates leak across train and val and your CV score lies.
Time-series / rolling-origin CV — never train on the future. Each fold uses [0, t] for training and [t + 1, t + h] for validation; t grows or slides. sklearn.model_selection.TimeSeriesSplit.
Nested CV — outer loop estimates generalisation, inner loop selects hyperparameters. Avoids the optimistic bias of using the same CV for both tuning and reporting.

Bias–variance and learning curves

Generalisation error decomposes into three pieces:

E[(y - yhat)^2] = Bias(yhat)^2 + Var(yhat) + sigma^2

where sigma^2 is irreducible noise. A learning curve plots train and validation error against the number of training samples. High train error tracking high val error = high bias (underfit) — pick a more expressive model. Low train error with a big gap to val = high variance (overfit) — regularise, augment, get more data. The two curves converging at high error and staying flat means you are near the noise floor: more data will not help.

Leaderboard tactics

Public vs private split — contests score submissions on a public fraction during the contest and a held-out private fraction at the end. Shake-up at the end is normal; trust your CV over the public board.
Probing for overfitting — if your CV improves but public LB stays flat, you are overfitting CV. If public LB improves but CV does not, you are overfitting the public split — the private board will punish you.
Ensembling / stacking — averaging diverse model predictions reduces variance. Rank-average for AUC tasks; logit-average for log-loss tasks; weight by CV score. Stacking trains a meta-learner on out-of-fold predictions; use a simple linear or logistic meta to avoid leakage.
Pseudo-labelling — predict on unlabeled / test data, keep the most confident predictions as extra training labels, retrain. Cheap semi-supervision; risky if your base model is biased.
Calibration & threshold tuning — Platt scaling or isotonic regression on out-of-fold probabilities for log-loss tasks. For F1 / accuracy, sweep the decision threshold on OOF predictions and pick the one that maximises the metric.

3. Python reference implementation (scikit-learn)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    train_test_split, StratifiedKFold, cross_val_score, learning_curve,
)
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support, f1_score,
    confusion_matrix, classification_report,
    roc_curve, roc_auc_score, precision_recall_curve, average_precision_score,
    log_loss,
)


# ---------- 1. data + stratified train / val / test split ----------
X, y = make_classification(
    n_samples=4000, n_features=20, n_informative=8,
    weights=[0.85, 0.15], random_state=0,            # imbalanced
)

X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, stratify=y, random_state=0,
)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.1765,        # ~15% of total
    stratify=y_trainval, random_state=0,
)
print("split sizes:", len(X_train), len(X_val), len(X_test))


# ---------- 2. manual stratified k-fold loop + sklearn parity ----------
def manual_cv(model_fn, X, y, k=5, seed=0):
    skf = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
    scores = []
    for fold, (tr, va) in enumerate(skf.split(X, y)):
        m = model_fn()
        m.fit(X[tr], y[tr])
        p = m.predict_proba(X[va])[:, 1]
        scores.append(roc_auc_score(y[va], p))
    return np.array(scores)

manual_scores = manual_cv(
    lambda: LogisticRegression(max_iter=1000), X_trainval, y_trainval,
)
sk_scores = cross_val_score(
    LogisticRegression(max_iter=1000), X_trainval, y_trainval,
    cv=StratifiedKFold(5, shuffle=True, random_state=0), scoring="roc_auc",
)
print(f"manual AUC: {manual_scores.mean():.4f} +/- {manual_scores.std():.4f}")
print(f"sklearn AUC: {sk_scores.mean():.4f} +/- {sk_scores.std():.4f}")


# ---------- 3. confusion matrix + classification_report ----------
clf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_train, y_train)
y_val_hat = clf.predict(X_val)
y_val_prob = clf.predict_proba(X_val)[:, 1]

print("confusion matrix:\n", confusion_matrix(y_val, y_val_hat))
print(classification_report(y_val, y_val_hat, digits=3))
print(f"log loss: {log_loss(y_val, y_val_prob):.4f}")


# ---------- 4. ROC + PR curves ----------
fpr, tpr, _ = roc_curve(y_val, y_val_prob)
prec, rec, pr_thr = precision_recall_curve(y_val, y_val_prob)
auc, ap = roc_auc_score(y_val, y_val_prob), average_precision_score(y_val, y_val_prob)

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(fpr, tpr, label=f"AUC = {auc:.3f}")
ax[0].plot([0, 1], [0, 1], "--", color="grey")
ax[0].set(xlabel="FPR", ylabel="TPR", title="ROC")
ax[0].legend()
ax[1].plot(rec, prec, label=f"AP = {ap:.3f}")
ax[1].set(xlabel="Recall", ylabel="Precision", title="PR")
ax[1].legend()
fig.tight_layout()


# ---------- 5. learning curve helper ----------
def plot_learning_curve(estimator, X, y, cv=5):
    sizes, tr, va = learning_curve(
        estimator, X, y, cv=cv, scoring="roc_auc",
        train_sizes=np.linspace(0.1, 1.0, 8), random_state=0,
    )
    plt.figure()
    plt.plot(sizes, tr.mean(axis=1), "o-", label="train")
    plt.plot(sizes, va.mean(axis=1), "o-", label="val")
    plt.xlabel("training samples"); plt.ylabel("AUC")
    plt.legend(); plt.title("learning curve")

plot_learning_curve(LogisticRegression(max_iter=1000), X_trainval, y_trainval)


# ---------- 6. threshold tuning to maximise F1 ----------
def best_f1_threshold(y_true, y_prob):
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    # precision_recall_curve returns one extra prec/rec point; align lengths
    f1 = 2 * prec[:-1] * rec[:-1] / (prec[:-1] + rec[:-1] + 1e-12)
    j = int(np.nanargmax(f1))
    return float(thr[j]), float(f1[j])

t_star, f1_star = best_f1_threshold(y_val, y_val_prob)
y_test_prob = clf.predict_proba(X_test)[:, 1]
y_test_hat  = (y_test_prob >= t_star).astype(int)
print(f"tuned threshold: {t_star:.3f}  val F1: {f1_star:.3f}")
print(f"test F1 @ tuned threshold: {f1_score(y_test, y_test_hat):.3f}")
print(f"test F1 @ 0.5 default:     {f1_score(y_test, (y_test_prob >= 0.5).astype(int)):.3f}")

The threshold is tuned on the validation split, then frozen and applied to the test set — never tune a decision threshold on the same data you report on. For multi-fold setups, tune the threshold on the concatenated out-of-fold probabilities instead.

4. Common USAAIO / IOAI applications

Classification rounds — usually scored by macro-F1 or plain accuracy. Macro-F1 is the standard for imbalanced multi-class problems.
Tabular rounds — frequently scored by ROC AUC (binary) or log loss for probability tasks. Rank-averaging ensembles dominate AUC leaderboards.
Counting / regression — IOAI chicken-counting and similar use MAE. RMSE shows up when outlier predictions must be punished.
Segmentation — pixel-wise Dice / IoU, often macro-averaged across classes. See U-Net.
NLP / generation — BLEU / ROUGE for MT and summarisation; exact-match / F1 for extractive QA.
Cross-validation choice — image and patient data almost always need group k-fold; time-series tabular needs rolling-origin. Default random k-fold is correct only for IID rows.

5. Drills

D1 · Derive F1 from precision and recall

You have precision P = 0.6 and recall R = 0.9. Compute F1 and verify it lies between min and max of P and R.

Solution

F1 = 2 * 0.6 * 0.9 / (0.6 + 0.9) = 1.08 / 1.5 = 0.72. Since F1 is the harmonic mean it is always between min(P, R) and the arithmetic mean (P + R) / 2 = 0.75; 0.6 ≤ 0.72 ≤ 0.75. The harmonic mean is biased toward the smaller of the two, which is why F1 punishes imbalance between precision and recall.

D2 · Why ROC AUC fails on extreme imbalance

Positives are 0.1% of the data. A model gets ROC AUC = 0.98 but PR AUC = 0.12. Why can both be true and which one should you trust?

Solution

FPR = FP / (FP + TN). When negatives dominate, the denominator is enormous and FPR stays tiny even for many false positives, inflating ROC AUC. PR AUC uses precision = TP / (TP + FP), whose denominator scales with FP, so it reflects the actual cost of false alarms. Trust PR AUC (or average precision) when the positive class is rare.

D3 · Time-series CV vs random k-fold gotcha

You forecast daily sales. Random 5-fold CV gives RMSE 120; the contest test set (future months) gives RMSE 410. What happened and how do you fix the CV?

Solution

Random k-fold mixes future rows into the training folds, leaking trend and seasonality that the model could not have known at prediction time. Switch to rolling-origin CV (TimeSeriesSplit): each fold trains on [0, t] and validates on [t + 1, t + h]. The CV RMSE will jump and start matching the test RMSE — that is the honest estimate you should have been optimising all along.

D4 · Expected public → private leaderboard gap

Your CV is 0.842, public LB is 0.861, private LB ends up 0.838. Which numbers were you overfitting and what is the lesson?

Solution

Public LB is well above CV, private LB drops back near CV. The model was overfitting the public split — either by repeated probing (each submission leaks a bit of public-set information) or by tuning a threshold against public LB feedback. CV was the honest estimate; the private board confirmed it. Lesson: trust your CV, submit sparingly, and freeze hyperparameters from CV not from public LB.

D5 · When to ensemble vs ship a single model

You have two models: model A with CV 0.81, model B with CV 0.80. Their predictions are correlated 0.96. Is averaging worth it?

Solution

Variance reduction from averaging k correlated predictors scales roughly with (1 + (k - 1) * rho) / k. With rho = 0.96 the reduction is ~2% — within the noise of CV. The ensemble is unlikely to beat A on the private board and adds inference cost. Ensemble pays off when predictors are diverse (rho < 0.9) and individually strong; otherwise ship A.

Next step

Now that you can score models honestly, loop back to Classical ML for the model families themselves, walk through the pitfalls checklist (leakage, target encoding, scaler-after-split, etc.), and run a full mock contest end-to-end with the right metric and the right CV scheme.