Mock A · Tabular ML — “Exam Outcome Prediction”
Format: Round-1 style · Time: 90 minutes · Compute: CPU only · Submission: predictions.csv · Metric: weighted F1 across 3 classes.
A self-contained tabular classification mock modeled on a USAAIO Round-1 problem. You receive a small synthetic training set and an unlabeled test set; your job is to ship a runnable notebook and a submission CSV inside the 90-minute window, scored by the rubric below.
- Time budget: 90 minutes, hard stop. Set a timer before you read the problem.
- Compute: CPU only. No GPU, no internet model downloads.
- Libraries: anything in the standard scientific-Python stack (numpy, pandas, scikit-learn, lightgbm/xgboost if installed locally). No pretrained foundation models.
- Submission: a single file
predictions.csvwith headerid,label, one row per test row, label ∈{low, mid, high}. - Write-up: 4–6 sentence markdown cell at the end of the notebook describing your preprocessing, features, model, and CV score.
Problem statement
You are given anonymized records for ~500 high-school students preparing for a standardized exam. Each
row contains four numerical / categorical features describing the student's study habits and prior
performance, plus a 3-class outcome label (low, mid, high) that
encodes the final exam tier. The labels are imbalanced (roughly 30 / 45 / 25 percent), and one feature
(attendance) contains ~8% missing values injected at random.
Train a classifier on the provided training set and predict the tier of every student in the held-out
test set. The grader will compute weighted F1 (sklearn:
f1_score(y_true, y_pred, average="weighted")) on the hidden test labels. Your raw weighted-F1
score does not directly become your grade — the rubric below scores you on the process
you demonstrate (preprocessing, baseline, feature engineering, model choice, write-up), not on a single
leaderboard number.
The dataset is fully synthetic and generated by the snippet in the next section; cite it as such if you publish your solution.
Data dictionary
Training set: train.csv — 500 rows, 5 columns (4 features + label).
Test set: test.csv — 150 rows, 5 columns (id + 4 features, no label).
| Column | Type | Range / values | Notes |
|---|---|---|---|
id | int | 0–649 | Row identifier; test-only column for re-joining predictions. |
study_hours | float | 0.0–12.0 | Daily self-reported study hours, last 30 days. |
prior_score | float | 0–100 | Previous standardized exam percentile-scaled to 0–100. |
sleep | float | 3.0–10.0 | Average hours of sleep / night. |
attendance | float (nullable) | 0.0–1.0 | Fraction of classes attended last semester. ~8% missing (MCAR). |
label | category | low, mid, high | Train-only. Class balance ~30 / 45 / 25 %. |
Synthetic data generator (use to materialize the dataset locally)
import numpy as np
import pandas as pd
rng = np.random.default_rng(2026)
N = 650
study_hours = rng.uniform(0, 12, size=N)
prior_score = rng.uniform(0, 100, size=N)
sleep = rng.normal(7.0, 1.2, size=N).clip(3, 10)
attendance = rng.beta(6, 2, size=N) # right-skewed, in (0, 1)
# latent score drives the label
latent = (
0.35 * (study_hours / 12)
+ 0.40 * (prior_score / 100)
+ 0.15 * (sleep / 10)
+ 0.10 * attendance
+ rng.normal(0, 0.08, size=N)
)
labels = np.where(latent < 0.45, "low",
np.where(latent < 0.70, "mid", "high"))
# inject MCAR missingness on attendance
mask = rng.random(N) < 0.08
attendance[mask] = np.nan
df = pd.DataFrame({
"study_hours": study_hours,
"prior_score": prior_score,
"sleep": sleep,
"attendance": attendance,
"label": labels,
})
train = df.iloc[:500].reset_index(drop=True)
test = df.iloc[500:].reset_index(drop=True)
test.insert(0, "id", np.arange(500, 650))
train.to_csv("train.csv", index=False)
test.drop(columns=["label"]).to_csv("test.csv", index=False)
test[["id", "label"]].to_csv("test_labels.csv", index=False) # for your own scoring
Submission format
A single CSV named predictions.csv with exactly 150 data rows and the header below.
id,label
500,mid
501,high
502,low
...
649,mid
- The
idcolumn must match the test set order; the grader joins onid. - Labels must be one of
low,mid,high(case-sensitive). - No extra columns, no index column, UTF-8, Unix newlines.
Scoring rubric (100 points)
| Section | Points | What earns credit |
|---|---|---|
| Preprocessing | 20 | Reasonable train/val split (or k-fold), handling of missing attendance (impute or model-native NaN handling), label encoding, no leakage from test into fit. |
| Baseline | 30 | A simple, runnable first model (logistic regression or single decision tree) with a reported CV weighted-F1 before any tuning. Numbers are reproducible (random_state pinned). |
| Feature engineering | 25 | At least 2 derived features motivated by EDA (e.g. study_hours × prior_score, sleep buckets, attendance imputed + missing-indicator). Improvement over baseline shown via CV. |
| Model choice | 15 | Final model is appropriate for small tabular data with mixed-scale features (gradient boosting, random forest, or stacked ensemble) with a brief justification. |
| Write-up | 10 | 4–6 sentence summary at the end of the notebook describing the pipeline, CV score, and one limitation. |
Note: the rubric scores process. A high leaderboard score with no baseline, no CV, and no write-up caps at ~55. A clean baseline + tidy CV with a mid-tier leaderboard score routinely lands ~85.