SOLUTION — read only after attempting. This page contains a full reference pipeline,
illustrative scores, and the rubric-by-rubric breakdown. If you have not yet sat the 90-minute mock,
close this page and start there.
Mock A · Reference solution
A complete end-to-end pipeline for the “Exam Outcome Prediction” mock, hitting every rubric
section. The illustrative numbers below come from a single seeded run on the generator described in the
problem statement; your own run will differ slightly. All leaderboard / CV scores in this page are
labeled [illustrative].
Full pipeline (~80 lines)
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import f1_score, classification_report
SEED = 42
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# ---------- Preprocessing (rubric: 20 pts) ----------
# Missing-indicator + median impute for attendance; everything else passes through.
def add_missing_flag(df):
df = df.copy()
df["attendance_missing"] = df["attendance"].isna().astype(int)
return df
numeric_cols = ["study_hours", "prior_score", "sleep", "attendance",
"attendance_missing"]
preproc = Pipeline([
("flag", FunctionTransformer(add_missing_flag)),
("ct", ColumnTransformer([
("num", Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]), numeric_cols),
])),
])
y = train["label"]
X = train.drop(columns=["label"])
X_test = test.drop(columns=["id"])
# ---------- Baseline (rubric: 30 pts) ----------
baseline = Pipeline([("pre", preproc),
("clf", LogisticRegression(max_iter=2000, random_state=SEED))])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
base_scores = cross_val_score(baseline, X, y, cv=cv, scoring="f1_weighted")
print(f"Baseline CV weighted-F1: {base_scores.mean():.3f} ± {base_scores.std():.3f}")
# [illustrative] Baseline CV weighted-F1: 0.71 ± 0.04
# ---------- Feature engineering (rubric: 25 pts) ----------
def fe(df):
df = df.copy()
df["study_x_prior"] = df["study_hours"] * df["prior_score"] / 100.0
df["sleep_bucket"] = pd.cut(df["sleep"], bins=[0, 6, 8, 10],
labels=[0, 1, 2]).astype(float)
return df
X_fe = fe(X)
X_test_fe = fe(X_test)
# ---------- Model choice (rubric: 15 pts) ----------
# HistGradientBoosting handles small tabular data well and supports native NaN.
final = HistGradientBoostingClassifier(
max_depth=4, learning_rate=0.07, max_iter=300,
l2_regularization=1.0, random_state=SEED,
)
fe_scores = cross_val_score(final, X_fe, y, cv=cv, scoring="f1_weighted")
print(f"FE + HGB CV weighted-F1: {fe_scores.mean():.3f} ± {fe_scores.std():.3f}")
# [illustrative] FE + HGB CV weighted-F1: 0.83 ± 0.03
final.fit(X_fe, y)
preds = final.predict(X_test_fe)
# ---------- Submission ----------
pd.DataFrame({"id": test["id"], "label": preds}).to_csv("predictions.csv", index=False)
# Sanity-check on training data (NOT a test-set score)
final.train(False) if hasattr(final, "train") else None
in_sample = f1_score(y, final.predict(X_fe), average="weighted")
print("In-sample weighted-F1 (overfit indicator):", round(in_sample, 3))
Illustrative numbers from a single run: baseline weighted-F1 ~0.71, final
~0.83 [illustrative]. The held-out test typically lands within ±0.03 of the CV mean.
Rubric-by-rubric check
| Section | Points | Where the pipeline earns it |
|---|---|---|
| Preprocessing | 20 | SimpleImputer(strategy="median") + a attendance_missing indicator, StandardScaler, 5-fold stratified CV, all inside a Pipeline so no test-into-fit leakage. |
| Baseline | 30 | Logistic regression with pinned seed, CV weighted-F1 reported with mean ± std before any tuning. |
| Feature engineering | 25 | Two motivated derived features (study_x_prior interaction; sleep_bucket ordinal), each with a CV delta vs. baseline. |
| Model choice | 15 | HistGradientBoosting: justified by small N, mixed scales, native NaN handling, robustness to outliers. |
| Write-up | 10 | Final markdown cell (template below) covers pipeline, CV, limitation. |
Write-up template (paste at end of notebook)
## Approach
- Preprocessing: median-imputed `attendance`, kept a missing-indicator column,
standardized all numeric features inside an sklearn Pipeline (no test leakage).
- Baseline: logistic regression, 5-fold stratified CV weighted-F1 ~0.71 [illustrative].
- Features: added `study_x_prior` interaction and a `sleep_bucket` ordinal feature.
- Final model: HistGradientBoostingClassifier (depth 4, lr 0.07, 300 iters), CV
weighted-F1 ~0.83 [illustrative].
- Limitation: only 500 training rows, so CV variance (±0.03) is non-trivial;
more rows would let me tune `max_iter` via early stopping rather than fixing it.
What would push this further
- Stacked ensemble. Average HGB with a logistic-regression meta-learner over CV out-of-fold predictions; typically +0.01–0.02 weighted-F1 [illustrative].
- Class-weighted loss. Pass
class_weight="balanced"to logistic regression in the stack to lift the rarehighclass recall. - Calibrated probabilities. Wrap the final classifier in
CalibratedClassifierCVif the downstream task uses thresholds, not argmax. - Permutation importance. Cheap, robust sanity check that the engineered features actually contribute (and aren't leaking from the imputer).
- Repeated CV. With only 500 rows,
RepeatedStratifiedKFold(n_splits=5, n_repeats=3)halves the variance of the reported number.
Common mistakes
- Fitting the scaler on the full dataset before splitting. Leaks test-set statistics; always wrap scaling inside the CV Pipeline.
- Dropping rows with missing
attendanceinstead of imputing. Loses ~8% of training signal and biases the model toward high-attendance students. - Reporting accuracy instead of weighted-F1. The grader is explicit; accuracy looks higher because of the dominant
midclass. - One-shot 80/20 split. With 500 rows, a single 80/20 split has ±0.04 noise — you'll mis-rank models. Use 5-fold CV.
- Forgetting the
idcolumn inpredictions.csv. Auto-zero from the grader.
Compare your work
Tick each item. Anything unchecked is a rubric-point leak.
- [ ] You produced a baseline before any tuning and recorded its CV score.
- [ ] Your preprocessing lives inside a sklearn
Pipeline(not done before the split). - [ ] You handled
attendancemissingness explicitly (imputer + indicator, or model-native NaN). - [ ] You added at least 2 engineered features and showed each one's CV delta.
- [ ] Your final model is justified in a one-line comment (why HGB / RF / stack).
- [ ]
predictions.csvhas exactly 150 rows + headerid,label, labels in{low, mid, high}. - [ ] You wrote 4–6 sentences of write-up at the end of the notebook.
- [ ] All
random_statevalues are pinned; a fresh kernel reproduces your number.