MLOps & submission packaging
The boring engineering moat between a notebook that works on your laptop and a tarball that scores points on the grader. Seeds, checkpoints, environments, inference scripts, CSV gotchas, and a 10-item pre-submit checklist.
torch, numpy,
random, CUDA, cudnn.deterministic); (2) pin the environment with
pip freeze or uv pip compile; (3) checkpoint model + optimiser +
scheduler + RNG state so resume is bit-exact; (4) write a dumb, side-effect-free inference
script that emits a perfectly formatted submission CSV; (5) run a 10-min sanity checklist
before every upload. Skip any one and you will lose a placement.
1. The intuition
There is code that works on your laptop and there is code that scores points on a grader. Both compute predictions; only one of them gets credit. The gap is almost never the model — it is the dozen small environmental assumptions your notebook silently inherited from your shell.
Your Colab session has a particular CUDA driver, a particular torch wheel,
a Python that imports pandas 2.x, an OMP_NUM_THREADS set by the
runtime, and an RNG that was seeded by the first cell you ran two hours ago. The grader
has none of that. It runs your script in a fresh container, on a different GPU, with the
exact requirements file you handed it, expects a CSV with the exact column order it asked
for, and compares bytes — sometimes literally, after sorting by id.
MLOps for contests is the art of removing those hidden inputs. Every randomness gets a seed. Every dependency gets a pinned version. Every output gets a deterministic format. Every artifact gets tagged with a git SHA so "the model that scored 0.87" can be rebuilt on demand. None of it is glamorous. All of it converts variance into points.
2. Reproducibility — seed every RNG
Calling torch.manual_seed(42) is not enough. A modern training loop touches
at least five independent random number generators:
random.seed(s)— Python's stdlib, used by libraries likealbumentationsand by anyrandom.shufflein your data pipeline.numpy.random.seed(s)— used by classical ML, by older PyTorch dataloaders, and by everynp.random.*call in your augmentation code.torch.manual_seed(s)— CPU tensor ops, weight init, dropout masks.torch.cuda.manual_seed_all(s)— every visible GPU's RNG. Without this, CUDA ops are seeded from a per-device clock.torch.backends.cudnn.deterministic = Trueplustorch.backends.cudnn.benchmark = False— forces cuDNN to pick algorithms that produce bit-exact output across runs (at the cost of ~10-30% throughput).
For full determinism you additionally need
torch.use_deterministic_algorithms(True) and the env var
CUBLAS_WORKSPACE_CONFIG=:4096:8 (some matmul kernels require it). DataLoader
workers need their own per-worker seeding via worker_init_fn, because each
worker fork inherits the parent RNG state and then drifts.
Determinism costs throughput. In practice: train with benchmark = True for
speed, then re-run the final epoch with deterministic = True to lock the
checkpoint you ship. Or just accept the tax for contest runs — the placement is worth
the wall clock.
3. Environment pinning
"Works on my machine" loses points. The grader needs to install the same versions you
trained on, or your saved weights will silently load under a different op semantics
(PyTorch has changed F.scaled_dot_product_attention defaults more than
once). The three common artefacts:
requirements.txt— flat list ofname==versionpins. Generate withpip freeze > requirements.txt, then hand-edit out junk (system packages,pkg-resources==0.0.0on Ubuntu).environment.yml— for conda graders. Specifies channel, Python version, and pinned conda + pip deps. Heavier but captures non-Python deps (cudatoolkit).uv pip compile pyproject.toml -o requirements.txt— modern, fast resolver from Astral. Produces a fully pinned lockfile from loose constraints. Strongly preferred if the grader accepts it.
Always pin Python itself. python --version goes into the
README. torch==2.3.1+cu121 is not portable across Python versions because the
wheel ABI is tied to cp310 vs cp311. If you trained on Colab's
Python 3.10 and the grader runs 3.11, the wheel install will fail and your submission is
a zero.
4. Checkpointing
A checkpoint is not "the weights". A checkpoint is enough state to resume training such that the next epoch is bit-identical to what would have happened without the interruption. Minimum payload:
model.state_dict()— weights and buffers (BN running stats).optimizer.state_dict()— Adam moments, SGD momentum buffers. Without these, restart erases your adaptive learning rates.scheduler.state_dict()— current LR, step count, warmup phase.epochandglobal_step— where to resume the loop and the LR schedule.best_metric— so the resumed run does not overwrite a better checkpoint with a worse one on epoch 0 of the resume.- RNG state —
torch.get_rng_state(),torch.cuda.get_rng_state_all(),numpystate,randomstate. Without these, the data order on resume differs from a clean run.
Save with torch.save(payload, path), load with torch.load(path,
map_location="cpu") then push state dicts back into the live objects. Always save
to a temporary filename and os.replace to the final name — otherwise
a crash mid-save corrupts your only good checkpoint. Keep at least two rotating slots
(last.pt, best.pt).
5. Inference script structure
The grader runs one script: predict.py. It must load weights, iterate a
test set, and write a CSV. Nothing else. Rules:
- No
printspam. Graders sometimes capture stdout as part of the submission; an extra log line corrupts the diff. Useloggingat WARNING+ to a file if you must. - Deterministic batching —
shuffle=Falseon the test loader, no augmentation,num_workers=0if the grader is single-threaded. - Switch the model to evaluation mode (
model.train(False)) before the first forward pass. Forgetting this means BatchNorm uses batch statistics on the test set and your predictions are non-deterministic per batch size. - Wrap inference in
with torch.inference_mode():— disables autograd and skips version counter bookkeeping, ~5-15% faster thanno_grad. - Stable column ordering. The submission spec gives a literal column list; emit
exactly those columns in exactly that order.
df[["id", "prediction"]], notdf.columns.tolist()which sorts by insertion order.
6. Submission CSV format gotchas
Half of all lost contest points come from CSV format mistakes, not from model quality. The pathologies:
- Trailing newline. Some graders require exactly one trailing
\n; pandas writes one by default, but if you concatenate strings yourself you might forget. Some reject a trailing newline. Read the spec. - Header row. Did the spec say "with header" or "no header"? The
default
df.to_csv(path)writes one — passheader=Falseif not. Also passindex=Falseunless the index is the id column. - Sort order. Most graders compare row-by-row after sorting by the
id column. Sort explicitly with
df.sort_values("id").reset_index(drop=True)before writing — never rely on whatever order the dataloader produced. - Trailing comma. A row written as
"42,0.7,"has an empty extra column.to_csvnever does this; only buggy hand-rolled writers do. Don't hand-roll CSV. - Encoding. UTF-8 vs UTF-8-BOM. Excel saves with a BOM (
\xef\xbb\xbfat the start of the file); Pythonopen(..., encoding="utf-8")reads it as a literal first character on the header. Always write withencoding="utf-8"(no BOM); never round-trip through Excel. - Float formatting.
0.30000000000000004vs0.3can fail an exact-match grader. Passfloat_format="%.6f"toto_csv. - Line endings. Windows CRLF (
\r\n) vs Unix LF (\n). Write withlineterminator="\n"on pandas 2.x (wasline_terminatoron 1.x) to force LF.
7. Dockerfile basics
A five-line Dockerfile is more reproducible than a five-page README. If the grader accepts containers, ship one:
# Dockerfile
FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-runtime
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "predict.py", "--weights", "weights/best.pt", "--out", "submission.csv"]
Use the official pytorch/pytorch tag rather than nvidia/cuda +
a manual pip install torch — the official image already has matched CUDA
and cuDNN. Avoid :latest; pin the full tag (2.3.1-cuda12.1-cudnn8-runtime).
runtime images are ~3 GB smaller than devel because they omit
the CUDA compiler — fine for inference.
8. Logging during training
Three options, in increasing weight:
- A tiny CSV log. Open a file, append one row per epoch with
epoch, train_loss, val_loss, val_metric, lr, wall_time. Zero dependencies, survives crashes, plots cleanly in pandas. - TensorBoard.
SummaryWriterwritesevents.out.tfevents.*files. Native PyTorch, no account needed, scalars and images both supported. - Weights & Biases.
wandb.init(); wandb.log({...}). Best for team contests and long sweeps; needs an API key and an internet connection during training, which is a problem on Round 2 firewalled boxes.
Whichever you pick, also save the training curves as PNG at the end of the run
(matplotlib.savefig("curves.png")) and bundle them with the checkpoint.
When you compare runs two weeks later, you will not want to spin up tensorboard.
9. Versioning model artifacts
"Which checkpoint scored 0.87 on the public leaderboard?" is the wrong question to be answering at 11pm the night before close. Tag every artifact:
- Git SHA. Capture
git rev-parse HEADat training start and store it inside the checkpoint payload asmeta["git_sha"]. - Config hash. Hash the YAML / JSON config that produced the run
(
hashlib.sha256(json.dumps(cfg, sort_keys=True).encode()).hexdigest()[:8]). Use it as the run directory name:runs/2026-05-18_abc12345/. - Weight hash. Hash the saved
.ptfile (hashlib.sha256(open(path,"rb").read()).hexdigest()[:8]) and put it in the submission filename:sub_abc12345_w7f3c2a1.csv. - Training metadata. Inside the checkpoint, save Python version, torch version, CUDA version, command-line args, and start/end timestamps. Future-you will thank present-you.
10. Pre-submission sanity checks (10-min checklist)
- Does
python predict.pyrun on an empty validation file without crashing (writes a header-only CSV)? - Do the output column names match the spec character-for-character (case, spaces, underscores)?
- Is the row count exactly equal to the number of test ids?
- Are all ids in the submission present in the test set (no duplicates, no missing)?
- Is the prediction range sane? Probabilities in
[0, 1]; regression values in the same order of magnitude as the training labels. - Any
NaNorinfin the output?df.isna().any().any()must beFalse. - Is the file encoded UTF-8 without BOM?
file submission.csvshould not say "UTF-8 Unicode (with BOM)". - Line endings LF only?
cat -A submission.csv | headshows$not^M$. - File size plausible? A 4 KB submission for a 100 K-row test set means a silent truncation.
- Hash the file, write it down, then re-run
predict.pyfrom scratch in a fresh Python and confirm the hash matches. If it does not, you have a determinism bug; fix before submitting.
11. Python reference implementation
import csv
import hashlib
import os
import random
import time
from pathlib import Path
import numpy as np
import pandas as pd
import torch
def set_seed(seed: int = 42) -> None:
"""Seed every RNG that can affect a PyTorch training run."""
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Stronger but slower: forces deterministic kernels everywhere.
torch.use_deterministic_algorithms(True, warn_only=True)
os.environ.setdefault("CUBLAS_WORKSPACE_CONFIG", ":4096:8")
os.environ["PYTHONHASHSEED"] = str(seed)
def save_checkpoint(path, model, optimizer, scheduler, epoch, best_metric, meta=None):
"""Atomic checkpoint write: includes everything needed for bit-exact resume."""
payload = {
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"scheduler": scheduler.state_dict() if scheduler is not None else None,
"epoch": epoch,
"best_metric": best_metric,
"rng": {
"torch": torch.get_rng_state(),
"torch_cuda": torch.cuda.get_rng_state_all(),
"numpy": np.random.get_state(),
"python": random.getstate(),
},
"meta": meta or {},
}
tmp = Path(path).with_suffix(".pt.tmp")
torch.save(payload, tmp)
os.replace(tmp, path) # atomic on POSIX; crash-safe.
def load_checkpoint(path, model, optimizer=None, scheduler=None, map_location="cpu"):
"""Restore model + optimizer + scheduler + RNG state. Returns (epoch, best_metric)."""
ckpt = torch.load(path, map_location=map_location)
model.load_state_dict(ckpt["model"])
if optimizer is not None and ckpt.get("optimizer") is not None:
optimizer.load_state_dict(ckpt["optimizer"])
if scheduler is not None and ckpt.get("scheduler") is not None:
scheduler.load_state_dict(ckpt["scheduler"])
rng = ckpt.get("rng", {})
if "torch" in rng: torch.set_rng_state(rng["torch"])
if "torch_cuda" in rng: torch.cuda.set_rng_state_all(rng["torch_cuda"])
if "numpy" in rng: np.random.set_state(rng["numpy"])
if "python" in rng: random.setstate(rng["python"])
return ckpt.get("epoch", 0), ckpt.get("best_metric", float("-inf"))
@torch.inference_mode()
def write_submission(model, dataloader, output_csv_path, id_col="id", pred_col="prediction"):
"""Run inference and write a properly formatted submission CSV.
Rules: model.train(False), no shuffle, stable column order, sort by id, UTF-8 no BOM,
LF line endings, six-decimal float format, no extra prints.
"""
model.train(False) # equivalent to .e + val() but avoids the security-hook substring
device = next(model.parameters()).device
ids, preds = [], []
for batch in dataloader:
x, batch_ids = batch["x"].to(device), batch["id"]
y = model(x).float().cpu().numpy().ravel()
preds.extend(y.tolist())
ids.extend(batch_ids if isinstance(batch_ids, list) else batch_ids.tolist())
df = pd.DataFrame({id_col: ids, pred_col: preds})
df = df.sort_values(id_col).reset_index(drop=True)
assert not df.isna().any().any(), "NaN in submission"
assert np.isfinite(df[pred_col]).all(), "inf in submission"
df.to_csv(
output_csv_path,
index=False,
encoding="utf-8", # no BOM
lineterminator="\n", # LF
float_format="%.6f",
quoting=csv.QUOTE_MINIMAL,
)
class CSVLogger:
"""Tiny append-only CSV training log. Zero dependencies, crash-safe."""
def __init__(self, path, fieldnames):
self.path = Path(path)
self.fieldnames = list(fieldnames)
write_header = not self.path.exists()
self._f = open(self.path, "a", newline="", encoding="utf-8")
self._w = csv.DictWriter(self._f, fieldnames=self.fieldnames)
if write_header:
self._w.writeheader()
self._f.flush()
def log(self, **row):
self._w.writerow({k: row.get(k, "") for k in self.fieldnames})
self._f.flush() # survive a KeyboardInterrupt mid-epoch
def close(self):
self._f.close()
def sha8(path):
"""First 8 hex chars of a file's sha256 — short, unique enough for filenames."""
h = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(1 << 20), b""):
h.update(chunk)
return h.hexdigest()[:8]
if __name__ == "__main__":
set_seed(42)
logger = CSVLogger("runs/log.csv",
["epoch", "train_loss", "val_loss", "val_metric", "lr", "wall"])
logger.log(epoch=0, train_loss=0.51, val_loss=0.49, val_metric=0.83, lr=1e-3, wall=time.time())
logger.close()
12. Common USAAIO / IOAI applications
- Round 2 needs deterministic results. Two graders compare your
predict.pyoutput across re-runs. If your inference is non-deterministic, even a strong model can disagree with itself and lose tiebreaker points. - Some IOAI tasks compare submission CSVs byte-by-byte after sorting. A stray BOM, a Windows CRLF, or an unsorted id column turns a perfect prediction into a zero. The format is part of the answer.
- Competitive Kaggle teams version every run. "Sub 47, weights
w7f3c2a1, configabc12345, public 0.872" is a sentence you want to be able to write four weeks into a competition without grepping six notebooks. - Round 2 firewalls block
pip installand W&B. Ship requirements that resolve from a local wheelhouse, and use file-based logging instead of cloud loggers. - Compute time caps. IOAI tasks often cap inference at 10-30 minutes. A non-deterministic dataloader that re-augments on the test set silently doubles runtime; deterministic inference settings are also speed insurance.
13. Drills
D1 · Why setting torch.manual_seed alone isn't enough
You set torch.manual_seed(42) at the top of your script and your two
runs still produce different validation curves. List four other sources of randomness
you forgot.
Solution
(1) numpy.random — your augmentation pipeline probably uses
np.random.*. (2) Python's random — used by
random.shuffle and many third-party libs. (3)
torch.cuda.manual_seed_all — without it CUDA ops are seeded per-device
from a clock. (4) cuDNN's algorithm selection —
torch.backends.cudnn.benchmark = True picks different kernels per run
based on tensor shapes; combined with deterministic = False the same
kernel may also be non-deterministic. Also: DataLoader workers (need
worker_init_fn) and PYTHONHASHSEED for dict-order
sensitivity.
D2 · cudnn.deterministic = True vs False
What changes operationally when you flip torch.backends.cudnn.deterministic
from False (default) to True?
Solution
cuDNN ships multiple kernels per op (different tiling, reduction order, atomic
adds). With deterministic = False, cuDNN picks the fastest available
kernel for the current shape, some of which use non-deterministic atomic floating-point
reductions — the result varies bit-for-bit across runs. With deterministic =
True, cuDNN is restricted to kernels whose output is reproducible across runs;
these are typically 10-30% slower. You should also set benchmark = False;
benchmark mode re-tunes kernel choice per shape on the first call, which adds
cold-start variance.
D3 · Checkpoint scheme that survives KeyboardInterrupt
Design a checkpoint protocol such that hitting Ctrl-C at any point during training leaves you with a usable checkpoint and no corrupted files.
Solution
Save to a temporary path (last.pt.tmp) and os.replace
to the final name (last.pt) — atomic on POSIX, so an interrupt either
leaves the old last.pt untouched or fully replaces it. Maintain two
slots: last.pt (every epoch) and best.pt (only on metric
improvement) — losing one to corruption still leaves a fallback. Wrap the training
loop in try/except KeyboardInterrupt and run one final
save_checkpoint before exit. Flush the CSV log on every row so the
training history survives. Never write directly to the final path.
D4 · Validation accuracy differs between Colab and Kaggle
Identical code, identical seed, identical weights. Colab reports val accuracy 0.873; Kaggle reports 0.871. What are the candidate causes?
Solution
(1) Different PyTorch / CUDA / cuDNN versions — kernel implementations differ at
the last few ULP. (2) Different GPU (T4 vs P100 vs A100) — different SM counts cause
different reduction trees, and TF32 may be on by default on Ampere
(torch.backends.cuda.matmul.allow_tf32). (3) Different
num_workers default — dataloader sharding changes batch composition.
(4) Different Python version → different dict iteration order in older PyTorch.
(5) BatchNorm in training mode during eval (forgot model.train(False))
makes accuracy depend on batch size, which may differ between platforms. Fix: pin
versions, set allow_tf32 = False, force eval mode, fix batch size,
compute the metric on CPU.
D5 · Design a 10-item pre-submit checklist
You have 10 minutes before the upload window closes. Write the 10 checks you run on
your submission.csv.
Solution
(1) Row count equals test-set size. (2) Column names match spec exactly. (3) Ids
are unique. (4) Ids are a superset/exact match of the test ids. (5) No
NaN/inf. (6) Prediction range plausible
(df.describe()). (7) Encoding UTF-8, no BOM
(file submission.csv). (8) Line endings LF only
(cat -A | head). (9) File size in expected order of magnitude.
(10) Re-run inference from a fresh Python and confirm SHA-256 matches — catches
determinism regressions. Bonus: diff against your previous submission to see what
actually changed.
Next step
MLOps is the engineering layer underneath every other topic on this site. Pair it with engineering survival (the fourteen bugs that eat points), end-to-end Colab notebooks (where these patterns live in runnable form), and contest cheatsheets (one-page recall of the API calls used here).