IOAI 2024 · ML · Feature engineering on matrix-shaped samples
Contest: IOAI 2024 (Bulgaria) · Round: Scientific, at-home stage · Category: Classical ML / feature engineering. On-site sibling: same idea, switched task type (classification ↔ regression) and a different matrix dataset.
Official sources: ioai-official.org/2024-tasks · IOAI-official/IOAI-2024 · at-home problems (zip).
1. Problem restatement
Each training example is a matrix X_i ∈ R^{r×c} (small — typical
sizes range r,c ∈ [4, 64]) paired with a scalar target y_i. The organisers fix a downstream
model (a plain gradient-boosted tree or a small MLP — the exact estimator is named in the notebook and
is not to be modified). The contestant's job is to design a feature extractor
φ(X) → R^d so that, when the fixed estimator is trained on φ(X_i), y_i, the
test score is as high as possible.
The interesting part is that the input is structured (a matrix, not a flat vector), so flattening throws away geometry. The on-site sibling keeps the same format but swaps the dataset and possibly the task type (e.g. at-home is regression, on-site is classification or vice-versa) — your feature extractor needs to be schema-portable, not over-tuned to a single column meaning.
2. What's being tested
- Feature engineering literacy. Can you reason about which statistics of a matrix are informative — row/column means, rank, singular values, off-diagonal energy, spectral gap?
- Bias-variance under a frozen model. You can't compensate for a poor feature set with a deeper model. Every regularisation knob is locked.
- Generalisation under distribution shift. If on-site flips the task type, features derived from the matrix itself (eigenvalues, Frobenius norm, sparsity pattern) transfer; features derived from "column 3 is age" do not.
- Computational budget. SVD on small matrices is fine; on 64×64 it's ~0.1 ms each, so even 100k samples cost ~10 s. Anything you can compute in NumPy is fair game.
See Classical ML for the underlying scikit-learn workflows and Math for the linear-algebra intuition behind eigenvalue features.
3. Data exploration / setup
The data ships as a directory of .npy files plus a labels.csv:
import numpy as np, pandas as pd, pathlib
root = pathlib.Path("ml_task_2024")
labels = pd.read_csv(root / "train_labels.csv") # columns: id, y
def load(i): return np.load(root / "train" / f"{i}.npy")
X0 = load(labels.id.iloc[0])
print(X0.shape, X0.dtype, labels.y.describe())
EDA checklist before you write a feature:
- Shape variability. Are all
X_ithe same shape, or jagged? Jagged inputs forbid most "flatten then PCA" recipes. - Scale. Are entries roughly in [-1, 1], or do magnitudes span 6 orders? Log-scale or row-standardise before computing distance features.
- Rank. Most matrix tasks hide a low-rank structure. Check
np.linalg.matrix_rankand the singular-value decay — if the top 3 SVs hold > 90% of the energy, your feature is "top-3 singular values + their ratio". - Target distribution. Long-tailed regression target ⇒ log-transform y; class imbalance ⇒ stratified split.
Metric: typically RMSE for regression or macro-F1 for classification, scaled to 0–100 on the leaderboard. [verify against source]
4. Baseline approach
A baseline that ships in 30 lines: flatten every matrix, run sklearn's StandardScaler, then
fit the fixed model. This works astonishingly well as a sanity check and gives you a number to beat.
import numpy as np, pandas as pd, pathlib
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
root = pathlib.Path("ml_task_2024")
labels = pd.read_csv(root / "train_labels.csv")
val = pd.read_csv(root / "val_labels.csv")
def phi_flat(X):
return X.reshape(-1)
Xtr = np.stack([phi_flat(np.load(root / "train" / f"{i}.npy")) for i in labels.id])
Xva = np.stack([phi_flat(np.load(root / "val" / f"{i}.npy")) for i in val.id])
sc = StandardScaler().fit(Xtr)
Xtr, Xva = sc.transform(Xtr), sc.transform(Xva)
# the "fixed" downstream model — replicate whatever the official notebook ships
m = GradientBoostingRegressor(n_estimators=300, max_depth=3, random_state=0)
m.fit(Xtr, labels.y)
print("val RMSE:", root_mean_squared_error(val.y, m.predict(Xva)))
# illustrative baseline score: ~ 50/100 [illustrative]
Score band: flatten-baseline typically lands in the lower-middle of the leaderboard. Real numbers are [illustrative] — IOAI does not publish per-task leaderboards.
5. Improvements that move the needle
5.1 · Replace flatten with permutation-invariant summaries
If row order is arbitrary (a common IOAI twist), flatten is broken: the same data permuted gives
different features. Use statistics that are invariant under row/column permutation: per-column mean,
std, min, max, median; per-row equivalents; the full sorted singular-value spectrum
(np.linalg.svd returns SVs in descending order regardless of input row order).
def phi_invariant(X):
col = np.concatenate([X.mean(0), X.std(0), X.min(0), X.max(0)])
row = np.concatenate([X.mean(1), X.std(1)])
sv = np.linalg.svd(X, compute_uv=False)
return np.concatenate([col, np.sort(row), sv,
[np.linalg.norm(X, "fro"),
np.linalg.matrix_rank(X, tol=1e-6)]])
5.2 · Add spectral and correlation features
For each matrix compute its covariance (or Gram matrix if rows are observations), take its top-k eigenvalues, and feed both the eigenvalues and the eigenvector entropies as features. This captures "is the matrix near-isotropic vs concentrated along one direction" — a signal that flatten-then-tree completely misses on small samples.
5.3 · Random-feature kernel (RFF) on the flattened matrix
If the downstream model is shallow (e.g. linear regression or small MLP), an RBF random-feature map on top of your invariant features approximates a Gaussian-kernel ridge regression — typically a free 5–10 RMSE points when the surface is non-linear.
from sklearn.kernel_approximation import RBFSampler
rff = RBFSampler(n_components=512, gamma=0.05, random_state=0)
Xtr_rff = rff.fit_transform(Xtr)
5.4 · Cross-validate the feature recipe, not the model
Because the downstream model is frozen, do not tune its hyperparameters. Spend your CV budget choosing which feature subset to include. A clean 5-fold loop over feature blocks (column-stats / row-stats / SVD / spectral / RFF) usually shows 1–2 blocks dominating and 1 block actively hurting — drop it.
5.5 · For the on-site sibling: keep features schema-agnostic
The on-site task switches the dataset. Any feature that names a specific column ("ratio of column 2
to column 5") will silently break. Stick to global invariants (norms, eigenvalues, entropies) and you
only need to retrain the frozen estimator on the new data, not rewrite φ.
6. Submission format & gotchas
- Submit
submission.csvwithid,y(orid,label) matchingtest.csvids exactly. - The grader retrains the fixed model from scratch on your features — so save
φas code, not as a fitted scaler. Anything stateful (PCA, StandardScaler) must be fit on train only. - Be careful not to leak: do not fit your scaler or PCA on the test set's matrices.
- Submit a fallback every hour. A working flatten-baseline that scores 50 is infinitely better than a half-finished spectral pipeline that crashes the grader.
7. What top solutions did
The official "best solutions" archive (linked above) bundles community write-ups. Common ingredients: permutation-invariant summaries (column/row stats + sorted SVs) form ~70% of the feature vector; spectral features on the Gram matrix add another chunk; an explicit "matrix is approximately rank-r" indicator (binary, derived from the SV gap) gives a couple of leaderboard points. No top solution used a deep network on the raw matrices — the frozen-estimator constraint kills that route. [verify against source]
8. Drill
D · Your matrix samples are 16×16, all real-valued. Pick 8 features that you'd compute first.
A defensible set: Frobenius norm, nuclear norm (sum of SVs),
top SV, top-1 / top-2 SV ratio (a "condition number" proxy),
trace, off-diagonal energy (||X − diag(X)||_F), mean of |X|,
fraction of entries below 0.1·max (sparsity proxy). Eight features, all permutation-aware
where the structure asks for it, all O(r·c) to compute. Add the full SV spectrum (16 numbers) if
your budget allows; this alone usually beats flatten.
D2 · The fixed model is a gradient-boosted tree. Why does this constrain your feature design?
Trees split on one feature at a time and don't combine features linearly. So interactions you might encode in a linear model (a polynomial cross-term) are pointless — the tree can recover them given enough depth. But trees are terrible at smooth functions of many features (RMS, sums). So features should be either (a) raw "look-up" statistics the tree can split on directly, or (b) pre-computed nonlinear aggregates (norms, entropies, eigenvalues) the tree can't easily synthesize on its own.