Classical machine learning

Before deep learning is on the table, classical ML solves most contest tabular problems faster, with fewer bugs, and with smaller compute. This page sweeps the scikit-learn families you must own: regression, classification, ensembles, cross-validation, clustering, dimensionality reduction.

The contest workflow

  1. Load & inspect. pd.read_csv, .info(), .describe(), scan for NaNs, look at value distributions.
  2. Train/val/test split. Stratified for classification; time-based for time series. Never mix.
  3. Baseline model. Logistic regression or random forest with default params. Whatever you build later must beat this.
  4. Feature engineering. Numerical: standardize / log-transform. Categorical: one-hot or target encode. Date: extract year / month / day-of-week.
  5. Model search. Try 2–3 model families, tune each with cross-validation.
  6. Ensemble. Average top models. Usually +1–3% on the leaderboard.
  7. Reproduce. Fix random seeds, save the trained model, log the val score.

Regression

Linear regression

Closed-form solution; assumes linear relationship + Gaussian errors. The baseline every other model is judged against.

sklearn.linear_model.LinearRegression

Ridge & Lasso

Linear regression + L2 (Ridge) or L1 (Lasso) regularization. Lasso also performs feature selection by zeroing weights.

Ridge(alpha=1.0) · Lasso(alpha=0.1)

Elastic Net

L1 + L2 combined. Good default when you don't know which kind of regularization helps.

ElasticNet(alpha=0.1, l1_ratio=0.5)

Gradient-boosted trees

Sequential additive trees. State-of-the-art on tabular data — usually beats neural nets on small/medium structured datasets.

HistGradientBoostingRegressor() or external xgboost / lightgbm

Metrics: MAE, MSE, RMSE, R². Pick based on the problem — RMSE penalizes large errors more; MAE is robust to outliers.

Classification

Logistic regression

Linear decision boundary, probabilistic output via sigmoid. Fast, interpretable, hard to overfit.

LogisticRegression(C=1.0, max_iter=1000)

k-Nearest Neighbors

Lazy learner — no model is "fit"; prediction looks at the k closest training points. Sensitive to feature scaling.

KNeighborsClassifier(n_neighbors=5)

Decision trees

Recursive feature splits. Interpretable but high variance — single trees overfit. The building block for ensembles.

DecisionTreeClassifier(max_depth=8)

Random forest

Average of many decorrelated trees. Strong default; robust to outliers and irrelevant features.

RandomForestClassifier(n_estimators=200)

Gradient boosting

Boosted trees that fit residuals sequentially. Best-in-class for most tabular classification.

HistGradientBoostingClassifier()

SVM

Maximum-margin linear classifier; with kernels (RBF, polynomial) it handles non-linear data. Slow at large n.

SVC(kernel='rbf', C=1.0)

Metrics: accuracy, precision, recall, F1, ROC-AUC, log loss. For imbalanced classes, accuracy is misleading — always check precision/recall and the confusion matrix.

Cross-validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(scores.mean(), scores.std())

Ensembles

Unsupervised learning

Clustering

Dimensionality reduction

Pitfalls