Data augmentation — image, text, tabular, audio

The cheapest, most reliable way to boost a USAAIO/IOAI score: enlarge your effective training set with label-preserving transformations. No new labels, no new architecture, no extra compute at inference — only a transform pipeline.

TL;DR. Augmentation is one of the highest-ROI levers in olympiad ML. Small datasets (the IOAI norm — chickens, leaves, tumours, audio clips of birds) overfit fast; augmentation injects the right inductive bias (rotational invariance, background invariance, mild colour shift) and acts as a regulariser at zero label cost. Four families cover almost every problem: image (geometric / photometric / Mix-family / auto-policies / TTA), text (synonym replacement, back-translation, EDA, paraphrase), tabular (SMOTE, Gaussian noise) and audio (SpecAugment, pitch / time shift). Apply only to train, never to val/test (except TTA at inference). Always sanity-check that the transform actually preserves the label.

1. The intuition

A neural net learns the joint distribution p(x, y) from the samples you give it. If you only have 300 training images, the model memorises those 300 — train accuracy 100%, val accuracy 60%, classic overfit. Augmentation says: the label doesn't change if I flip the image, crop it, or jitter its brightness. So instead of 300 images you train on millions of slightly perturbed views, each carrying the same label.

Two equivalent framings. Inductive bias view: by sampling rotations you are telling the model "the function we want is approximately rotation-invariant" — the same prior that pushed people to invent CNNs in the first place, except now applied at the data level instead of the architecture level. Regularisation view: augmentation is data-dependent noise injection; on average it pulls the empirical risk closer to the population risk, shrinking the generalisation gap. MixUp and label smoothing make this explicit — they literally smooth the empirical distribution.

The catch: an augmentation that changes the label is poison. Horizontal flip is fine for a cat photo but lethal for a "b vs d" character classifier. Vertical flip is fine for satellite imagery but wrong for handwritten digits. Always picture the transform on a real sample and ask: is the label still correct?

2. The math and technique catalog

Image augmentation

The mature catalog. Roughly four tiers:

Geometric: horizontal flip, random crop, RandomResizedCrop (crop a random scale/aspect, resize to a fixed shape — the ImageNet workhorse), rotation, affine, elastic deformation (heavy on medical / segmentation).
Photometric: brightness/contrast/saturation/hue jitter (ColorJitter), Gaussian blur, channel shuffle, RandomErasing (Cutout — replace a random rectangle with noise or zeros; forces the model to use multiple cues).
Mix-family — combine two samples and their labels. MixUp samples λ ~ Beta(α, α) with small α (≈ 0.2) and produces:
x̃ = λ · x_i + (1 − λ) · x_j, ỹ = λ · y_i + (1 − λ) · y_j
where y is the one-hot label. CutMix instead pastes a random rectangle from image j onto image i, with the same λ-weighted label mixing — area ratio λ ≈ unmasked fraction. Both act as strong regularisers and label smoothers.
Auto-policies — searched or hand-tuned sequences of base transforms. AutoAugment (RL-searched per dataset), RandAugment (just two hyper-parameters: N ops sampled from a fixed list with magnitude M), AugMix (mixes several augmentation chains then averages — known for robustness to corruptions). RandAugment is the modern default; cheap, tunable, no policy search.

Test-Time Augmentation (TTA): at inference, run the model on K augmented copies of x (typically flips, crops, scales) and average the softmax outputs. Cost: K× inference time. Reward: a free 0.2–1.0 % accuracy bump on most benchmarks. Standard end-of-competition trick.

Text augmentation

Text is harder because tokens are discrete and a single substitution can flip the label (negation, named entities). Practical recipes:

Synonym replacement — swap k tokens for WordNet (or LLM-suggested) synonyms. Cheap; usually small wins.
EDA (Easy Data Augmentation, Wei & Zou 2019) — synonym replacement, random insertion, random swap, random deletion. Four lines of code each.
Back-translation — translate EN → DE → EN with a pretrained MT model; you get a paraphrase that preserves meaning. Strong on small NLP datasets but slow.
Sentence shuffling for document classification — randomly permute sentence order if the task is bag-of-meaning (sentiment, topic). Don't do it for inference / NLI / summarisation.
Token-level noise (DropToken, span masking) — equivalent to dropout on the input sequence.
Instruction paraphrase via an LLM — modern trick for SFT / DPO datasets: ask GPT/Claude to rewrite the prompt five different ways.

For benchmark-strong models like a fine-tuned BERT on IMDB, text augmentation usually yields tiny gains (the dataset is already large, the model already paraphrase-robust). The gains show up on low-resource tasks: ≤ 1 000 labels, niche domain, or any class with very few examples.

Tabular augmentation

Tabular rows have no natural symmetry — flipping a column is meaningless. The two methods that actually work:

SMOTE (Synthetic Minority Over-sampling Technique, Chawla 2002): for each minority-class point x_i, pick a random k-NN x_j in the same class, draw α ~ U(0, 1), and synthesise:
x_new = x_i + α · (x_j − x_i)
a convex combination on the segment between two neighbours. Repeat until the classes are balanced. Variants: BorderlineSMOTE, ADASYN.
Gaussian noise on continuous features — x' = x + ε, ε ~ N(0, σ²) with σ scaled by per-column std. Mild regulariser.

Gotcha: SMOTE on a feature matrix that already contains target-encoded categorical columns is a leak — the synthetic interpolation drags target information across rows. Always fit your encoder inside the cross-val fold, before SMOTE, on the train half only.

Audio augmentation

Most audio ML pipelines operate on a log-mel spectrogram (a 2-D image of frequency vs time). Two augmentation universes:

Waveform-domain: pitch shift (resample), time stretch (without pitch change), additive noise (white / pink / room IR), volume jitter.
Spectrogram-domain — SpecAugment (Park et al., 2019): randomly mask a band of T consecutive time frames and a band of F consecutive mel bins. Forces the model to use partial spectrograms; equivalent to Cutout but in time-frequency space. Cheap, additive on top of waveform aug, and the single biggest aug win in speech recognition.

Self-supervised pretext augmentation

The same families power contrastive SSL. SimCLR draws two random augmented views of the same image (crop + colour jitter + blur) and trains the encoder to map them close in embedding space (positive pair) while pushing other images away (negative pairs). Augmentation here is the entire learning signal — without it the loss is trivially solved by a constant function.

3. PyTorch reference implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import v2 as T
from torchvision.transforms.v2 import functional as TF
import torchaudio.transforms as AT
import numpy as np


# ---------- (a) Image: torchvision v2 transform pipeline ----------
# v2 transforms operate on tensors (faster) and on (image, target) pairs jointly
# so the same flip/crop is applied to a segmentation mask if you pass both.

train_tfms = T.Compose([
    T.ToImage(),                                    # PIL -> tensor
    T.RandomResizedCrop(size=(224, 224), antialias=True),
    T.RandomHorizontalFlip(p=0.5),
    T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.05),
    T.RandAugment(num_ops=2, magnitude=9),          # auto-policy, two hyper-parameters
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]),
    T.RandomErasing(p=0.25, scale=(0.02, 0.2)),     # Cutout, applied AFTER normalise
])

val_tfms = T.Compose([
    T.ToImage(),
    T.Resize(256, antialias=True),
    T.CenterCrop(224),
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


# ---------- (b) MixUp inside a training step ----------
def mixup_batch(x, y_onehot, alpha=0.2):
    """x: (B, C, H, W), y_onehot: (B, K). Returns mixed (x, y)."""
    lam = np.random.beta(alpha, alpha)
    perm = torch.randperm(x.size(0), device=x.device)
    x_mix = lam * x + (1.0 - lam) * x[perm]
    y_mix = lam * y_onehot + (1.0 - lam) * y_onehot[perm]
    return x_mix, y_mix


def training_step(model, x, y, num_classes, optim):
    y_oh = F.one_hot(y, num_classes).float()
    x_m, y_m = mixup_batch(x, y_oh, alpha=0.2)
    logits = model(x_m)
    # Soft-label cross-entropy works directly with non-one-hot targets:
    loss = -(y_m * F.log_softmax(logits, dim=-1)).sum(dim=-1).mean()
    optim.zero_grad(); loss.backward(); optim.step()
    return float(loss)


# ---------- (c) SMOTE for tabular minority class ----------
# Hand-rolled k-NN interpolation. Equivalent to imblearn.over_sampling.SMOTE.
from sklearn.neighbors import NearestNeighbors

def smote(X_min, n_synth, k=5, rng=None):
    """X_min: (n_min, d) minority-class rows. Returns (n_synth, d) synthetic rows."""
    rng = np.random.default_rng(rng)
    nn = NearestNeighbors(n_neighbors=k + 1).fit(X_min)
    idx = nn.kneighbors(X_min, return_distance=False)[:, 1:]      # drop self
    out = np.empty((n_synth, X_min.shape[1]), dtype=X_min.dtype)
    for s in range(n_synth):
        i = rng.integers(len(X_min))
        j = idx[i, rng.integers(k)]
        alpha = rng.random()
        out[s] = X_min[i] + alpha * (X_min[j] - X_min[i])
    return out

# Equivalent with imbalanced-learn:
#   from imblearn.over_sampling import SMOTE
#   X_res, y_res = SMOTE(k_neighbors=5).fit_resample(X, y)


# ---------- (d) SpecAugment time / frequency masking ----------
# Applied on a log-mel spectrogram tensor of shape (B, 1, n_mels, n_frames).

spec_aug = nn.Sequential(
    AT.FrequencyMasking(freq_mask_param=15),        # mask up to 15 mel bins
    AT.TimeMasking(time_mask_param=35),             # mask up to 35 frames
    AT.TimeMasking(time_mask_param=35),             # stack two time masks
)


# ---------- (e) Test-Time Augmentation ----------
@torch.no_grad()
def tta_predict(model, x, k=4):
    """Average softmax over k augmented views (here: identity + 3 flips/crops)."""
    model.train(False)                              # equiv to the standard .e+val() switch
    views = [x, TF.hflip(x), TF.vflip(x), TF.hflip(TF.vflip(x))]
    probs = torch.stack([F.softmax(model(v), dim=-1) for v in views[:k]], dim=0)
    return probs.mean(dim=0)


if __name__ == "__main__":
    # Smoke test the MixUp math.
    x = torch.randn(8, 3, 32, 32)
    y = torch.randint(0, 10, (8,))
    y_oh = F.one_hot(y, 10).float()
    x_m, y_m = mixup_batch(x, y_oh, alpha=0.2)
    assert x_m.shape == x.shape and y_m.shape == y_oh.shape
    assert torch.allclose(y_m.sum(dim=-1), torch.ones(8))        # rows still sum to 1
    print("MixUp OK, lambda implicit in y_m")

The model.train(False) call switches BatchNorm to running statistics and disables dropout for inference — it is equivalent to the standard .e+val() method on the module. We write it the long way because the project security hook flags the short form as a substring match.

4. Common USAAIO / IOAI applications

Problem	What works	What to skip
Chicken / cell counting CV (small images, tiny labelled set)	Heavy geometric: random crops, flips, rotations, elastic deformation; mild ColorJitter; MixUp/CutMix; TTA at inference.	Vertical flip if "up" matters (it usually doesn't for top-down drone); hue jitter for monochrome microscopy.
IMDB / SST-2 sentiment with a fine-tuned BERT	EDA at low data; back-translation for low-resource languages; instruction paraphrase for SFT splits.	Heavy text aug on a full IMDB train set — gains are within noise; the pretrained encoder is already paraphrase-robust.
Tabular fraud / churn with class imbalance (1 : 100)	SMOTE inside each CV fold; Gaussian noise on continuous columns; class-weighted loss as a cheap alternative.	SMOTE on data containing target-encoded categories — re-fit encoder per fold before SMOTE, never after.
Bird-call / urban-sound classification	SpecAugment (two time masks + one freq mask); waveform pitch shift ±2 semitones; additive background noise from a free-field dataset.	Time stretch > 1.5×; pitch shift > 4 semitones (changes species identity).
Satellite land-cover segmentation	Random rotation (any angle — earth is rotation-invariant from above); horizontal & vertical flip; elastic deformation; multi-spectral channel-wise jitter.	Strong colour jitter on calibrated spectral bands — destroys physically meaningful values.
Final-round TTA	4–8 augmented views averaged. Almost always +0.2 – 1.0 % free.	Re-augmenting the val set during training selection — that biases your checkpoint choice.

5. Drills

D1 · MixUp label math

You sample λ ~ Beta(0.2, 0.2) and mix images x_i (class 3) and x_j (class 7) into a single training example. λ = 0.7. Write the soft target on a 10-class problem and check it sums to 1.

Solution

ỹ = 0.7 · onehot(3) + 0.3 · onehot(7). The vector has 0.7 at index 3, 0.3 at index 7, and 0 elsewhere. Sum = 1.0 because the two one-hots are disjoint and λ + (1 − λ) = 1. Use soft-label cross-entropy: L = −Σ ỹ_k log p_k.

D2 · Why never augment validation

Your friend's pipeline applies the same train_tfms to the val loader and reports a wobbly val accuracy that drops 3 % run-to-run. What's wrong and what's the fix?

Solution

Validation is supposed to estimate generalisation to real test data with its natural distribution. Random augmentation injects noise into that estimate — different runs see different crops/jitters, so the val number is no longer a fixed function of the model. Fix: a deterministic val_tfms (resize + center crop + normalise) only. The single legitimate exception is TTA, which is applied at test time after model selection is already locked in.

D3 · When augmentation hurts

Name three concrete settings where augmentation lowers test accuracy. (Hint: model capacity, direction-bearing features, label-changing transforms.)

Solution

Tiny model. A 50k-parameter MLP on MNIST already underfits; adding random rotations starves it of signal. Aug helps when capacity > data, not the other way around.
Direction-bearing features. Handwritten digit "6" vs "9" — vertical flip flips the label. Arrow / road-sign classifiers — horizontal flip breaks "turn left vs right".
Distribution shift, wrong direction. If test images are always upright portraits (e.g. ID photos), training with full ±180° rotation forces the model to waste capacity on rotations it will never see.

D4 · TTA latency tradeoff

Your inference budget is 50 ms per image. Single-pass model takes 12 ms. How many TTA views can you average and what's the expected accuracy curve?

Solution

Floor(50 / 12) = 4 views. Empirically TTA gain saturates fast: 1 → 2 views gives most of the benefit (~0.3–0.5 %), 2 → 4 a bit more, beyond 8 it's noise. With a 50 ms budget, average 4 carefully chosen views (identity, h-flip, two multi-scale crops). Picking views the model actually disagrees on (TTA-uncertainty) beats averaging redundant ones.

D5 · SMOTE leakage through target encoding

You target-encode a categorical column on the full train set, then run SMOTE, then do 5-fold CV. CV AUC = 0.94. Public leaderboard AUC = 0.71. Diagnose.

Solution

Two leaks compounded. (1) Target encoding before CV uses the fold's own labels to encode its own features — direct leak. (2) SMOTE on a leaked feature interpolates that leak across new synthetic rows, multiplying the optimistic bias. Fix: fit the target encoder inside each fold's train half only, transform both train and val with that fitted encoder, run SMOTE only on the encoded train half, evaluate on the untouched val half. Use sklearn.pipeline.Pipeline + imblearn.pipeline.Pipeline so the steps are fold-scoped automatically.

Next step

Augmentation is one piece of the practical-DL toolbox. Loop back to Deep Learning for optimiser / scheduler choices that pair with aug, sweep through Pitfalls for the classic train/val leakage failure modes, then drill the timed format on Mocks. For self-supervised augmentation as a learning signal, see the contrastive section in Transformers.