USAAIO 2025 Round 1 · Problem 4 · CNN on a small image set [reconstructed]

Contest: 2025 USA-NA-AIO Round 1 · Round: Round 1 (online) · Category: Deep learning / convolutional networks.

Official sources: usaaio.org/past-problems · 2025 Round 1 forum.

Reconstruction notice. The forum publicly hosts Problems 1–3 in detail. P4 here is a [reconstructed] skeleton consistent with the Round-1 syllabus published on usaaio.org/syllabus. Cross-check against the official PDF when released and update wording / scoring weights.

1. Problem restatement

Given a small labelled image dataset (likely a downsampled CIFAR-style 10-class set, ~5 000 training images), train a CNN from scratch. The problem walks through ~12 parts: load and visualise the data, build the model architecture in PyTorch, train for a few epochs, plot loss curves, evaluate, then discuss the effect of data augmentation, normalisation, and learning-rate scheduling. All on CPU / a small GPU.

2. What's being tested

PyTorch literacy. Datasets, DataLoaders, nn.Module, optimisers.
Conv arithmetic. Output sizes after stride/padding/kernel, parameter counts.
Training-loop hygiene. Move tensors to device, zero_grad, backward, step.
Generalisation reasoning. Why does data augmentation help when the dataset is small?

3. Data exploration / setup

import torch, torchvision
from torchvision import transforms as T
from torch.utils.data import DataLoader

tfm = T.Compose([T.ToTensor(),
                 T.Normalize((0.5,) * 3, (0.5,) * 3)])

train = torchvision.datasets.ImageFolder("data/train", transform=tfm)
val   = torchvision.datasets.ImageFolder("data/val",   transform=tfm)

print(len(train), len(val), train.classes)
tr = DataLoader(train, batch_size=128, shuffle=True, num_workers=2)
va = DataLoader(val,   batch_size=256, shuffle=False)

4. Baseline approach

import torch.nn as nn, torch.nn.functional as F

class SmallCNN(nn.Module):
    def __init__(self, nc=10):
        super().__init__()
        self.c1 = nn.Conv2d(3, 32, 3, padding=1)
        self.c2 = nn.Conv2d(32, 64, 3, padding=1)
        self.c3 = nn.Conv2d(64, 128, 3, padding=1)
        self.fc = nn.Linear(128 * 4 * 4, nc)
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.c1(x)), 2)
        x = F.max_pool2d(F.relu(self.c2(x)), 2)
        x = F.max_pool2d(F.relu(self.c3(x)), 2)
        return self.fc(x.flatten(1))

m = SmallCNN().cuda() if torch.cuda.is_available() else SmallCNN()
opt = torch.optim.AdamW(m.parameters(), lr=3e-3, weight_decay=1e-4)
for epoch in range(10):
    m.train()
    for x, y in tr:
        x, y = x.to(m.fc.weight.device), y.to(m.fc.weight.device)
        opt.zero_grad()
        loss = F.cross_entropy(m(x), y)
        loss.backward(); opt.step()

Baseline accuracy: ~60–70% on a CIFAR-10-style problem at 32×32. [illustrative]

5. Improvements that move the needle

5.1 · Data augmentation

Random crop + horizontal flip is a 5–10 point reliable lift. Add T.RandomCrop(32, padding=4) and T.RandomHorizontalFlip() to the train transform only.

5.2 · Batch normalisation between conv and ReLU

Insert nn.BatchNorm2d after every Conv2d. Stabilises gradients, allows a higher learning rate, +3–5 points.

5.3 · Cosine LR schedule + warmup

Start lr at 1e-4, ramp to 3e-3 over 1 epoch, then cosine decay to 0 over remaining epochs. Schedulers do most of the work tuners imagine they do.

5.4 · Test-time augmentation

Average predictions across the original and the horizontally flipped test image. Free +1 point.

5.5 · Explicit parameter-count and FLOP report

Round-1 problems often allocate points for "describe your model's complexity". Print parameter count and a rough FLOP estimate; mention it in the write-up.

6. Submission format & gotchas

Post code + final accuracy + training-loss plot for each part on the forum.
Seed PyTorch (torch.manual_seed(0)) and CUDA (torch.cuda.manual_seed_all(0)) for reproducibility.
Don't accidentally train on the val set; use ImageFolder splits the problem provides.
If running on CPU, batch-size and epoch counts must be smaller — note this in the write-up.

7. What top solutions did

[reconstructed — verify against published solutions] The expected full-marks recipe for a CIFAR-style P4: small CNN with BatchNorm, augmentations (random crop + flip), cosine LR with warmup, weight decay 1e-4, ~30 epochs. End on ~85% val accuracy. A more ambitious team would add Mixup or RandAugment for another 2 points. ResNet-18 is overkill at this dataset size and was rarely the winning move on similar problems.

8. Drill

D · Your training loss is decreasing but val accuracy plateaus at 50%. What's the first thing to try?

Add data augmentation. A plateau with continuing training-loss decrease is textbook overfitting. Random crop + horizontal flip alone often closes most of the gap; if that's already enabled, reach for stronger augmentation (Mixup, CutMix, RandAugment) and increase weight decay. Reducing model capacity is usually the wrong move — a slightly bigger model with stronger augmentation generalises better than a smaller model with none.

D2 · Compute the output spatial size of a 3×3 conv with stride 1, padding 1, on a 32×32 input.

Output size = floor((H + 2·pad − kernel) / stride) + 1 = floor((32 + 2 − 3) / 1) + 1 = 32. So padding=1 with kernel=3 stride=1 preserves spatial dimensions exactly. That's why "same padding" is the default in most CNNs. Knowing this formula cold saves 10 minutes of debugging when your flatten dimension doesn't match.

← USAAIO 2025 Round 1 set