Fine-tuning — adapting pretrained models on a budget

A pretrained ResNet, CLIP, or LLM already encodes most of what your task needs. The competition skill is to move it to your distribution cheaply — without melting a Colab L4, without forgetting the pretraining, and without paying for the full optimiser state of a 7B-parameter model.

TL;DR. Full fine-tuning updates every weight and stores an Adam moment pair per parameter — so a 7B model needs ~84 GB just for optimiser state. Parameter-efficient fine-tuning (PEFT) freezes the base model and trains a tiny side path: a linear probe (one head), adapters (bottleneck MLPs), or LoRA (a low-rank update BA added to each linear). QLoRA stacks 4-bit base weights underneath, dropping the largest LLMs onto a single 16 GB GPU. For instruction following you do SFT (next-token CE on instruction/response pairs); for preference alignment you do DPO (a closed-form swap for RLHF that needs no reward model). On IOAI/USAAIO Round-2 Colab budgets, LoRA on a frozen backbone is almost always the right call.

1. The intuition

A pretrained model is a very, very good initialisation. ImageNet-trained ResNets already see edges, textures, parts; a pretrained CLIP already aligns images and captions; a pretrained DistilBERT already speaks English. Your task — chickens vs. ducks, sentiment, math word problems — usually lives in a small neighbourhood of that pretrained weight. You do not need to re-learn what an edge is. You need to nudge the weights a little.

Geometrically, the claim behind PEFT is that the difference W - W_pretrained needed for your task is low-rank: a few directions in weight space do all the work. If that is true, you can parameterise the update with a tiny r << d rank-r matrix product and save 99% of the parameters. The pretrained weights stay frozen — so you cannot catastrophically forget — and you only ever train ~0.1 - 1% of the original parameter count.

The other axis is what loss to optimise. For classification on a frozen backbone, plain cross-entropy on a new head is enough — that is linear probing. For chat-style instruction following, you do supervised fine-tuning (SFT): standard next-token cross-entropy, but only on the response tokens. For aligning outputs to human preferences without a reward model, you use DPO, which rewrites the RLHF objective into a closed-form classification loss over preferred / dispreferred response pairs. All of these compose with LoRA underneath.

2. The math

Full fine-tuning

Every parameter receives a gradient update; nothing is frozen.

W ← W - eta * grad_W L, for every layer 0..L-1

Memory cost with Adam: parameters + gradients + two moments + activations ≈ 16 * N bytes for an N-parameter model in fp32, or ~12 * N bytes with mixed precision. For 7B params that is ~84 GB before activations — far past a single consumer GPU.

Linear probing — only the head

Freeze the backbone f_phi and train only a new classifier g_theta on top of its features:

y_hat = g_theta(f_phi(x)), grad_phi := 0

Trainable params = d_feat * K + K. This is the cheapest possible adaptation and is the right default when your task is close to pretraining (e.g. CIFAR-10 with a frozen CLIP image encoder).

Adapters — bottleneck modules

Houlsby et al. (2019) insert a small MLP after each transformer sublayer:

h_out = h + W_up * sigma(W_down * h), W_down in R^(r x d), W_up in R^(d x r)

Only W_down, W_up (and biases) are trainable, ~2 d r params per inserted block. Adapters add inference latency because they sit in the forward path serially.

LoRA — low-rank update

Hu et al. (2021). For each target linear layer with frozen weight W_0 in R^(d x k), parameterise the update as a rank-r product:

W_new = W_0 + (alpha / r) * B * A, B in R^(d x r), A in R^(r x k), r << min(d, k)

A is initialised from a small Gaussian, B from zero, so the initial update is exactly zero and training starts from the pretrained model. Trainable params per layer = r * (d + k), a factor of r / min(d, k) smaller than full FT. At inference you can fold (alpha/r) B A back into W_0 for zero added latency.

Typical values for a 7B LLM: target q_proj and v_proj (sometimes all linears), r = 8 .. 64, alpha = 16 .. 32, dropout 0.05. Result: ~10-30M trainable params instead of 7B.

QLoRA — quantised base, LoRA on top

Dettmers et al. (2023). Store W_0 in 4-bit NF4 format (a quantile-quantisation grid optimised for Gaussian weights), de-quantise on the fly for the forward pass, but keep A, B in fp16/bf16:

W_new(x) = dequant_NF4(W_0_q) * x + (alpha / r) * B * (A * x)

Memory: 4 bits per base param + fp16 LoRA weights + fp16 Adam state on the LoRA weights only. A 7B model in 4-bit fits in ~5 GB; with LoRA on top the total trainable footprint is well under 16 GB. Double quantisation further compresses the per-block scale factors. Paged optimiser spills Adam state to CPU when activations spike, so a single L4 / T4 can fine-tune 7B.

Prompt and prefix tuning

Instead of changing weights at all, learn a sequence of soft tokens prepended to the input embeddings:

x' = [p_1, p_2, ..., p_m, e(t_1), e(t_2), ...], p_i in R^d trainable

Trainable params = m * d for a single prompt; for prefix tuning, the prefix is injected into the key/value caches of every transformer layer (~m * 2 * L * d params). Cheapest of all PEFT methods but typically underperforms LoRA below ~10B base model size.

SFT — supervised instruction tuning

Given an instruction x and a desired response y, train next-token cross-entropy on the response tokens only:

L_SFT = - sum_{t in response_tokens} log p_theta(y_t | x, y_{<t})

The instruction / system prompt tokens are masked out of the loss (loss weight 0) — you do not want the model to learn to generate the user's prompt, only to generate good responses to it.

DPO — direct preference optimisation

Rafailov et al. (2023). Given a pair (x, y_w, y_l) where y_w is the preferred response and y_l the rejected one, and a frozen reference policy pi_ref (usually the SFT model):

L_DPO = - log sigma( beta * log(pi_theta(y_w|x) / pi_ref(y_w|x)) - beta * log(pi_theta(y_l|x) / pi_ref(y_l|x)) )

This is a binary cross-entropy that pushes pi_theta to assign more relative log-probability to y_w than to y_l (measured against the reference). It is mathematically equivalent to RLHF under the Bradley-Terry preference model, but with no separate reward model and no PPO rollouts. beta (~0.1) controls how far pi_theta is allowed to drift from pi_ref.

3. PyTorch reference implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models


# ---------- 1. Linear probing: freeze ResNet, train head ----------

def build_linear_probe(num_classes: int) -> nn.Module:
    """Frozen ResNet-50 backbone + new classifier head."""
    net = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
    for p in net.parameters():
        p.requires_grad = False                     # freeze every backbone weight
    in_feat = net.fc.in_features                    # 2048 for resnet50
    net.fc = nn.Linear(in_feat, num_classes)        # head is fresh + trainable
    return net


def trainable_count(model: nn.Module) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


# ---------- 2. Hand-rolled LoRA layer wrapping nn.Linear ----------

class LoRALinear(nn.Module):
    """y = W_0 x + (alpha / r) * B (A x); W_0 frozen, A/B trainable."""
    def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16, dropout: float = 0.0):
        super().__init__()
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False                 # freeze pretrained linear
        d_out, d_in = base.weight.shape
        self.r = r
        self.scale = alpha / r
        # A: r x d_in (small Gaussian)   B: d_out x r (zeros) — initial update = 0
        self.A = nn.Parameter(torch.randn(r, d_in) * (1.0 / r))
        self.B = nn.Parameter(torch.zeros(d_out, r))
        self.drop = nn.Dropout(dropout) if dropout > 0 else nn.Identity()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Frozen path + low-rank update
        base_out = self.base(x)
        lora_out = F.linear(self.drop(x), self.A)   # (..., r)
        lora_out = F.linear(lora_out, self.B)       # (..., d_out)
        return base_out + self.scale * lora_out

    @torch.no_grad()
    def merge(self) -> nn.Linear:
        """Fold (alpha/r) B A back into W_0 for zero-latency inference."""
        merged = nn.Linear(self.base.in_features, self.base.out_features,
                           bias=self.base.bias is not None)
        merged.weight.copy_(self.base.weight + self.scale * self.B @ self.A)
        if self.base.bias is not None:
            merged.bias.copy_(self.base.bias)
        return merged


def inject_lora(module: nn.Module, target_names=("q_proj", "v_proj"),
                r: int = 8, alpha: int = 16) -> nn.Module:
    """Walk a transformer; replace named nn.Linear children with LoRALinear."""
    for name, child in module.named_children():
        if isinstance(child, nn.Linear) and name in target_names:
            setattr(module, name, LoRALinear(child, r=r, alpha=alpha))
        else:
            inject_lora(child, target_names, r, alpha)
    return module


# ---------- 3. SFT loop sketch with Hugging Face Trainer + PEFT ----------
# Sketch — assumes `transformers`, `peft`, `datasets`, `bitsandbytes` installed.
"""
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,
    TrainingArguments, Trainer, DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

MODEL = "meta-llama/Llama-3.2-1B"

# Load base in 4-bit NF4 (QLoRA setup)
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                          bnb_4bit_compute_dtype=torch.bfloat16,
                          bnb_4bit_use_double_quant=True)
tok = AutoTokenizer.from_pretrained(MODEL)
base = AutoModelForCausalLM.from_pretrained(MODEL, quantization_config=bnb, device_map="auto")
base = prepare_model_for_kbit_training(base)

# Attach LoRA adapters
lora_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)
model = get_peft_model(base, lora_cfg)
model.print_trainable_parameters()        # e.g. 0.4% trainable

# Format prompts; mask loss on the instruction tokens, keep loss on the response
def format_example(ex):
    prompt = f"### Instruction:\\n{ex['instruction']}\\n\\n### Response:\\n"
    full = prompt + ex["response"] + tok.eos_token
    ids = tok(full, truncation=True, max_length=1024)["input_ids"]
    labels = list(ids)
    n_prompt = len(tok(prompt, add_special_tokens=False)["input_ids"])
    for i in range(min(n_prompt, len(labels))):
        labels[i] = -100                  # ignore_index = no loss on prompt
    return {"input_ids": ids, "labels": labels}

# trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=..., ...)
# trainer.train(); model.save_pretrained("./lora-adapter")
"""


if __name__ == "__main__":
    probe = build_linear_probe(num_classes=10)
    total = sum(p.numel() for p in probe.parameters())
    train = trainable_count(probe)
    print(f"linear-probe: {train:,} / {total:,} trainable ({100*train/total:.3f}%)")

    # LoRA-wrap a 1024-dim linear: 1024*1024 = 1.05M params -> r=8 gives 16384.
    lin = nn.Linear(1024, 1024)
    wrapped = LoRALinear(lin, r=8, alpha=16)
    print("base params (frozen):", sum(p.numel() for p in wrapped.base.parameters()))
    print("lora params (train):", sum(p.numel() for p in [wrapped.A, wrapped.B]))
    merged = wrapped.merge()                          # fold for inference
    x = torch.randn(2, 1024)
    assert torch.allclose(wrapped(x), merged(x), atol=1e-5)
    print("merged forward matches lora forward.")

The peft snippet is shown as a docstring so the file is importable even without the heavy dependencies installed; copy it into a notebook cell when you actually run QLoRA. LoraConfig and get_peft_model are the real public API from Hugging Face's peft library.

4. Common USAAIO / IOAI applications

5. Drills

D1 · LoRA params vs full FT

A transformer linear is 4096 x 4096. You add LoRA with r = 16. How many trainable params does LoRA introduce, and what fraction of the original linear is that?

Solution

LoRA params = r*(d + k) = 16*(4096 + 4096) = 131,072. Full linear = 4096*4096 = 16,777,216. Ratio ≈ 0.78%. Multiply that across only q_proj and v_proj in every block and the total trainable footprint of a 7B model is well under 1%.

D2 · QLoRA memory savings

Full fine-tuning a 7B model in fp16 with Adam needs roughly how much GPU memory? Why does QLoRA fit on a 16 GB card?

Solution

Full fp16 FT: 2 bytes/param weights + 2 bytes grads + 8 bytes Adam (two fp32 moments) = 12 bytes/param * 7e9 ≈ 84 GB, before activations. QLoRA: base weights at 4 bits = 0.5 byte/param * 7e9 ≈ 3.5 GB, no grads / no Adam on the base (it is frozen and quantised), and LoRA adapters at <1% of params carry the only Adam state. Total well under 16 GB with paged optimisers and gradient checkpointing.

D3 · Why DPO needs no reward model

RLHF trains a reward model r_phi(x, y) then optimises a policy against it with PPO. DPO skips both steps. Why is that valid?

Solution

Under the Bradley-Terry preference model with a KL constraint to a reference policy, the optimal policy has a closed form: pi*(y|x) ∝ pi_ref(y|x) * exp(r(x,y) / beta). Inverting that gives r(x,y) = beta * log(pi*(y|x) / pi_ref(y|x)) + Z(x). Plugging this expression for r back into the Bradley-Terry likelihood, the partition function Z(x) cancels between the preferred and rejected response, leaving a pure classification loss on policy log-ratios. No explicit reward model is ever materialised.

D4 · When prompt tuning beats LoRA

Lester et al. find prompt tuning is competitive with full fine-tuning only for very large base models. Why?

Solution

Prompt tuning has very few parameters (a handful of soft-token embeddings, ~m * d) and only acts at the input. Its expressiveness is dominated by the base model's in-context-learning capacity, which grows sharply with scale. Below ~10B params the base does not have enough ICL capacity to be steered by a short prefix, and LoRA — which can adjust every attention projection — wins. Above ~10B params prompt tuning closes the gap and is cheaper. For competition-scale (1-7B) bases, prefer LoRA.

D5 · Catastrophic forgetting

You full-fine-tune Llama-3-8B on 500 medical Q&A pairs. The model now answers medical questions but has lost its ability to do basic arithmetic. Diagnose and fix.

Solution

Classic catastrophic forgetting: with 500 examples and all 8B params trainable, the optimiser overfits the narrow distribution and overwrites capacity that encoded everything else. Fixes (in order of effort): (i) switch to LoRA so the base weights stay frozen and the new behaviour lives in a side path you can disable; (ii) lower learning rate by 5-10x; (iii) mix in a general-purpose SFT dataset (e.g. Tulu, OpenHermes) so each batch is partly "remember how to be a general assistant"; (iv) early stop on a held-out general-capability eval, not just medical loss.

Next step

Pair this with Transformers to see what you are inserting LoRA into (which linears matter — q_proj, v_proj, sometimes the MLPs). Then jump to Notebooks for runnable QLoRA + SFT recipes on Colab, and back to Deep Learning for the optimiser / mixed-precision plumbing that makes the memory math above actually hold.