Fine-tuning — adapting pretrained models on a budget
A pretrained ResNet, CLIP, or LLM already encodes most of what your task needs. The competition skill is to move it to your distribution cheaply — without melting a Colab L4, without forgetting the pretraining, and without paying for the full optimiser state of a 7B-parameter model.
BA added to each linear).
QLoRA stacks 4-bit base weights underneath, dropping the largest
LLMs onto a single 16 GB GPU. For instruction following you do SFT (next-token
CE on instruction/response pairs); for preference alignment you do DPO (a closed-form
swap for RLHF that needs no reward model). On IOAI/USAAIO Round-2 Colab budgets,
LoRA on a frozen backbone is almost always the right call.
1. The intuition
A pretrained model is a very, very good initialisation. ImageNet-trained ResNets already see edges, textures, parts; a pretrained CLIP already aligns images and captions; a pretrained DistilBERT already speaks English. Your task — chickens vs. ducks, sentiment, math word problems — usually lives in a small neighbourhood of that pretrained weight. You do not need to re-learn what an edge is. You need to nudge the weights a little.
Geometrically, the claim behind PEFT is that the difference W - W_pretrained
needed for your task is low-rank: a few directions in weight space
do all the work. If that is true, you can parameterise the update with a tiny
r << d rank-r matrix product and save 99% of the parameters.
The pretrained weights stay frozen — so you cannot catastrophically forget — and
you only ever train ~0.1 - 1% of the original parameter count.
The other axis is what loss to optimise. For classification on a frozen backbone, plain cross-entropy on a new head is enough — that is linear probing. For chat-style instruction following, you do supervised fine-tuning (SFT): standard next-token cross-entropy, but only on the response tokens. For aligning outputs to human preferences without a reward model, you use DPO, which rewrites the RLHF objective into a closed-form classification loss over preferred / dispreferred response pairs. All of these compose with LoRA underneath.
2. The math
Full fine-tuning
Every parameter receives a gradient update; nothing is frozen.
Memory cost with Adam: parameters + gradients + two moments + activations
≈ 16 * N bytes for an N-parameter model in fp32, or
~12 * N bytes with mixed precision. For 7B params that is
~84 GB before activations — far past a single consumer GPU.
Linear probing — only the head
Freeze the backbone f_phi and train only a new classifier
g_theta on top of its features:
Trainable params = d_feat * K + K. This is the cheapest possible
adaptation and is the right default when your task is close to pretraining
(e.g. CIFAR-10 with a frozen CLIP image encoder).
Adapters — bottleneck modules
Houlsby et al. (2019) insert a small MLP after each transformer sublayer:
Only W_down, W_up (and biases) are trainable, ~2 d r
params per inserted block. Adapters add inference latency because they sit in the
forward path serially.
LoRA — low-rank update
Hu et al. (2021). For each target linear layer with frozen weight
W_0 in R^(d x k), parameterise the update as a rank-r
product:
A is initialised from a small Gaussian, B from zero, so
the initial update is exactly zero and training starts from the pretrained model.
Trainable params per layer = r * (d + k), a factor of
r / min(d, k) smaller than full FT. At inference you can fold
(alpha/r) B A back into W_0 for zero added latency.
Typical values for a 7B LLM: target q_proj and v_proj
(sometimes all linears), r = 8 .. 64, alpha = 16 .. 32,
dropout 0.05. Result: ~10-30M trainable params instead of 7B.
QLoRA — quantised base, LoRA on top
Dettmers et al. (2023). Store W_0 in 4-bit NF4 format (a
quantile-quantisation grid optimised for Gaussian weights), de-quantise on the
fly for the forward pass, but keep A, B in fp16/bf16:
Memory: 4 bits per base param + fp16 LoRA weights + fp16 Adam state on the LoRA weights only. A 7B model in 4-bit fits in ~5 GB; with LoRA on top the total trainable footprint is well under 16 GB. Double quantisation further compresses the per-block scale factors. Paged optimiser spills Adam state to CPU when activations spike, so a single L4 / T4 can fine-tune 7B.
Prompt and prefix tuning
Instead of changing weights at all, learn a sequence of soft tokens prepended to the input embeddings:
Trainable params = m * d for a single prompt; for prefix tuning,
the prefix is injected into the key/value caches of every transformer layer
(~m * 2 * L * d params). Cheapest of all PEFT methods but typically
underperforms LoRA below ~10B base model size.
SFT — supervised instruction tuning
Given an instruction x and a desired response y, train
next-token cross-entropy on the response tokens only:
The instruction / system prompt tokens are masked out of the loss (loss weight 0) — you do not want the model to learn to generate the user's prompt, only to generate good responses to it.
DPO — direct preference optimisation
Rafailov et al. (2023). Given a pair (x, y_w, y_l) where y_w
is the preferred response and y_l the rejected one, and a frozen
reference policy pi_ref (usually the SFT model):
This is a binary cross-entropy that pushes pi_theta to assign more
relative log-probability to y_w than to y_l (measured
against the reference). It is mathematically equivalent to RLHF under the
Bradley-Terry preference model, but with no separate reward model and no PPO
rollouts. beta (~0.1) controls how far pi_theta is
allowed to drift from pi_ref.
3. PyTorch reference implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
# ---------- 1. Linear probing: freeze ResNet, train head ----------
def build_linear_probe(num_classes: int) -> nn.Module:
"""Frozen ResNet-50 backbone + new classifier head."""
net = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
for p in net.parameters():
p.requires_grad = False # freeze every backbone weight
in_feat = net.fc.in_features # 2048 for resnet50
net.fc = nn.Linear(in_feat, num_classes) # head is fresh + trainable
return net
def trainable_count(model: nn.Module) -> int:
return sum(p.numel() for p in model.parameters() if p.requires_grad)
# ---------- 2. Hand-rolled LoRA layer wrapping nn.Linear ----------
class LoRALinear(nn.Module):
"""y = W_0 x + (alpha / r) * B (A x); W_0 frozen, A/B trainable."""
def __init__(self, base: nn.Linear, r: int = 8, alpha: int = 16, dropout: float = 0.0):
super().__init__()
self.base = base
for p in self.base.parameters():
p.requires_grad = False # freeze pretrained linear
d_out, d_in = base.weight.shape
self.r = r
self.scale = alpha / r
# A: r x d_in (small Gaussian) B: d_out x r (zeros) — initial update = 0
self.A = nn.Parameter(torch.randn(r, d_in) * (1.0 / r))
self.B = nn.Parameter(torch.zeros(d_out, r))
self.drop = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Frozen path + low-rank update
base_out = self.base(x)
lora_out = F.linear(self.drop(x), self.A) # (..., r)
lora_out = F.linear(lora_out, self.B) # (..., d_out)
return base_out + self.scale * lora_out
@torch.no_grad()
def merge(self) -> nn.Linear:
"""Fold (alpha/r) B A back into W_0 for zero-latency inference."""
merged = nn.Linear(self.base.in_features, self.base.out_features,
bias=self.base.bias is not None)
merged.weight.copy_(self.base.weight + self.scale * self.B @ self.A)
if self.base.bias is not None:
merged.bias.copy_(self.base.bias)
return merged
def inject_lora(module: nn.Module, target_names=("q_proj", "v_proj"),
r: int = 8, alpha: int = 16) -> nn.Module:
"""Walk a transformer; replace named nn.Linear children with LoRALinear."""
for name, child in module.named_children():
if isinstance(child, nn.Linear) and name in target_names:
setattr(module, name, LoRALinear(child, r=r, alpha=alpha))
else:
inject_lora(child, target_names, r, alpha)
return module
# ---------- 3. SFT loop sketch with Hugging Face Trainer + PEFT ----------
# Sketch — assumes `transformers`, `peft`, `datasets`, `bitsandbytes` installed.
"""
from transformers import (
AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,
TrainingArguments, Trainer, DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
MODEL = "meta-llama/Llama-3.2-1B"
# Load base in 4-bit NF4 (QLoRA setup)
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True)
tok = AutoTokenizer.from_pretrained(MODEL)
base = AutoModelForCausalLM.from_pretrained(MODEL, quantization_config=bnb, device_map="auto")
base = prepare_model_for_kbit_training(base)
# Attach LoRA adapters
lora_cfg = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
model = get_peft_model(base, lora_cfg)
model.print_trainable_parameters() # e.g. 0.4% trainable
# Format prompts; mask loss on the instruction tokens, keep loss on the response
def format_example(ex):
prompt = f"### Instruction:\\n{ex['instruction']}\\n\\n### Response:\\n"
full = prompt + ex["response"] + tok.eos_token
ids = tok(full, truncation=True, max_length=1024)["input_ids"]
labels = list(ids)
n_prompt = len(tok(prompt, add_special_tokens=False)["input_ids"])
for i in range(min(n_prompt, len(labels))):
labels[i] = -100 # ignore_index = no loss on prompt
return {"input_ids": ids, "labels": labels}
# trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=..., ...)
# trainer.train(); model.save_pretrained("./lora-adapter")
"""
if __name__ == "__main__":
probe = build_linear_probe(num_classes=10)
total = sum(p.numel() for p in probe.parameters())
train = trainable_count(probe)
print(f"linear-probe: {train:,} / {total:,} trainable ({100*train/total:.3f}%)")
# LoRA-wrap a 1024-dim linear: 1024*1024 = 1.05M params -> r=8 gives 16384.
lin = nn.Linear(1024, 1024)
wrapped = LoRALinear(lin, r=8, alpha=16)
print("base params (frozen):", sum(p.numel() for p in wrapped.base.parameters()))
print("lora params (train):", sum(p.numel() for p in [wrapped.A, wrapped.B]))
merged = wrapped.merge() # fold for inference
x = torch.randn(2, 1024)
assert torch.allclose(wrapped(x), merged(x), atol=1e-5)
print("merged forward matches lora forward.")
The peft snippet is shown as a docstring so the file is importable
even without the heavy dependencies installed; copy it into a notebook cell when
you actually run QLoRA. LoraConfig and get_peft_model are
the real public API from Hugging Face's peft library.
4. Common USAAIO / IOAI applications
- Round 2 Colab budget — you get an L4 (24 GB) for a few hours. Full fine-tuning anything past ~500M params is out; LoRA on a 1-3B base is realistic, QLoRA opens 7B.
- Fine-tune a CLIP head — for a few-shot classification task
(chickens, satellites, plant disease), freeze the CLIP image encoder and train a
single
nn.Linearon top. Trains in seconds. - Fine-tune DistilBERT for sentiment / NLI — full fine-tuning is fine (66M params, fits easily). Or LoRA on the attention projections if you want to ship five task adapters from one base.
- Instruction-tune a small LLM — QLoRA on Llama-3.2-1B or Qwen2.5-1.5B with a 1k-example SFT set is the canonical Round-2 LLM workflow.
- Preference tuning — DPO on top of an SFT checkpoint when you have a pairwise-preference dataset; cheaper and more stable than PPO/RLHF.
- Continual learning across tasks — train one LoRA adapter per task; swap them at inference time without ever touching the base weights.
5. Drills
D1 · LoRA params vs full FT
A transformer linear is 4096 x 4096. You add LoRA with
r = 16. How many trainable params does LoRA introduce, and what
fraction of the original linear is that?
Solution
LoRA params = r*(d + k) = 16*(4096 + 4096) = 131,072. Full
linear = 4096*4096 = 16,777,216. Ratio ≈ 0.78%.
Multiply that across only q_proj and v_proj in every
block and the total trainable footprint of a 7B model is well under 1%.
D2 · QLoRA memory savings
Full fine-tuning a 7B model in fp16 with Adam needs roughly how much GPU memory? Why does QLoRA fit on a 16 GB card?
Solution
Full fp16 FT: 2 bytes/param weights + 2 bytes grads + 8 bytes Adam (two fp32 moments) = 12 bytes/param * 7e9 ≈ 84 GB, before activations. QLoRA: base weights at 4 bits = 0.5 byte/param * 7e9 ≈ 3.5 GB, no grads / no Adam on the base (it is frozen and quantised), and LoRA adapters at <1% of params carry the only Adam state. Total well under 16 GB with paged optimisers and gradient checkpointing.
D3 · Why DPO needs no reward model
RLHF trains a reward model r_phi(x, y) then optimises a policy
against it with PPO. DPO skips both steps. Why is that valid?
Solution
Under the Bradley-Terry preference model with a KL constraint to a
reference policy, the optimal policy has a closed form:
pi*(y|x) ∝ pi_ref(y|x) * exp(r(x,y) / beta). Inverting that gives
r(x,y) = beta * log(pi*(y|x) / pi_ref(y|x)) + Z(x). Plugging
this expression for r back into the Bradley-Terry likelihood,
the partition function Z(x) cancels between the preferred and
rejected response, leaving a pure classification loss on policy log-ratios.
No explicit reward model is ever materialised.
D4 · When prompt tuning beats LoRA
Lester et al. find prompt tuning is competitive with full fine-tuning only for very large base models. Why?
Solution
Prompt tuning has very few parameters (a handful of soft-token embeddings,
~m * d) and only acts at the input. Its expressiveness is
dominated by the base model's in-context-learning capacity, which grows
sharply with scale. Below ~10B params the base does not have enough ICL
capacity to be steered by a short prefix, and LoRA — which can adjust every
attention projection — wins. Above ~10B params prompt tuning closes the gap
and is cheaper. For competition-scale (1-7B) bases, prefer LoRA.
D5 · Catastrophic forgetting
You full-fine-tune Llama-3-8B on 500 medical Q&A pairs. The model now answers medical questions but has lost its ability to do basic arithmetic. Diagnose and fix.
Solution
Classic catastrophic forgetting: with 500 examples and all 8B params trainable, the optimiser overfits the narrow distribution and overwrites capacity that encoded everything else. Fixes (in order of effort): (i) switch to LoRA so the base weights stay frozen and the new behaviour lives in a side path you can disable; (ii) lower learning rate by 5-10x; (iii) mix in a general-purpose SFT dataset (e.g. Tulu, OpenHermes) so each batch is partly "remember how to be a general assistant"; (iv) early stop on a held-out general-capability eval, not just medical loss.
Next step
Pair this with Transformers to see what
you are inserting LoRA into (which linears matter — q_proj,
v_proj, sometimes the MLPs). Then jump to
Notebooks for runnable QLoRA + SFT recipes on
Colab, and back to Deep Learning for the optimiser /
mixed-precision plumbing that makes the memory math above actually hold.