IOAI 2024 · CV · Make SDXL-mini swap zebras and giraffes — without touching the prompt

Contest: IOAI 2024 (Bulgaria) · Round: Scientific, at-home stage · Category: Computer Vision / model editing / diffusion internals.

Official sources: ioai-official.org/2024-tasks · IOAI-official/IOAI-2024 · at-home problems (zip).

1. Problem restatement

You are given a frozen SDXL-mini text-to-image diffusion checkpoint and a fixed list of evaluation prompts. The prompts mention animals — some say "a zebra eating grass", others say "a giraffe in the savanna". Your job is to edit the model weights so that, at inference time, the same prompts produce the swapped animal: "zebra" prompts must yield giraffes and "giraffe" prompts must yield zebras. Other categories (lions, elephants, backgrounds) must remain untouched.

Hard rules: you may not change the prompts, the tokenizer, the sampler, or the seeds. You may modify UNet/text-encoder weights in any way and ship the modified checkpoint. The grader runs a held-out prompt set and computes (a) a swap accuracy via a CLIP classifier ("did this image of a 'zebra' prompt show a giraffe?") and (b) a collateral-damage penalty on unrelated prompts.

The on-site sibling is the cow + hydrant composition task — see that walkthrough.

Source. Task summary paraphrased from the IOAI 2024 CV at-home description on open-cu/awesome-ioai-tasks. Exact CLIP-grader thresholds and prompt counts are [verify against source] in the official notebook.

2. What's being tested

Conceptual model editing. Classical "ROME / MEMIT" research on language models maps a key (the concept "zebra") to a value (visual features of a zebra) inside MLP layers. The same idea applies inside diffusion UNets and text encoders — you must find the right layer and the right rank-1 update.
Locality. The graded penalty is heavy on collateral damage. A "swap" that also destroys "lion" images costs more than not swapping at all.
Open-weights tooling. diffusers, transformers, peft. You should be comfortable pulling out a single linear layer of the UNet's cross-attention.
No-prompt-change rule. Forces you away from "just edit the embedding of the word zebra" hacks and toward genuine internal edits.

See Deep Learning for diffusion fundamentals and the UNet page for the architecture you'll be editing.

3. Data exploration / setup

The "data" is unusual — instead of a CSV, you get:

A frozen SDXL-mini checkpoint (~1 GB weights, distributed by the organisers).
A prompts.json with several hundred evaluation prompts split into "zebra prompts", "giraffe prompts", and "control prompts" (other animals + non-animal scenes).
A scoring script that calls a CLIP classifier and computes swap accuracy + collateral.
A small handful of reference images to sanity-check the CLIP grader's behaviour.

EDA you should do first:

from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
    "path/to/sdxl-mini", torch_dtype=torch.float16).to("cuda")

# 1. Generate 3 zebra prompts and 3 giraffe prompts at fixed seed; eyeball.
# 2. Find the cross-attention layer index where "zebra" attention map peaks.
# 3. Run CLIP on baseline outputs and confirm the grader's accuracy on the
#    unmodified model is roughly 0 swap accuracy + 0 collateral (sanity).

Scoring: combined score = swap_accuracy − λ · collateral_damage, both in [0, 1], scaled to 0–100. [illustrative — verify weights in the official notebook]

4. Baseline approach

The simplest legal edit: in the text encoder, swap the embedding rows for the tokens "zebra" and "giraffe". One line of code, no training. This works for prompts where the animal token is the only "animal-meaningful" token — a surprisingly good baseline.

import torch
from transformers import CLIPTextModel, CLIPTokenizer

tok = CLIPTokenizer.from_pretrained("path/to/sdxl-mini", subfolder="tokenizer")
enc = pipe.text_encoder

zebra_id   = tok.convert_tokens_to_ids("zebra</w>")
giraffe_id = tok.convert_tokens_to_ids("giraffe</w>")

with torch.no_grad():
    e = enc.get_input_embeddings().weight
    e[zebra_id], e[giraffe_id] = e[giraffe_id].clone(), e[zebra_id].clone()

# Now the text encoder treats "zebra" as if it were "giraffe" and vice versa.
pipe.text_encoder = enc

This trivial edit typically scores ~40/100 on the combined metric: high swap accuracy on simple prompts, but collateral damage because the tokenizer occasionally fragments "zebra" into sub-tokens that you missed. [illustrative]

5. Improvements that move the needle

5.1 · Edit cross-attention K/V projections, not the token embedding

In SDXL the prompt influences the UNet via cross-attention: text embeddings become keys and values in each UNet block. Identify the cross-attention layer with the strongest "zebra" signal (run a few forward passes, dump attention maps, find the layer where the zebra-token attention concentrates on striped regions) and apply a rank-1 update to its to_k / to_v projection that maps the zebra key to the giraffe value.

5.2 · Use ROME-style closed-form rank-1 updates

Build a "key" matrix from the encoder output for "zebra" prompts and a "value" matrix from the encoder output for "giraffe" prompts. Solve min ||W' − W||_F s.t. W' k_zebra ≈ v_giraffe. The closed-form rank-1 patch is W' = W + (v_giraffe − W k_zebra) k_zebra^T / (k_zebra^T C^{-1} k_zebra) where C is a sample covariance estimated on random prompts. This was the standard ROME trick from 2022 NLP work, adapted directly to UNet cross-attention.

def rome_rank1_update(W, k, v_target, C_inv):
    # k, v_target: vectors; W: existing linear weight; C_inv: pre-computed covariance inverse
    delta = (v_target - W @ k) / (k @ C_inv @ k)
    return W + torch.outer(delta, C_inv @ k)

5.3 · Add a locality constraint via random control prompts

Collect ~200 control prompts ("lion", "elephant", "city street") and require the edited model's outputs on those prompts to stay close (CLIP-similarity > 0.95) to the original outputs. If the constraint is violated, shrink the rank of your edit or add a low-norm regulariser. This usually buys 10–15 points of collateral score.

5.4 · Edit two layers, not one

Single-layer edits are brittle — the swap leaks for unusual phrasings ("Equus quagga"). Editing the same key/value at two adjacent UNet blocks (say block 6 and block 10) typically lifts swap accuracy by 5–10 points while costing only a small amount of collateral.

5.5 · Validate the CLIP grader is judging what you think

Before submitting, run the official scoring script on a handful of your generated images. CLIP can be fooled — a giraffe with vertical stripes still reads as "zebra" to CLIP. If your visual swap is perfect but score is low, look at the grader, not your model.

6. Submission format & gotchas

Submit the modified checkpoint as a directory (unet/, text_encoder/, etc.) in the same layout as the original. Test by reloading via from_pretrained(...) before uploading.
Seeds matter: the grader fixes seeds. If you accidentally introduce randomness in your edit routine, your local score won't match the grader's.
Half-precision (fp16) edits sometimes accumulate error — cast to fp32, edit, cast back.
Disk quota is real: SDXL-mini is ~1 GB. Don't commit the original alongside the edited one in a submission zip.

7. What top solutions did

The official best-solutions archive shows two converging recipes: (1) ROME-style rank-1 cross-attention edits in 2–3 mid-UNet blocks, with a locality term computed on a few hundred random prompts; (2) textual-inversion baselines that learn a small embedding-shift token and apply it to the text encoder. Recipe (1) dominates the leaderboard because it survives held-out prompts; recipe (2) is faster but over-fits the training prompt distribution. [verify against source]

8. Drill

D · Why does swapping token embeddings work at all? Why doesn't it work perfectly?

It works because the text encoder is mostly linear in the embedding space until the first attention layer — swapping embedding row i with row j means the encoder produces "giraffe-shaped" outputs whenever the input token id is "zebra". It breaks because (a) the tokenizer may split zebraness into sub-tokens you didn't swap, (b) attention layers in the encoder mix the swapped row with context tokens, so the final embedding isn't a clean swap, and (c) some prompts describe a zebra without ever using the word ("striped equine"). True swap requires editing the visual representation, not the lexical one.

D2 · Where would you measure the "zebra" concept inside the UNet?

Inside each UNet block, the cross-attention map for the "zebra" token, averaged across heads, highlights pixels that this block believes are "zebra" pixels. Compute the entropy of that map for every block on a batch of zebra prompts. The block with the lowest entropy (sharpest map) is the block that most strongly encodes the concept — that's where a rank-1 ROME edit will land cleanest. Typically this is a mid-resolution block (16×16 feature map), not the highest or lowest resolution.

← IOAI 2024 Scientific set