IOAI 2024 Practical · Creative brief & storyboard
Contest: IOAI 2024 (Bulgaria) · Round: Practical, on-site (4 h) · Category: Creative AI / planning & brief.
Official source: ioai-official.org/2024-tasks. The on-site brief was based on a track by Maria Ilieva; exact song and lyrics distributed to teams during the round. [verify against on-site materials]
1. Problem restatement
Teams receive an audio file (a song) and are tasked with producing both an album cover and a short music video using existing generative AI tools, in 4 hours. The single biggest determinant of jury score is not generation quality — it's the consistency between the cover and the video, and the alignment of both with the song's mood. This task page is about the planning phase: turning a song you've never heard into a one-page brief tight enough that downstream prompt engineering becomes mechanical.
2. What's being tested
- Time discipline. If you start generating before you have a brief, you'll burn 2 of your 4 hours regenerating inconsistent images.
- Multimodal listening. Pulling visual cues from audio: tempo → cut pace, key → palette, lyrics → motifs, instrumentation → texture.
- Communication. A brief is for your team. Writing it down doubles as a forcing function to commit to one direction.
- Constraint-setting. Locking 3 colours, 1 aspect ratio, 1 era of visual reference saves you from "every image looks different".
3. Data exploration / setup
You have an audio.mp3 file and (typically) a lyric sheet. First-pass tooling:
import librosa, numpy as np
y, sr = librosa.load("audio.mp3", sr=22050)
tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
chroma = librosa.feature.chroma_stft(y=y, sr=sr).mean(1) # 12-dim key profile
print("tempo:", tempo, " key:", "ABCDEFGABCDE"[chroma.argmax()])
duration = librosa.get_duration(y=y, sr=sr)
Use these numbers to set the video's cut pace (one cut per ~2 beats is a safe pop video rule) and aspect ratio (the assigned track was a Maria Ilieva pop number — a 16:9 or 9:16 vertical aspect both work; commit to one early).
4. Baseline approach
Spend exactly 30 minutes on the brief. Write the following sections, in this order, in a shared doc:
- Song summary (3 sentences): genre, mood, lyrical theme.
- Visual concept (1 sentence): "neon-soaked nighttime city seen through rain on a car window".
- Three colours (hex): one primary, one accent, one neutral. These constrain every image you generate.
- Reference: name a known director or visual artist whose style you'll echo. This word goes in every prompt.
- Six storyboard frames for the video: a tiny doodle is enough. One frame per ~5 seconds for a 30-second video.
- Cover composition: subject, framing, where the title type will sit.
Sign off — out loud, to your team — before any generation tool is opened. Drift from the brief is the single biggest source of lost points.
5. Improvements that move the needle
5.1 · Build a "prompt fragment library" before the round
Pre-write fragments for camera ("shot on Arri Alexa, 35 mm, shallow depth"), light ("soft volumetric rim light"), grade ("teal & orange grade"), and style ("Wong Kar-wai influence"). On the day, plug fragments into the brief's three colour codes.
5.2 · Decide cover vs video order based on your strongest tool
If your team is faster with image generation, build the cover first and use its hero frame as the image-to-video seed. If you're faster with video, generate a 5-second loop and screenshot the most striking frame for the cover. Reversing this costs an hour.
5.3 · Lock seed and aspect ratio across deliverables
The same SDXL seed with the same prompt produces a consistent character across multiple images. Generate your "hero" character at one seed, then derive cover + video keyframes from that anchor.
5.4 · Build a 60-second highlight cut first, then expand
A polished 15-second cut beats a rough 45-second cut. Hit save at 90 minutes of video work; if you have time, add more frames. If not, your 15-second cut is the submission.
5.5 · Plan the title typography manually
Diffusion models still mangle text in 2024. Generate the image without title text, then composite the title in any vector tool (Figma, Inkscape). Treat the typeface as a brief decision: a single serif weight per concept.
6. Submission format & gotchas
- Submit a single zip with
cover.png(square, ≥ 1024×1024) andvideo.mp4(with audio muxed in, ≤ 60 s, H.264). - Include a one-page
brief.pdf— judges read it. Teams that submit without the brief lose communication points. - Don't render at 4K; render at 1080p and call it a day. The render time is the constraint.
- Test that the video plays in a browser (Chrome) before submitting. AV1 / VP9 encodes sometimes fail in jury environments.
7. What top solutions did
Public write-ups for IOAI 2024 Practical are sparse — the round is judged subjectively and teams don't publish briefs. The pattern visible in the screened final-presentation videos: teams that committed to a single visual era (e.g. "1980s neon" or "watercolour storybook") for the full 4 hours scored higher than teams that explored multiple directions. The brief is the lock on that commitment. [illustrative]
8. Drill
D · The song is melancholic, in A minor, 72 bpm, mentions rain and goodbye. Sketch a brief in 60 seconds.
Visual concept: a single empty train platform at dusk, low-key blue + amber neon, slow camera
drift. Colours: #1B2540 (deep blue), #D88A3A (amber), #E6E1D4
(warm grey). Reference: Wong Kar-wai. Six frames: (1) close-up rain on glass, (2) wide shot platform
with one figure walking away, (3) clock close-up, (4) bench detail, (5) train arriving in motion blur,
(6) figure boarding. Cover: figure silhouette centered, title at bottom in serif type. The video cut
pace at 72 bpm is one cut every ~5 s — six frames perfectly fit a 30-second video.
D2 · Your team disagrees on the colour palette. How do you resolve in < 5 minutes?
Force a vote on a single reference image: each team member proposes one URL of a real-world photo whose palette fits the song. Pick the image with two votes. Run a colour-picker (any palette tool) and lock the top 3 colours. The vote takes 90 seconds; the picker takes 30 seconds. The remaining 3 minutes are documenting the choice so no one re-litigates at hour 3.