IOAI 2025 · Concepts · Reason over abstract concepts from text
Contest: IOAI 2025 (Beijing) · Round: Individual Contest · Category: NLP / LLM prompting / classification.
Official sources: Individual-Contest/Concepts · Concepts_Solution.ipynb.
1. Problem restatement
Each example pairs a short text passage with one or more "concepts" drawn from a fixed vocabulary, and the task is to predict either (a) which concept(s) the passage instantiates, or (b) a structured relation between two concepts mentioned in the passage. The contest allows open-source LLMs as components but does not allow hosted API access — everything runs on the contestant's local box.
The interesting design choice is that classical text classification (TF-IDF + logistic regression) is a hard baseline to beat at small data scale, but a frozen LLM used carefully as a feature extractor or a few-shot classifier dominates as data shrinks.
2. What's being tested
- LLM-as-feature-extractor literacy. Using a frozen open-weights model (e.g. a 1B-parameter instruct model) to embed text or to vote on candidate labels.
- Prompt engineering for structured outputs. Coaxing reliable JSON / single-token classification answers from a small LLM.
- Few-shot vs fine-tune trade-off. Knowing when to spend GPU on a quick LoRA vs. when zero-shot prompting is enough.
- Calibration. Raw LLM probabilities are miscalibrated; a Platt-scaling layer fitted on validation usually adds points.
3. Data exploration / setup
import pandas as pd
train = pd.read_csv("concepts/train.csv") # columns: id, text, concept(s)
val = pd.read_csv("concepts/val.csv")
print(train.head())
print(train.concept.value_counts().head(20))
print("median text length:", train.text.str.split().str.len().median())
Things to check:
- Vocabulary size of concepts. 10 concepts is a different problem from 1000. The notebook tells you which.
- Multi-label vs single-label. If a passage can map to multiple concepts, it's a multi-label classification problem and your output head and loss must change.
- Text length. Short texts (~20 tokens) favour embedding-based classifiers; long texts favour LLM-as-judge.
4. Baseline approach
TF-IDF + LogReg, multi-label-aware. 20 lines, no GPU.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
mlb = MultiLabelBinarizer()
ytr = mlb.fit_transform(train.concept.str.split("|")) # if multi-label
yva = mlb.transform(val.concept.str.split("|"))
vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)
Xtr = vec.fit_transform(train.text); Xva = vec.transform(val.text)
clf = OneVsRestClassifier(LogisticRegression(C=4.0, max_iter=2000, class_weight="balanced"))
clf.fit(Xtr, ytr)
pred = clf.predict(Xva)
print("micro-F1:", f1_score(yva, pred, average="micro"),
"macro-F1:", f1_score(yva, pred, average="macro"))
# expected ~ 0.55 micro [illustrative]
5. Improvements that move the needle
5.1 · Sentence embeddings from a small instruct LLM
Replace TF-IDF with mean-pooled hidden states from a small open model
(e.g. sentence-transformers/all-MiniLM-L6-v2 or any 1B-param instruct model). Train
LogReg on top — typical +5–10 micro-F1 on conceptual text.
from sentence_transformers import SentenceTransformer
enc = SentenceTransformer("all-MiniLM-L6-v2")
Etr = enc.encode(train.text.tolist(), batch_size=64, show_progress_bar=True)
Eva = enc.encode(val.text.tolist(), batch_size=64)
clf = OneVsRestClassifier(LogisticRegression(C=4.0, max_iter=2000))
clf.fit(Etr, ytr)
5.2 · Zero-shot LLM as a label scorer
For each (passage, candidate concept) pair, prompt a small instruct model with "Does this passage instantiate the concept C? Answer yes or no." and read the logit of "yes". Threshold per concept (calibrate on val). This is slow but often beats TF-IDF on rare classes where you have ~5 training examples.
5.3 · Hybrid: embedding-shortlist + LLM rerank
Use embeddings to retrieve the top-5 candidate concepts per passage, then have the LLM rerank only those 5. Cuts LLM calls by 100× vs scoring every concept while keeping accuracy.
5.4 · LoRA fine-tune on rare classes only
Build a small LoRA on a 1B-param model, training only on training examples for the bottom-half rare-frequency classes. The LoRA learns "what does this rare concept look like" without disturbing the model's general representations. Combine with the embedding classifier for common classes.
5.5 · Calibrate thresholds per class
For multi-label problems, per-class decision thresholds matter more than the model. Sweep thresholds on val to maximise micro-F1 — typically +2 free F1 points.
6. Submission format & gotchas
submission.csvwithid,concept(single label) orid,concept_list(multi-label, pipe-separated). Read the notebook header to confirm.- If using an LLM, set a fixed temperature (0.0 for classification) and a fixed seed. Otherwise re-running gives different scores.
- Download the model weights ahead of time — contest network access may be restricted.
- Per-class thresholds belong saved as a JSON next to the model. The grader runs your inference notebook cold.
7. What top solutions did
The official solution notebook (linked above) combines an embedding-based classifier with an LLM rerank step on borderline cases. Top community write-ups (when published) add per-class threshold calibration and a small LoRA on rare-class examples. Pure-TF-IDF and pure-LLM submissions both leave points on the table. [verify against official solution]
8. Drill
D · You have 50 concepts and 500 training examples — about 10 per concept. Pick a strategy.
Embeddings + LogReg, with class-weighting and per-class threshold calibration. 10 examples per class is too few for fine-tuning a transformer head, and TF-IDF will be brittle on rare concepts. An embedding classifier amortises learning across classes (the embedding model already knows English) and threshold calibration handles imbalance. Reach for LLM rerank only on the bottom- decile of classes where even the embedding classifier underperforms.
D2 · A 1B-param instruct model returns "Y" sometimes and "Yes" other times. How do you get a clean signal?
Read the logits, not the decoded text. Find the token ids for "Yes" and "No" (and "yes", "no") in the tokenizer; sum the probabilities across capitalisation variants for each class. This gives a calibrated score per (passage, concept). Decoded-text checks are brittle; logit reads aren't.