IOAI 2025 · Concepts · Reason over abstract concepts from text

Contest: IOAI 2025 (Beijing) · Round: Individual Contest · Category: NLP / LLM prompting / classification.

Official sources: Individual-Contest/Concepts · Concepts_Solution.ipynb.

1. Problem restatement

Each example pairs a short text passage with one or more "concepts" drawn from a fixed vocabulary, and the task is to predict either (a) which concept(s) the passage instantiates, or (b) a structured relation between two concepts mentioned in the passage. The contest allows open-source LLMs as components but does not allow hosted API access — everything runs on the contestant's local box.

The interesting design choice is that classical text classification (TF-IDF + logistic regression) is a hard baseline to beat at small data scale, but a frozen LLM used carefully as a feature extractor or a few-shot classifier dominates as data shrinks.

Source. Paraphrased from the Concepts task folder. The exact concept vocabulary, label schema, and evaluation metric are specified in the official notebook — treat the specifics below as [verify against the notebook].

2. What's being tested

LLM-as-feature-extractor literacy. Using a frozen open-weights model (e.g. a 1B-parameter instruct model) to embed text or to vote on candidate labels.
Prompt engineering for structured outputs. Coaxing reliable JSON / single-token classification answers from a small LLM.
Few-shot vs fine-tune trade-off. Knowing when to spend GPU on a quick LoRA vs. when zero-shot prompting is enough.
Calibration. Raw LLM probabilities are miscalibrated; a Platt-scaling layer fitted on validation usually adds points.

3. Data exploration / setup

import pandas as pd
train = pd.read_csv("concepts/train.csv")     # columns: id, text, concept(s)
val   = pd.read_csv("concepts/val.csv")
print(train.head())
print(train.concept.value_counts().head(20))
print("median text length:", train.text.str.split().str.len().median())

Things to check:

Vocabulary size of concepts. 10 concepts is a different problem from 1000. The notebook tells you which.
Multi-label vs single-label. If a passage can map to multiple concepts, it's a multi-label classification problem and your output head and loss must change.
Text length. Short texts (~20 tokens) favour embedding-based classifiers; long texts favour LLM-as-judge.

4. Baseline approach

TF-IDF + LogReg, multi-label-aware. 20 lines, no GPU.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score

mlb = MultiLabelBinarizer()
ytr = mlb.fit_transform(train.concept.str.split("|"))   # if multi-label
yva = mlb.transform(val.concept.str.split("|"))

vec = TfidfVectorizer(ngram_range=(1, 2), min_df=2, sublinear_tf=True)
Xtr = vec.fit_transform(train.text); Xva = vec.transform(val.text)

clf = OneVsRestClassifier(LogisticRegression(C=4.0, max_iter=2000, class_weight="balanced"))
clf.fit(Xtr, ytr)
pred = clf.predict(Xva)
print("micro-F1:", f1_score(yva, pred, average="micro"),
      "macro-F1:", f1_score(yva, pred, average="macro"))
# expected ~ 0.55 micro [illustrative]

5. Improvements that move the needle

5.1 · Sentence embeddings from a small instruct LLM

Replace TF-IDF with mean-pooled hidden states from a small open model (e.g. sentence-transformers/all-MiniLM-L6-v2 or any 1B-param instruct model). Train LogReg on top — typical +5–10 micro-F1 on conceptual text.

from sentence_transformers import SentenceTransformer
enc = SentenceTransformer("all-MiniLM-L6-v2")
Etr = enc.encode(train.text.tolist(), batch_size=64, show_progress_bar=True)
Eva = enc.encode(val.text.tolist(),   batch_size=64)

clf = OneVsRestClassifier(LogisticRegression(C=4.0, max_iter=2000))
clf.fit(Etr, ytr)

5.2 · Zero-shot LLM as a label scorer

For each (passage, candidate concept) pair, prompt a small instruct model with "Does this passage instantiate the concept C? Answer yes or no." and read the logit of "yes". Threshold per concept (calibrate on val). This is slow but often beats TF-IDF on rare classes where you have ~5 training examples.

5.3 · Hybrid: embedding-shortlist + LLM rerank

Use embeddings to retrieve the top-5 candidate concepts per passage, then have the LLM rerank only those 5. Cuts LLM calls by 100× vs scoring every concept while keeping accuracy.

5.4 · LoRA fine-tune on rare classes only

Build a small LoRA on a 1B-param model, training only on training examples for the bottom-half rare-frequency classes. The LoRA learns "what does this rare concept look like" without disturbing the model's general representations. Combine with the embedding classifier for common classes.

5.5 · Calibrate thresholds per class

For multi-label problems, per-class decision thresholds matter more than the model. Sweep thresholds on val to maximise micro-F1 — typically +2 free F1 points.

6. Submission format & gotchas

submission.csv with id,concept (single label) or id,concept_list (multi-label, pipe-separated). Read the notebook header to confirm.
If using an LLM, set a fixed temperature (0.0 for classification) and a fixed seed. Otherwise re-running gives different scores.
Download the model weights ahead of time — contest network access may be restricted.
Per-class thresholds belong saved as a JSON next to the model. The grader runs your inference notebook cold.

7. What top solutions did

The official solution notebook (linked above) combines an embedding-based classifier with an LLM rerank step on borderline cases. Top community write-ups (when published) add per-class threshold calibration and a small LoRA on rare-class examples. Pure-TF-IDF and pure-LLM submissions both leave points on the table. [verify against official solution]

8. Drill

D · You have 50 concepts and 500 training examples — about 10 per concept. Pick a strategy.

Embeddings + LogReg, with class-weighting and per-class threshold calibration. 10 examples per class is too few for fine-tuning a transformer head, and TF-IDF will be brittle on rare concepts. An embedding classifier amortises learning across classes (the embedding model already knows English) and threshold calibration handles imbalance. Reach for LLM rerank only on the bottom- decile of classes where even the embedding classifier underperforms.

D2 · A 1B-param instruct model returns "Y" sometimes and "Yes" other times. How do you get a clean signal?

Read the logits, not the decoded text. Find the token ids for "Yes" and "No" (and "yes", "no") in the tokenizer; sum the probabilities across capitalisation variants for each class. This gives a calibrated score per (passage, concept). Decoded-text checks are brittle; logit reads aren't.

← IOAI 2025 Individual set