Modeling techniques catalogue

A working set of the modeling tools that show up again and again in HiMCM. Each one has a "when to use", a sketch of the math, a worked example, and a note on pitfalls. Combine 2–3 of these in most problems.

How to choose a method

Read the problem twice and ask three questions:

  1. What's the output? A number, a ranking, a plan, a forecast, a probability, a yes/no?
  2. What's the structure of the input? One scalar, several criteria, a time series, a network, a population?
  3. What's the relationship? Linear, exponential, S-shaped, periodic, stochastic, equilibrium-driven?
If you need to…Look at…
Rank options against multiple criteriaAHP, TOPSIS, Entropy Weight Method, weighted sum
Forecast a quantity over timeRegression, ARIMA, exponential smoothing, logistic growth
Model a population or compartmentsODEs (SIR, logistic, predator–prey), difference equations
Predict a yes/no outcomeLogistic regression, classification trees
Allocate scarce resources optimallyLinear programming, integer programming, greedy algorithms
Simulate behavior with randomnessMonte Carlo simulation, agent-based, Markov chain
Find a route or scheduleGraph algorithms (Dijkstra, MST, TSP heuristics), networkx
Quantify uncertaintyBootstrap, Monte Carlo, confidence intervals

1 · Analytic Hierarchy Process (AHP)

When to use. You have multiple criteria, each with subjective weights, and you need to rank alternatives. Classic for "pick the best host city" or "rank these candidate sports".

The mechanics.

  1. Build a pairwise comparison matrix $A$ of your criteria: $a_{ij}$ is how much more important criterion $i$ is than criterion $j$, on a 1–9 scale.
  2. Compute the principal eigenvector of $A$. Its entries (normalized) are your criterion weights $w$.
  3. Score each alternative on each criterion, multiply by $w$, sum.
  4. Check consistency: $CR = (\lambda_{\max} - n)/((n-1) \cdot RI)$. CR should be ≤ 0.10.
$w_i = $ normalized principal eigenvector of pairwise comparison matrix $A$.
$\text{score}(k) = \sum_i w_i \cdot r_{ki}$, where $r_{ki}$ is alternative $k$'s rating on criterion $i$.

Pitfalls. The pairwise weights are subjective — be explicit about who set them and why. Always run a sensitivity check by perturbing the weights and seeing if the ranking changes. Many top papers combine AHP with the Entropy Weight Method to get a hybrid weight that has both subjective and data-driven components.

Sketch: AHP for choosing a Super Bowl host city

Criteria: renewable energy mix, transit access, hotel capacity, climate severity, existing venues. Build a 5×5 pairwise matrix (use scale 1=equal, 3=moderately more important, 5=strongly, 7=very strongly, 9=extremely). Compute weights. Score each candidate city 1–10 on each criterion (use real data). Multiply, sum, rank.


2 · TOPSIS

When to use. Same problem class as AHP — multi-criteria ranking — but you already have numeric scores for each alternative on each criterion and want a principled way to combine them.

The mechanics.

  1. Build the decision matrix: rows = alternatives, columns = criteria, values = scores.
  2. Normalize each column (vector normalization is standard: $\tilde r_{ki} = r_{ki} / \sqrt{\sum_k r_{ki}^2}$).
  3. Multiply each column by its weight $w_i$.
  4. Identify the "ideal best" $A^+$ (max in each column for "more is better" criteria, min for "less is better") and "ideal worst" $A^-$.
  5. For each alternative compute Euclidean distance to $A^+$ and $A^-$.
  6. Closeness coefficient $C_k = d_k^- / (d_k^+ + d_k^-)$. Rank by $C_k$, higher is better.
$C_k = \dfrac{d_k^-}{d_k^+ + d_k^-} \in [0, 1]$, where $d_k^\pm = \sqrt{\sum_i (v_{ki} - v_i^\pm)^2}$.

Pitfalls. TOPSIS depends entirely on the weights. Use AHP or Entropy Weight to derive them. Don't pretend uniform weights are "neutral" — they're a choice. Numbers must be comparable post-normalization; mind units.


3 · Entropy Weight Method (EWM)

When to use. You want data-driven weights for your criteria, with no subjective input. Often paired with TOPSIS or AHP.

The mechanics. Criteria with more variation across alternatives are more informative, so they get more weight.

  1. Normalize the decision matrix into proportions $p_{ki}$ (each column sums to 1).
  2. For each criterion $i$, compute its entropy: $e_i = -\frac{1}{\ln K} \sum_k p_{ki} \ln p_{ki}$ (where $K$ is the number of alternatives).
  3. The criterion's weight is proportional to $1 - e_i$, then normalized.
$w_i = \dfrac{1 - e_i}{\sum_j (1 - e_j)}$

Pitfalls. EWM weights are sensitive to the choice of alternatives. Adding or removing an alternative shifts all weights — be honest about this. A common best practice is the Combined Weight Model: average AHP weights and EWM weights to get something with both expert judgment and data sensitivity.


4 · Regression (linear, polynomial, multivariate)

When to use. You have a numeric target and one or more numeric inputs, and you want to fit a relationship.

  • Linear: $y = a + bx$. Good baseline. Use when the scatter plot looks like a straight line.
  • Polynomial: $y = a_0 + a_1 x + a_2 x^2 + \dots$. Beware overfitting — degree 2 or 3 is usually enough.
  • Multivariate: $y = a_0 + \sum_i a_i x_i$. For multiple predictors.
  • Power / exponential / log: Linearize first (take log of both sides) and fit linearly.

What to report. The fit equation, $R^2$, residual plot, and a comment on whether residuals look random or have structure.

Worked sketch: 2022-B CO₂ trend

Given annual CO₂ from 1959–2021, fit:

  • Linear: $C(t) = a + b(t-1959)$. Likely underestimates recent years.
  • Quadratic: $C(t) = a + b(t-1959) + c(t-1959)^2$. Captures the acceleration.
  • Exponential: $\ln(C - 280) = a + b(t-1959)$. Models "excess over pre-industrial".

Report all three with $R^2$ and prediction for 2050 + 2100. Pick the one with both good $R^2$ and physically reasonable extrapolation. Quadratic typically wins here.


5 · Logistic growth

When to use. Quantity grows fast at first, then saturates at a carrying capacity. Populations, market penetration, adoption curves.

$\dfrac{dP}{dt} = r P \left(1 - \dfrac{P}{K}\right) \quad \Rightarrow \quad P(t) = \dfrac{K}{1 + \left(\frac{K-P_0}{P_0}\right) e^{-rt}}$

Parameters to estimate: $r$ (intrinsic growth rate), $K$ (carrying capacity), $P_0$ (initial population). Fit using least squares to historical data, or estimate from physical reasoning.

Pitfalls. Don't apply logistic to a quantity that's still in pure exponential phase — you need data near saturation to estimate $K$.


6 · Compartmental ODE models

When to use. A system has distinct states and flow between them (susceptible/infected/recovered, larvae/pupae/adults, queue/server). The dynamics are continuous and aggregate.

Template (3-compartment example):

$\dfrac{dS}{dt} = -\beta SI$    $\dfrac{dI}{dt} = \beta SI - \gamma I$    $\dfrac{dR}{dt} = \gamma I$

Solve numerically with SciPy's solve_ivp. Plot trajectories. Identify equilibria. Compute basic reproduction number $R_0 = \beta / \gamma$.

Pitfalls. Choose units carefully. Verify mass conservation ($S + I + R = N$ should hold). Stiff systems need stiff solvers (LSODA, BDF).

Sketch: honeybee colony (2022-A)

Compartments: eggs $E$, larvae $L$, workers $W$, queens $Q$. Daily flows: eggs are laid at rate $\beta Q$; eggs mature into larvae at rate $\mu_E$; larvae become workers at rate $\mu_L$; workers die at rate $\mu_W$. Add seasonal multipliers on $\beta$ and $\mu_W$ to capture winter slowdown. Solve as a system of ODEs.


7 · Logistic regression (classification)

When to use. You're predicting a binary outcome (will this SDE be included? will the team be evacuated in time? will the species become invasive?). Inputs can be a mix of numeric and categorical.

$P(y=1 \mid \mathbf{x}) = \dfrac{1}{1 + \exp(-(\beta_0 + \beta^\top \mathbf{x}))}$

Fit by maximum likelihood (use statsmodels.Logit or sklearn.linear_model.LogisticRegression). Threshold at 0.5 to convert to a binary prediction, or interpret the probability directly.

Pitfalls. Always check for class imbalance. Report AUC or precision/recall, not just accuracy. Coefficients are interpretable as log-odds changes, which makes for good paper writing.


8 · ARIMA / time-series forecasting

When to use. You have a time series with trend and/or seasonality, and need to forecast forward. The judges flag ARIMA as appropriate, but warn that you must explain what it is, not just call auto_arima.

ARIMA(p, d, q) means: $p$ autoregressive lags, $d$ differencing steps to make the series stationary, $q$ moving-average terms.

Workflow.

  1. Plot the series. Look for trend, seasonality, level shifts.
  2. Difference until stationary (test with ADF). $d$ is usually 0, 1, or 2.
  3. Inspect ACF/PACF plots to pick $p, q$.
  4. Fit, check residuals (should be white noise), report AIC.
  5. Forecast with prediction intervals.

Alternative for HiMCM: exponential smoothing (Holt-Winters) is simpler to explain and often just as good for short horizons.


9 · Markov chains

When to use. Discrete states, transition between them depends only on the current state. Great for modeling progression, dynamics, or steady-state behavior.

The mechanics. Build transition matrix $P$ where $P_{ij}$ is the probability of going from state $i$ to state $j$. Starting distribution $\pi_0$. After $n$ steps: $\pi_n = \pi_0 P^n$. Steady-state distribution: eigenvector of $P$ with eigenvalue 1.

Use case: HPC energy mix evolution (2024-B)

States: coal-dominant, gas-dominant, nuclear-dominant, renewable-dominant. Each year there's a small probability of transitioning between states based on policy / investment. Iterate the matrix to see how the mix evolves over a decade. Steady-state tells you the long-run equilibrium under given transition probabilities.


10 · Monte Carlo simulation

When to use. Your model has multiple uncertain inputs, and you want a distribution of outputs (not just a point estimate). Also good for sensitivity analysis and for problems with complex randomness (queues, evacuations, disease spread).

Recipe.

  1. Pick distributions for each uncertain input (e.g., walking speed ∼ Normal(1.4, 0.2)).
  2. Draw $N$ samples (e.g., $N = 10\,000$).
  3. Run the model on each sample.
  4. Report the distribution: mean, std, 5th/95th percentile, histogram.

Convergence check. Plot output mean vs. $N$. It should stabilize. If not, increase $N$.


11 · Optimization (LP, IP, heuristics)

When to use. "Find the best…", "Minimize total…", "Maximize coverage…".

  • Linear programming (LP): linear objective, linear constraints, continuous variables. Use scipy.optimize.linprog.
  • Integer / Mixed-integer (IP/MIP): same but with integer constraints. Use pulp or cvxpy with a free solver (CBC).
  • Nonlinear: use scipy.optimize.minimize with methods like SLSQP or COBYLA.
  • Heuristic search: for hard combinatorial problems (TSP, scheduling), greedy + local search or simulated annealing.
Standard LP form:   $\min c^\top x \quad \text{s.t.} \quad Ax \le b, \quad x \ge 0$

12 · Graphs & networks

When to use. Routing (evacuation, delivery), connectivity (power grid, transit), influence (social networks).

Useful algorithms (all in networkx):

  • Shortest path: Dijkstra (positive weights), Bellman-Ford (negative weights).
  • Connectivity / components: connected_components.
  • Centrality: degree, betweenness, eigenvector — for finding key nodes.
  • Minimum spanning tree: Kruskal, Prim — for building cheapest connected network.
  • Max flow / min cut: for capacity problems.
  • TSP approximations: nearest-neighbor heuristic, 2-opt.
Use case: evacuation sweep routing (2025-A)

Model the building as a graph: rooms and hallway junctions are nodes, doors are edges, edge weight = traversal time. Each responder starts at a known node. Goal: visit all "room" nodes in minimum total time across all responders. This is a vehicle-routing / multi-agent shortest-path variant — use a greedy assignment plus local search, or formulate as a MIP if the graph is small.


13 · Agent-based simulation

When to use. Individuals (people, bees, buses) interact according to local rules, and you want emergent behavior. Especially good when an ODE doesn't capture spatial or behavioral heterogeneity.

Sketch. Define agent state (position, status). Define a step function that updates all agents per time tick. Run the simulation. Track aggregate quantities over time. Visualize.

Mesa (Python) is a clean ABM framework. For HiMCM-scale problems, writing a custom loop in NumPy is often simpler and faster.


How top teams typically combine these

From the 2024 judges' commentary, here's the recipe Outstanding teams used on Problem A (Olympic SDE selection):

  1. Identify factors (criteria) — qualitative reasoning + literature.
  2. Get data for each factor across alternatives (some from datasets, some from scraping Google Trends / social media).
  3. Use AHP and Entropy Weight Method to compute two sets of weights; combine them.
  4. Score alternatives with TOPSIS using the combined weights.
  5. Apply Monte Carlo on the weights to get a distribution of rankings (MC-TOPSIS) — this is the sensitivity analysis.
  6. Pick best alternatives, write recommendations.

For Problem B (HPC environmental impact):

  1. Estimate current energy use using engineering reasoning + published reports.
  2. Build a base carbon model: emissions = energy × emission factor, weighted by energy mix.
  3. Extrapolate with logistic growth (recognizing limits) instead of pure exponential.
  4. Use Markov chain over energy-mix states to model long-term transition.
  5. Add a second model for a chosen secondary impact (water / e-waste).
  6. Sensitivity analysis on the assumed growth rate and mix transitions.
Pattern. Almost every winning paper uses one main model plus one secondary technique for forecasting/uncertainty. Don't try to use eight techniques. Use two well.