The math you actually need

Five areas: linear algebra, probability & statistics, single- & multivariable calculus, convex optimization, and concentration inequalities. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.

Syllabus link. This page tracks the math block of the official USAAIO syllabus. If something here disagrees with the syllabus, the syllabus wins.

TL;DR. By the end of this page you should be able to (1) multiply matrices and read off rank/eigenvalues/SVD shapes, (2) apply Bayes' rule and compute expectations of common distributions, (3) compute derivatives/integrals and gradients/Hessians of small functions and chain-rule a 2-layer MLP, (4) state when gradient descent converges and (5) bound a tail probability with Hoeffding's inequality. USAAIO problems test these as quick MCQ ("what is the rank of this matrix?", "is this loss convex?") and as the derivations underneath the ML/DL pages.

1. Linear algebra

Concept

A vector x ∈ ℝⁿ is an arrow with magnitude and direction; a matrix A ∈ ℝᵐˣⁿ is a linear map taking ℝⁿ → ℝᵐ. The dot product xᵀy = Σ xᵢ yᵢ measures alignment and factors as ‖x‖‖y‖cos θ. Norms quantify size: ‖x‖₁ = Σ|xᵢ| (L1), ‖x‖₂ = √(Σ xᵢ²) (L2), ‖x‖∞ = max|xᵢ|.

The rank of A is the dimension of its column space — the number of linearly independent directions it can output. Eigenvalues satisfy A v = λ v; eigenvectors are directions that don't rotate under A. For a symmetric matrix, eigenvectors are orthogonal and eigenvalues are real — this is what makes PCA work, since the covariance matrix is symmetric.

SVD is the universal factorization: any X ∈ ℝᵐˣⁿ can be written X = U Σ Vᵀ with U, V orthogonal and Σ diagonal with non-negative singular values. SVD underlies low-rank approximation (keep the top-k singular values), PCA (right singular vectors of centered data = principal components), and the pseudo-inverse used in ridge regression.

Worked example — PCA by hand on 5 points

import numpy as np

X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9],
              [1.9, 2.2], [3.1, 3.0]])

# 1. center
Xc = X - X.mean(axis=0)

# 2. covariance matrix (n=5 samples, so divide by n-1=4)
C = (Xc.T @ Xc) / (Xc.shape[0] - 1)

# 3. eigendecomposition (symmetric -> use eigh)
evals, evecs = np.linalg.eigh(C)
order = np.argsort(evals)[::-1]
evals, evecs = evals[order], evecs[:, order]

# 4. project onto top component
pc1 = evecs[:, 0]
scores = Xc @ pc1
print("eigenvalues:", evals)        # variance per PC
print("PC1 direction:", pc1)        # ~ (0.677, 0.735) — the diagonal
print("variance kept:", evals[0] / evals.sum())

Drills

D1 · Determinant and trace from eigenvalues

A 3×3 matrix has eigenvalues 1, 2, 4. Find det(A) and trace(A).

Solution

det(A) = ∏λᵢ = 1·2·4 = 8. trace(A) = Σλᵢ = 7. These identities hold for any square matrix (over ℂ, counting multiplicity).

D2 · SVD shapes

X is 100×5. Give the shapes of U, Σ, Vᵀ in the thin SVD.

Solution

Thin SVD: U is 100×5, Σ is 5×5 diagonal, Vᵀ is 5×5. Reconstruction: X = U Σ Vᵀ. The "thin" form drops the (100−5) zero singular values.

D3 · Rank of a rank-1 update

Let A = u vᵀ where u ∈ ℝ⁵, v ∈ ℝ³, both non-zero. What is rank(A)?

Solution

rank(A) = 1. Every column of A is a scalar multiple of u, so the column space is the 1-dimensional span of u.

D4 · PCA variance fraction

Covariance eigenvalues are 12, 6, 1.5, 0.5. What fraction of variance is captured by the top two PCs?

Solution

(12 + 6) / (12 + 6 + 1.5 + 0.5) = 18/20 = 0.90 → 90%.

D5 · Orthogonality of symmetric eigenvectors

Show that if A = Aᵀ and A v₁ = λ₁ v₁, A v₂ = λ₂ v₂ with λ₁ ≠ λ₂, then v₁ᵀ v₂ = 0.

Solution

Compute v₁ᵀ A v₂ two ways. From A v₂ = λ₂ v₂: v₁ᵀ A v₂ = λ₂ v₁ᵀ v₂. Using A = Aᵀ: v₁ᵀ A v₂ = (A v₁)ᵀ v₂ = λ₁ v₁ᵀ v₂. So (λ₁ − λ₂) v₁ᵀ v₂ = 0, and since λ₁ ≠ λ₂, v₁ᵀ v₂ = 0.

2. Probability & statistics

This section covers core probability and statistics topics, from basic expectation and variance formulas to Maximum Likelihood Estimation and Concentration Inequalities. Through our four-layer "Core Concept ➔ Mathematical Deep Dive ➔ Geometric Intuition & Physical Meaning ➔ Common Pitfalls & Drills" structure, you will master these high-frequency USAIO syllabus topics.

1. Linearity of Expectation & Variance Formulas

Core Concept:

Linearity of Expectation: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ (holds for any random variables, no independence required)
Variance Formula: $\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$

# Expectation and Variance in PyTorch
import torch

x = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
mean_val = x.mean()               # Corresponds to mathematical expectation E[X]
var_pop = x.var(unbiased=False)   # Population variance (divided by N)
var_sample = x.var(unbiased=True) # Sample variance (divided by N-1, PyTorch default)

Expand to View Mathematical Deep Dive & Proofs

1. Proof of Linearity of Expectation (Continuous Case)

Let $X$ and $Y$ be continuous random variables with joint probability density function $p(x, y)$. By the definition of mathematical expectation:

$$\mathbb{E}[aX + bY] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} (ax + by) p(x, y) \, dx \, dy$$

Using the linearity of integrals, we can split the double integral into two parts:

$$\mathbb{E}[aX + bY] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} ax p(x, y) \, dx \, dy + \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} by p(x, y) \, dx \, dy$$

Factor out the constants $a$ and $b$, and swap the order of integration:

$$\mathbb{E}[aX + bY] = a \int_{-\infty}^{\infty} x \left( \int_{-\infty}^{\infty} p(x, y) \, dy \right) dx + b \int_{-\infty}^{\infty} y \left( \int_{-\infty}^{\infty} p(x, y) \, dx \right) dy$$

Using the definition of marginal probability density functions:

$$p_X(x) = \int_{-\infty}^{\infty} p(x, y) \, dy, \quad p_Y(y) = \int_{-\infty}^{\infty} p(x, y) \, dx$$

Substituting these back into the equation:

$$\mathbb{E}[aX + bY] = a \int_{-\infty}^{\infty} x p_X(x) \, dx + b \int_{-\infty}^{\infty} y p_Y(y) \, dy = a\mathbb{E}[X] + b\mathbb{E}[Y]$$

Q.E.D. Note that we never assumed $X$ and $Y$ are independent (i.e., we did not use the condition $p(x, y) = p_X(x) p_Y(y)$). Therefore, Linearity of Expectation holds unconditionally for any random variables, regardless of their dependence.

2. Proof of the Variance Formula

Variance is defined as the expected squared deviation of a random variable from its mean (let $\mu = \mathbb{E}[X]$):

$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2]$$

Expanding the squared term inside the expectation operator:

$$(X - \mu)^2 = X^2 - 2\mu X + \mu^2$$

By the linearity of expectation, we apply the expectation operator $\mathbb{E}$ to each term individually:

$$\text{Var}(X) = \mathbb{E}[X^2 - 2\mu X + \mu^2] = \mathbb{E}[X^2] - \mathbb{E}[2\mu X] + \mathbb{E}[\mu^2]$$

Since $\mu = \mathbb{E}[X]$ is a constant, the expectation of a constant is the constant itself, and we can factor constants out of the expectation operator:

$$\mathbb{E}[2\mu X] = 2\mu \mathbb{E}[X] = 2\mu \cdot \mu = 2\mu^2$$ $$\mathbb{E}[\mu^2] = \mu^2$$

Substitute these back into the equation:

$$\text{Var}(X) = \mathbb{E}[X^2] - 2\mu^2 + \mu^2 = \mathbb{E}[X^2] - \mu^2 = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

Q.E.D.

Expand to View Geometric Intuition & Physical Meaning

1. "Centroid Superposition" of Expectation

If we view a probability density function as a physical mass distribution, the mathematical expectation $\mathbb{E}[X]$ is the center of mass (centroid) of the system.

When you join two physical bodies with centroids at different coordinates, the centroid of the combined system depends only on the individual centroids and their relative mass weights.
Even if the two bodies are connected by a stiff spring (dependent), their combined center of mass is still the weighted sum of their individual centers of mass.
In machine learning, this property forms the basis of ensemble methods like Bagging and Random Forests. The expected prediction of an ensemble is simply the average of the individual models' expectations, regardless of the correlation between the models.

2. Variance and "Moment of Inertia"

In physics, the moment of inertia measures a body's resistance to rotational acceleration and is defined as $\int r^2 dm$, which is mathematically identical to the variance formula $\mathbb{E}[(X - \mu)^2]$:

The mean $\mu$ acts as the rotation axis (centroid).
The variance represents the moment of inertia about this axis, describing the spread of the probability mass.
The variance formula $\text{Var}(X) = \mathbb{E}[X^2] - \mu^2$ is the statistical equivalent of the Parallel Axis Theorem: the moment of inertia about any axis ($\mathbb{E}[X^2]$) is equal to the moment of inertia about the centroid axis ($\text{Var}(X)$) plus the total mass times the squared distance between the axes ($1 \cdot \mu^2$).

⚠️ Common Pitfalls & Exam Tips

Expectation of a Product: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$ only holds if $X$ and $Y$ are independent (or uncorrelated). In general: $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] + \text{Cov}(X, Y)$, where $\text{Cov}(X, Y)$ is the covariance.
Variance of a Sum: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ only holds if $X$ and $Y$ are uncorrelated. In the general case: $$\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X, Y)$$
PyTorch Bessel's Correction: By default, `x.var()` computes the sample variance (with Bessel's correction, dividing by $N-1$, i.e., `unbiased=True`). If a mathematical derivation or competitive programming test requires the population variance (divided by $N$), you must explicitly set `unbiased=False`.

2. Maximum Likelihood Estimation (MLE) & ML Loss Functions

Core Concept:

Maximum Likelihood Estimation (MLE) finds the parameters $\theta$ that maximize the likelihood of the observed dataset $\mathcal{D}$:
$$\theta_{\text{MLE}} = \arg\max_\theta L(\theta) = \arg\max_\theta \prod_{i=1}^n p(\mathbf{x}_i, y_i \mid \theta) = \arg\min_\theta - \sum_{i=1}^n \ln p(\mathbf{x}_i, y_i \mid \theta)$$ Thus, Maximizing Likelihood $\iff$ Minimizing Negative Log-Likelihood (NLL) Loss.

# Loss Functions in PyTorch and MLE equivalence
import torch.nn as nn

# 1. Mean Squared Error (Equivalent to MLE under Gaussian noise)
mse_loss = nn.MSELoss()

# 2. Binary Cross-Entropy (Equivalent to MLE under Bernoulli trials)
bce_loss = nn.BCELoss() # Requires manual Sigmoid activation beforehand
bce_logits_loss = nn.BCEWithLogitsLoss() # Incorporates Sigmoid and uses log-sum-exp for numerical stability

Expand to View Mathematical Deep Dive & Proofs

Derivation 1: Equivalence of Linear Regression (Gaussian Noise) and MSE Loss

Assume a linear regression model predicts $\hat{y}_i = \mathbf{w}^\top \mathbf{x}_i + b$. Let the model prediction error (noise) be $\epsilon_i = y_i - \hat{y}_i$, where $\epsilon_i$ are independent and identically distributed (i.i.d.) according to a zero-mean Gaussian distribution: $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.

Thus, given features $\mathbf{x}_i$ and parameters $\mathbf{w}, b$, the target label $y_i$ follows the conditional distribution:

$$y_i \mid \mathbf{x}_i; \mathbf{w}, b \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}_i + b, \sigma^2)$$

The probability density function for a single point is:

$$p(y_i \mid \mathbf{x}_i; \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2}{2\sigma^2} \right)$$

The likelihood function $L(\mathbf{w}, b)$ over the entire dataset of $n$ samples is the product of individual densities:

$$L(\mathbf{w}, b) = \prod_{i=1}^n p(y_i \mid \mathbf{x}_i; \mathbf{w}, b) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2}{2\sigma^2} \right)$$

Taking the natural logarithm of both sides gives the log-likelihood:

$$\ln L(\mathbf{w}, b) = \sum_{i=1}^n \ln \left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2}{2\sigma^2} \right) \right)$$ $$\ln L(\mathbf{w}, b) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2$$

To find the MLE parameters, we maximize $\ln L(\mathbf{w}, b)$. Since the term $-\frac{n}{2}\ln(2\pi\sigma^2)$ and the coefficient $\frac{1}{2\sigma^2}$ are constants with respect to the optimization parameters $\mathbf{w}$ and $b$:

$$\arg\max_{\mathbf{w}, b} \ln L(\mathbf{w}, b) = \arg\max_{\mathbf{w}, b} \left[ - \sum_{i=1}^n (y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2 \right] = \arg\min_{\mathbf{w}, b} \sum_{i=1}^n (y_i - (\mathbf{w}^\top \mathbf{x}_i + b))^2$$

Normalizing by the factor $1/n$ does not alter the optimal parameters. Hence, maximizing the likelihood is mathematically equivalent to minimizing the Mean Squared Error (MSE) Loss:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$

Q.E.D.

Derivation 2: Equivalence of Logistic Regression (Bernoulli Trials) and Binary Cross-Entropy (BCE) Loss

In a binary classification task, labels $y_i \in \{0, 1\}$. Let the predicted probability be $p_i = \sigma(\mathbf{w}^\top \mathbf{x}_i + b) \in [0, 1]$ representing:

$$P(y_i = 1 \mid \mathbf{x}_i; \mathbf{w}, b) = p_i, \quad P(y_i = 0 \mid \mathbf{x}_i; \mathbf{w}, b) = 1 - p_i$$

This conditional probability can be written in the compact Bernoulli distribution form:

$$P(y_i \mid \mathbf{x}_i; \mathbf{w}, b) = p_i^{y_i} (1 - p_i)^{1 - y_i}$$

The likelihood function $L(\mathbf{w}, b)$ for $n$ i.i.d. samples is:

$$L(\mathbf{w}, b) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i}$$

Taking the natural logarithm yields the log-likelihood:

$$\ln L(\mathbf{w}, b) = \sum_{i=1}^n \left[ y_i \ln p_i + (1 - y_i) \ln(1 - p_i) \right]$$

Maximizing the log-likelihood is equivalent to minimizing the Negative Log-Likelihood (NLL). Normalizing by $1/n$ gives:

$$\mathcal{L}_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \ln p_i + (1 - y_i) \ln(1 - p_i) \right]$$

This is exactly the standard Binary Cross-Entropy (BCE) Loss.

Q.E.D.

Bonus: What if the Noise is Laplacian? (Laplace Noise & L1 Loss)

If we assume the error follows a Laplace distribution $\epsilon_i \sim \text{Laplace}(0, b)$, its probability density is $p(\epsilon_i) = \frac{1}{2b} \exp\left(-\frac{|\epsilon_i|}{b}\right)$. The log-likelihood is:

$$\ln L(\mathbf{w}, b) = -n\ln(2b) - \frac{1}{b} \sum_{i=1}^n |y_i - (\mathbf{w}^\top \mathbf{x}_i + b)|$$

Maximizing this likelihood is equivalent to minimizing:

$$\sum_{i=1}^n |y_i - (\mathbf{w}^\top \mathbf{x}_i + b)|$$

This is the L1 Loss (Mean Absolute Error, MAE). Thus: Gaussian Noise $\iff$ L2 Loss (MSE), while Laplacian Noise $\iff$ L1 Loss (MAE). This is a classic conceptual question on the USAIO!

Expand to View Geometric Intuition & Physical Meaning

1. Gaussian Noise and the Quadratic Penalty of MSE

The Gaussian distribution assumes that small errors are common, whereas extremely large errors (outliers) are exponentially rare. Thus, if a model makes a massive prediction error, the likelihood of that observation under a Gaussian assumption is virtually zero.

To prevent this, the loss function uses a **quadratic (L2) penalty** that disproportionately punishes large errors.
This explains why MSE is highly sensitive to **outliers**: a single outlier with a large error can dominate the loss value and drag the regression line far away from the rest of the data.

2. Laplacian Noise and the Robust L1 Loss

The Laplace distribution is peakier and has fatter tails than the Gaussian distribution, meaning it assigns higher probability to larger deviations (outliers).

Its corresponding loss function uses a linear absolute (L1) penalty, which does not blow up for large errors.
This gives L1 Loss its **robustness**: it can tolerate extreme noise and outliers without destabilizing the fitted model.

⚠️ Common Pitfalls & Exam Tips

Sigmoid Saturation: In logistic regression, if we compute $p_i = \sigma(z)$ and then evaluate `loss = - (y * log(p) + (1-y) * log(1-p))` manually, numerical underflow/overflow can lead to `NaN` values when $p_i$ is very close to 0 or 1. **Best Practice**: In PyTorch, always use `nn.BCEWithLogitsLoss()`, which fuses the Sigmoid and Binary Cross-Entropy calculations using numerically stable log-sum-exp implementations.
Ignoring the Prior: MLE assumes the parameter vector $\theta$ is a fixed, unknown constant and ignores its prior probability. If we model $\theta$ as a random variable with a prior distribution $P(\theta)$, we use Maximum A Posteriori (MAP) estimation. In ML, MAP estimation directly corresponds to adding L2 regularization (Weight Decay) to the loss function.

3. Bayes' Theorem & "Belief Update" Dynamic Process

Core Concept:

The core of Bayesian inference is updating our belief about parameters as we observe data:
$$P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \, P(\theta)}{P(\mathcal{D})} \propto P(\mathcal{D} \mid \theta) \, P(\theta)$$ In short: $\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$.

# Discrete Bayesian Update: Disease Screening Simulator
def bayesian_update(prior, sensitivity, specificity, test_result):
    """
    prior: P(D) prior probability of having the disease
    sensitivity: P(Pos | D) true positive rate
    specificity: P(Neg | NoD) true negative rate
    test_result: 'positive' or 'negative'
    """
    p_pos_given_disease = sensitivity
    p_pos_given_no_disease = 1.0 - specificity # false positive rate
    
    # 1. Total probability P(Pos)
    p_pos = (p_pos_given_disease * prior) + (p_pos_given_no_disease * (1.0 - prior))
    
    # 2. Posterior probability P(D | Pos)
    if test_result == 'positive':
        posterior = (p_pos_given_disease * prior) / p_pos
    else:
        # P(Neg)
        p_neg = ((1.0 - sensitivity) * prior) + (specificity * (1.0 - prior))
        posterior = ((1.0 - sensitivity) * prior) / p_neg
    return posterior

# Example: Rare disease with prior P(D) = 0.001 (0.1%), test is 99% accurate (both ways)
post_pos = bayesian_update(prior=0.001, sensitivity=0.99, specificity=0.99, test_result='positive')
print(f"Posterior probability of disease given positive test: {post_pos:.4f}") # Output: 0.0901 (~9%)

Expand to View Mathematical Deep Dive & Proofs

1. Proof of Discrete Bayes' Rule & Total Probability

Let events $A_1, A_2, \dots, A_k$ partition the sample space (mutually exclusive and exhaustive). For any observable event $B$ with $P(B) > 0$:

$$P(A_i \mid B) = \frac{P(A_i \cap B)}{P(B)}$$

Using the conditional probability definition $P(A_i \cap B) = P(B \mid A_i)P(A_i)$ and expanding the denominator $P(B)$ using the Law of Total Probability:

$$P(B) = \sum_{j=1}^k P(B \mid A_j)P(A_j)$$

Substituting back yields the discrete Bayes' formula:

$$P(A_i \mid B) = \frac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^k P(B \mid A_j) P(A_j)}$$

2. Continuous Case: Beta-Binomial Conjugacy for Parameter Estimation

Suppose we want to estimate the probability $\theta \in [0, 1]$ of a coin landing heads.

Prior Distribution: Let $\theta$ follow a Beta distribution, $\theta \sim \text{Beta}(\alpha, \beta)$, with density: $$p(\theta) = \frac{1}{\text{B}(\alpha, \beta)} \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \propto \theta^{\alpha - 1} (1-\theta)^{\beta - 1}$$
Likelihood Function: We perform $n$ independent flips and observe $k$ heads and $n-k$ tails. Under a Binomial model: $$P(\mathcal{D} \mid \theta) = \binom{n}{k} \theta^k (1-\theta)^{n-k} \propto \theta^k (1-\theta)^{n-k}$$
Posterior Distribution: Applying continuous Bayes' rule: $$p(\theta \mid \mathcal{D}) \propto P(\mathcal{D} \mid \theta) p(\theta) \propto \left[ \theta^k (1-\theta)^{n-k} \right] \cdot \left[ \theta^{\alpha - 1} (1-\theta)^{\beta - 1} \right] = \theta^{k+\alpha-1} (1-\theta)^{n-k+\beta-1}$$

This reveals that the posterior distribution is also a Beta distribution:

$$\theta \mid \mathcal{D} \sim \text{Beta}(\alpha + k, \beta + n - k)$$

Because the prior and posterior belong to the same probability family, Beta is the conjugate prior to the Binomial likelihood. This property allows analytical updates without numerical integration.

Proof of Posterior Variance Shrinkage as $n \to \infty$:

For $\text{Beta}(\alpha + k, \beta + n - k)$, the posterior expectation and variance are:

$$\mathbb{E}[\theta \mid \mathcal{D}] = \frac{\alpha + k}{\alpha + \beta + n}, \quad \text{Var}(\theta \mid \mathcal{D}) = \frac{(\alpha + k)(\beta + n - k)}{(\alpha + \beta + n)^2(\alpha + \beta + n + 1)}$$

As the number of samples $n \to \infty$, the prior parameters $\alpha, \beta$ are overwhelmed:

$$\lim_{n\to\infty} \mathbb{E}[\theta \mid \mathcal{D}] = \lim_{n\to\infty} \frac{k}{n} = \hat{\theta}_{\text{empirical}}$$

For the variance, the denominator is $O(n^3)$ while the numerator is at most $O(n^2)$:

$$\lim_{n\to\infty} \text{Var}(\theta \mid \mathcal{D}) \le \lim_{n\to\infty} \frac{n^2}{n^3} = \lim_{n\to\infty} O\left(\frac{1}{n}\right) = 0$$

This mathematically proves that as the observed data grows to infinity, our posterior expectation converges to the empirical frequency, and the uncertainty (variance) shrinks to 0.

Expand to View Geometric Intuition & Physical Meaning

1. The "Spotlight Focusing" Intuition of Bayesian Updates

Bayesian parameter estimation can be visualized as focusing a spotlight on a target value:

Prior: A broad, dim wash of light cast onto the wall, representing our initial fuzzy guess about the parameters.
Likelihood: Each incoming data point acts as a physical filter that blocks out parameter values that are highly inconsistent with the observed facts.
Posterior: The shape of the light after passing through the filter. As data points pile up (multiplying filters), the fuzzy beam of light narrows and sharpens into a bright, precise laser point centered on the true parameter value.

2. The Base Rate Fallacy Explained via Grid Frequency

Why does a test with 99% accuracy on a rare disease (0.1% base rate) only result in a ~9% posterior probability of illness upon a positive result?

Imagine a population of 100,000 people. With a base rate of 0.1%, only 100 people actually have the disease.
If we test all 100,000 people: the 100 sick individuals will yield 99 positive tests (True Positives).
Among the 99,900 healthy individuals, the 1% false positive rate will yield $99,900 \times 1\% \approx 999$ positive tests (False Positives).
The total number of positive test results is $99 + 999 = 1,098$. Within this group, the probability of actually being sick is $99 / 1,098 \approx 9.01\%$.
Physical Essence: Because the healthy population is so massive, even a tiny false positive rate produces an absolute number of misdiagnoses that dwarfs the entire pool of true patients.

⚠️ Common Pitfalls & Exam Tips

Base Rate Fallacy: On the USAIO exam, students frequently skip the denominator calculation and assume the posterior equals the test accuracy (e.g., 99%). Always calculate the total probability of a positive test first (the sum of true positives and false positives).
Intractable Integrals in Deep Learning: In our coin example, the Beta prior and Binomial likelihood are conjugate, making the posterior analytical. In deep neural networks, however, we use complex, non-conjugate priors and likelihoods, making the denominator integral analytically intractable. Modern DL bypasses this by using Variational Inference (VI) or Markov Chain Monte Carlo (MCMC) methods.

4. Concentration Inequalities & Generalization Bounds

Core Concept:

Concentration inequalities bound the probability that a random variable deviates from its expectation:
Markov's Inequality: $P(X \ge a) \le \dfrac{\mathbb{E}[X]}{a}$ (requires $X \ge 0$)
Chebyshev's Inequality: $P(|X - \mu| \ge t) \le \dfrac{\text{Var}(X)}{t^2}$ (holds for any distribution)
Hoeffding's Inequality: $P\!\left( \left| \bar{X} - \mathbb{E}[\bar{X}] \right| \ge t \right) \le 2\exp\!\left( -\dfrac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right)$ (requires independent, bounded variables $X_i \in [a_i, b_i]$)

Comparison of Concentration Inequalities:

Inequality	Conditions Required	Known Parameters	Tail Decay Rate	Typical Application in ML
Markov Inequality	$X \ge 0$ (non-negative)	Mean $\mathbb{E}[X]$	Linear $O(1/t)$ (very loose)	Foundation for proving tighter bounds (e.g., Chernoff bound)
Chebyshev Inequality	None (arbitrary distribution)	Mean $\mathbb{E}[X]$, Variance $\text{Var}(X)$	Quadratic $O(1/t^2)$ (moderate)	Proving the Weak Law of Large Numbers; basic error bounds
Hoeffding's Inequality	Independent and bounded $X_i \in [a_i, b_i]$	Sample size $n$, bounds $a_i, b_i$	Exponential $\exp(-2nt^2)$ (extremely tight)	Deriving generalization error bounds in PAC-Learning

Expand to View Mathematical Deep Dive & Proofs

1. Proof of Markov's Inequality

Let $X$ be a non-negative continuous random variable with PDF $p(x)$. For any constant $a > 0$, the expectation of $X$ is:

$$\mathbb{E}[X] = \int_{0}^{\infty} x p(x) \, dx$$

Split the domain of integration at $a$:

$$\mathbb{E}[X] = \int_{0}^{a} x p(x) \, dx + \int_{a}^{\infty} x p(x) \, dx$$

Since $X \ge 0$ and $p(x) \ge 0$, the first integral is non-negative: $\int_{0}^{a} x p(x) \, dx \ge 0$. Dropping it yields:

$$\mathbb{E}[X] \ge \int_{a}^{\infty} x p(x) \, dx$$

In the interval $[a, \infty)$, we have $x \ge a$. Replacing $x$ with the lower bound $a$ further reduces the integral:

$$\mathbb{E}[X] \ge \int_{a}^{\infty} a p(x) \, dx = a \int_{a}^{\infty} p(x) \, dx$$

By the definition of probability, the remaining integral is exactly $P(X \ge a)$:

$$\mathbb{E}[X] \ge a P(X \ge a) \implies P(X \ge a) \le \frac{\mathbb{E}[X]}{a}$$

Q.E.D.

2. Proof of Chebyshev's Inequality

Let $X$ have mean $\mu$ and variance $\sigma^2$. Define a new random variable $Y = (X - \mu)^2$.

Since $Y$ is a squared term, it is non-negative ($Y \ge 0$). The expectation of $Y$ is the variance of $X$:

$$\mathbb{E}[Y] = \mathbb{E}[(X - \mu)^2] = \text{Var}(X) = \sigma^2$$

Applying Markov's Inequality to $Y$ with threshold $t^2$ (for $t > 0$):

$$P(Y \ge t^2) \le \frac{\mathbb{E}[Y]}{t^2}$$

Substitute $Y = (X - \mu)^2$ and $\mathbb{E}[Y] = \sigma^2$ into the inequality:

$$P((X - \mu)^2 \ge t^2) \le \frac{\sigma^2}{t^2}$$

Since the events $(X-\mu)^2 \ge t^2$ and $|X-\mu| \ge t$ are logically equivalent:

$$P(|X - \mu| \ge t) \le \frac{\sigma^2}{t^2}$$

Q.E.D.

3. Proof Sketch of Hoeffding's Inequality (Chernoff Bound Method)

To prove Hoeffding's inequality, we employ the exponential Chernoff transform:

To bound the probability $P(S_n - \mathbb{E}[S_n] \ge z)$, we multiply by a parameter $s > 0$ and exponentiate both sides:

$$P(S_n - \mathbb{E}[S_n] \ge z) = P\left(e^{s(S_n - \mathbb{E}[S_n])} \ge e^{sz}\right)$$

Applying **Markov's Inequality** to the non-negative exponential term:

$$P(e^{s(S_n - \mathbb{E}[S_n])} \ge e^{sz}) \le e^{-sz} \mathbb{E}\left[ e^{s(S_n - \mathbb{E}[S_n])} \right]$$

Since $S_n = \sum X_i$ and the random variables $X_i$ are independent, the expectation of their product factors:

$$\mathbb{E}\left[ e^{s(S_n - \mathbb{E}[S_n])} \right] = \mathbb{E}\left[ \prod_{i=1}^n e^{s(X_i - \mathbb{E}[X_i])} \right] = \prod_{i=1}^n \mathbb{E}\left[ e^{s(X_i - \mathbb{E}[X_i])} \right]$$

Here, we apply Hoeffding's Lemma: for any random variable bounded in $[a, b]$:

$$\mathbb{E}\left[ e^{s(X_i - \mathbb{E}[X_i])} \right] \le \exp\left( \frac{s^2 (b_i - a_i)^2}{8} \right)$$

Substituting this back into the product expression:

$$P(S_n - \mathbb{E}[S_n] \ge z) \le e^{-sz} \prod_{i=1}^n \exp\left( \frac{s^2 (b_i - a_i)^2}{8} \right) = \exp\left( -sz + \frac{s^2 \sum_{i=1}^n (b_i - a_i)^2}{8} \right)$$

To find the tightest upper bound, we minimize the exponent with respect to $s > 0$. Differentiating and setting to zero yields the optimal value $s^* = \frac{4z}{\sum (b_i - a_i)^2}$. Plugging this back in yields the single-sided bound:

$$P(S_n - \mathbb{E}[S_n] \ge z) \le \exp\left( -\frac{2z^2}{\sum_{i=1}^n (b_i - a_i)^2} \right)$$

To estimate the sample mean, we set $\bar{X} = \frac{1}{n} S_n$ and $z = nt$. Symmetrizing for both tails yields the two-sided Hoeffding's inequality:

$$P(|\bar{X} - \mathbb{E}[\bar{X}]| \ge t) \le 2 \exp\left( -\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right)$$

Q.E.D.

Expand to View Geometric Intuition & Physical Meaning

1. The Information-Bound Hierarchy

These three inequalities showcase how incorporating more information leads to tighter bounds:

Markov (Mean only): We have minimal information. The bound relies solely on the centroid: to keep the center of mass in place, the distribution cannot place too much weight extremely far out. Because we know nothing about the shape, the bound decays slowly at a linear rate $O(1/t)$.
Chebyshev (Mean & Variance): We gain information about the spread ($\sigma$). Knowing how much the distribution fluctuates around the mean forces the probability of large deviations to contract at a quadratic rate $O(1/t^2)$.
Hoeffding's (Independence & Boundedness): The strongest assumption. Because the variables are **independent**, their fluctuations tend to cancel out. The Central Limit Theorem tells us that their sum concentrates tightly around the mean, yielding an **exponentially decaying** probability tail $\exp(-2nt^2)$.

2. The Generalization Gap Safety Rope

In machine learning, we calculate a model's empirical error $E_{\text{in}}$ on a test set, but we care about its true error $E_{\text{out}}$ on unseen data. Hoeffding's Inequality acts as a safety rope: it guarantees that as long as our test data is sampled independently, the probability of our test score lying far from the true generalization ability decays exponentially with the dataset size $n$. This is the foundation of **Probably Approximately Correct (PAC) Learning**.

⚠️ Common Pitfalls & Exam Tips

Independence is Critical: Hoeffding's inequality requires the random variables to be mutually independent. If your data points are correlated (e.g., in a time series or on graph nodes), this inequality is invalid, and you must use Martingale inequalities (like the Azuma-Hoeffding inequality).
Do Not Forget the Coefficient 2: In exam problems, pay close attention to whether you are calculating a one-sided bound ($P(\bar{X} - \mathbb{E}[\bar{X}] \ge t)$) or a two-sided bound ($P(|X - \mu| \ge t)$). Symmetrization adds a factor of 2 in front of the exponential. Omitting this will throw off your sample size calculation!

Probability & Statistics Drills

D1 · Arithmetic of Expectation and Variance

Suppose random variables $X$ and $Y$ satisfy: $\mathbb{E}[X] = 2$, $\mathbb{E}[Y] = -1$, $\text{Var}(X) = 4$, $\text{Var}(Y) = 9$, and their covariance is $\text{Cov}(X, Y) = -2$.
Calculate: (1) $\mathbb{E}[3X - 2Y]$, and (2) $\text{Var}(3X - 2Y)$.

Solution

Part 1: Expectation $\mathbb{E}[3X - 2Y]$

Apply the Linearity of Expectation (which holds regardless of correlation):

$$\mathbb{E}[3X - 2Y] = 3\mathbb{E}[X] - 2\mathbb{E}[Y]$$

Substitute the given values:

$$\mathbb{E}[3X - 2Y] = 3(2) - 2(-1) = 6 + 2 = 8$$

Part 2: Variance $\text{Var}(3X - 2Y)$

Recall the variance formula for a linear combination of random variables:

$$\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X, Y)$$

Here, $a = 3$ and $b = -2$. Substitute these coefficients:

$$\text{Var}(3X - 2Y) = 3^2\text{Var}(X) + (-2)^2\text{Var}(Y) + 2(3)(-2)\text{Cov}(X, Y)$$ $$\text{Var}(3X - 2Y) = 9\text{Var}(X) + 4\text{Var}(Y) - 12\text{Cov}(X, Y)$$

Plug in the parameters $\text{Var}(X) = 4$, $\text{Var}(Y) = 9$, and $\text{Cov}(X, Y) = -2$:

$$\text{Var}(3X - 2Y) = 9(4) + 4(9) - 12(-2) = 36 + 36 + 24 = 96$$

Correct Answers: (1) $\mathbb{E}[3X - 2Y] = 8$; (2) $\text{Var}(3X - 2Y) = 96$.

D2 · Parameter Estimation of an Exponential Distribution via MLE

Let $X$ be a random variable distributed exponentially with PDF: $p(x \mid \lambda) = \lambda e^{-\lambda x}$ (where $x \ge 0, \lambda > 0$).
Given $n$ i.i.d. observations $\mathcal{D} = \{x_1, x_2, \dots, x_n\}$, derive the Maximum Likelihood Estimator $\hat{\lambda}_{\text{MLE}}$ of the parameter $\lambda$.

Solution

Step 1: Construct the Likelihood Function $L(\lambda)$

Since the observations are independent, the joint density is the product of individual densities:

$$L(\lambda) = \prod_{i=1}^n p(x_i \mid \lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} = \lambda^n e^{-\lambda \sum_{i=1}^n x_i}$$

Step 2: Construct the Log-Likelihood Function $\ln L(\lambda)$

$$\ln L(\lambda) = \ln \left( \lambda^n e^{-\lambda \sum_{i=1}^n x_i} \right) = n \ln \lambda - \lambda \sum_{i=1}^n x_i$$

Step 3: Differentiate with respect to $\lambda$ and set to 0

$$\frac{d}{d\lambda} \ln L(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0$$

Step 4: Solve for $\hat{\lambda}_{\text{MLE}}$

$$\frac{n}{\lambda} = \sum_{i=1}^n x_i \implies \hat{\lambda}_{\text{MLE}} = \frac{n}{\sum_{i=1}^n x_i} = \frac{1}{\bar{x}}$$

where $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ is the empirical sample mean.

Conclusion: The MLE of $\lambda$ is the reciprocal of the sample mean ($1/\bar{x}$). This matches the theoretical expectation of the exponential distribution, which is $\mathbb{E}[X] = 1/\lambda$.

D3 · Hoeffding's Inequality and Test Set Sample Size Bounds

Suppose we evaluate a classifier on a test set of $n$ independent samples. For each sample $i$, the classification loss is $L_i \in [0, 1]$ ($0$ for correct, $1$ for incorrect). Let the true generalization error be $E_{\text{out}}$, and the empirical test error be $E_{\text{in}}$.
We want to guarantee that the generalization gap $|E_{\text{in}} - E_{\text{out}}| \le 0.05$ holds with a probability of at least $95\%$ (i.e., $95\%$ confidence level). Calculate the minimum number of test samples $n$ required.

Solution

Step 1: Write down the two-sided Hoeffding's Inequality

Each sample loss is bounded: $L_i \in [0, 1] \implies b_i - a_i = 1$.

The probability that the sample average $E_{\text{in}} = \frac{1}{n}\sum_{i=1}^n L_i$ deviates from its mean $E_{\text{out}}$ by more than $\epsilon$ is:

$$P(|E_{\text{in}} - E_{\text{out}}| \ge \epsilon) \le 2 \exp\left( -\frac{2 n^2 \epsilon^2}{\sum_{i=1}^n (1 - 0)^2} \right) = 2 \exp(-2 n \epsilon^2)$$

Step 2: Map to the confidence and error requirements

We want the gap to be small with high probability: $P(|E_{\text{in}} - E_{\text{out}}| \le \epsilon) \ge 1 - \delta = 0.95$.

This is equivalent to bounding the failure probability: $P(|E_{\text{in}} - E_{\text{out}}| \ge \epsilon) \le \delta = 0.05$.

Thus, we set the Hoeffding upper bound to be at most $\delta$:

$$2 \exp(-2 n \epsilon^2) \le \delta$$

Step 3: Solve for $n$ algebraically

1. Divide by 2:

$$\exp(-2 n \epsilon^2) \le \frac{\delta}{2}$$

2. Take the natural logarithm $\ln$ of both sides:

$$-2 n \epsilon^2 \le \ln\left( \frac{\delta}{2} \right)$$

3. Multiply by $-1$ (flips the inequality sign):

$$2 n \epsilon^2 \ge -\ln\left( \frac{\delta}{2} \right) = \ln\left( \frac{2}{\delta} \right)$$

4. Divide by $2\epsilon^2$ to find the lower bound on $n$:

$$n \ge \frac{1}{2\epsilon^2} \ln\left( \frac{2}{\delta} \right)$$

Step 4: Substitute the values and compute

Given $\epsilon = 0.05$ and $\delta = 0.05$:

$$n \ge \frac{1}{2 \times (0.05)^2} \ln\left( \frac{2}{0.05} \right) = \frac{1}{0.005} \ln(40) = 200 \ln(40)$$

Using standard logarithmic approximations: $\ln(40) \approx 3.688879$.

$$n \ge 200 \times 3.688879 = 737.78$$

Since the sample size $n$ must be an integer, we round up (using the ceiling function):

$$n \ge 738$$

Correct Answer: We need at least 738 independent test samples.

D4 · Independent ≠ uncorrelated (in general)

Let X ∼ Uniform(−1,1) and Y = X². Show Cov(X,Y) = 0 but X, Y are dependent.

Solution

E[X] = 0, E[XY] = E[X³] = 0 by symmetry, so Cov(X,Y) = 0. But Y is a deterministic function of X, so they cannot be independent. (For jointly Gaussian variables zero covariance does imply independence; in general it does not.)

D5 · Gaussian MLE for variance

Given i.i.d. x₁,…,xₙ ∼ 𝒩(μ, σ²) with μ known, derive the MLE of σ².

Solution

Log-likelihood (dropping constants): ℓ = −n/2 log σ² − (1/(2σ²)) Σ (xᵢ−μ)². Differentiate w.r.t. σ²: −n/(2σ²) + (1/(2σ⁴)) Σ(xᵢ−μ)² = 0 → σ̂² = (1/n) Σ(xᵢ−μ)². (This MLE is biased when μ is also estimated; the unbiased version divides by n−1.)

3. Calculus (Single & Multivariable)

Concept

Single-variable differentiation measures the rate of change of a function. The derivative f'(x) = df/dx = lim_{h→0} (f(x+h) − f(x))/h represents the slope of the tangent line at x. Key rules include the power rule (d/dx(xᵏ) = k xᵏ⁻¹), the product rule ((fg)' = f'g + fg'), the quotient rule ((f/g)' = (f'g − fg')/g²), and the chain rule (d/dx[f(g(x))] = f'(g(x)) g'(x)).

Integration is the reverse of differentiation. The definite integral ∫ₐᵇ f(x) dx computes the signed area under the curve f(x) from a to b. The Fundamental Theorem of Calculus states that if F'(x) = f(x), then ∫ₐᵇ f(x) dx = F(b) − F(a). In ML/AI, integration is essential for computing expectations of continuous random variables (E[X] = ∫ x p(x) dx), normalizing probability density functions, and working with continuous-time systems (e.g., diffusion models).

A partial derivative ∂f/∂xᵢ measures the rate of change of f when only xᵢ moves. The gradient ∇f = (∂f/∂x₁, …, ∂f/∂xₙ) is a vector pointing in the direction of steepest ascent; its magnitude is the slope in that direction. Gradient descent is just x ← x − η ∇f(x).

The Jacobian of f: ℝⁿ → ℝᵐ is the m × n matrix of partials Jᵢⱼ = ∂fᵢ/∂xⱼ. The Hessian H of a scalar function is the n × n matrix of second partials; if H ≻ 0 (positive definite) at a critical point, that point is a strict local minimum, and the function is locally convex.

The chain rule for composition z = f(g(x)) is ∂z/∂x = (∂f/∂g)(∂g/∂x). In vector form: J_{f∘g} = J_f · J_g. Backpropagation is nothing more than this chain rule, applied right-to-left through every layer of a neural network. A second-order Taylor expansion at x₀ is f(x₀+δ) ≈ f(x₀) + ∇f(x₀)ᵀ δ + ½ δᵀ H δ; Newton's method uses this to take steps that account for curvature.

Worked example 1 — Numerical differentiation and integration in Python

We approximate the derivative of f(x) = x² + 3x at x = 2 using finite differences: f'(x) ≈ (f(x+h) − f(x))/h. Then we approximate the definite integral of f(x) from 0 to 2 using the trapezoidal rule manually to avoid version dependencies.

import numpy as np

def f(x):
    return x**2 + 3*x

# 1. Numerical differentiation (Finite Difference)
x = 2.0
h = 1e-5
df_approx = (f(x + h) - f(x)) / h
df_exact  = 2 * x + 3
print(f"Derivative at x=2: Approx = {df_approx:.5f}, Exact = {df_exact:.5f}")

# 2. Numerical integration (Trapezoidal Rule manual implementation)
# Analytical integral of x^2 + 3x from 0 to 2 is [x^3/3 + 1.5*x^2] |0^2 = 8/3 + 6 = 8.66667
a, b = 0.0, 2.0
num_points = 1000
xs = np.linspace(a, b, num_points)
ys = f(xs)
dx = (b - a) / (num_points - 1)
integral_approx = dx * (0.5 * (ys[0] + ys[-1]) + np.sum(ys[1:-1]))
integral_exact  = 8.0/3.0 + 6.0
print(f"Integral from 0 to 2: Approx = {integral_approx:.5f}, Exact = {integral_exact:.5f}")

Worked example 2 — Backprop on a 2-layer MLP, by hand and by autograd

Network: z = w x + b, h = ReLU(z), ŷ = v h + c, loss L = ½(ŷ − y)². Compute ∂L/∂w.

Chain rule, layer by layer:

∂L/∂ŷ = (ŷ − y); ∂ŷ/∂h = v; ∂h/∂z = 𝟙[z > 0]; ∂z/∂w = x. Therefore ∂L/∂w = (ŷ − y) · v · 𝟙[z > 0] · x.

import torch

w = torch.tensor(0.7, requires_grad=True)
b = torch.tensor(-0.1, requires_grad=True)
v = torch.tensor(1.3, requires_grad=True)
c = torch.tensor(0.2, requires_grad=True)
x = torch.tensor(2.0)
y = torch.tensor(1.5)

z   = w * x + b
h   = torch.relu(z)
y_hat = v * h + c
L   = 0.5 * (y_hat - y) ** 2
L.backward()

# manual:
manual = (y_hat - y).item() * v.item() * (1.0 if z.item() > 0 else 0.0) * x.item()
print(w.grad.item(), manual)    # equal within FP error

Drills

D1 · Single-variable differentiation

Find the derivative of f(x) = x e⁻ˣ².

Solution

Apply the product rule (uv)' = u'v + uv' where u = x and v = e⁻ˣ². We have u' = 1, and by the chain rule, v' = -2x e⁻ˣ². Thus, f'(x) = 1 · e⁻ˣ² + x · (-2x e⁻ˣ²) = (1 − 2x²) e⁻ˣ².

D2 · Integration by substitution

Evaluate the definite integral ∫₀⁰⁰ x e⁻ˣ² dx (from 0 to infinity).

Solution

Use substitution. Let u = x², so du = 2x dx, which means x dx = ½ du. The limits of integration remain 0 to infinity (as x=0 → u=0 and x→∞ → u→∞). The integral becomes: ∫₀⁰⁰ ½ e⁻ᵘ du = [−½ e⁻ᵘ]₀⁰⁰ = 0 − (−½) = ½. (This integral is commonly used to normalize or find moments of continuous distributions like half-Normal.)

D3 · Gradient and one step

f(x,y) = x² + 4y². Compute ∇f(1,1). With η = 0.1, where do you land after one GD step?

Solution

∇f = (2x, 8y) = (2, 8). Step: (1,1) − 0.1·(2,8) = (0.8, 0.2).

D4 · Hessian and convexity

Is f(x,y) = x² + xy + y² convex?

Solution

Hessian H = [[2,1],[1,2]]. Eigenvalues: 2 ± 1 = 1, 3, both positive. So H ≻ 0 globally and f is strictly convex.

D5 · Chain rule through softmax + cross-entropy

For logits z and one-hot label y, with p = softmax(z) and L = − Σᵢ yᵢ log pᵢ, show ∂L/∂z = p − y.

Solution

∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ). Then ∂L/∂zⱼ = −Σᵢ yᵢ (1/pᵢ)·pᵢ(δᵢⱼ − pⱼ) = −yⱼ + pⱼ Σᵢ yᵢ = pⱼ − yⱼ (using Σyᵢ = 1). This identity is why cross-entropy + softmax has such a clean gradient and is the canonical classifier loss.

D6 · Jacobian of a linear layer

y = Wx + b with W ∈ ℝᵐˣⁿ. What is ∂y/∂x? What about ∂L/∂W for a scalar loss L (express in terms of ∂L/∂y and x)?

Solution

∂y/∂x = W. ∂L/∂W = (∂L/∂y) xᵀ — an outer product of the upstream gradient with the layer's input. This is exactly what PyTorch's autograd computes.

D7 · Taylor expansion of log(1 + x)

Expand log(1+x) to second order around x = 0 and use it to approximate log(1.05).

Solution

log(1+x) ≈ x − x²/2. At x = 0.05: 0.05 − 0.00125 = 0.04875. True value ≈ 0.04879.

4. Convex optimization

Concept

A set is convex if it contains the line segment between any two of its points. A function is convex if its epigraph is convex; equivalently, for any t ∈ [0,1], f(t x + (1−t) y) ≤ t f(x) + (1−t) f(y) (Jensen). A twice-differentiable function is convex iff its Hessian is positive semidefinite everywhere. Squared error, logistic loss, hinge loss, and cross-entropy are all convex in their linear-model weights.

Gradient descent: x_{k+1} = x_k − η ∇f(x_k). For an L-smooth convex f, choosing η ≤ 1/L guarantees monotonic decrease and convergence at rate O(1/k). Momentum adds a velocity term: v ← βv + ∇f, x ← x − ηv; it accelerates along consistent directions and damps oscillation. Adam rescales each coordinate by its running RMS gradient, giving adaptive step sizes.

Constraints via Lagrange duality: to minimise f(x) subject to gᵢ(x) ≤ 0 and hⱼ(x) = 0, form the Lagrangian L(x, λ, ν) = f(x) + Σ λᵢ gᵢ(x) + Σ νⱼ hⱼ(x) with λᵢ ≥ 0. The KKT conditions (stationarity, primal feasibility, dual feasibility, complementary slackness) are necessary at an optimum; for convex problems they are also sufficient. SVMs are derived by writing the margin-maximisation problem and solving its KKT system.

Worked example — Constrained quadratic via Lagrange

Minimize f(x,y) = x² + y² subject to x + y = 1.

Lagrangian: L = x² + y² − λ(x + y − 1). Stationarity: 2x − λ = 0, 2y − λ = 0, so x = y. Constraint: 2x = 1 → x = y = ½. Minimum value = ½.

import numpy as np

# verify numerically via gradient descent on the unconstrained surrogate
# f(x) + (mu/2)(x+y-1)^2  with increasing penalty mu
x, y = 0.0, 0.0
lr, mu = 0.1, 50.0
for _ in range(500):
    gx = 2*x + mu*(x+y-1)
    gy = 2*y + mu*(x+y-1)
    x -= lr*gx; y -= lr*gy
print(x, y, x+y)     # ~ 0.5, 0.5, 1.0

Drills

D1 · Convexity of logistic loss

Show that ℓ(w) = log(1 + exp(−y wᵀ x)) is convex in w (with y ∈ {−1,+1}, x fixed).

Solution

log(1 + exp(z)) is convex in z (its second derivative is σ(z)(1−σ(z)) ≥ 0). Here z = −y wᵀ x is affine in w. Composition of a convex function with an affine map is convex. Done.

D2 · GD trajectory

Run gradient descent on f(x) = (x − 3)² from x₀ = 0 with η = 0.5. Compute x₁, x₂, x₃.

Solution

f'(x) = 2(x−3). Update x ← x − η·2(x−3) = x − (x−3) = 3. So x₁ = x₂ = x₃ = 3 — one step lands at the optimum because η = 1/L with L = 2.

D3 · Why divergence at large η

For the same f(x) = (x−3)², what is the smallest η that causes divergence?

Solution

Update factor on the deviation x − 3 is (1 − 2η). Divergence when |1 − 2η| > 1, i.e. η > 1. At η = 1 you oscillate around the optimum without converging.

D4 · Adam intuition

In one sentence each, explain why Adam often beats vanilla SGD on transformers but loses to SGD+momentum on ResNets.

Solution

Adam's per-parameter scaling helps when gradients are highly heterogeneous in magnitude (typical of transformer embeddings and layer norms). On convolutional vision nets with relatively uniform gradients and strong weight decay, SGD+momentum reaches a flatter minimum and generalizes better.

D5 · Jensen for the mean

Use Jensen's inequality to show E[X]² ≤ E[X²] for any random variable X with finite second moment.

Solution

g(t) = t² is convex. Jensen: g(E[X]) ≤ E[g(X)] → E[X]² ≤ E[X²]. Equivalently, Var(X) ≥ 0.

5. Hoeffding's inequality & generalisation

Concept

Hoeffding's inequality is the cleanest concentration bound on the cards. For independent X₁,…,Xₙ with Xᵢ ∈ [a, b] and sample mean X̄: P(|X̄ − E[X̄]| ≥ t) ≤ 2 exp(−2 n t² / (b − a)²). The exponential decay in n is the formal reason why test-set evaluation is reliable: the empirical accuracy on a held-out set is, with high probability, within O(1/√n) of the true accuracy. It also gives the simplest PAC-learning bound: a hypothesis class with |H| functions can achieve |train − test| ≤ √( (log|H| + log(2/δ)) / (2n) ) with probability 1 − δ.

Worked example — Confidence interval for accuracy

import math
n, acc, delta = 1000, 0.86, 0.05
# Hoeffding with bounded [0,1]:  t = sqrt(log(2/delta)/(2n))
t = math.sqrt(math.log(2/delta) / (2*n))
print(f"true accuracy in [{acc-t:.3f}, {acc+t:.3f}] with prob >= {1-delta}")
# -> roughly [0.817, 0.903]

Drills

D1 · Tighter bound from larger n

Quadruple n. By what factor does the confidence half-width t shrink?

Solution

t ∝ 1/√n; quadrupling n halves t.

D2 · Why "independent" matters

If your test set is constructed by oversampling each correct prediction, why does Hoeffding fail?

Solution

The samples are no longer independent of the model's behavior, and they are no longer i.i.d. with respect to the true data distribution. The bound assumes both. The empirical accuracy in this rigged sample is biased upward and concentrates around the wrong value.

D3 · Hoeffding vs. CLT

One sentence: how does Hoeffding differ from the Central Limit Theorem in what it guarantees?

Solution

CLT is asymptotic ("for large n, the sample mean is approximately Gaussian"). Hoeffding is a finite-sample, distribution-free upper bound on tail probability with explicit constants — usable for any n.

Checkpoint — answer out loud

Can you compute eigenvalues of a 2×2 matrix and state what they tell you about a linear map?
Can you write down Bayes' rule and plug in numbers without looking?
Can you find the derivative of x e⁻ˣ² and compute the area under x e⁻ˣ² from 0 to infinity?
Can you derive the gradient of MSE loss for linear regression and verify it with autograd?
Can you state when gradient descent converges and what happens if η is too large?
Given a held-out accuracy of 0.84 on 500 samples, can you give a 95% Hoeffding confidence interval?

Next step

Once the math feels mechanical, go install the toolchain on the Python data stack page and start translating the formulas in this section into runnable code. For short-answer practice, the Round 2 theory drills include softmax Jacobians, PCA-via-SVD, Hoeffding bounds, and a worked MLE derivation.