The math you actually need
Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.
How deep to go. Aim for fluency, not proofs. If you can re-derive backpropagation on
a 2-layer MLP on paper and explain why PCA picks the top eigenvectors, you're past the bar.
Linear algebra
- Vectors and norms. Dot product, projection, L1 / L2 / L∞ norms. Geometric intuition matters: a dot product is "how much one vector points in the direction of another."
- Matrices as linear maps. Matrix-vector multiplication is "apply this transformation." Rotation, scaling, projection are all matrices.
- Matrix multiplication. Composing transformations. Know that
(AB)x = A(Bx)and why it's not commutative. - Rank, null space, column space. What information a matrix preserves vs. destroys.
- Eigenvalues & eigenvectors. The directions a matrix scales without rotating. Foundation for PCA and spectral methods.
- Singular value decomposition (SVD). The factorization you'll keep meeting in dimensionality reduction, recommender systems, and low-rank approximation.
Key identity for ML: for matrix X with rows = data points, the covariance matrix is C = (1/n) Xᵀ X (after centering). Its top eigenvectors are the principal components.
Probability & statistics
- Random variables, expectation, variance. The two summary statistics that drive almost every loss function.
- Common distributions. Bernoulli, binomial, Gaussian, uniform, Poisson. Recognize them from their PDFs.
- Joint, marginal, conditional probability.
P(A, B) = P(A | B) P(B). Bayes' rule. - Independence vs. correlation. Independence implies zero correlation; the reverse is false except for Gaussians.
- Maximum likelihood estimation (MLE). The probabilistic justification behind least-squares regression and cross-entropy classification.
- Central limit theorem. Why averaging always tends toward Gaussian and why we trust mean estimates.
- Hypothesis testing basics. p-values, confidence intervals — useful for evaluating whether model A really beats model B.
Multivariable calculus
- Partial derivatives. ∂f/∂xᵢ — "how does f change if I wiggle xᵢ alone."
- Gradient. The vector of all partials. Points in the direction of steepest increase.
- Chain rule. The single most important calculus fact for ML — backpropagation is just the chain rule applied repeatedly.
- Jacobian & Hessian. Matrix of partials (Jacobian) and second-order partials (Hessian). The Hessian's eigenvalues tell you about local curvature and convexity.
- Taylor expansion. Approximate a function locally by its first or second derivative. Underlies Newton's method.
Gradient descent: x ← x − η · ∇f(x). Reduce η if the loss oscillates; increase η if it crawls.
Convex optimization
- Convex sets and functions. A function is convex if every chord lies above the graph. Convex problems have a unique global minimum.
- Convex losses. Squared error, logistic loss, hinge loss, cross-entropy — all convex in the linear-model setting.
- Lagrange multipliers & KKT conditions. Handle constraints. Show up in SVMs explicitly.
- Gradient descent & variants. SGD, momentum, Adam. Know the update rules from memory.
- Convergence intuition. Learning rate too high → divergence. Too low → slow crawl. Adam smooths over both by adapting per-parameter rates.
Drills
- Implement matrix multiplication, transpose, and dot product from scratch in NumPy without using built-ins. Verify against
np.matmul. - Compute PCA by hand on a 5-point 2-D dataset: center, covariance, eigendecomposition, project.
- Derive the gradient of mean-squared-error loss for linear regression. Verify with PyTorch autograd.
- For a 2-layer MLP with one hidden ReLU and a sigmoid output, write out the four partial derivatives needed for backprop on a single sample.
- Run gradient descent on a 1-D convex function (e.g. f(x) = (x − 3)²) and plot the trajectory for three learning rates.