The math you actually need

Four areas: linear algebra, probability & statistics, multivariable calculus, convex optimization. You don't need a full undergraduate sequence — you need the parts that make gradient descent, PCA, and the bias / variance tradeoff legible.

How deep to go. Aim for fluency, not proofs. If you can re-derive backpropagation on a 2-layer MLP on paper and explain why PCA picks the top eigenvectors, you're past the bar.

Linear algebra

Vectors and norms. Dot product, projection, L1 / L2 / L∞ norms. Geometric intuition matters: a dot product is "how much one vector points in the direction of another."
Matrices as linear maps. Matrix-vector multiplication is "apply this transformation." Rotation, scaling, projection are all matrices.
Matrix multiplication. Composing transformations. Know that (AB)x = A(Bx) and why it's not commutative.
Rank, null space, column space. What information a matrix preserves vs. destroys.
Eigenvalues & eigenvectors. The directions a matrix scales without rotating. Foundation for PCA and spectral methods.
Singular value decomposition (SVD). The factorization you'll keep meeting in dimensionality reduction, recommender systems, and low-rank approximation.

Key identity for ML: for matrix X with rows = data points, the covariance matrix is C = (1/n) Xᵀ X (after centering). Its top eigenvectors are the principal components.

Probability & statistics

Random variables, expectation, variance. The two summary statistics that drive almost every loss function.
Common distributions. Bernoulli, binomial, Gaussian, uniform, Poisson. Recognize them from their PDFs.
Joint, marginal, conditional probability. P(A, B) = P(A | B) P(B). Bayes' rule.
Independence vs. correlation. Independence implies zero correlation; the reverse is false except for Gaussians.
Maximum likelihood estimation (MLE). The probabilistic justification behind least-squares regression and cross-entropy classification.
Central limit theorem. Why averaging always tends toward Gaussian and why we trust mean estimates.
Hypothesis testing basics. p-values, confidence intervals — useful for evaluating whether model A really beats model B.

Multivariable calculus

Partial derivatives. ∂f/∂xᵢ — "how does f change if I wiggle xᵢ alone."
Gradient. The vector of all partials. Points in the direction of steepest increase.
Chain rule. The single most important calculus fact for ML — backpropagation is just the chain rule applied repeatedly.
Jacobian & Hessian. Matrix of partials (Jacobian) and second-order partials (Hessian). The Hessian's eigenvalues tell you about local curvature and convexity.
Taylor expansion. Approximate a function locally by its first or second derivative. Underlies Newton's method.

Gradient descent: x ← x − η · ∇f(x). Reduce η if the loss oscillates; increase η if it crawls.

Convex optimization

Convex sets and functions. A function is convex if every chord lies above the graph. Convex problems have a unique global minimum.
Convex losses. Squared error, logistic loss, hinge loss, cross-entropy — all convex in the linear-model setting.
Lagrange multipliers & KKT conditions. Handle constraints. Show up in SVMs explicitly.
Gradient descent & variants. SGD, momentum, Adam. Know the update rules from memory.
Convergence intuition. Learning rate too high → divergence. Too low → slow crawl. Adam smooths over both by adapting per-parameter rates.

Drills

Implement matrix multiplication, transpose, and dot product from scratch in NumPy without using built-ins. Verify against np.matmul.
Compute PCA by hand on a 5-point 2-D dataset: center, covariance, eigendecomposition, project.
Derive the gradient of mean-squared-error loss for linear regression. Verify with PyTorch autograd.
For a 2-layer MLP with one hidden ReLU and a sigmoid output, write out the four partial derivatives needed for backprop on a single sample.
Run gradient descent on a 1-D convex function (e.g. f(x) = (x − 3)²) and plot the trajectory for three learning rates.