Q1 · Backprop through softmax
Let s = softmax(z) where z ∈ R^K, so s_i = exp(z_i) / Σ_j exp(z_j). Derive the Jacobian ∂s_i / ∂z_j.
Solution
Differentiate s_i = exp(z_i) / Z with Z = Σ_k exp(z_k). The numerator depends on z_j only when i = j; the denominator always depends on z_j.
Case i = j. Quotient rule gives ∂s_i/∂z_i = (exp(z_i)·Z − exp(z_i)·exp(z_i)) / Z² = s_i − s_i² = s_i(1 − s_i).
Case i ≠ j. The numerator's derivative is zero, so ∂s_i/∂z_j = −exp(z_i)·exp(z_j)/Z² = −s_i s_j.
Compact form: J = diag(s) − s s^T. This matrix is symmetric, rank K−1, and singular (the all-ones vector lies in its null space, which expresses that shifting all logits by a constant does not change the softmax).