Chain Rule = how you take derivatives when a value depends on another value, which itself depends on another value (i.e. a composition of functions)
Intuition
- If A affects B and B affects C, then A affects C through B. The chain rule just says: Total sensitivity of these order of operations = (how sensitive C is to B) × (how sensitive B is to A)
- A neural network is a long chain of computations and the output loss depends on a weight through many intermediate values. What neural network training means/needs is to answer: _How does a tiny change in a weight change the [[Loss]]?
The chain rule says influence flows through intermediate values:
- Start at the end with the loss’ sensitivity to the output (an upstream gradient).
- Move one operation backward at a time: multiply by that operation’s local derivative.
- Pass the product backward. Repeat until you reach the weights.
Complex Stuff
- The “canceling fractions” mnemonic for the basic operation of chain rule: Just cancel the denominator of the first partial derivative with the corresponding numerator of the derivative to its right – and continue doing so moving rightward. Useful, but don’t take it literally because derivatives are not fractions. The real meaning is that we can multiply local rates of change along the dependency path.
- Vanishing/exploding gradients are side-effects of the chain rule. If you multiply many derivatives repeatedly that are numbers slightly < 1 → they shrink toward 0 (vanish). Similarly multiplying numbers > 1 repeatedly → will blow up (explode). This is why many architecture and training choices (activations, initialization, normalization, residuals) matter.
Primary Resources
- 3Blue1Brown’s Neural Networks course has Chapter 4 Backpropagation Calculus. Around ~06:25–06:35 — he says (verbatim) “This right here is the chain rule…” and explains it as multiplying those intermediate ratios/sensitivities along the path.