Note: I had stopped writing posts in 2017. Started again in late-2024, mostly for AI.

Think about local minima in thousands of dimensions

Aug 3, 2025 | LLM

When I first learned about Gradient Descent about two years ago, I pictured it in the most obvious 3D way – where one imagines two input variables (as x and y axis in a 2D plane) and the loss being the third (z) axis. In terms of ‘local minima’ I imagined it as the model getting stuck in a “false bottom” of this bowl-shaped landscape, unable to reach the true minimum, the lowest point.

But this WelchLabs video changed my mental model. Here is the key sentence from the incredible video explanation:

“For gradient descent to become fully stuck in a local minimum it would have to get stuck in every dimension at once. And the chances of this happening become smaller and smaller as we add more and more parameters.”

The human brain struggles to picture anything beyond three dimensions, but these models are performing math in thousands of dimensions. For eg. GPT 3.5 acted in 1536 dimensional vectors – something I wrote about in Jan 2024. So it was a good reminder: our spatial reasoning breaks down in these higher-dimensional worlds.

PS: The way John explains it by saying ‘a wormhole opens up’ (16:55) is a good analogy. I never realized that the 3D loss landscape actually moves with learning.