I first wrapped my head around diffusion models in 2023, thanks to MIT 6.S191 Lecture on ‘Deep Learning New Frontiers‘. The idea of reverse-denoising just clicked for me—it reminded me of how our brains pick out shapes and objects in clouds or random mosaics.
Yesterday my subscription from 3Blue1Brown surfaced ‘But how do AI videos actually work?‘ a guest video by @WelchLabsVideo. That video unlocked a whole new layer of understanding about diffusion models for me. There were a few things I hadn’t fully grasped until I saw this explanation:
- I used to think diffusion models worked like this: take a bunch of training images, add noise bit by bit until the images are unrecognizable, and then train a neural network to reverse that process. What I hadn’t realized is that during each generation step, after the model predicts a less noisy image, a bit of random noise is added again before the next iteration. Counterintuitive, but it actually makes the final images sharper and more realistic—if you skip that extra noise step, you just get blurry, averaged-out images. The visual explanation in the video using a tree-in-desert made this crystal clear!
- Another “aha” moment: the diffusion model isn’t actually trained to reverse noise step by step. Instead, it’s trained to predict the total noise added to the original image. This makes the task more challenging for the model, but also more efficient and powerful.
- The video also touched on how models match images and captions using something called cosine similarity. Basically, both images and text are mapped as vectors in a shared embedding space—so aligning an image and its caption is just about making those vectors point in the same direction. It’s so simple and elegant. What really blew my mind is that this shared space captures actual concepts in the image. Eg. if you subtract the “no hat” image vector from the “wearing hat” vector, and then look for matching text vectors, the closest match is “hat.” The geometry of the space literally encodes differences in image content—like “hat or not”—as distances. That’s the core of OpenAI’s CLIP model breakthrough back in 2021.
More than anything, I’m genuinely awed by the storytelling abilities of some youtube creators. The way 3Blue1Brown and WelchLabsVideo break down complex math and AI concepts is nothing short of astounding. I’m thankful for the clarity and inspiration they bring—I honestly wish resources like these had existed when I was in high school.