RNN

Nov 26, 2025 | LLM Concepts

A lot of data in the world has sequence.. ie a temporal aspect. Eg. music, spoken language. By treating each data point independently (in a feed-forward only mode) we lose the ability to model the sequential aspect of it.

In Recurrent Neural Network , computation of the neuron is a product of both the current input and this past memory of previous time steps. So it has a constant relation back to itself.

PS: This aspect reminds me of Blockchain (where each block header has a hash of the previous block.. preserving validity and sequence).

With RNNs, the loss is computed at EVERY time step to time step…continuously. Then it gets added at the very end to get a total loss
Now as we back-propagate this loss all the way back to the input, you need to repeatedly calculate the gradient at each timestamp. This can lead to:
- Diminishing gradients (they vanish) or
- Exploding gradients (become too large)
ReLU helps, but the main idea for avoiding this is to use gates to selectively add/remove information within each recurrent unit.
- Gates optionally forget/keep some info into the recurrent cell (–> based on weight matrices)
- Common architecture for this is LSTM (Long Short Term Memory)