RNN

Nov 26, 2025 | LLM Concepts

A lot of data in the world has sequence.. ie a temporal aspect. Eg. music, spoken language. By treating each data point independently (in a feed-forward only mode) we lose the ability to model the sequential aspect of it.

In Recurrent Neural Network , computation of the neuron is a product of both the current input and this past memory of previous time steps. So it has a constant relation back to itself.

PS: This aspect reminds me of Blockchain (where each block header has a hash of the previous block.. preserving validity and sequence).

  • With RNNs, the loss is computed at EVERY time step to time step…continuously. Then it gets added at the very end to get a total loss
  • Now as we back-propagate this loss all the way back to the input, you need to repeatedly calculate the gradient at each timestamp. This can lead to:
    • Diminishing gradients (they vanish) or
    • Exploding gradients (become too large)
  • ReLU helps, but the main idea for avoiding this is to use gates to selectively add/remove information within each recurrent unit.
    • Gates optionally forget/keep some info into the recurrent cell (–> based on weight matrices)
    • Common architecture for this is LSTM (Long Short Term Memory)