A lot of data in the world has sequence.. ie a temporal aspect. Eg. music, spoken language. By treating each data point independently (in a feed-forward only mode) we lose the ability to model the sequential aspect of it.
In Recurrent Neural Network , computation of the neuron is a product of both the current input and this past memory of previous time steps. So it has a constant relation back to itself.
PS: This aspect reminds me of Blockchain (where each block header has a hash of the previous block.. preserving validity and sequence).
- With RNNs, the loss is computed at EVERY time step to time step…continuously. Then it gets added at the very end to get a total loss
- Now as we back-propagate this loss all the way back to the input, you need to repeatedly calculate the gradient at each timestamp. This can lead to:
- Diminishing gradients (they vanish) or
- Exploding gradients (become too large)
- ReLU helps, but the main idea for avoiding this is to use gates to selectively add/remove information within each recurrent unit.
- Gates optionally forget/keep some info into the recurrent cell (–> based on weight matrices)
- Common architecture for this is LSTM (Long Short Term Memory)