Encoder
For each input word, we encode
- It’s position (in the input text). Since Transformer is not an RNN, the sequence aspect can get lost. So instead, here we encode the position explicitly.
- Evaluate ‘self-attention’ – what other OUTPUT words might matter
- Evaluate attention to encoded representations of the INPUT words –> this is the extra step in decoder (vs. encoder)
- Put it through a feed-forward neural network
With the neural network approach, the model is able to learn how to perform these attention steps. The model is able to iteratively get better (learn) what things to pay attention to, in order to be more accurate at predicting the right output word.
Decoder
For each output word, we consider:
- The previous output word
- It’s position (in the output text)
- Evaluate ‘self-attention’ – what other words it should pay attention to. And do it more than once (‘multiple heads of self-attention’)
- Put it through a feed-forward neural network
PS: The fact that it happens for each darn word in the input 🤯. So Decoder is essentially doing the same sequence as encoder, but with the extra step of doing more step of paying attention to encoded representations of the INPUT words as well 🤯 🤯.
Primary Resources:
- In Harvard CS50’s Introduction to Artificial Intelligence with Python 2023, Brian Wu explains transformer architecture at 00:54:15 and then the encoder-decoder part