Transformer

Nov 25, 2025 | LLM Concepts

A transformer is the major breakthrough of recent times – a neural network architecture that uses attention to process all tokens in parallel instead of step-by-step like an RNN.

  • Processes sequences in parallel
  • Built from stacked blocks of attention + feedforward layers
  • Scales very well with data and compute (better for long texts + GPUs/TPUs))

Why are they better

  • RNNs have sequential bottleneck, long range dependencies
  • CNNs only partially fix this

High level structure

  • Embedding
  • Encoder, Decoder
  • Attention

Position-aware encoding:  Have the vector embedding preserve the relative positions of data sequence. That positional embedding is repeated 3 times in the neural net:

  • Query (Q)
  • Key (K)
  • Value (V)

Transformer Explainer

Primary Resources