What is a Transformer?
The revolutionary neural network architecture that powers GPT, BERT, Claude, and virtually every modern AI system. Introduced in the landmark 2017 paper "Attention Is All You Need."
The Big Idea: Attention Is All You Need
Before the Transformer, AI models that worked with language (like translating between languages or generating text) relied on Recurrent Neural Networks (RNNs) and their variant, LSTMs. These models processed words one at a time, sequentially, like reading a book one word at a time without ever glancing ahead or back.
The Transformer, introduced by researchers at Google in 2017, threw away this sequential approach entirely. Its core innovation is the self-attention mechanism, which allows the model to look at every word in a sentence simultaneously and figure out which words are most relevant to each other. This single idea transformed the entire field of artificial intelligence.
"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." -- Vaswani et al., 2017
Self-Attention: Understanding Context
Self-attention allows the model to understand that words carry different meanings depending on their context. Click on a word below to see which other words the model would "pay attention to" in order to understand it.
Click a word to see its attention pattern:
How the Transformer Works
The Transformer architecture has several key components that work together. Here is a simplified breakdown of each.
1. Input Embedding
Each word (or token) is converted into a dense numerical vector -- a list of numbers that captures its meaning in a high-dimensional space.
2. Positional Encoding
Since the Transformer processes all words in parallel (not sequentially), it needs a way to know word order. Positional encodings are mathematical patterns added to each embedding to encode where each word sits in the sentence.
3. Multi-Head Attention
Instead of computing attention once, the model runs multiple attention operations in parallel -- each "head" learns to focus on different types of relationships (syntax, semantics, coreference, etc.).
4. Feed-Forward Network
After attention, each position passes through a small fully connected neural network. This adds non-linearity and allows the model to transform the attended information into richer representations.
5. Layer Normalization
Each sub-layer (attention and feed-forward) includes normalization and residual connections to stabilize training and allow the model to go very deep -- sometimes 96+ layers.
6. Encoder-Decoder Structure
The original Transformer has an encoder (reads the input and builds understanding) and a decoder (generates the output token by token). Models like BERT use only the encoder; GPT uses only the decoder.
The Attention Mechanism in Detail
At its mathematical core, self-attention computes three vectors for each token: a Query (Q), a Key (K), and a Value (V). Think of it as a retrieval system.
The Query asks "what am I looking for?" The Key says "here is what I represent." The dot product between them produces an attention score -- how relevant each token is to the current one. These scores are normalized via softmax, then used to create a weighted sum of the Values. The result is a context-aware representation of every token.
Why Transformers Replaced RNNs
The shift from RNNs to Transformers was driven by three fundamental advantages.
| Property | RNN / LSTM | Transformer |
|---|---|---|
| Processing | Sequential (one word at a time) | Parallel (all words at once) |
| Training Speed | Slow -- cannot parallelize across time steps | Fast -- fully parallelizable on GPUs |
| Long-Range Dependencies | Struggles with sentences longer than ~50-100 tokens | Handles thousands of tokens via direct attention connections |
| Scalability | Diminishing returns with more parameters | Performance scales predictably with model size and data |
| Context Window | Limited by vanishing gradient problem | Defined by design (e.g., 4K, 128K, or 1M+ tokens) |
Encoder-Only, Decoder-Only, and Encoder-Decoder
The original Transformer paper described a full encoder-decoder model. In practice, three variants have emerged, each suited to different tasks.
Encoder-Only
Example: BERT, RoBERTa
Best for understanding tasks -- classification, sentiment analysis, named entity recognition. The encoder reads the entire input bidirectionally to build a rich representation.
Decoder-Only
Example: GPT-4, Claude, LLaMA
Best for generation tasks -- text completion, conversation, code generation. The decoder generates tokens left-to-right, attending only to previous tokens (causal masking).
Encoder-Decoder
Example: T5, BART, the original Transformer
Best for sequence-to-sequence tasks -- translation, summarization. The encoder processes the input; the decoder generates the output while attending to the encoder's representations.
The Transformer's Impact: A Timeline
"Attention Is All You Need"
Vaswani et al. at Google publish the Transformer paper. Originally designed for machine translation, it outperforms all existing models while being faster to train.
BERT and GPT-1
Google releases BERT (encoder-only), revolutionizing NLP benchmarks. OpenAI releases GPT-1 (decoder-only), demonstrating the power of generative pre-training.
GPT-3 and Scaling Laws
OpenAI's GPT-3 with 175 billion parameters shows that Transformers exhibit emergent abilities when scaled up -- few-shot learning, reasoning, and code generation.
Beyond Language
Vision Transformers (ViT) bring attention to image processing. DALL-E applies Transformers to image generation. AlphaFold 2 uses attention for protein structure prediction.
The Foundation Model Era
ChatGPT, Claude, Gemini, and LLaMA bring Transformer-based models to billions of users. Transformers become the universal architecture for language, vision, audio, video, and multimodal AI.
Want to Go Deeper?
This lexicon entry covers the essentials. For a comprehensive walkthrough of the Transformer architecture with visual diagrams and code examples, read our full guide.
Read the Full Transformers Guide →