AI Architecture

What is a Transformer?

The revolutionary neural network architecture that powers GPT, BERT, Claude, and virtually every modern AI system. Introduced in the landmark 2017 paper "Attention Is All You Need."

The Big Idea: Attention Is All You Need

Before the Transformer, AI models that worked with language (like translating between languages or generating text) relied on Recurrent Neural Networks (RNNs) and their variant, LSTMs. These models processed words one at a time, sequentially, like reading a book one word at a time without ever glancing ahead or back.

The Transformer, introduced by researchers at Google in 2017, threw away this sequential approach entirely. Its core innovation is the self-attention mechanism, which allows the model to look at every word in a sentence simultaneously and figure out which words are most relevant to each other. This single idea transformed the entire field of artificial intelligence.

"The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution." -- Vaswani et al., 2017

Self-Attention: Understanding Context

Self-attention allows the model to understand that words carry different meanings depending on their context. Click on a word below to see which other words the model would "pay attention to" in order to understand it.

Click a word to see its attention pattern:

The cat sat on the mat because it was tired
Click any word to explore how self-attention works.

How the Transformer Works

The Transformer architecture has several key components that work together. Here is a simplified breakdown of each.

1. Input Embedding

Each word (or token) is converted into a dense numerical vector -- a list of numbers that captures its meaning in a high-dimensional space.

2. Positional Encoding

Since the Transformer processes all words in parallel (not sequentially), it needs a way to know word order. Positional encodings are mathematical patterns added to each embedding to encode where each word sits in the sentence.

3. Multi-Head Attention

Instead of computing attention once, the model runs multiple attention operations in parallel -- each "head" learns to focus on different types of relationships (syntax, semantics, coreference, etc.).

4. Feed-Forward Network

After attention, each position passes through a small fully connected neural network. This adds non-linearity and allows the model to transform the attended information into richer representations.

5. Layer Normalization

Each sub-layer (attention and feed-forward) includes normalization and residual connections to stabilize training and allow the model to go very deep -- sometimes 96+ layers.

6. Encoder-Decoder Structure

The original Transformer has an encoder (reads the input and builds understanding) and a decoder (generates the output token by token). Models like BERT use only the encoder; GPT uses only the decoder.

The Attention Mechanism in Detail

At its mathematical core, self-attention computes three vectors for each token: a Query (Q), a Key (K), and a Value (V). Think of it as a retrieval system.

Query (Q) "What am I looking for?" Key (K) "What do I contain?" Value (V) "Here is my actual content" Attention Score softmax(Q . K^T / sqrt(d_k)) . V Output Embedding

The Query asks "what am I looking for?" The Key says "here is what I represent." The dot product between them produces an attention score -- how relevant each token is to the current one. These scores are normalized via softmax, then used to create a weighted sum of the Values. The result is a context-aware representation of every token.

Why Transformers Replaced RNNs

The shift from RNNs to Transformers was driven by three fundamental advantages.

Property RNN / LSTM Transformer
Processing Sequential (one word at a time) Parallel (all words at once)
Training Speed Slow -- cannot parallelize across time steps Fast -- fully parallelizable on GPUs
Long-Range Dependencies Struggles with sentences longer than ~50-100 tokens Handles thousands of tokens via direct attention connections
Scalability Diminishing returns with more parameters Performance scales predictably with model size and data
Context Window Limited by vanishing gradient problem Defined by design (e.g., 4K, 128K, or 1M+ tokens)

Encoder-Only, Decoder-Only, and Encoder-Decoder

The original Transformer paper described a full encoder-decoder model. In practice, three variants have emerged, each suited to different tasks.

Encoder-Only

Example: BERT, RoBERTa

Best for understanding tasks -- classification, sentiment analysis, named entity recognition. The encoder reads the entire input bidirectionally to build a rich representation.

Decoder-Only

Example: GPT-4, Claude, LLaMA

Best for generation tasks -- text completion, conversation, code generation. The decoder generates tokens left-to-right, attending only to previous tokens (causal masking).

Encoder-Decoder

Example: T5, BART, the original Transformer

Best for sequence-to-sequence tasks -- translation, summarization. The encoder processes the input; the decoder generates the output while attending to the encoder's representations.

The Transformer's Impact: A Timeline

2017

"Attention Is All You Need"

Vaswani et al. at Google publish the Transformer paper. Originally designed for machine translation, it outperforms all existing models while being faster to train.

2018

BERT and GPT-1

Google releases BERT (encoder-only), revolutionizing NLP benchmarks. OpenAI releases GPT-1 (decoder-only), demonstrating the power of generative pre-training.

2020

GPT-3 and Scaling Laws

OpenAI's GPT-3 with 175 billion parameters shows that Transformers exhibit emergent abilities when scaled up -- few-shot learning, reasoning, and code generation.

2020-2022

Beyond Language

Vision Transformers (ViT) bring attention to image processing. DALL-E applies Transformers to image generation. AlphaFold 2 uses attention for protein structure prediction.

2022-Present

The Foundation Model Era

ChatGPT, Claude, Gemini, and LLaMA bring Transformer-based models to billions of users. Transformers become the universal architecture for language, vision, audio, video, and multimodal AI.

Want to Go Deeper?

This lexicon entry covers the essentials. For a comprehensive walkthrough of the Transformer architecture with visual diagrams and code examples, read our full guide.

Read the Full Transformers Guide →