Causal Language Model
A language model that generates text left-to-right, predicting each token based only on the tokens that came before it (never looking ahead).
Architecture
Uses a decoder-only transformer with causal (masked) self-attention. Each position can only attend to previous positions, enforcing the autoregressive property. GPT, LLaMA, Claude, and Gemini are all causal language models.
Training Objective
Trained on next-token prediction: given a sequence of tokens, predict the probability distribution over possible next tokens. The loss is cross-entropy between predicted and actual next tokens.
Why Causal?
The causal constraint ensures the model can generate text token by token at inference time. It also makes training efficient: a single forward pass through a sequence provides training signal at every position simultaneously.