AI Glossary

Next-Token Prediction

The fundamental training objective of autoregressive language models: given a sequence of tokens, predict the probability distribution over possible next tokens.

The Core of LLMs

GPT and similar models are trained on one simple task: given all previous tokens, predict the next one. Despite this simplicity, scaling this objective to massive datasets and model sizes produces remarkably capable systems that can reason, code, translate, and more.

How Generation Works

At inference time, the model predicts a probability for every possible next token, samples one, appends it to the sequence, and repeats. Temperature and top-p parameters control the randomness of sampling.

The Scaling Hypothesis

The surprising effectiveness of next-token prediction at scale suggests that predicting text well enough requires learning a compressed model of the world. This is why LLMs exhibit emergent capabilities like reasoning and tool use.

← Back to AI Glossary

Next-Token Prediction

The Core of LLMs

How Generation Works

The Scaling Hypothesis

Related Articles

K-Nearest Neighbors: The Simplest ML Algorithm

Related Concepts