AI Glossary

Token

The basic unit of text that language models process -- typically a word, subword, or character, depending on the tokenizer used.

What Tokens Are

Tokenizers split text into pieces the model can process. 'Hello world' might be 2 tokens. 'Unbelievable' might be split into 'Un', 'believ', 'able' (3 tokens). Common words are single tokens; rare words are split into subwords.

Token Counts

A rough rule: 1 token is approximately 4 characters or 0.75 words in English. A page of text is roughly 500 tokens. Token limits determine how much text fits in a model's context window and affect API pricing.

Tokenizer Types

BPE (Byte-Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. SentencePiece: Language-independent tokenization used by LLaMA. WordPiece: Used by BERT. All are subword tokenizers balancing vocabulary size and coverage.

← Back to AI Glossary

Token

What Tokens Are

Token Counts

Tokenizer Types

Related Articles

LLM Tokenization: BPE, WordPiece, and SentencePiece

Context Windows Explained: Why Token Limits Matter

Text Preprocessing: Tokenization, Stemming, and Lemmatization

Mixture of Experts (MoE): How Sparse Models Scale Efficiently

BERT Explained: Bidirectional Understanding in NLP

Related Concepts