AI Glossary

Token

The basic unit of text that language models process -- typically a word, subword, or character, depending on the tokenizer used.

What Tokens Are

Tokenizers split text into pieces the model can process. 'Hello world' might be 2 tokens. 'Unbelievable' might be split into 'Un', 'believ', 'able' (3 tokens). Common words are single tokens; rare words are split into subwords.

Token Counts

A rough rule: 1 token is approximately 4 characters or 0.75 words in English. A page of text is roughly 500 tokens. Token limits determine how much text fits in a model's context window and affect API pricing.

Tokenizer Types

BPE (Byte-Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. SentencePiece: Language-independent tokenization used by LLaMA. WordPiece: Used by BERT. All are subword tokenizers balancing vocabulary size and coverage.

← Back to AI Glossary

Last updated: March 5, 2026