Token
The basic unit of text that language models process -- typically a word, subword, or character, depending on the tokenizer used.
What Tokens Are
Tokenizers split text into pieces the model can process. 'Hello world' might be 2 tokens. 'Unbelievable' might be split into 'Un', 'believ', 'able' (3 tokens). Common words are single tokens; rare words are split into subwords.
Token Counts
A rough rule: 1 token is approximately 4 characters or 0.75 words in English. A page of text is roughly 500 tokens. Token limits determine how much text fits in a model's context window and affect API pricing.
Tokenizer Types
BPE (Byte-Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. SentencePiece: Language-independent tokenization used by LLaMA. WordPiece: Used by BERT. All are subword tokenizers balancing vocabulary size and coverage.