Before an LLM can process any text, that text must be converted into numbers. This conversion -- tokenization -- is far more nuanced than simply splitting text into words. The choice of tokenization algorithm affects model performance, multilingual capability, vocabulary efficiency, and even the cost of using the model. Understanding tokenization is essential for anyone working with LLMs.
Why Not Just Use Words?
Word-level tokenization seems intuitive but has critical problems. English alone has over a million distinct word forms when you count inflections, compounds, and proper nouns. A word-level vocabulary would be enormous, making the embedding matrix and output layer impractically large. Worse, any word not in the vocabulary becomes an unknown [UNK] token, losing all information about it.
Character-level tokenization solves the vocabulary problem (you only need about 256 characters) but creates extremely long sequences. The sentence "attention mechanism" becomes 19 characters instead of 2 words, dramatically increasing computation and making it harder for the model to learn word-level patterns.
Subword tokenization strikes the ideal balance: common words are kept as single tokens, while rare words are split into meaningful subword pieces. "Unhappiness" might become ["un", "happiness"] or ["un", "happi", "ness"], preserving morphological information while keeping the vocabulary manageable.
Subword tokenization is the Goldilocks solution: not too fine-grained like characters, not too coarse like words, but just right for capturing both common patterns and rare words.
Byte Pair Encoding (BPE)
BPE, originally a data compression algorithm, is the most widely used tokenization method for LLMs (used by GPT, LLaMA, and others). The training process is straightforward:
- Start with a vocabulary of individual characters (or bytes)
- Count every pair of adjacent tokens in the training corpus
- Merge the most frequent pair into a new token
- Repeat steps 2-3 until reaching the desired vocabulary size
For example, if "th" appears most frequently, it becomes a single token. Then if "the" is most frequent, it becomes one token. The process builds up from characters to common subwords to common words.
GPT-2 introduced byte-level BPE, which starts from individual bytes (256 possible values) rather than Unicode characters. This guarantees that any text can be tokenized without ever producing an unknown token, since any byte sequence can be represented.
Key Takeaway
BPE builds vocabulary by iteratively merging the most frequent character pairs. Byte-level BPE ensures any text can be tokenized without unknown tokens, making it robust for multilingual and code inputs.
WordPiece
WordPiece, used by BERT and other Google models, is similar to BPE but differs in how it selects merges. Instead of choosing the most frequent pair, WordPiece selects the pair that maximally increases the likelihood of the training data. This means it prioritizes merges that make the overall encoding more statistically efficient, not just the most common ones.
WordPiece marks subword continuations with a special "##" prefix. The word "playing" might be tokenized as ["play", "##ing"], where "##" indicates that "ing" is a continuation of a previous token, not a standalone word.
SentencePiece
SentencePiece, developed by Google, takes a different approach: it treats the input as a raw byte stream without assuming any pre-tokenization (like splitting on spaces). This is critical for languages like Japanese and Chinese that do not use spaces between words.
SentencePiece implements both BPE and a unigram language model algorithm. The unigram approach starts with a large vocabulary and iteratively removes tokens that least affect the overall encoding probability, eventually reaching the target vocabulary size.
SentencePiece is used by LLaMA, T5, and many multilingual models because of its language-agnostic design.
Vocabulary Size and Its Impact
The choice of vocabulary size involves trade-offs:
- Smaller vocabulary (e.g., 32K): Less memory for embeddings, but common words may be split into more tokens, increasing sequence length
- Larger vocabulary (e.g., 128K): More words are single tokens (shorter sequences), but the embedding matrix is larger and rare tokens get less training signal
Common vocabulary sizes: GPT-2 uses 50,257 tokens. LLaMA 2 uses 32,000 tokens. LLaMA 3 expanded to 128,000 tokens, significantly improving multilingual efficiency and code tokenization.
Why Tokenization Affects Cost and Performance
Tokenization has direct practical implications:
- Cost: API pricing is per token. Inefficient tokenization means more tokens for the same text, costing more money.
- Context usage: More tokens per message means less room in the context window for conversation history or documents.
- Multilingual equity: Tokenizers trained primarily on English can require 3-5x more tokens for the same text in other languages, making non-English usage more expensive and less efficient.
- Code handling: Poor code tokenization can split common patterns (like indentation) into many tokens, wasting context and compute.
Key Takeaway
Tokenization is not just a preprocessing step -- it directly affects model cost, context efficiency, multilingual performance, and coding ability. Choosing the right tokenizer and vocabulary size is a critical design decision for any LLM.
The tokenization landscape continues to evolve. Research into more efficient encodings, better multilingual support, and specialized tokenizers for code and structured data continues to improve how LLMs interface with human language.
