Tokenizer
A component that converts raw text into a sequence of tokens (integers) that a language model can process, and converts token IDs back into text.
Common Algorithms
BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. WordPiece: Used by BERT. Similar to BPE but uses likelihood-based merging. SentencePiece: Language-independent, used by LLaMA. Unigram: Probabilistic subword segmentation.
Impact on Performance
Tokenizer quality affects model efficiency (tokens per word), multilingual capability, code handling, and context window utilization. A good tokenizer compresses text efficiently while keeping meaningful units intact.