AI Glossary

Tokenizer

A component that converts raw text into a sequence of tokens (integers) that a language model can process, and converts token IDs back into text.

Common Algorithms

BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. WordPiece: Used by BERT. Similar to BPE but uses likelihood-based merging. SentencePiece: Language-independent, used by LLaMA. Unigram: Probabilistic subword segmentation.

Impact on Performance

Tokenizer quality affects model efficiency (tokens per word), multilingual capability, code handling, and context window utilization. A good tokenizer compresses text efficiently while keeping meaningful units intact.

← Back to AI Glossary

Last updated: March 5, 2026