AI Glossary

Tokenizer

A component that converts raw text into a sequence of tokens (integers) that a language model can process, and converts token IDs back into text.

Common Algorithms

BPE (Byte Pair Encoding): Used by GPT models. Iteratively merges frequent byte pairs. WordPiece: Used by BERT. Similar to BPE but uses likelihood-based merging. SentencePiece: Language-independent, used by LLaMA. Unigram: Probabilistic subword segmentation.

Impact on Performance

Tokenizer quality affects model efficiency (tokens per word), multilingual capability, code handling, and context window utilization. A good tokenizer compresses text efficiently while keeping meaningful units intact.

← Back to AI Glossary

Tokenizer

Common Algorithms

Impact on Performance

Related Articles

K-Nearest Neighbors: The Simplest ML Algorithm

Multilingual NLP: Building AI That Speaks Every Language

LLM Tokenization: BPE, WordPiece, and SentencePiece

Related Concepts