AI Glossary

Tokenization

The process of breaking text into smaller units (tokens) that AI models can process, a fundamental preprocessing step that significantly impacts model performance.

Methods

Word-level: Split on spaces (large vocabulary, can't handle unknown words). Character-level: Each character is a token (small vocabulary, loses word meaning). Subword (BPE): Byte-Pair Encoding finds optimal subword units. SentencePiece: Language-agnostic tokenization.

Impact on LLMs

Tokenizer choice affects model efficiency, multilingual capability, and cost. A word might be 1-4 tokens depending on the tokenizer. Pricing is per-token, so efficient tokenization saves money. Different models use different tokenizers.

Considerations

Common English words are usually single tokens. Rare words, code, and non-English text may use many tokens. Numbers are often poorly tokenized (each digit can be a separate token). This can affect math performance.

← Back to AI Glossary

Tokenization

Methods

Impact on LLMs

Considerations

Related Articles

LLM Tokenization: BPE, WordPiece, and SentencePiece

Text Preprocessing: Tokenization, Stemming, and Lemmatization

K-Nearest Neighbors: The Simplest ML Algorithm

Multilingual NLP: Building AI That Speaks Every Language

Related Concepts