AI Glossary

Byte-Pair Encoding

A subword tokenization algorithm that iteratively merges the most frequent character pairs.

Overview

Byte-Pair Encoding (BPE) is a subword tokenization algorithm originally developed for data compression, adapted for NLP by Sennrich et al. in 2016. It starts with individual characters and iteratively merges the most frequent adjacent pair of tokens until reaching a target vocabulary size.

Key Details

BPE handles out-of-vocabulary words gracefully by breaking them into known subword units. For example, 'unhappiness' might be tokenized as ['un', 'happiness'] or ['un', 'happi', 'ness']. GPT models use a variant called Byte-level BPE. It balances vocabulary size with the ability to represent any text, making it the dominant tokenization method for modern language models.

Related Concepts

tokenization • tokenizer • token

← Back to AI Glossary