AI Glossary

WordPiece

A subword tokenization method used by BERT that maximizes language model likelihood.

Overview

WordPiece is a subword tokenization algorithm developed at Google, most famously used in BERT. Unlike BPE which merges based on frequency, WordPiece selects merges that maximize the likelihood of the training data under a language model, leading to slightly different subword boundaries.

Key Details

WordPiece tokens that continue a word are prefixed with '##' (e.g., 'playing' becomes ['play', '##ing']). The algorithm produces a fixed vocabulary that can represent any text as a sequence of subword tokens, balancing between character-level and word-level representations.

Related Concepts

byte pair encoding • tokenization • bert

← Back to AI Glossary