WordPiece
A subword tokenization method used by BERT that maximizes language model likelihood.
Overview
WordPiece is a subword tokenization algorithm developed at Google, most famously used in BERT. Unlike BPE which merges based on frequency, WordPiece selects merges that maximize the likelihood of the training data under a language model, leading to slightly different subword boundaries.
Key Details
WordPiece tokens that continue a word are prefixed with '##' (e.g., 'playing' becomes ['play', '##ing']). The algorithm produces a fixed vocabulary that can represent any text as a sequence of subword tokens, balancing between character-level and word-level representations.