SentencePiece
A language-independent tokenizer that treats text as raw bytes, requiring no pre-tokenization.
Overview
SentencePiece is an unsupervised text tokenizer and detokenizer developed by Google that works directly on raw text without language-specific pre-processing. It treats the input as a sequence of Unicode characters (or bytes) and learns subword units using either BPE or unigram language model algorithms.
Key Details
Its language-independence makes it ideal for multilingual models. SentencePiece is used by many large language models including T5, LLaMA, and Gemini. It includes special handling for whitespace (using the '▁' character) and can be trained on any language without requiring word segmentation tools.