AI Glossary

SentencePiece

A language-independent tokenizer that treats text as raw bytes, requiring no pre-tokenization.

Overview

SentencePiece is an unsupervised text tokenizer and detokenizer developed by Google that works directly on raw text without language-specific pre-processing. It treats the input as a sequence of Unicode characters (or bytes) and learns subword units using either BPE or unigram language model algorithms.

Key Details

Its language-independence makes it ideal for multilingual models. SentencePiece is used by many large language models including T5, LLaMA, and Gemini. It includes special handling for whitespace (using the '▁' character) and can be trained on any language without requiring word segmentation tools.

Related Concepts

byte pair encoding • wordpiece • tokenization

← Back to AI Glossary