BERT
Bidirectional Encoder Representations from Transformers -- a landmark language model from Google (2018) that learns deep bidirectional representations of text.
The Breakthrough
Before BERT, language models processed text in one direction (left-to-right or right-to-left). BERT reads in both directions simultaneously using masked language modeling, where random words are hidden and the model must predict them from surrounding context.
Architecture
BERT is an encoder-only transformer. BERT-Base has 110M parameters (12 layers), BERT-Large has 340M parameters (24 layers). It produces contextual embeddings where the same word gets different representations based on its context.
Legacy
BERT transformed NLP by establishing the pretrain-then-fine-tune paradigm. Variants include RoBERTa (optimized training), DistilBERT (smaller), and domain-specific versions like BioBERT, SciBERT, and LegalBERT.