BERT (Bidirectional Encoder Representations from Transformers) fundamentally changed how NLP models understand language. Released by Google in 2018, BERT demonstrated that bidirectional pre-training -- reading text from both left-to-right and right-to-left simultaneously -- produced dramatically better language understanding than the unidirectional approach used by GPT. While decoder-only models like GPT have since become dominant for generation tasks, BERT and its descendants remain critical for understanding tasks like search, classification, and information extraction.
The Key Insight: Bidirectional Context
Consider the word "bank" in two sentences: "I deposited money at the bank" and "I sat on the river bank." To correctly understand the meaning of "bank," a model needs context from both directions. The word "money" (before "bank") and the full sentence structure (after "bank") both contribute to disambiguation.
GPT-style models read left to right. When processing "bank," they can only see "I deposited money at the" -- they cannot peek ahead. BERT, by contrast, sees the entire sentence at once, with every token attending to every other token through bidirectional self-attention.
GPT reads forward, predicting the next word. BERT reads everything at once, understanding the whole context. This is why BERT excels at understanding while GPT excels at generation.
Pre-training Objectives
BERT cannot use standard next-token prediction for pre-training because bidirectional attention would allow the model to "cheat" by seeing the token it is supposed to predict. Instead, BERT uses two innovative pre-training tasks:
Masked Language Modeling (MLM)
15% of input tokens are randomly selected for prediction. Of these, 80% are replaced with a special [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. The model must predict the original token for each selected position.
This is like a fill-in-the-blank exercise: given "The [MASK] sat on the mat," the model must predict "cat." Because the model sees context from both sides of the masked position, it learns deep bidirectional representations.
Next Sentence Prediction (NSP)
The model receives pairs of sentences and must predict whether the second sentence follows the first in the original text. 50% of pairs are actual consecutive sentences; 50% are random pairings. This task was designed to help the model understand relationships between sentences.
Later research showed that NSP contributed less than initially thought, and models like RoBERTa achieved better results by dropping it entirely and using longer training sequences instead.
Key Takeaway
BERT's masked language modeling objective enables bidirectional pre-training by randomly hiding tokens and asking the model to predict them using full context from both directions.
Architecture Details
BERT uses the transformer encoder architecture (no decoder). It comes in two sizes:
- BERT-base: 12 layers, 768 hidden dimensions, 12 attention heads, 110M parameters
- BERT-large: 24 layers, 1024 hidden dimensions, 16 attention heads, 340M parameters
Special tokens structure the input:
[CLS](classification): Added at the start; its final hidden state serves as the sequence-level representation for classification tasks[SEP](separator): Separates two input sentences for tasks involving sentence pairs- Segment embeddings distinguish tokens from the first and second sentences
BERT was trained on the BooksCorpus (800M words) and English Wikipedia (2,500M words) using WordPiece tokenization with a vocabulary of 30,000 tokens.
Fine-Tuning BERT for Downstream Tasks
BERT's pre-trained representations can be fine-tuned for specific tasks by adding a simple output layer:
- Text classification: Use the [CLS] token representation with a classification head
- Named entity recognition: Use each token's representation with a token classification head
- Question answering: Predict the start and end positions of the answer span in a passage
- Sentence similarity: Compare [CLS] representations of two sentences
Fine-tuning typically takes only a few epochs on task-specific data, making BERT extremely practical for production applications. The pre-trained model handles the "understanding" while the thin fine-tuning layer handles the "task."
The BERT Family
BERT inspired numerous variants that addressed its limitations:
RoBERTa (2019)
Facebook's "Robustly Optimized BERT" showed that BERT was significantly undertrained. By training longer, on more data, with larger batches, and removing NSP, RoBERTa substantially outperformed the original BERT.
ALBERT (2019)
ALBERT reduced BERT's parameter count through two techniques: factorized embedding parameterization and cross-layer parameter sharing. An ALBERT model with 18x fewer parameters matched BERT-large's performance.
DistilBERT (2019)
DistilBERT used knowledge distillation to compress BERT into a model 60% smaller and 60% faster while retaining 97% of its language understanding capabilities. This made BERT practical for edge deployment.
DeBERTa (2021)
DeBERTa introduced disentangled attention that separates content and position representations, along with an enhanced mask decoder. It surpassed human performance on the SuperGLUE benchmark.
Key Takeaway
While GPT-style decoder models dominate text generation, BERT-family encoder models remain the standard for text understanding tasks like classification, search, entity recognition, and similarity matching in production systems.
BERT's Lasting Impact
BERT transformed Google Search in 2019, improving the understanding of roughly 10% of English queries. It remains deployed in search engines, recommendation systems, content moderation, and countless production NLP systems worldwide.
More broadly, BERT established the paradigm of "pre-train then fine-tune" that became the standard approach for NLP. While the specific architecture has been superseded by larger and more capable models, BERT's conceptual contributions -- bidirectional context, masked language modeling, and transfer learning for NLP -- remain foundational to the field.
