Before machines could understand language, they needed a way to represent words as numbers. Word embeddings -- dense vector representations that capture semantic meaning -- are the foundational technology that made modern NLP possible. From the early days of Word2Vec to the contextual revolution of BERT, the evolution of word representations tells the story of NLP's transformation from a rule-based discipline to a deep learning powerhouse.

Why Words Need Numbers

Computers operate on numbers, not words. The simplest way to represent words is one-hot encoding, where each word in the vocabulary gets a unique binary vector. In a vocabulary of 50,000 words, "king" might be [0, 0, ..., 1, ..., 0] -- a vector of 50,000 dimensions with a single 1. This representation has two critical problems: the vectors are enormous (high dimensionality), and they capture no semantic relationships (the distance between "king" and "queen" is the same as between "king" and "banana").

Word embeddings solve both problems by mapping words to dense vectors of a few hundred dimensions, where semantically similar words are close together in vector space. The word "king" might be represented as [0.52, -0.31, 0.87, ...] -- a compact vector that encodes meaning through its position in the embedding space.

"You shall know a word by the company it keeps." -- J.R. Firth, 1957. This distributional hypothesis is the philosophical foundation of all word embedding techniques.

Word2Vec: The Breakthrough

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec, a method for learning word embeddings from large text corpora using shallow neural networks. Word2Vec made word embeddings practical by being fast enough to train on billions of words.

Two Architectures

  • CBOW (Continuous Bag of Words): Predicts a target word from its surrounding context words. Given the context "the cat sat on the ___," CBOW predicts "mat." Faster to train and works well for frequent words.
  • Skip-gram: The inverse of CBOW -- predicts context words from a target word. Given "mat," predict "the," "cat," "sat," "on." Works better for rare words and smaller datasets.

The most celebrated property of Word2Vec embeddings is their ability to capture analogical relationships through vector arithmetic. The famous example: king - man + woman = queen. This works because the model learns that the relationship between "king" and "man" (royalty minus male gender) is parallel to the relationship between "queen" and "woman."

Key Takeaway

Word2Vec demonstrated that simple neural network architectures trained on large text corpora could learn rich semantic representations. Its vector arithmetic property showed that embeddings capture meaningful relationships between concepts.

GloVe: Global Vectors for Word Representation

GloVe (Global Vectors), developed by Pennington, Socher, and Manning at Stanford in 2014, takes a different approach. Rather than learning from local context windows like Word2Vec, GloVe constructs a global word co-occurrence matrix and factorizes it to produce embeddings.

The key insight is that ratios of word co-occurrence probabilities encode meaning. If "ice" co-occurs frequently with "solid" and "water" co-occurs frequently with "liquid," the ratio of their co-occurrences with "steam" versus "ice" reveals the relationship between temperature states. GloVe's objective function is designed to preserve these meaningful ratios.

In practice, GloVe and Word2Vec produce embeddings of comparable quality. GloVe has the advantage of being deterministic (same corpus always produces the same embeddings) and can be more efficient for very large corpora since the co-occurrence matrix only needs to be computed once.

FastText: Subword Embeddings

FastText, developed by Facebook in 2016, extends Word2Vec by representing each word as a bag of character n-grams. The word "where" would include the n-grams: <wh, whe, her, ere, re>, plus the full word <where>. The word's embedding is the sum of its n-gram embeddings.

This seemingly small change has profound implications. FastText can generate embeddings for out-of-vocabulary words by summing the n-grams that are in the vocabulary. It also handles morphologically rich languages better, since words with shared roots (like "teach," "teacher," "teaching") share n-gram components and thus have related embeddings.

The Contextual Revolution: ELMo and BERT

All the methods above produce static embeddings -- each word gets a single vector regardless of context. But words have different meanings in different contexts: "bank" means something different in "river bank" versus "bank account." This limitation drove the development of contextual embeddings.

ELMo (2018)

ELMo (Embeddings from Language Models) was the first widely successful contextual embedding model. It uses a bidirectional LSTM trained as a language model to produce word representations that change based on context. ELMo embeddings are computed as a weighted combination of the representations from different LSTM layers, capturing both syntactic and semantic information.

BERT (2018)

BERT (Bidirectional Encoder Representations from Transformers) replaced LSTMs with the transformer architecture and introduced masked language modeling as a pre-training objective. Instead of predicting the next word, BERT masks random words and predicts them from their full bidirectional context.

BERT's contextual embeddings are dramatically more powerful than static embeddings. The same word "bank" receives different vector representations depending on its surrounding context, enabling much better performance on downstream NLP tasks. BERT's success spawned a family of models -- RoBERTa, ALBERT, DeBERTa, XLNet -- each refining the approach further.

"BERT didn't just improve word embeddings; it changed the entire paradigm. Pre-training on vast text corpora and fine-tuning on specific tasks became the standard recipe for NLP."

Choosing the Right Embedding for Your Task

Despite the advances in contextual embeddings, static embeddings still have their place. Here is a practical guide:

  • Word2Vec / GloVe: Best for lightweight applications, word similarity tasks, and when computational resources are limited. Pre-trained vectors are freely available and easy to use.
  • FastText: Ideal when handling morphologically complex languages, domain-specific vocabulary, or when out-of-vocabulary words are common.
  • BERT and contextual models: The clear choice for tasks where word sense disambiguation matters, such as question answering, NER, and sentiment analysis. Requires more computation but delivers superior results.
  • Sentence-level models: For tasks like semantic search and text similarity, sentence-level embeddings from models like Sentence-BERT are more appropriate than word-level embeddings.

Key Takeaway

The evolution from Word2Vec to BERT represents a shift from static, context-free word representations to dynamic, context-aware embeddings. Each generation built on the insights of the previous one, and understanding this progression is essential for any NLP practitioner.