What is a Word Embedding?

Computers do not understand words. They understand numbers. If you want a machine learning model to process text -- to understand a sentence, translate a document, or answer a question -- you first need to convert words into a numerical format that the model can work with. A word embedding is exactly that: a way of representing a word as a list of numbers (a vector) such that the numbers capture the word's meaning, context, and relationships with other words.

What makes word embeddings remarkable is not just that they turn words into numbers, but that they do so in a way that preserves meaning. Words with similar meanings end up with similar vectors, close together in a mathematical space. The word "happy" and the word "joyful" will have nearly identical embeddings, even though they share no letters in common. This ability to encode semantic meaning as mathematical relationships is one of the foundational breakthroughs that made modern NLP possible, from search engines to chatbots to translation systems.

From Words to Vectors

Before embeddings, the standard way to represent words was called one-hot encoding. Each word in a vocabulary was represented as a sparse vector with a single 1 and all other positions set to 0. If your vocabulary had 50,000 words, each word was a vector of 50,000 dimensions with exactly one non-zero element. The word "cat" might be [0, 0, 1, 0, 0, ...] and the word "dog" might be [0, 0, 0, 1, 0, ...]. This representation has a critical problem: every word is equally distant from every other word. "Cat" and "dog" are just as different as "cat" and "skyscraper." There is no notion of similarity.

Word embeddings solve this by compressing words into dense, low-dimensional vectors -- typically 100 to 300 dimensions. Instead of a sparse 50,000-dimensional vector, each word is represented by, say, 300 real-valued numbers. These numbers are not handcrafted; they are learned from data by training a model to predict words from their context or co-occurrence patterns. The result is a compact representation where each dimension captures some aspect of a word's meaning.

No single dimension has a clear, human-interpretable meaning. You cannot point to dimension 47 and say "this encodes whether the word is an animal." Instead, meaning emerges from the pattern across all dimensions together. This is called a distributed representation, and it is what gives embeddings their power. Because meaning is spread across many dimensions, the model can encode incredibly nuanced semantic distinctions that would be impossible with simpler representations.

The process of learning these vectors happens automatically during training. The model reads enormous amounts of text and adjusts each word's vector so that words appearing in similar contexts end up with similar vectors. The linguistic insight behind this -- called the distributional hypothesis -- states that "you shall know a word by the company it keeps." Words that share similar neighbors in sentences tend to have similar meanings, and embeddings capture this beautifully.

Semantic Similarity in Embedding Space

Once words are embedded as vectors, you can measure how similar two words are by calculating the distance or angle between their vectors. The most common metric is cosine similarity, which measures the angle between two vectors regardless of their magnitude. A cosine similarity of 1 means the vectors point in exactly the same direction (very similar words), 0 means they are unrelated, and -1 means they point in opposite directions.

In a well-trained embedding space, you will find that "king" and "queen" are close together because they share semantic properties (royalty, authority, singular nouns). "Paris" and "France" are close because they share a geographic relationship. "Running" and "jogging" are close because they describe similar activities. Meanwhile, "king" and "toaster" are far apart because they share almost no semantic overlap.

But the most famous property of word embeddings is their ability to capture analogies through arithmetic. The vector relationship "king - man + woman" produces a vector that is closest to "queen." This works because the model has learned that the difference between "king" and "man" (roughly, the concept of "royalty") is the same as the difference between "queen" and "woman." You can discover these relationships purely through vector subtraction and addition, without any explicit programming.

These analogies extend across many domains. "Paris - France + Japan" yields "Tokyo." "Walking - walked + swam" yields "swimming." The embedding space captures grammatical relationships (tense, plurality), geographic relationships (capital-country), and semantic relationships (synonym, antonym) all within the same vector space. This was a revelation when first demonstrated, because it showed that neural networks were learning structured, human-like representations of language without being explicitly taught grammar or geography.

Measuring Similarity

Cosine similarity between "happy" and "joyful" might be 0.89 (very similar). Between "happy" and "sad" it might be 0.35 (related but different). Between "happy" and "refrigerator" it might be 0.05 (unrelated). These numbers enable search engines, recommendation systems, and question-answering systems.

Word2Vec and GloVe

Two methods revolutionized word embeddings and made them a mainstream tool in NLP. Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, trains a shallow neural network to predict either a word from its surrounding context (the CBOW architecture) or the surrounding context from a word (the Skip-gram architecture). By training on billions of words, the network's hidden layer weights become the word embeddings. Word2Vec demonstrated that simple, scalable training on large text corpora could produce embeddings that captured rich semantic relationships.

The Skip-gram variant is particularly elegant. Given the word "cat," it tries to predict words like "fluffy," "meow," "pet," and "kitten" that often appear nearby. Through this prediction task, "cat" and "dog" end up with similar embeddings because they both predict many of the same context words ("pet," "food," "vet," "cute"). The model was never told that cats and dogs are related; it discovered this purely from co-occurrence patterns.

GloVe (Global Vectors), developed at Stanford in 2014, takes a different approach. Instead of learning from local context windows like Word2Vec, GloVe constructs a global co-occurrence matrix that counts how often each pair of words appears near each other across the entire corpus. It then learns embeddings by factorizing this matrix, optimizing so that the dot product of two word vectors approximates the logarithm of their co-occurrence count. GloVe combines the advantages of global statistical methods with the representation quality of neural approaches.

Both Word2Vec and GloVe produce static embeddings -- each word gets one fixed vector regardless of context. The word "bank" has the same embedding whether it refers to a river bank or a financial bank. This limitation was addressed by contextual embeddings in models like ELMo (2018) and later BERT and GPT, where each word's representation changes depending on the surrounding sentence. Modern language models produce embeddings that are dynamically generated for each word in each context, capturing polysemy and nuance far beyond what static embeddings can achieve.

Despite the rise of contextual embeddings, Word2Vec and GloVe remain widely used for tasks where simplicity, speed, and interpretability matter. They are easy to train, fast to look up, and the resulting vectors are compact and well-understood.

Key Takeaway

Word embeddings are the bridge between human language and machine computation. They convert words from opaque symbols into rich numerical representations where meaning is encoded as geometry -- similar words are nearby, analogies are vector arithmetic, and relationships are directions in space. This transformation is what allows AI models to process, compare, and generate language with remarkable fluency.

From the early days of Word2Vec and GloVe to the contextual embeddings produced by modern transformers, the core idea remains the same: represent words as dense vectors learned from data, and let the patterns in language reveal themselves through the structure of the resulting vector space. Every time you use a search engine, a translation tool, or a chatbot, word embeddings are working beneath the surface, converting your words into numbers and finding meaning in the mathematics.

← Back to AI Glossary

Next: Explainable AI (XAI) →