Topic Modeling: LDA, BERTopic, and Document Clustering

When faced with thousands or millions of documents, how do you discover what they are about without reading each one? Topic modeling answers this question by automatically identifying the underlying themes -- or topics -- that run through a collection of documents. From academic research to customer feedback analysis, topic modeling is one of the most widely used unsupervised NLP techniques for making sense of large text corpora.

What Is Topic Modeling?

Topic modeling is an unsupervised machine learning technique that discovers abstract "topics" within a collection of documents. Each topic is represented as a distribution over words, and each document is represented as a distribution over topics. For example, a news corpus might contain topics like "politics" (characterized by words like election, vote, candidate), "technology" (startup, AI, innovation), and "sports" (game, team, championship).

Unlike text classification, which requires labeled data and predefined categories, topic modeling discovers categories organically from the data itself. This makes it particularly valuable for exploratory analysis, where you do not know in advance what themes exist in your corpus.

"Topic modeling reveals the hidden thematic structure of a document collection -- the conversations happening across your data that you didn't know to look for."

Latent Dirichlet Allocation (LDA)

LDA, introduced by Blei, Ng, and Jordan in 2003, remains the most widely known topic modeling algorithm. It is a generative probabilistic model that assumes each document is a mixture of topics, and each topic is a mixture of words.

How LDA Works

LDA assumes the following generative process for each document: first, choose a distribution over topics (e.g., 60% politics, 30% economics, 10% sports). Then, for each word in the document, choose a topic according to that distribution, and then choose a word from that topic's word distribution. The algorithm works backward from observed words to infer the latent topic structure.

Tuning LDA

Number of topics (K): The most critical hyperparameter. Too few topics produce overly broad themes; too many produce fragmented, overlapping topics. Coherence scores help identify the optimal K.
Alpha and Beta: Alpha controls document-topic density (lower alpha means documents have fewer topics). Beta controls topic-word density (lower beta means topics have fewer characteristic words).
Preprocessing: LDA relies heavily on bag-of-words representation, making preprocessing (stopword removal, lemmatization) crucial for quality results.

Key Takeaway

LDA is a solid starting point for topic modeling, especially when interpretability matters. However, its bag-of-words assumption means it misses word order and semantic nuance, limitations that newer methods address.

Non-Negative Matrix Factorization (NMF)

NMF is an alternative to LDA that decomposes the document-term matrix into two non-negative matrices: one mapping documents to topics and another mapping topics to words. Unlike LDA, NMF is not a probabilistic model, but it often produces more focused, coherent topics.

NMF works well with TF-IDF representations and is computationally efficient. It tends to produce topics that are more distinct from each other compared to LDA, making it a strong choice when topic separation is important. However, it does not provide a proper probabilistic framework, limiting its use in scenarios where uncertainty quantification matters.

BERTopic: Topic Modeling Meets Transformers

BERTopic, developed by Maarten Grootendorst, represents a paradigm shift in topic modeling by leveraging transformer-based embeddings. Instead of relying on bag-of-words representations, BERTopic uses sentence embeddings to capture semantic meaning.

The BERTopic Pipeline

Document embedding: Each document is encoded into a dense vector using a pre-trained sentence transformer (like all-MiniLM-L6-v2).
Dimensionality reduction: UMAP reduces the high-dimensional embeddings to a lower-dimensional space while preserving local and global structure.
Clustering: HDBSCAN clusters the reduced embeddings into groups of semantically similar documents.
Topic representation: c-TF-IDF (class-based TF-IDF) extracts the most representative words for each cluster, creating interpretable topic labels.

BERTopic's advantages include its ability to capture semantic similarity (documents about "automobiles" and "cars" end up in the same topic), dynamic topic modeling (tracking topics over time), and hierarchical topic structures. It also handles short texts much better than LDA, making it suitable for social media and review analysis.

"BERTopic bridges the gap between the interpretability of traditional topic models and the semantic understanding of modern language models."

Top2Vec and Other Modern Approaches

Top2Vec is another embedding-based topic modeling approach that jointly learns document and word embeddings, then uses density-based clustering to find topics. Unlike BERTopic, Top2Vec automatically determines the number of topics and provides word vectors that can be used for topic exploration.

Other notable modern approaches include:

CTM (Contextualized Topic Models): Combines neural topic models with pre-trained language model embeddings for improved topic quality.
ETM (Embedded Topic Model): Uses word embeddings within the LDA framework, combining the probabilistic rigor of LDA with the semantic power of embeddings.
LLM-based topic modeling: Using GPT-4 or similar models to generate topic labels and descriptions, often as a post-processing step on top of clustering results.

Practical Tips and Applications

Choosing the right topic modeling approach depends on your specific use case. For traditional document analysis with well-defined corpora, LDA remains a strong baseline. For semantic analysis, short texts, or when you need dynamic topic tracking, BERTopic is often the best choice.

Common applications of topic modeling include:

Customer feedback analysis: Discovering common themes in product reviews and support tickets
Research literature surveys: Mapping the intellectual landscape of a research field
Content recommendation: Matching users to relevant content based on topical interests
Social media monitoring: Tracking emerging topics and trends in real time
Legal discovery: Organizing large document collections by theme during litigation

Key Takeaway

Topic modeling has evolved from bag-of-words approaches like LDA to semantically-aware methods like BERTopic. The choice depends on your data characteristics, computational resources, and whether you need probabilistic interpretation or semantic accuracy.

Topic Modeling: LDA, BERTopic, and Document Clustering

What Is Topic Modeling?

Latent Dirichlet Allocation (LDA)

How LDA Works

Tuning LDA

Key Takeaway

Non-Negative Matrix Factorization (NMF)

BERTopic: Topic Modeling Meets Transformers

The BERTopic Pipeline

Top2Vec and Other Modern Approaches

Practical Tips and Applications

Key Takeaway

References & Sources

Related Glossary Terms

What Is Topic Modeling?

Latent Dirichlet Allocation (LDA)

How LDA Works

Tuning LDA

Key Takeaway

Non-Negative Matrix Factorization (NMF)

BERTopic: Topic Modeling Meets Transformers

The BERTopic Pipeline

Top2Vec and Other Modern Approaches

Practical Tips and Applications

Key Takeaway

References & Sources

Related Glossary Terms

Related Articles

Word Embeddings: Word2Vec, GloVe, and the Path to BERT

Sentence Embeddings: Measuring Semantic Similarity

Text Preprocessing: Tokenization, Stemming, and Lemmatization