Information Retrieval
The field of finding relevant documents or passages from a large collection in response to a user query, fundamental to search engines and RAG systems.
Traditional Methods
TF-IDF: Scores documents by term frequency and inverse document frequency. BM25: The standard keyword matching algorithm, used by Elasticsearch. These are fast and effective for exact term matching.
Neural Retrieval
Dense retrieval: Encode queries and documents as vectors, find nearest neighbors. Cross-encoders: Score query-document pairs together for higher accuracy (but slower). Hybrid: Combine BM25 with dense retrieval.
Modern Pipeline
Retrieve candidates with fast methods (BM25 + dense retrieval), rerank top-k with a cross-encoder, then pass to an LLM for answer generation. This retrieve-rerank-generate pipeline is the backbone of production RAG systems.