Basic RAG systems follow a straightforward pattern: embed a query, retrieve the nearest document chunks from a vector store, and feed them to a language model. This approach works surprisingly well for many use cases, but it hits a ceiling when queries become complex, ambiguous, or require nuanced understanding. Advanced RAG techniques push past that ceiling by improving how queries are formed, how results are ranked, and how context is selected for generation.

These techniques represent the difference between a RAG system that works on demo queries and one that performs reliably in production. Each addresses a specific failure mode in the naive retrieval pipeline, and combining them strategically can dramatically improve answer quality.

Re-ranking: The Second Pass That Changes Everything

The fundamental insight behind re-ranking is that initial retrieval and precise relevance ranking are different problems requiring different solutions. Bi-encoder models used for initial retrieval embed queries and documents independently, making them fast but limited in their ability to capture fine-grained relevance. Cross-encoder models used for re-ranking process the query and document together, enabling much more accurate relevance judgments at the cost of speed.

A typical re-ranking pipeline works in two stages. First, the bi-encoder retrieves a broad set of candidates, perhaps the top 50 or 100 chunks. Then, the cross-encoder scores each candidate against the query and reorders them. Only the top results after re-ranking are passed to the generator.

Re-ranking consistently produces some of the largest quality improvements in RAG systems. It is often the single most impactful upgrade you can make to a basic pipeline.

Popular cross-encoder models for re-ranking include Cohere Rerank, BGE Reranker, and models from the sentence-transformers library. These models are trained specifically on relevance judgment tasks and can dramatically improve context quality even when the initial retriever is mediocre.

Implementing Re-ranking Effectively

The key parameters to tune are the number of initial candidates and the number of final results after re-ranking. Retrieving too few candidates limits the cross-encoder's ability to find relevant documents. Retrieving too many increases latency without proportional quality gains. A good starting point is retrieving 20-50 candidates and keeping the top 3-5 after re-ranking.

Query Expansion: Casting a Wider Net

Users rarely phrase their queries in ways that match how information is stored in documents. Query expansion addresses this gap by generating multiple reformulations of the original query, each capturing a different aspect or phrasing.

Multi-Query Generation

The simplest form of query expansion uses an LLM to generate multiple versions of the user's query. For example, the query "How does photosynthesis work?" might be expanded into:

  • "What is the process of photosynthesis in plants?"
  • "Light reactions and Calvin cycle mechanism"
  • "How do plants convert sunlight into energy?"
  • "Chloroplast function in energy production"

Each expanded query is used independently for retrieval, and the results are merged and deduplicated. This approach captures relevant documents that might use different terminology or frame the topic from different angles.

Step-Back Prompting

Step-back prompting generates a more abstract version of the query to retrieve broader context. Instead of searching directly for "Why did revenue drop in Q3 2024?", the system might first retrieve context for "What factors affect quarterly revenue?" and use that broader context alongside specific results.

Key Takeaway

Query expansion is especially valuable when users use informal language, domain-specific jargon, or when the corpus uses terminology different from what users naturally search for.

HyDE: Hypothetical Document Embeddings

Hypothetical Document Embeddings (HyDE) is a creative technique that flips the retrieval paradigm. Instead of embedding the query and finding similar documents, HyDE first asks an LLM to generate a hypothetical answer to the query, then embeds that hypothetical document and uses it for retrieval.

The intuition is elegant: a hypothetical answer, even if imperfect, will be closer in embedding space to the actual relevant documents than a short query would be. Queries and documents live in different linguistic spaces. A query like "What causes inflation?" is very different in structure from a document paragraph explaining monetary policy, but a hypothetical answer about inflation would be structurally similar to real documents about inflation.

HyDE works best when the LLM has some general knowledge about the topic, so its hypothetical answer is at least in the right semantic neighborhood. It can struggle when the topic is highly specialized or when the LLM's training data has gaps in the relevant domain.

Recursive Retrieval and Query Decomposition

Complex questions often require information from multiple documents. Query decomposition breaks a complex query into simpler sub-queries, retrieves context for each, and then synthesizes a final answer. For example, "Compare the economic policies of France and Germany in response to the 2024 energy crisis" would be decomposed into sub-queries about each country's policies independently, then combined.

Recursive retrieval takes this further by using initial retrieval results to inform subsequent queries. The system might first retrieve high-level overview documents, extract key concepts, and then retrieve more detailed documents about those specific concepts.

Sentence-Window Retrieval

Standard chunking can split relevant information across chunk boundaries. Sentence-window retrieval addresses this by embedding individual sentences for precise matching but expanding the context window during retrieval to include surrounding sentences. This gives you the precision of fine-grained matching with the context of larger chunks.

Fusion Retrieval

Different retrieval methods have different strengths. Hybrid search combines dense vector retrieval with sparse keyword-based retrieval like BM25. Vector search excels at semantic similarity, while BM25 catches exact keyword matches that embedding models might miss, such as product codes, acronyms, or proper nouns.

Reciprocal Rank Fusion (RRF) is a popular method for combining results from multiple retrieval methods. It assigns scores based on rank position rather than raw similarity scores, making it robust to differences in score distributions across methods.

The best RAG systems rarely rely on a single retrieval strategy. Combining multiple approaches through fusion retrieval provides robustness against the weaknesses of any individual method.

Contextual Compression

Even after retrieving the most relevant chunks, not all content in those chunks is relevant to the query. Contextual compression uses an LLM to extract or summarize only the relevant portions of each retrieved chunk before passing them to the generator. This reduces noise in the context window and allows more relevant information to fit within token limits.

Choosing the Right Techniques

Not every advanced technique is appropriate for every use case. Here is a practical guide for selecting techniques based on your challenges:

  1. Low retrieval precision? Start with re-ranking. It provides the best quality-to-effort ratio.
  2. Vocabulary mismatch? Add query expansion or hybrid search to bridge terminology gaps.
  3. Complex multi-part questions? Implement query decomposition to handle each part separately.
  4. Short, ambiguous queries? Try HyDE to generate richer representations for retrieval.
  5. Long documents with sparse relevance? Use contextual compression to extract only what matters.

Key Takeaway

Advanced RAG techniques should be applied systematically based on identified failure modes, not adopted wholesale. Profile your system's errors, identify the root causes, and apply the technique that addresses each specific weakness.

The field of advanced RAG is evolving rapidly, with new techniques emerging regularly. The key is building a modular pipeline where components can be swapped and combined as better approaches become available. Start with a solid basic pipeline, add re-ranking first, then incrementally layer additional techniques based on measured improvements.