Retrieval-augmented generation (RAG) represents a leap forward in natural language processing. Well-crafted RAG systems deliver meaningful business value in a user-friendly form factor. However, these systems contain multiple complex components. RAG failure modes can inhibit a system’s value and cause significant headaches.

In this article, we will explore some of the most common pitfalls encountered in RAG pipelines and provide actionable solutions to address them.

What is retrieval-augmented generation (RAG)?

RAG systems combine the strengths of reliable source documents with the generative capability of large language models (LLMs).

The simplest RAG system consists of a vector database, an LLM, a user interface, and an orchestrator such as LlamaIndex or LangChain. After a user enters their query, the system retrieves relevant documents or document chunks from the vector database and adds them to the initial request as context. This final prompt gives the LLM more context with which to answer the user’s question.

Learn more about retrieval-augmented generation in our guide.

Prompt engineering: crafting effective prompt templates

Poorly designed prompt templates can lead to off-target responses or outputs that lack the desired specificity. This is akin to giving a colleague or subordinate incomplete or poorly-formed instructions; if they don’t understand the task, they can’t complete it.

Solving challenges with prompt templates

Begin by clearly defining the prompt’s objective and the desired characteristics of the output. Experiment with different prompt structures, starting with simple instructions and iteratively incorporating more complex directives as needed. Consider using prompt engineering techniques such as few-shot learning, where relevant examples are included to guide the model’s response.

You may also need to add dynamic variables to your prompt template. For example, an LLM does not know the current date. If any intended task requires this knowledge, you should include a step in your pipeline that injects the current date and time into the prompt.

Whatever approach you take, iterate. Adjust your prompt, re-run it on a consistent set of representative queries, and inspect the results. You should soon find a working template.

Data coverage and quality

When data scientists build RAG systems, they do so with the intent of enriching users’ requests with high-quality, relevant context. The systems’ users assume that it’s drawing on a comprehensive set of high-quality documents, but that’s not always the case.

A RAG system that lacks complete coverage could leave the LLM with no context to draw on, increasing the likelihood that it generates a “hallucination.” Additionally, a system that includes poorly curated documents could give the end user misleading, incorrect, or outdated information.

Solving challenges with data coverage and quality

If your RAG system’s responses seem unanchored in the right underlying information, check the context included in the prompts. If the prompts lack appropriate context, thoroughly search your vector database and then fill in any gaps you find.

Teams managing RAG applications should regularly audit and expand the system’s dataset to ensure comprehensive coverage. They should also check older documents and ensure that they remain accurate; organizational policies and product offerings will change over time. Automated tools can help identify gaps or biases in the data, enabling proactive adjustments.

Document chunking: striking the balance

Ineffective document chunking can lead to information loss, noisy context, or irrelevant retrievals, hampering the performance of RAG pipelines. Chunking documents into sections that are too large may result in the system overlooking pertinent information. Overly granular chunks can separate pertinent information from important surrounding context.

If the chunks included in your RAG prompts are too long, too short, or cut off in the middle of vital information, you may have a chunking problem.

Improper document chunking is a common rag failure mode.
Chunk size 512, overlap 0.2. Notice how a single section is spread across multiple chunks.
Snorkel AI's dynamic chunker overcomes the common RAG failure mode of improper document chunking
The SnorkelDynamicChunker automatically chunks documents by section!

Improving RAG outcomes with better document chunking

While off-the-shelf RAG orchestration tools typically chunk documents according to a set number of tokens by default, they include other chunking options. Experimenting with different token windows and overlaps may yield usable results. If they don’t, switching to paragraph, page, or semantic segmentation might.

In Snorkel Flow, we use a proprietary algorithm that leverages semantic similarity and document structure to approximate how a human might separate document sections. This yields chunks of different sizes that typically include enough information to feel self-contained. In practice, we have found that this algorithm yields production-grade results.

Embedding models: capturing semantics accurately

Challenges with embedding models initially appear similar to challenges with document coverage; the prompts lack the appropriate context.

If you have this problem and verify that the appropriate information exists in your vector database, the challenge likely springs from your embedding model. Generalist embedding models usually won’t capture the semantic nuances of domain-specific data, causing the system to struggle to prioritize chunks for retrieval at inference time. Some retrieved chunks may be relevant. Others may not be. An off-the-shelf embedding model likely can’t tell the difference.

Improving retrieval through custom embedding models

Selecting the right pre-trained embedding model for your domain can significantly improve the relevance of retrieved chunks. However, optimal embedding model performance requires custom fine-tuning on your proprietary corpus.

Fine-tuning an embedding model calls for three-part training examples, which consist of:

  1. A query.
  2. A chunk of context that is highly relevant to that question.
  3. A chunk of context that is irrelevant to that question.

This process works best if the “wrong” example is similar to the “right” example. These “near misses” (also known as “hard negatives”) highlight document chunks that look or behave similarly to the query but are not truly related. These examples can be hard to find, and sometimes the process is referred to as “hard negative mining.”

Data scientists can achieve results faster by randomly pairing correct chunks with unrelated ones. This may yield an embedding model less sensitive to important nuances than one trained with hard negatives, but the result may satisfy deployment needs for far less effort.

Chunk separation on a on off-the-shelf (left) and customized (right) embedding model in a real-world, domain-specific use-case.

Chunk enrichment: enhancing contextual relevance

Sometimes, vector-based retrieval will fall short of enabling your system to prioritize all of the correct information at all times—regardless of how well you cusomize your embedding model. Some tasks demand prioritizing specific information that may not appear semantically similar to a user’s query.

For example, an application built to find payment dates may need to surface all passages that include references to dates, which an embedding model may struggle to prioritize on its own.

Enriching chunks with metada enables hybrid approaches that leverage categorical information as well as vector embeddings.

Better results through chunk tagging

Teams in charge of RAG systems can enrich document chunks with additional metadata or annotations to aid in retrieval and generation. This could include tagging chunks with key entities, summarizing sections, or linking related documents.

Data teams can add tags to chunks as a one-time exercise or build it into the pipeline. Supporting tools such as named entity recognition (NER) or topic models can automatically enrich incoming chunks with relevant contextual information.

Once they’ve added tags to the chunks, data scientists can use their orchestration framework to filter results accordingly. They can do this separately and in addition to finding chunks with high relevance scores. This process might, as a result, add two separate sets of chunks to the context—one optimized for relevancy and one optimized for tags—ensuring that the LLM gets everything it needs to answer the user’s question.

LLM fine-tuning: aligning with task-specific needs

Off-the-shelf LLMs can produce generic or contextually inappropriate outputs, undermining the effectiveness of the RAG pipeline. If your RAG system retrieves the correct context, but the model returns a result that is out of line with your expectations—tonally, factually, or format-wise—you likely need to fine-tune your LLM.

Improving RAG outcomes with customized LLMs

Fine-tuning LLMs on domain- and task-specific datasets aligns their generative capabilities with the desired output. You may want to begin by finding an open source LLM already fine-tuned to your domain. That may be enough to sharpen responses to meet your standards. If it isn’t—or you can’t find an LLM for your domain—the next step is to fine-tune your chosen open-source model with appropriate prompts and high-quality responses.

The team’s data scientists should work closely with subject matter experts to label a corpus of model responses (either historical or freshly generated) as high or low quality. We recommend that teams do this scalably with programmatic labeling, but other methods can also work.

Once curated, use the prompt/response pairs to fine-tune the model and align its outputs with what you want them to look like. This may require several iterations.

Debug your RAG system and enhance its enterprise value

RAG pipelines hold immense potential to transform how enterprises interact with information by providing contextually enriched, accurate responses. However, their success hinges on careful attention to the nuances of their implementation.

Organizations can harness the full power of RAG systems by understanding and addressing common failure modes—such as document chunking, embedding models, LLM fine-tuning, chunk enrichment, and prompt template engineering. Emphasizing a data-centric approach, continuous iteration, and user feedback will ensure these pipelines not only meet but exceed expectations, driving innovation and problem-solving to new heights.

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!

Featured image by Jessica Ruscello on Unsplash