Retrieval-augmented generation (RAG) failure modes and how to fix them
Retrieval-augmented generation (RAG) represents a leap forward in natural language processing. Well-crafted RAG systems deliver meaningful business value in a user-friendly form factor. However, these systems contain multiple complex components. RAG failure modes can inhibit a system’s value and cause significant headaches.
In this article, we will explore some of the most common pitfalls encountered in RAG pipelines and provide actionable solutions to address them.
What is retrieval-augmented generation (RAG)?
RAG systems combine the strengths of reliable source documents with the generative capability of large language models (LLMs).
The simplest RAG system consists of a vector database, an LLM, a user interface, and an orchestrator such as LlamaIndex or LangChain. After a user enters their query, the system retrieves relevant documents or document chunks from the vector database and adds them to the initial request as context. This final prompt gives the LLM more context with which to answer the user’s question.
Learn more about retrieval-augmented generation in our guide.
Prompt engineering: crafting effective prompt templates
Poorly designed prompt templates can lead to off-target responses or outputs that lack the desired specificity. This is akin to giving a colleague or subordinate incomplete or poorly-formed instructions; if they don’t understand the task, they can’t complete it.
Solving challenges with prompt templates
Begin by clearly defining the prompt’s objective and the desired characteristics of the output. Experiment with different prompt structures, starting with simple instructions and iteratively incorporating more complex directives as needed. Consider using prompt engineering techniques such as few-shot learning, where relevant examples are included to guide the model’s response.
You may also need to add dynamic variables to your prompt template. For example, an LLM does not know the current date. If any intended task requires this knowledge, you should include a step in your pipeline that injects the current date and time into the prompt.
Whatever approach you take, iterate. Adjust your prompt, re-run it on a consistent set of representative queries, and inspect the results. You should soon find a working template.
Data coverage and quality
When data scientists build RAG systems, they do so with the intent of enriching users’ requests with high-quality, relevant context. The systems’ users assume that it’s drawing on a comprehensive set of high-quality documents, but that’s not always the case.
A RAG system that lacks complete coverage could leave the LLM with no context to draw on, increasing the likelihood that it generates a “hallucination.” Additionally, a system that includes poorly curated documents could give the end user misleading, incorrect, or outdated information.
Solving challenges with data coverage and quality
If your RAG system’s responses seem unanchored in the right underlying information, check the context included in the prompts. If the prompts lack appropriate context, thoroughly search your vector database and then fill in any gaps you find.
Teams managing RAG applications should regularly audit and expand the system’s dataset to ensure comprehensive coverage. They should also check older documents and ensure that they remain accurate; organizational policies and product offerings will change over time. Automated tools can help identify gaps or biases in the data, enabling proactive adjustments.
Document chunking: striking the balance
Ineffective document chunking can lead to information loss, noisy context, or irrelevant retrievals, hampering the performance of RAG pipelines. Chunking documents into sections that are too large may result in the system overlooking pertinent information. Overly granular chunks can separate pertinent information from important surrounding context.
If the chunks included in your RAG prompts are too long, too short, or cut off in the middle of vital information, you may have a chunking problem.
Improving RAG outcomes with better document chunking
While off-the-shelf RAG orchestration tools typically chunk documents according to a set number of tokens by default, they include other chunking options. Experimenting with different token windows and overlaps may yield usable results. If they don’t, switching to paragraph, page, or semantic segmentation might.
In Snorkel Flow, we use a proprietary algorithm that leverages semantic similarity and document structure to approximate how a human might separate document sections. This yields chunks of different sizes that typically include enough information to feel self-contained. In practice, we have found that this algorithm yields production-grade results.
Embedding models: capturing semantics accurately
Challenges with embedding models initially appear similar to challenges with document coverage; the prompts lack the appropriate context.
If you have this problem and verify that the appropriate information exists in your vector database, the challenge likely springs from your embedding model. Generalist embedding models usually won’t capture the semantic nuances of domain-specific data, causing the system to struggle to prioritize chunks for retrieval at inference time. Some retrieved chunks may be relevant. Others may not be. An off-the-shelf embedding model likely can’t tell the difference.
Improving retrieval through custom embedding models
Selecting the right pre-trained embedding model for your domain can significantly improve the relevance of retrieved chunks. However, optimal embedding model performance requires custom fine-tuning on your proprietary corpus.
Fine-tuning an embedding model calls for three-part training examples, which consist of:
- A query.
- A chunk of context that is highly relevant to that question.
- A chunk of context that is irrelevant to that question.
This process works best if the “wrong” example is similar to the “right” example. These “near misses” (also known as “hard negatives”) highlight document chunks that look or behave similarly to the query but are not truly related. These examples can be hard to find, and sometimes the process is referred to as “hard negative mining.”
Data scientists can achieve results faster by randomly pairing correct chunks with unrelated ones. This may yield an embedding model less sensitive to important nuances than one trained with hard negatives, but the result may satisfy deployment needs for far less effort.
Chunk enrichment: enhancing contextual relevance
Sometimes, vector-based retrieval will fall short of enabling your system to prioritize all of the correct information at all times—regardless of how well you cusomize your embedding model. Some tasks demand prioritizing specific information that may not appear semantically similar to a user’s query.
For example, an application built to find payment dates may need to surface all passages that include references to dates, which an embedding model may struggle to prioritize on its own.
Enriching chunks with metada enables hybrid approaches that leverage categorical information as well as vector embeddings.
Better results through chunk tagging
Teams in charge of RAG systems can enrich document chunks with additional metadata or annotations to aid in retrieval and generation. This could include tagging chunks with key entities, summarizing sections, or linking related documents.
Data teams can add tags to chunks as a one-time exercise or build it into the pipeline. Supporting tools such as named entity recognition (NER) or topic models can automatically enrich incoming chunks with relevant contextual information.
Once they’ve added tags to the chunks, data scientists can use their orchestration framework to filter results accordingly. They can do this separately and in addition to finding chunks with high relevance scores. This process might, as a result, add two separate sets of chunks to the context—one optimized for relevancy and one optimized for tags—ensuring that the LLM gets everything it needs to answer the user’s question.
LLM fine-tuning: aligning with task-specific needs
Off-the-shelf LLMs can produce generic or contextually inappropriate outputs, undermining the effectiveness of the RAG pipeline. If your RAG system retrieves the correct context, but the model returns a result that is out of line with your expectations—tonally, factually, or format-wise—you likely need to fine-tune your LLM.
Improving RAG outcomes with customized LLMs
Fine-tuning LLMs on domain- and task-specific datasets aligns their generative capabilities with the desired output. You may want to begin by finding an open source LLM already fine-tuned to your domain. That may be enough to sharpen responses to meet your standards. If it isn’t—or you can’t find an LLM for your domain—the next step is to fine-tune your chosen open-source model with appropriate prompts and high-quality responses.
The team’s data scientists should work closely with subject matter experts to label a corpus of model responses (either historical or freshly generated) as high or low quality. We recommend that teams do this scalably with programmatic labeling, but other methods can also work.
Once curated, use the prompt/response pairs to fine-tune the model and align its outputs with what you want them to look like. This may require several iterations.
Debug your RAG system and enhance its enterprise value
RAG pipelines hold immense potential to transform how enterprises interact with information by providing contextually enriched, accurate responses. However, their success hinges on careful attention to the nuances of their implementation.
Organizations can harness the full power of RAG systems by understanding and addressing common failure modes—such as document chunking, embedding models, LLM fine-tuning, chunk enrichment, and prompt template engineering. Emphasizing a data-centric approach, continuous iteration, and user feedback will ensure these pipelines not only meet but exceed expectations, driving innovation and problem-solving to new heights.
Learn how to get more value from your PDF documents!
Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.
Featured image by Jessica Ruscello on Unsplash
Matt Casey leads content production at Snorkel AI. In prior roles, Matt built machine learning models and data pipelines as a data scientist. As a journalist, he produced written and audio content for outlets including The Boston Globe and NPR affiliates.