How we achieved 89% accuracy on contract question answering |

Imagine sifting through a 500-page legal document for several hours to answer a simple question. Now, imagine a machine learning system that could perform contract question answering for you in a fraction of the time.

That’s what we at Snorkel AI recently created for a top 10 US bank. Our goal was to automate the process of extracting complex information from extensive legal PDFs, freeing up the bank’s subject matter experts (SMEs) to concentrate on the more enjoyable—and more valuable—parts of their jobs.

We began with an out-of-the-box solution using the GPT-4 large language model (LLM) and OpenAI’s text data embeddings. With an initial accuracy rate of just 25%, it was clear we had our work cut out for us. I recently talked with Matt Casey, Snorkel AI’s data science content lead, about how we leveraged the power of Snorkel Flow to boost our system’s performance to production levels in just a few weeks.

You can watch the full interview (embedded below), but I’ve summed up the journey here.

Optimizing pre-retrieval + retrieval for contract question answering

To improve the performance of our system, we initially concentrated our efforts on optimizing two key aspects: pre-retrieval and retrieval.

Pre-retrieval: chunking and tagging

During pre-retrieval, we focused on smarter “chunking”—an approach that prioritizes keeping bullet points and other structural components of the PDF’s text together. This helped to better organize the chunks and enhance them with relevant metadata.

The metadata included:

Identification of the document section where a paragraph was located.
Recognition of whether a paragraph was discussing a date.
Detection of whether a paragraph was providing legal definitions.

Attaching this metadata required us to create helper models that we called “custom extractors.” These binary classifiers identified whether a document chunk was likely to contain their targeted type of content—dates for one, definitions for the other. Creating these custom extractors would typically take months. However, with the help of Snorkel Flow, we were able to develop them in just a few days. This significantly sped up the process and improved the system’s overall efficiency.

This metadata enhancement allowed us to create a more robust foundation for the subsequent retrieval process.

Retrieval: customizing our embedding model

The retrieval aspect was equally crucial and required careful fine-tuning. The generic embedding model performed some tasks well—like telling legal documents from news articles—but it struggled to differentiate between chunks of legal text.

We fine-tuned the embedding model to the task’s narrow, specialized domain. This significantly improved the application’s ability to differentiate relevant content from irrelevant content.

Additional tweaks

While most of our early work focused on pre-retrieval and retrieval optimization, we also made other adjustments—for example, injecting relevant non-document information into the prompt. This may sound obvious, but GPT-4 doesn’t know the current date. So, we had to include the current date in our prompt template.

Taken together, all of these tweaks were an iterative process that involved running multiple experiments.

How Snorkel Flow programmatic and synthetic data helped

We leveraged the power of our Snorkel Flow AI data development platform to accelerate the process of developing data and iterating on our application. The platform allowed us to encode the knowledge and intuition of the bank’s SMEs into labeling functions that combined to swiftly generate a large quantity of programmatic labels.

Some key steps in this process included:

SME tracking: We recorded the pages the SMEs visited and the definitions they looked for when answering particular questions to develop our labeling functions.
Synthetic data generation: To scale data development, we generated synthetic data guided by the metadata and filtered for quality according to models built using the original data. This synthetic data was important in allowing the embedding model to generalize well to many types of questions.
Rapid annotation: By combining programmatic labels from Snorkel Flow with a few manually labeled PDFs, we can quickly scale our training data to encompass 500 PDFs.
Custom extractors: Using Snorkel Flow, we created helper models that allowed us to tag document chunks as likely to include dates or likely to include legal definitions. We built them in a single day. Without Snorkel Flow, labeling the data to create these extractors could take months.
Iterative experiments: We conducted more than 40 experiments in three weeks during our first sprint. This rapid experimentation, augmented by Snorkel Flow, allowed us to understand what system components were holding us back at each point and efficiently improve upon them.

In the interview, Matt asked me how long this project would have taken without Snorkel Flow. The answer is that it probably wouldn’t have happened. When large companies find themselves staring down months of hand-labeling work, they often shelve the project and move on.

With Snorkel Flow, they don’t have to.

Minhajul Hoque discussed contract question answering with Matt Casey.

Results and ongoing work

Our first sprint, which lasted three weeks, saw the answer accuracy rate improve from 25% to 79%. In the second sprint, we further improved the accuracy to 89%.

However, the project is still ongoing. We’re currently in the third sprint where we aim to expand the number of questions the system can answer.

Snorkel Flow: freeing up SME time

By automating a tedious and time-consuming task, we not only improved efficiency but also allowed SMEs to focus on higher-value tasks. The success of this project exemplifies the potential of AI in turning complex ideas into practical solutions, and we’re excited about the future possibilities.