As enterprises aim to deploy a growing number of generative AI applications, evaluating the performance of large language models has become a gateway to successful production deployment. We at Snorkel AI believe that customized and automated evaluation approaches present the best path to production—especially when paired with data slices.

Data scientists often don’t understand how to judge model performance against business use cases. Subject matter experts typically don’t have the tools to scale their expertise to assess large language models (LLMs). With this dynamic, some companies have deployed armies of human labelers. As an alternative, Snorkel built a platform where data scientists and SMEs can collaborate to build scalable, sustainable solutions.

I spoke about the value of data slices at a recent webinar. You can watch my entire remarks (embedded below), but I have summarized the main points here.

Why is customized and automated LLM evaluation so critical?

Without customized LLM evaluation, enterprises can’t use LLM applications for business-critical tasks. The model must perform according to the standards an organization cares about and on the tasks it needs to complete.

While open source benchmarks and direct human evaluation can be useful, both have drawbacks. Open source benchmarks will not understand your organization’s specific challenges. Human evaluators might, but they can be expensive, and each round of model evaluation requires a fresh round of human annotation.

Image4

In their place, we recommend building custom evaluation models encoded with your organization’s data and your expert’s knowledge and experience.

This approach:

  1. Enables continuous improvement: Automated evaluations help track the performance of your LLMs as they undergo updates. They act as a unit test, ensuring no regressions and highlighting areas for improvement.
  2. They can spot domain-specific nuances: The definition of “correctness” in LLMs is multi-faceted, encompassing factors like toxicity, compliance, relevance, comprehensiveness, or the ability to provide substantiated responses. Properly customized, automated evaluations will account for these nuances.
Image3

Introduction to data slices

Data slices are subsets of data that represent meaningful aspects of your business operations. In the context of LLM evaluation, they enable a fine-grained understanding of model performance across different business-critical areas. This granular approach of evaluating LLMs through data slices transforms the evaluation and iteration process.

In the Snorkel Flow AI data development platform, building a data slice works similarly to building a labeling function. Users employ LLM prompts, regular expressions, ontologies, embeddings, or any other of the many tools the platform offers to define slice membership at scale.

Slicing your data allows you to measure performance on precise, enterprise-specific tasks. For instance, an e-commerce company’s LLM evaluation might include slices for customer service queries about shipping issues, product information, or transaction concerns.

Image2

Leveraging programmatic techniques to evaluate data slices enables scalability. Rather than manually analyzing tens of thousands of data points, programmatic data slices allow for automated, robust, and consistent evaluations that can adapt to your changing business needs.

By evaluating each slice independently, you can easily identify which ones need further work. You can then address the training data in that slice by generating new “golden” examples or writing additional labeling functions to better label your training data.

As you make these adjustments, you can see how changes to the model have improved performance in each slice.

Building customized evaluations using data slices

Snorkel researchers and engineers have already used scalable evaluation and data slices to improve performance on customers’ LLM applications. We’re now moving this capability into the platform.

Here’s how the process works.

Step 1: Create a golden data set

Begin by using your subject matter experts (SMEs) to create a golden data set. This involves labeling a subset of data points and specifying acceptance criteria for what constitutes a good response.

Step 2: Encode expert criteria

Capture the rationale behind SMEs’ judgments programmatically. Use these criteria to develop a custom quality model that can scale your experts’ insights across the entire dataset.

Step 3: Slice and dice your data

Identify and define the data slices that matter most to your business. This might involve categorizing queries into different types of user interactions or product lines.

Step 4: Generate detailed evaluation reports

Combine your manual ground truth, programmatic quality model, metrics and data slices to produce a comprehensive evaluation report. This report should detail performance across different slices, allowing for an actionable understanding of where improvements are necessary.

Image1

Practical applications of data slices

To illustrate how data slices and quality models can help in real-world settings, here are two examples.

Consider an e-commerce platform using an LLM-powered chatbot. By slicing data into categories such as “Shipping Issues” and “Product Information,” the platform can track which specific types of queries the model handles well and where it needs improvement. This ensures that critical areas like transaction concerns, which have higher stakes, receive focused attention for optimization.

In the financial sector, data slices can be used to evaluate LLM performance in areas with strict compliance requirements such as transaction inquiries, fraud detection, and customer service. Each slice represents a different type of customer interaction, enabling the company to maintain high standards of accuracy and customer support.

Data slices + custom, automated evaluation: a powerful pair

Data slices provide a powerful method for fine-grained, domain-specific LLM evaluation. They ensure that your models meet the unique needs and critical operations of your enterprise. By adopting a systematic approach to slicing and evaluating your data according to your custom, organization-specific standards, you can build robust, scalable evaluations that keep your LLM deployments on track and your business operations smooth.

At Snorkel, we understand the importance of fine-grained evaluations. Our approach of leveraging programmatic data development techniques helps enterprises build custom, scalable, and efficient LLM evaluations that ensure their models are production-ready and aligned with business goals.

Request a free, custom LLM evaluation from Snorkel AI

To ship LLMs with confidence, enterprises need custom evaluations that are purpose-built for their domains and use cases. In a unique offer, Snorkel AI will help select organizations with a complimentary LLM evaluation, powered by Snorkel’s programmatic data development technology.

To learn more, visit our Snorkel Custom Evaluation page.