Why GenAI evaluation requires SME-in-the-loop for validation and trust
It’s critical enterprises can trust and rely on GenAI evaluation results, and for that, SME-in-the-loop workflows are needed.
In my first blog post on enterprise GenAI evaluation, I discussed the importance of specialized evaluators as a scalable proxy for SMEs. It simply isn’t practical to task SMEs with performing manual evaluations – it can take weeks if not longer, unnecessarily slowing down GenAI development. However, if we can capture SME domain knowledge in the form of well-defined acceptance criteria, and scale it via automated, specialized evaluators, we can accelerate evaluation exponentially – from several weeks or more to a few hours or less.
However, that doesn’t mean evaluation should occur without SMEs. Their domain knowledge is critical throughout the process, but we need to use their time as efficiently as possible. This is what we mean by SME-in-the-loop.
Specifically, enterprise GenAI evaluation requires SMEs to:
- Define acceptance criteria – factors to consider when evaluating responses
- Provide ground truth – examples of accepted/rejected responses, and reasons why
- Resolve conflicts – explanations of why evaluator judgement may be wrong
- Provide ongoing feedback – reasons why some accepted responses should be rejected
The first two are required to support the initial development of specialized evaluators. The third is the focus on this blog. The last is needed to ensure that evaluation continues to reflect the requirements of AI assistants and copilots as they evolve.
GenAI evaluation with fine-grained metrics
In my second blog post, I mentioned there were three dimensions to GenAI evaluation metrics. I went into detail on two of them – acceptance criteria and prompt categories. And here are the evaluation results for our example, a customer support AI assistant.
AI assistant evaluation results – round 2
Billing | Stolen card | Account status | |
No repetition | 85% | 90% | 95% |
No refusal to help | 50% | 98% | 100% |
Acknowledge request | 80% | 95% | 99% |
If the AI team had been provided with an overall metric such as a pass rate of 80%, they would not have enough insight to determine whether or not the AI assistant should be deployed to production, let alone where improvements must be made in order to increase the pass rate. On the other hand, fine-grained metrics for acceptance criteria and prompt categories go a long way when it comes to actionable insight for AI teams.
However, we’re assuming the “No refusal to help” evaluator is correct, and can be relied on. What if it’s not? What if it can’t be?
If we can’t validate the correctness of specialized evaluators, we can’t rely on them. And if we can’t rely on them, what good are they? AI teams may end up wasting significant time trying to correct a failure which isn’t occurring. Worse, they may deploy to production because evaluations failed to detect severe failures.
This brings me to the third dimension, SME-evaluator agreement (or alignment).
GenAI evaluation with SME-evaluator agreement
AI/ML engineers develop specialized evaluators with ground truth. Let’s consider an LLM-as-a-Judge (LLMAJ) which checks to see if an AI assistant has repeated itself. I find this to be a great example because there’s nothing worse than calling customer support and being passed around, with each representative asking the same questions as the previous one. If customers don’t like this when interacting with humans, then they’re not going to like it when interacting with an AI assistant either.
Let’s continue by assuming that SMEs have provided ground truth for 100 conversations.
First, an AI/ML engineer is going to iterate on the prompt until LLM judgments match the ground truth provided by SMEs. They will test the LLMAJ evaluator against different conversations, both accepted and rejected, until its judgement is always correct, perhaps on 10-20 conversations.
Next, the evaluator is run against a large sample of the evaluation dataset, including all 100 conversations with ground truth. As part of this process, we need a metric calculated for the agreement between SMEs and the LLMAJ evaluator – the percentage of conversations where an SME and the LLMAJ evaluator came to the same conclusion.
AI assistant evaluation results – round 3a
Billing | Stolen card | Account status | ||||
Pass | Agree | Pass | Agree | Pass | Agree | |
No repetition | 85% | 95% | 90% | 90% | 95% | 98% |
No refusal to help | 50% | 90% | 98% | 95% | 100% | 95% |
Acknowledge request | 80% | 85% | 95% | 92% | 99% | 95% |
This third dimension now provides everything an AI team needs to understand how well its AI assistant is performing, and precisely where improvements must be made. If SME-evaluator agreement is 90% for “no refusal to help” on billing requests, they can be confident that this is where improvements need to be made.
However, what if it wasn’t 90%?
AI assistant evaluation results – round 3b
Billing | Stolen card | Account status | ||||
Pass | Agree | Pass | Agree | Pass | Agree | |
No repetition | 85% | 95% | 90% | 90% | 95% | 98% |
No refusal to help | 50% | 40% | 98% | 95% | 100% | 95% |
Acknowledge request | 80% | 85% | 95% | 92% | 99% | 95% |
Should the AI team focus on improving the quality of billing requests when it comes to helpfulness? I think not.
It’s clear SMEs do not agree with the LLMAJ evaluator. It’s far more likely that the AI/ML engineer needs to go back and continue iterating on the prompt. However, it’s possible there may be ground truth errors. Either way, the AI team should focus on collaborating with SMEs to improve SME-evaluator agreement before trying to improve the AI assistant itself.
There’s a third scenario.
AI assistant evaluation results – round 3c
Billing | Stolen card | Account status | ||||
Pass | Agree | Pass | Agree | Pass | Agree | |
No repetition | 85% | 95% | 90% | 90% | 95% | 98% |
No refusal to help | 90% | 55% | 98% | 95% | 100% | 95% |
Acknowledge request | 80% | 85% | 95% | 92% | 99% | 95% |
Without an SME-evaluator agreement metric, the AI team would have been led to believe the AI assistant met the “No refusal to help” criteria, and was ready for production. Imagine the consequences if an AI team launched a customer-facing AI assistant which responded to billing requests poorly. I for one would rather not.
However, like before, SME-evaluator agreement at 55% indicates that the LLMAJ evaluator needs further prompt engineering. Afterwards, the results will likely indicate that the AI assistant is in fact struggling to be helpful when responding to billing requests. With this new insight, the AI team can make much needed improvements before deploying to production – and prevent a negative outcome with potentially severe consequences for the business.
Finally, the AI team should request additional ground truth as evaluation datasets are updated with samples of production conversations. After updating the evaluation dataset and running a full evaluation, they should assign a small sample of accepted and rejected conversations to SMEs for judgement. This allows the AI team to ensure that their specialized evaluators are still correct and can still be relied on.
GenAI evaluation principles
Over the course of three blogs, we’ve shared the key principles to successful GenAI evaluation:
- Specialized evaluators for speed, scale and relevance
- Fine-grained metrics for actionable insights
- SME-in-the-loop for validation and trust
In the real-world, these principles have helped Snorkel customers achieve up to 7x faster evaluation and nearly 90% agreement between SMEs and automated evaluators – accelerating GenAI development and gaining the confidence needed for production deployment.
Specialized evaluators
It isn’t practical to task SMEs with manually evaluating thousands of GenAI responses. It can take weeks if not months – time AI teams could have spent improving the system. The key to faster, scalable evaluation is creating specialized evaluators based on SME-defined acceptance criteria. They automate and scale evaluation without sacrificing accuracy and reliability.
Fine-grained metrics
The value of evaluation isn’t limited to determining whether an AI assistant or copilot should be deployed to production. It’s a critical tool which can provide AI teams with actionable insights, informing them of where and why repeated errors are occurring. You can’t improve a GenAI system if you don’t know exactly what it’s doing wrong.
SME-in-the-loop
We need automation to accelerate and scale evaluation so it doesn’t stall or block GenAI development. However, it must include participation from SMEs. Their domain knowledge is critical. Rather than asking SMEs to spend weeks reviewing reasoning traces and responses, they should be engaged with greater efficiency. Specifically, soliciting much-needed input and feedback which AI teams can then act on and scale.
When it comes to evaluation, with the right approach, can have your cake and eat it too.
I started out as a developer and architect before pivoting to product/marketing. I'm still a developer at heart (and love coding for fun), but I love advocating for innovative products -- particularly to developers.
I've spent most of my time in the database space, but lately I've been going down the LLM rabbit hole.