Why enterprise GenAI evaluation requires fine-grained metrics to be insightful
GenAI evaluation is critical for enterprises deploying AI assistants and copilots. It must inform AI teams of whether or not these applications adhere to SME-defined acceptance criteria. Specifically, they need to know if the AI assistant or copilot is responding as the business requires it to. And when it’s not, evaluation must inform them of precisely where and why failures are occurring.
Without actionable insights, AI teams are more or less asked to throw spaghetti on the wall and see what sticks. It’s neither practical nor effective, and it is most definitely frustrating.
However, when evaluations provide deep insights into the behavior of GenAI applications, AI/ML engineers can quickly identify what improvements are needed and correctly determine the best way to implement them – resulting in a much faster, and far more efficient, GenAI development process.
Evaluation dimensions
There are three dimensions to enterprise GenAI evaluation.
The first is acceptance criteria. I discussed this in my previous blog on the need for specialization when it comes to GenAI evaluation. Let’s consider an AI assistant for customer support. The same policies, guidelines and best practices followed by customer support representatives (CSRs) must be followed by the AI assistant – don’t repeat questions, don’t refuse to help, don’t recommend competitors, acknowledge the customer’s request and so on. The big takeaway here is that these acceptance criteria reflect business, domain and use-case requirements, and thus require specialized evaluators.
The second is prompt categories. I’m going to assume we’ve all called customer support before – press 1 to make a payment, press 2 to report a lost or stolen card, press 3 to check the status of an order… press 0 to speak to an agent. We know requests handled by CSRs fall into one of N categories. And we can assume the business monitors the performance of CSRs within these categories, and should performance fall below a specific threshold, additional training may be required. The same is true for enterprise AI assistants and copilots.
I’ll cover the third in my next evaluation blog.
Visibility into these evaluation dimensions is critical for AI teams.
Overall evaluation metrics
Take the worst case scenario of simply being provided an overall pass/fail metric after evaluating our new AI assistant for customer support.
AI assistant evaluation results – round 1
Pass | Fail | |
Acceptable | 80% | 20% |
In this case, it turns out that 80% of our AI assistant’s responses have been judged acceptable. That’s pretty good, right? In fact, it may be good enough for production.
However, there’s a few important questions to ask.
- What went wrong in the 20% that failed?
- What’s the negative impact of these failures?
- What improvements are needed to prevent them?
What if I said the 20% which failed were related to billing or stolen cards?
Would we still consider an 80% pass rate good enough for production?
However, even if we knew that most of the failures were related to billing, we’d still be lacking the actionable insight needed to improve our AI assistant. We could review each failure one at a time and attempt to identify a pattern, but this approach is terribly inefficient. Even worse, sampling failures and trying to make an educated guess.
Fine-grained evaluation metrics
We can solve this problem by looking at the intersection between acceptance criteria and prompt categories.
AI assistant evaluation results – round 2
Billing | Stolen card | Account status | |
No repetition | 85% | 90% | 95% |
No refusal to help | 50% | 98% | 100% |
Acknowledge request | 80% | 95% | 99% |
This is actionable insight. An 80% pass rate provides AI teams with very little to act on, something akin to “just make it better”. Now, AI teams can focus on the real issue – the AI assistant is struggling to provide customers with helpful responses to billing inquiries. Rather than undertaking a broader improvement and praying it addresses critical failures, one which require significant time and effort, they can implement a faster, targeted solution.
Of course, we can go one step further by adding additional acceptance criteria.
AI assistant evaluation results – round 3
Billing | Stolen card | Account status | |
No repetition | 85% | 65% | 95% |
No refusal to help | 50% | 98% | 100% |
Sufficient answer | 80% | 95% | 70% |
Prompt adherence | 90% | 60% | 95% |
Chunk relevance | 40% | 95% | 90% |
Answer structure | 85% | 90% | 55% |
Generally speaking, there are three ways to improve the accuracy and quality of GenAI responses: better prompts (instructions, examples, etc), better retrieval (chunk, indexing and search) and better generation (LLM fine tuning and alignment). With the acceptance criteria and prompt categories above, we not only know what types of requests the AI assistant is struggling with, but likely the reason why too.
- If billing requests suffer from poor chunk relevance, then fine-tune the embedding model with good and bad examples of billing question-chunk pairs (or triplets).
- If stolen card requests suffer from repetition, then refine the prompt template with clearer instructions and examples.
- If account status questions suffer from insufficient answers, then fine-tune the LLM with training data consisting of account status questions with properly structured responses.
Better evaluation
At the end of day, AI teams need evaluations to answer important questions:
- What’s the risk of deployment?
- What types of failures are occurring, and why?
- What can we do to prevent these failures from occurring again?
- What is the impact of previous changes?
It isn’t practical for AI teams to improve GenAI applications by applying changes that address one failure at a time. Nor is random sampling or trying different things until the results look better. AI teams deserve evaluations which provide them with actionable insights so they can not only deliver high-quality GenAI applications, but do it faster and more efficiently.
I started out as a developer and architect before pivoting to product/marketing. I'm still a developer at heart (and love coding for fun), but I love advocating for innovative products -- particularly to developers.
I've spent most of my time in the database space, but lately I've been going down the LLM rabbit hole.