The purpose of GenAI evaluation within an enterprise is to ensure custom AI assistants and copilots respond in a way that meets business requirements and expectations.

Or, perhaps more specifically, that they’re responding the same way an experienced employee would. The challenge is that what makes a response correct is determined by company standards, policies, guidelines and so on. The same judgment expected of experienced employees, especially subject matter experts (SMEs) and/or those in customer-facing roles, must be demonstrated by GenAI assistants and copilots too.

We refer to these expectations as acceptance criteria. They are the characteristics which SMEs consider when determining whether or not a response is acceptable. However, it simply isn’t practical to have SMEs review every single response to see if it meets all of their acceptance criteria.

The problem with enterprise GenAI evaluation is that there is a gaping hole when it comes to applying SME acceptance criteria, evaluating GenAI applications within a business context.

There are plenty of standard LLM benchmarks – MMLU PRO, IFEval, MBPP EvalPlus, MATH, GPQA Diamond and many others. However, in practice, they’re simply used by providers to prove that their latest model is better than competing ones at core capabilities such as coding and reasoning. They’re not particularly helpful for enterprise AI teams when it comes to evaluating their GenAI applications.

It should come as no surprise then to see the emergence of GenAI evaluation platforms for enterprises. They’re more helpful than LLM benchmarks, but they place a strong emphasis on the use of out-of-the-box (OOTB) evaluators – and it’s not enough. Yes, they’re helpful in identifying structural errors such as inefficient retrieval. However, these general evaluators can’t help enterprises understand whether or not their GenAI applications are responding as they should, and if they meet the requirements for production deployment.

Production requires confidence, and confidence requires specialized GenAI evaluators.

Specialized evaluators

Simply put, an evaluator is a function which checks to see if a response (along with the prompt and context) meets a specific acceptance criteria. In this way, an evaluator acts as a proxy for SMEs – allowing AI teams to run automatic, comprehensive and trustworthy evaluations at scale (vs. asking SMEs to review every response one at a time).

There are OOTB evaluators included in every GenAI evaluation platform, often implemented with LLM-as-a-Judge (LLMAJ). However, while these evaluators can help identify structural errors such as poor chunk relevance, they’re not a proxy for SMEs – and can’t determine whether or not a response is acceptable to the business.

The most critical acceptance criteria are based specifically on the domain, business or use cases. Here are few examples which can’t be addressed by OOTB evaluators:

  • [Domain] Adhering to industry regulations
  • [Business] Consistency with brand guidelines
  • [Use case] Following established best practices

What these examples are really enforcing for depends on the context. For example, when building an AI assistant to help with customer service, there may be best practices such as not repeating questions and finishing conversations by asking if there is anything else you can help with. There may be brand guidelines which prohibit certain language or industry regulations which constrain what information and be requested or shared.

Regardless, OOTB evaluators which evaluate structural correctness (e.g., instruction and context adherence) are not enough to provide enterprises with the confidence needed to move forward let alone identify where an AI assistant/copilot is failing to meet business requirements and expectations.

This is why specialized evaluators are required too.

However, creating specialized evaluators isn’t as simple as writing an LLM prompt, and it can’t be done without the help of SMEs.

Prompt engineering

Creating specialized LLMAJ evaluators is, in part, a prompt engineering exercise. It will almost certainly require multiple iterations before the prompt is consistently inducing the correct judgment from an LLM – and input from SMEs whose judgment it’s trying to replicate will be necessary. However, the only way to know it’s doing this is by incorporating reference prompt, context and response triplets. It’s the best way to validate the correctness of an LLMAJ evaluator. Validation requires ground truth.

It’s critical that LLMAJ evaluators align with SMEs. Otherwise, what good are they?

SME alignment

This may be the most critical aspect of evaluation. Because specialized evaluators are created to act as a proxy for SMEs, part of the validation process must include comparing LLMAJ judgments with SME judgments.

If an LLMAJ evaluator produces judgments which match a small amount of ground truth, that’s a positive signal. However, in practice, especially during the initial stages of development, enterprises often find that LLMs and SMEs are far from aligned in terms of judgment – and it’s important to understand where they’re misaligned and the degree to which they are.

The easiest way to do this is to run an evaluation which generates results from the LLMAJ evaluator and assign a high-value subset of the evaluation data to SMEs for expert judgment, which includes their rationale. Next, compare their judgment with that of the LLMAJ. If their alignment is around 50%, that’s not good. It means the LLMAJ evaluator needs improvement, and that will require input from SMEs. If it’s around 90%, then the LLMAJ evaluator can be used as a reliable proxy for SMEs.

Speed, scale and confidence

At the end of the day, the path to production for enterprise AI runs through specialized evaluation.

By creating automated proxies for SMEs, AI teams can accelerate the evaluation process exponentially without sacrificing quality. And it scales AI adoption by enabling them to not only run trustworthy evaluations frequently, but to standardize on a framework which can be used to support evaluation of all GenAI applications – whether assistants or copilots, employee-facing or customer-facing.

Finally, it’s necessary to provide the confidence needed for production deployment. All too often, when enterprises find themselves in the news for less than positive reasons regarding AI, it’s because GenAI applications were not properly evaluated – and the risk was neither recognized nor mitigated.

Specialization is but one aspect of enterprise GenAI evaluation. Stay tuned as we’ll discuss another critical aspect in our next blog – insight.