Data development

CRFM’s HELM and enterprise LLM evaluation beyond accuracy

April 3, 2024
5 min read

Foundation models, large language models, and generative AI have exploded in importance in recent years. Concurrently, researchers in academia and here at Snorkel AI increasingly understand that data scientists must enforce evaluation methods to make these powerful tools valuable in any setting. That statement rings especially true in the enterprise.

I had this in mind when I had the pleasure of inviting Yifan Mai to speak with our engineers and researchers. Mai serves as the lead maintainer of the Holistic Evaluation for Language Models (HELM) project at Stanford’s Center for Research on Foundation Models (CRFM), and we were excited to hear his insights from the cutting edge of this discipline.

Snorkel aims to employ more and better evaluation metrics and embed evaluation tools into our Snorkel Flow AI data development platform this year, and we hoped Mai’s insights might guide our work. Mai generously visited us for more than an hour, giving a raw and unfiltered look into the difficulty and promise of evaluating large language models (LLMs).

Image2

The importance of model evaluation

In his presentation, Mai said that evaluating AI models serves as a compass, directing us toward a better understanding of what these models can achieve and where they might fall short.

Here’s his summary of why model evaluation is so crucial:

  • Understanding capabilities: Evaluations provide an insight into the capabilities of AI models. Testing these models across a range of scenarios identifies their strengths and weaknesses.
  • Identifying risks: Evaluations also play a critical role in risk assessment. AI models, particularly those used in enterprise settings, might be tasked with handling sensitive data, such as personally identifiable information or intellectual property. Through rigorous evaluation, we can ensure that these models handle such data responsibly.
  • Ensuring alignment with objectives and ethics: Foundation model evaluations, Mai said, encode values into measurable numbers. This helps ensure that organizations choose and develop models that align with technical objectives as well as ethical standards.
  • Refining the user experience: Even seemingly minor details, such as the chattiness of a model’s responses, can impact user experience. Evaluations can assess these aspects at scale and allow data scientists to fine-tune model outputs.
  • Guiding policy and legislation: The White House recently encouraged more scrutiny of LLMs. That’s not possible without scalable evaluation—which can also help inform industry standards.

In essence, Mai said, model evaluation is a multi-faceted process that goes beyond mere performance metrics. It’s about understanding the AI model in its entirety, from its technical capabilities to its alignment with ethical standards, and its potential impact on users and society at large.

HELM’s guiding principles

Hai said three key principles guide CRFM’s work on HELM:

  1. Broad Coverage: CRFM aimed to include a wide range of previous benchmark papers in NLP literature and build upon them. This principle encourages the consideration of a wide array of data sources and perspectives when building and training models, and recognizes where the evaluation suite is incomplete.
  2. Multi-Metric Measurement: Academic evaluation tends to focus on accuracy as a primary metric. CRFM’s HELM project assesses AI models based on multiple aspects such as alignment, quality, aesthetics, reasoning, knowledge, bias, toxicity, and more.
  3. Standardization: CRFM evaluates all models under the same setup to ensure comparability.
Image1

LLM evaluation in the enterprise

At Snorkel, we think that CRFM’s guiding principles can apply in the enterprise setting as well—with a twist.

CRFM’s HELM tools serve as a great starting point for enterprises to decide which open source model to use in their own work. The HELM leaderboard segments the group’s wide array of metrics into sensible groupings, such as bias, fairness, and summarization metrics. Enterprises can pick what’s most important to them, rank the models, and choose the one that best fits their purpose.

But LLM evaluation shouldn’t stop there. To customize robust language models, enterprise data science teams must adapt their chosen base model using LoRA or some other kind of fine-tuning. We believe that enterprises get the best performance boost from their LLMs when they adapt them in an iterative loop that includes developing and curating their training data and evaluating the model’s performance on customized and scalable metrics.

HELM’s open source tools can play a role in this iteration, but we think enterprises will want to develop bespoke evaluation metrics for their specific purposes.

At Snorkel, we are currently building tools to allow Snorkel Flow users to design custom performance metrics and apply them to specific slices of data. This will help our users ensure that their bespoke models perform well on multiple variations of their target tasks.

HELM and the future of enterprise LLM evaluation

CRFM’s work and insights will help enterprises and researchers navigate the rapidly evolving AI landscape. Their research and principles guide the development of AI models, while their open-source evaluation framework empowers companies to conduct their own evaluations and help them select the right model according to multiple axes.

While there are still open questions and challenges to address, we are excited about the future of foundation model evaluation and development. We look forward to applying the learnings from CRFM in our work at Snorkel AI, understanding the capabilities and risks of AI models, and contributing to the development of safe and effective AI applications.

More Snorkel AI events coming!

Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.

If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.

Share this article
vivek krishnamurthy
Vivek Krishnamurthy
Applied Research Scientist

Vivek Krishnamurthy is currently conducting research at the intersection of Computer Vision and Natural Language. He is focused on fine-tuning multi-modal FMs with customer data to facilitate a range of downstream tasks, including classification and image retrieval.

Recommended articles

View all articles
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.