Foundation models, large language models, and generative AI have exploded in importance in recent years. Concurrently, researchers in academia and here at Snorkel AI increasingly understand that data scientists must enforce evaluation methods to make these powerful tools valuable in any setting. That statement rings especially true in the enterprise.

I had this in mind when I had the pleasure of inviting Yifan Mai to speak with our engineers and researchers. Mai serves as the lead maintainer of the Holistic Evaluation for Language Models (HELM) project at Stanford’s Center for Research on Foundation Models (CRFM), and we were excited to hear his insights from the cutting edge of this discipline.

Snorkel aims to employ more and better evaluation metrics and embed evaluation tools into our Snorkel Flow AI data development platform this year, and we hoped Mai’s insights might guide our work. Mai generously visited us for more than an hour, giving a raw and unfiltered look into the difficulty and promise of evaluating large language models (LLMs).

Image2

The importance of model evaluation

In his presentation, Mai said that evaluating AI models serves as a compass, directing us toward a better understanding of what these models can achieve and where they might fall short.

Here’s his summary of why model evaluation is so crucial:

  • Understanding capabilities: Evaluations provide an insight into the capabilities of AI models. Testing these models across a range of scenarios identifies their strengths and weaknesses.
  • Identifying risks: Evaluations also play a critical role in risk assessment. AI models, particularly those used in enterprise settings, might be tasked with handling sensitive data, such as personally identifiable information or intellectual property. Through rigorous evaluation, we can ensure that these models handle such data responsibly.
  • Ensuring alignment with objectives and ethics: Foundation model evaluations, Mai said, encode values into measurable numbers. This helps ensure that organizations choose and develop models that align with technical objectives as well as ethical standards.
  • Refining the user experience: Even seemingly minor details, such as the chattiness of a model’s responses, can impact user experience. Evaluations can assess these aspects at scale and allow data scientists to fine-tune model outputs.
  • Guiding policy and legislation: The White House recently encouraged more scrutiny of LLMs. That’s not possible without scalable evaluation—which can also help inform industry standards.

In essence, Mai said, model evaluation is a multi-faceted process that goes beyond mere performance metrics. It’s about understanding the AI model in its entirety, from its technical capabilities to its alignment with ethical standards, and its potential impact on users and society at large.

HELM’s guiding principles

Hai said three key principles guide CRFM’s work on HELM:

  1. Broad Coverage: CRFM aimed to include a wide range of previous benchmark papers in NLP literature and build upon them. This principle encourages the consideration of a wide array of data sources and perspectives when building and training models, and recognizes where the evaluation suite is incomplete.
  2. Multi-Metric Measurement: Academic evaluation tends to focus on accuracy as a primary metric. CRFM’s HELM project assesses AI models based on multiple aspects such as alignment, quality, aesthetics, reasoning, knowledge, bias, toxicity, and more.
  3. Standardization: CRFM evaluates all models under the same setup to ensure comparability.
Image1

LLM evaluation in the enterprise

At Snorkel, we think that CRFM’s guiding principles can apply in the enterprise setting as well—with a twist.

CRFM’s HELM tools serve as a great starting point for enterprises to decide which open source model to use in their own work. The HELM leaderboard segments the group’s wide array of metrics into sensible groupings, such as bias, fairness, and summarization metrics. Enterprises can pick what’s most important to them, rank the models, and choose the one that best fits their purpose.

But evaluation shouldn’t stop there. To customize robust language models, enterprise data science teams must adapt their chosen base model using LoRA or some other kind of fine-tuning. We believe that enterprises get the best performance boost from their LLMs when they adapt them in an iterative loop that includes developing and curating their training data and evaluating the model’s performance on customized and scalable metrics.

HELM’s open source tools can play a role in this iteration, but we think enterprises will want to develop bespoke evaluation metrics for their specific purposes.

At Snorkel, we are currently building tools to allow Snorkel Flow users to design custom performance metrics and apply them to specific slices of data. This will help our users ensure that their bespoke models perform well on multiple variations of their target tasks.

HELM and the future of enterprise LLM evaluation

CRFM’s work and insights will help enterprises and researchers navigate the rapidly evolving AI landscape. Their research and principles guide the development of AI models, while their open-source evaluation framework empowers companies to conduct their own evaluations and help them select the right model according to multiple axes.

While there are still open questions and challenges to address, we are excited about the future of foundation model evaluation and development. We look forward to applying the learnings from CRFM in our work at Snorkel AI, understanding the capabilities and risks of AI models, and contributing to the development of safe and effective AI applications.

Learn how to get more from foundation models without fine tuning!

At Noon Pacific, April 5, PhD Student Dyah Adila from the University of Wisconsin-Madison will discuss how you can achieve higher model performance from foundation models such as CLIP without spending days, weeks, or months fine tuning them.

Learn more (and register) here.