Continual Learning Bench by Berkeley & Snorkel

RESOURCES

Blog

Ideas, updates, and practical guidance from the Snorkel team.

Research

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Evaluating multi-agent systems in enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

Oct 09, 2025 •

Bhavishya Pohani

Learn more about Evaluating multi-agent systems in enterprise tool use

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…

Sep 30, 2025 •

Kobie Crawford, Jeong Shin, Tom Walshe

Learn more about Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Parsing isn’t neutral: why evaluation choices matter

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…

Sep 26, 2025 •

Justin Bauer

Learn more about Parsing isn’t neutral: why evaluation choices matter

The science of rubric design

Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.

Sep 11, 2025 •

Charles Dickens, Chris Glaze

Learn more about The science of rubric design

The right tool for the job: An A-Z of rubrics

Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.

Sep 02, 2025 •

Tom Walshe, Armin Parchami

Learn more about The right tool for the job: An A-Z of rubrics

Data quality and rubrics: how to build trust in your models

Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.

Jul 29, 2025 •

Armin Parchami

Learn more about Data quality and rubrics: how to build trust in your models

Building the benchmark: inside our agentic insurance underwriting dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

Jul 10, 2025 •

Chris Glaze, Fred Sala

Learn more about Building the benchmark: inside our agentic insurance underwriting dataset

Evaluating AI agents for insurance underwriting

In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals…

Jun 26, 2025 •

Chris Glaze

Learn more about Evaluating AI agents for insurance underwriting

LLM observability: key practices, tools, and challenges

LLM observability is crucial for monitoring, debugging, and improving large language models. Learn key practices, tools, and strategies of LLM observability.

Jun 23, 2025 •

Snorkel Team

Learn more about LLM observability: key practices, tools, and challenges

1 … 3 4 5 … 37

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Blog

Closing the Evaluation Gap in Agentic AI

Join our newsletter

How do you want to work with Snorkel?