Continual Learning Bench by Berkeley & Snorkel

We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmark Grants

Benchmarking Agents in Insurance Underwriting Environments

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking & Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Continual Learning Bench
Co-published with Berkeley

Terminal-Bench 2.0 (+3.0)
Co-authored with Laude Institute

BigLaw Bench: Research
Co-released with Harvey

SlopCode Bench
Co-released with UW-Madison

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Rapid Data Creation with Weak Supervision
Best of VLDB 2017

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Blog

NEW

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control….

Jun 04, 2026 •

Alexis Sobel

Learn more about Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Benchtalks #2: The future of coding benchmarks

Blog

NEW

Benchtalks #2: The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

Jun 03, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #2: The future of coding benchmarks

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality...

Research Paper

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

May 26, 2026 •

Charles Dickens

Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Blog

Code World Models and AutoHarness for LLM Agents

At our latest Snorkel AI Reading Group, Carter Wendelken of Google DeepMind walked us through two related papers he presented at ICLR: Code World Models for General Game Playing and AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. Both ask the same question from opposite ends: when you want an LLM to act reliably in a complex, possibly…

May 14, 2026 •

David Burch

Learn more about Code World Models and AutoHarness for LLM Agents

Blog

Why coding agents need better data, evals, and environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 11, 2026 •

Justin Bauer

Learn more about Why coding agents need better data, evals, and environments

Blog

Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development

At our latest Snorkel AI Reading Group, Mayee Chen (Stanford, Hazy Research) stopped by our San Francisco office to walk us through Olmix: A Framework for Data Mixing Throughout LM Development — work she contributed to during her internship at Ai2 on OLMo 3. Olmix tackles one of the messiest, least-documented levers in LLM pre-training: how to set the ratios…

May 01, 2026 •

David Burch

Learn more about Understanding Olmix: A Framework for Data Mixing Throughout Language Model Development

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...

Research Paper

Accepted to MLSys 2026

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Apr 21, 2026 •

Justin Bauer, Thomas Walsh, Derek Pham, Harit Vishwakarma, Armin Parchami, Fred Sala, Paroma Varma

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Blog

Benchmarks should shape the frontier, not just measure it

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks drive how the field allocates research effort, the bar for benchmarks has risen as well. Here, we share what’s now table stakes for useful benchmarks, and what separates the ones…

Apr 07, 2026 •

Vincent Sunn Chen

Learn more about Benchmarks should shape the frontier, not just measure it

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics

Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by...

Research Paper

Accepted to ICLR Brazil 2026

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics

Apr 03, 2026 •

Zhengyang (Jason) Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Fred Sala, Paroma Varma

Learn more about RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics

1 2 … 35 36

Let’s research together

Join our team of leading researchers and help shape the future of AI.

View all careers

Open Benchmarks Grants

We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Featured research

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

SlopCodeBench: A community benchmark measuring code erosion

Harvey’s BigLaw Bench: Research

Continual Learning Bench: Evaluating agents that adapt and improve over time

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Benchmarking Agents in Insurance Underwriting Environments

Vision and impact

Benchmarking & Evaluation

Scaling Subject Matter Expertise

RL, Training, & Data Valuation

Community and open science

Open Benchmarks Grants

Benchtalks

Reading Group

Technical advisors and distinguished affiliates

Stephen Bach

Jason Fries

Jared Dunnmon

Fred Sala

Chris Ré

Ludwig Schmidt

Karthik Narasimhan

Yu Su

Lewis Tunstall

Browse research blogs and academic papers

Let’s research together

How do you want to work with Snorkel?

Browse research blogs and academic papers