We develop methods, benchmarks, and training systems that turn expert data into frontier AI

building benchmarks and collaborating with

Image
Image
Image
Image
Image
Image
Image
Image
Image
key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Image

Open Benchmarks Grants

Backed by a $3M commitment, the program funds
open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built
and evaluated.

Image

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.
Image

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach headshot

Stephen Bach

Brown University
Eliot Horowitz Assistant Professor, Computer Science Department
Jason Fries headshot

Jason Fries

Stanford University
Assistant Professor of Biomedical Data Science and of Medicine
Jared Dunnmon headshot

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup
Prev. Dir. of AI at DIU
Fred Sala headshot

Fred Sala

Chief Scientist, Snorkel AI
Assistant Professor @ University of Wisconsin-Madison
Chris Ré headshot

Chris Ré

Co-Founder, Snorkel AI
Professor @ Stanford University
Ludwig Schmidt headshot

Ludwig Schmidt

Stanford University · LAION
Stanford researcher and LAION collaborator
Karthik Narasimhan headshot

Karthik Narasimhan

Princeton University
Professor of Computer Science
Yu Su headshot

Yu Su

Ohio State University
Associate Professor of Computer Science and Engineering
Lewis Tunstall headshot

Lewis Tunstall

Hugging Face
Machine Learning Engineer
PUBLICATIONS

Browse research blogs
and academic papers

Type: All Types
Sort: Newest
Coding agents don’t need to be perfect, they need to recover
Blog
Coding agents don’t need to be perfect, they need to recover

Error analysis of 8 models on Agentic Coding tasks Successful completion of complex tasks doesn’t come from models being always right. It comes from models being resilient when things go wrong. To get a deeper understanding of model behavior in agentic environments, our team analyzed all of the errors found in the full traces of tasks from our Agentic Coding…

Feb 13, 2026
Learn more about Coding agents don’t need to be perfect, they need to recover
Closing the Evaluation Gap in Agentic AI
Blog
Closing the Evaluation Gap in Agentic AI

Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real — backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will find hesitation. The reason: our ability to…

Feb 11, 2026
Learn more about Closing the Evaluation Gap in Agentic AI
Benchmarking Agents in Insurance Underwriting Environments
As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not...
Research Paper
Benchmarking Agents in Insurance Underwriting Environments

As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors…

Jan 31, 2026
Snorkel Team
Learn more about Benchmarking Agents in Insurance Underwriting Environments
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
AI agents may soon become capable of autonomously completing valuable, long horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent...
Research Paper
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, long horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows….

Jan 30, 2026
Snorkel Team
Learn more about Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
SlopCodeBench: Measuring Code Erosion as Agents Iterate
Blog
SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench reveals how AI coding agents degrade code quality over time—measuring “slop,” technical debt, and architectural erosion across iterations.

Jan 20, 2026
Learn more about SlopCodeBench: Measuring Code Erosion as Agents Iterate
Introducing the Snorkel Agentic Coding Benchmark
Blog
Introducing the Snorkel Agentic Coding Benchmark

Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle the full complexity of software engineering work.

Jan 09, 2026
Learn more about Introducing the Snorkel Agentic Coding Benchmark
2026: The year of environments
Blog
2026: The year of environments

We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward.

Dec 10, 2025
Learn more about 2026: The year of environments
Part V: Future direction and emerging trends
Blog
Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Dec 05, 2025
Learn more about Part V: Future direction and emerging trends
The self-critique paradox: Why AI verification fails where it’s needed most
Blog
The self-critique paradox: Why AI verification fails where it’s needed most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The promise vs. the reality…

Nov 26, 2025
Learn more about The self-critique paradox: Why AI verification fails where it’s needed most
1 2 34 35

Let’s research together

Join our team of leading researchers and help shape the future of AI.