Terminal-Bench 3.0 built with Harbor & Laude

We develop methods, benchmarks, and training systems that turn expert data into frontier AI

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

NEW

Benchmark

Senior SWE-bench built by Snorkel with Princeton University and UW-Madison

Benchmark

Open Benchmark Grants

Continual Learning Bench: Evaluating agents that adapt and improve over time

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmarks Grants

Agents' Last Exam: Challenge and measure AI agents on economically valuable and real-world tasks

Benchmark

Open Benchmark Grants

Agentic Coding benchmark: Evaluating AI Models on complex, real-world coding tasks

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking & Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Senior SWE-bench
Built with Princeton & UW-Madison

OSWorld 2.0
Co-authored with XLANG Lab

Agents' Last Exam
Co-authored with UC Berkeley RDI

BigLaw Bench: Research
Co-released with Harvey

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Shrinking the Generation–Verification Gap with Weak Verifiers
NeurIPS 2025

Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
UAI 2022, Best Paper Runner-Up

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Blog

NEW

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Anthropic’s Claude Opus 5 recently debuted as the second model overall on the current Senior SWE-bench leaderboard, behind Fable 5. It also achieves the highest score of any evaluated model on the benchmark’s Bug & Performance Investigation category, reinforcing the rapid progress frontier coding models continue to make on increasingly realistic software engineering tasks. Just as notable, Opus 5 reaches…

Jul 27, 2026 •

Ankit Aich

Learn more about Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Blog

Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

At our latest Snorkel AI Reading Group, Henry Ehrenberg presented Senior SWE-Bench, an open-source, Harbor-compatible benchmark for evaluating coding agents on realistic, senior-level software engineering work. Its 100 tasks, with 50 public and 50 kept private to mitigate contamination, are sourced from real pull requests across 12 production repositories and cover complex features, migrations, bugs, and performance issues. Senior SWE-Bench…

Jul 16, 2026 •

Snorkel Team

Learn more about Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

Blog

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

We’ve evaluated Grok 4.5 on Snorkel’s GDPval+ dataset, Snorkel’s expert-created dataset of professional workplace reasoning tasks from across the economy. To compare performance against other frontier models, we ran the evaluation alongside GPT 5.5 and Claude Opus 4.8. Overall, Grok 4.5 demonstrated the strongest overall performance. Dataset GDPval+ is part of the Snorkel Data Series (SDS), Snorkel’s portfolio of expert-curated…

Jul 08, 2026 •

Jacob Fleisig

Learn more about Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

Blog

Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 30, 2026 •

Snorkel Team

Learn more about Agents’ Last Exam: AI Benchmarking for Real Work

Blog

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete andrequires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking,compared with about 30 in OSWORLD 1.0. OSWORLD 2.0 targets challenge phenomenathat are common in real workflows yet underrepresented in...

Research Paper

OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Jun 26, 2026 •

XLANG Lab and contributions from Snorkel AI’s Zhengyang Qi, Vincent Sunn Chen, and Frederic Sala

Learn more about OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Blog

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

Blog

Agentic AI evaluation: Closing the gap with better benchmarks and data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that…

Jun 23, 2026 •

Snorkel Team

Learn more about Agentic AI evaluation: Closing the gap with better benchmarks and data

Blog

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based…

Jun 18, 2026 •

Snorkel Team

Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment