We develop methods, benchmarks, and training systems that turn expert data into frontier AI

building benchmarks and collaborating with

Image
Image
Image
Image
Image
Image
Image
Image
Image
agent-le-logo
rdi-foundation
Cua Logo
Image
key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Image

Open Benchmarks Grants

Backed by a $3M commitment, the program funds
open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built
and evaluated.

Image

Benchtalks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.
Image

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach headshot

Stephen Bach

Brown University
Eliot Horowitz Assistant Professor, Computer Science Department
Jason Fries headshot

Jason Fries

Stanford University
Assistant Professor of Biomedical Data Science and of Medicine
Jared Dunnmon headshot

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup
Prev. Dir. of AI at DIU
Fred Sala headshot

Fred Sala

Chief Scientist
,
Snorkel AI
Assistant Professor @ University of Wisconsin-Madison
Chris Ré headshot

Chris Ré

Co-Founder
,
Snorkel AI
Professor @ Stanford University
Ludwig Schmidt headshot

Ludwig Schmidt

Stanford University · LAION
Stanford researcher and LAION collaborator
Karthik Narasimhan headshot

Karthik Narasimhan

Princeton University
Professor of Computer Science
Yu Su headshot

Yu Su

Ohio State University
Associate Professor of Computer Science and Engineering
Lewis Tunstall headshot

Lewis Tunstall

Hugging Face
Machine Learning Engineer
PUBLICATIONS

Browse research blogs
and academic papers

Type: All Types
Sort: Newest
Agents’ Last Exam: AI Benchmarking for Real Work
Blog
NEW
Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 30, 2026
Learn more about Agents’ Last Exam: AI Benchmarking for Real Work
Continual learning and evaluating how AI agents learn across sequences of tasks
Blog
NEW
Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 29, 2026
Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks
OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete andrequires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking,compared with about 30 in OSWORLD 1.0. OSWORLD 2.0 targets challenge phenomenathat are common in real workflows yet underrepresented in...
Research Paper
NEW
OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limita-tions of frontier agents. We introduce OSWORLD 2.0, a benchmark of 108 long-horizoncomputer-use workflows across everyday and professional tasks, designed to capturecomplex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a…

Jun 26, 2026
XLANG Lab and contributions from Snorkel AI's Zhengyang Qi, Vincent Sunn Chen, and Frederic Sala
Learn more about OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
Benchtalks #3: We taught AI everything except how to learn
Blog
NEW
Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 25, 2026
Learn more about Benchtalks #3: We taught AI everything except how to learn
Agentic AI evaluation: Closing the gap with better benchmarks and data
Blog
NEW
Agentic AI evaluation: Closing the gap with better benchmarks and data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that…

Jun 23, 2026
Learn more about Agentic AI evaluation: Closing the gap with better benchmarks and data
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Blog
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based…

Jun 18, 2026
Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
The Art and Science of Building AI Benchmarks That Shape the Field
Blog
The Art and Science of Building AI Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with…

Jun 16, 2026
Learn more about The Art and Science of Building AI Benchmarks That Shape the Field
Cua-Bench: benchmarking computer-use agents on professional software
Blog
Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Learn more about Cua-Bench: benchmarking computer-use agents on professional software
Can Generalist Agents Automate Data Curation?
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce CURATION-BENCH, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents commandline access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution– research gap: agents mainly tune...
Research Paper
Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce CURATION-BENCH, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents commandline access to…

Jun 09, 2026
Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma , Frederic Sala, Dawn Song, Ruoxi Jia
Learn more about Can Generalist Agents Automate Data Curation?
1 2 37

Let’s research together

Join our team of leading researchers and help shape the future of AI.