resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers, and more.

Blog

Why coding agents need better data, evals, and environments

Announcing a $3M commitment to launch Open Benchmarks Grants

Justin Bauer

May 11, 2026

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Blog

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants

Kobie Crawford

September 30, 2025

Blog

Benchtalks #2: The future of coding benchmarks

Featuring John Yang (SWE-bench, ProgramBench)

Vincent Sunn Chen

June 3, 2026

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants

Bhavishya Pohani

March 30, 2026

Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants

Charles Dickens

September 11, 2025

Agentic AI evaluation: Closing the gap with better benchmarks and data

Blog

NEW

Agentic AI evaluation: Closing the gap with better benchmarks and data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that…

Jun 23, 2026 •

Snorkel Team

Learn more about Agentic AI evaluation: Closing the gap with better benchmarks and data

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

Blog

NEW

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based…

Jun 18, 2026 •

Snorkel Team

Learn more about JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

The Art and Science of Building AI Benchmarks That Shape the Field

Blog

NEW

The Art and Science of Building AI Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with…

Jun 16, 2026 •

Snorkel Team

Learn more about The Art and Science of Building AI Benchmarks That Shape the Field

Cua-Bench: benchmarking computer-use agents on professional software

Blog

NEW

Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Jun 15, 2026 •

Zhengyang (Jason) Qi, Armin Parchami

Learn more about Cua-Bench: benchmarking computer-use agents on professional software

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce CURATION-BENCH, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents commandline access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution– research gap: agents mainly tune...

Research Paper

Can Generalist Agents Automate Data Curation?

Jun 09, 2026 •

Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma , Frederic Sala, Dawn Song, Ruoxi Jia

Learn more about Can Generalist Agents Automate Data Curation?

Agents’ Last Exam

Recent AI systems have achieved strong results on a wide range of benchmarks, yetthese gains have not translated into economically meaningful deployment acrossmany professional domains. We argue that this gap is largely an evaluation problem:widely used benchmarks lack sustained performance measurement on real andeconomically valuable workflows. This paper introduces Agents’ Last Exam(ALE), a benchmark designed to evaluate AI agents on long horizon, economicallyvaluable, real world tasks with verifiable outcomes. Developed in collaborationwith 250+ industry experts, ALE covers non-physical industries defined withreference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It isorganized around a task taxonomy with 55 sub...

Research Paper

Agents’ Last Exam

Jun 08, 2026 •

Yiyou Sun, Dawn Song, et al. (UC Berkeley RDI) with contributions from Snorkel AI's Amanda Dsouza and Vincent Sunn Chen

Learn more about Agents’ Last Exam

Blog

The standard for agents you can trust: Lessons from the federal front lines

In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on…

Jun 05, 2026 •

Snorkel Team

Learn more about The standard for agents you can trust: Lessons from the federal front lines

Blog

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control….

Jun 04, 2026 •

Snorkel Team

Learn more about Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Blog

Benchtalks #2: The future of coding benchmarks

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

Jun 03, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #2: The future of coding benchmarks

1 2 … 65

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Resource library

Why coding agents need better data, evals, and environments

Closing the Evaluation Gap in Agentic AI

Evaluating coding agent capabilities with Terminal-Bench: Snorkel’s role in building the next generation benchmark

Benchtalks #2: The future of coding benchmarks

Building FinQA: An Open RL Environment for Financial Reasoning Agents

The science of rubric design

Join our newsletter

How do you want to work with Snorkel?