Research

The Art and Science of Building AI Benchmarks That Shape the Field

June 16, 2026
2 min read
Snorkel Team

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it.

The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with rigorous AI benchmarks in realistic, high-stakes settings has lagged behind. Closing that evaluation gap is one of the most important problems in AI right now — and open benchmarks are one of the most powerful levers available to address it.

In the talk, Vincent breaks the problem into two halves. The first is the science of an effective measuring stick — rigorous task quality, deliberate distributional diversity, real model headroom, and a robust evaluation methodology — illustrated with benchmarks like GPQA, MMLU, ARC-AGI, and τ-bench. The second is the art that separates benchmarks that merely measure from the ones that reshape the field: a clear thesis on where things are going, a roadmap others can build on, and first-class researcher UX — think Terminal-Bench, SWE-bench, and HELM. The talk closes with a look at where the next great benchmarks may emerge: environment complexity, autonomy horizon, and output complexity.

If you want to go deeper than the talk, the two pieces below are the fuller written versions:


And if any of this maps to what you’re building, the Open Benchmarks Grants are open: a $3M commitment to fund open benchmarks, datasets, and evaluation artifacts for frontier agents. Share a proposal or reach out at benchmarks.snorkel.ai.

Share this article

Recommended articles

View all articles
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.