Research

The Art and Science of Building AI Benchmarks That Shape the Field

June 16, 2026

•

2 min read

•

Snorkel Team

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it.

The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with rigorous AI benchmarks in realistic, high-stakes settings has lagged behind. Closing that evaluation gap is one of the most important problems in AI right now — and open benchmarks are one of the most powerful levers available to address it.

In the talk, Vincent breaks the problem into two halves. The first is the science of an effective measuring stick — rigorous task quality, deliberate distributional diversity, real model headroom, and a robust evaluation methodology — illustrated with benchmarks like GPQA, MMLU, ARC-AGI, and τ-bench. The second is the art that separates benchmarks that merely measure from the ones that reshape the field: a clear thesis on where things are going, a roadmap others can build on, and first-class researcher UX — think Terminal-Bench, SWE-bench, and HELM. The talk closes with a look at where the next great benchmarks may emerge: environment complexity, autonomy horizon, and output complexity.

If you want to go deeper than the talk, the two pieces below are the fuller written versions:

For the framework: what makes an AI benchmark a useful measuring stick, and what makes one shape the frontier (Benchmarks should shape the frontier, not just measure it).
For the forward-looking view: the three axes where tomorrow’s AI benchmarks need to push hardest (Closing the Evaluation Gap in Agentic AI).

And if any of this maps to what you’re building, the Open Benchmarks Grants are open: a $3M commitment to fund open benchmarks, datasets, and evaluation artifacts for frontier agents. Share a proposal or reach out at benchmarks.snorkel.ai.

Share this article

Recommended articles

View all articles

Claude Opus 5: Performance and Error Analysis on Frontier Coding Tasks

Anthropic’s Claude Opus 5 recently debuted as the second model overall on the current Senior SWE-bench leaderboard, behind Fable 5. It also achieves the highest score of any evaluated model on the benchmark’s Bug & Performance Investigation category, reinforcing the rapid progress frontier coding models continue to make on increasingly realistic software engineering tasks. Just as notable, Opus 5 reaches

July 27, 2026

•

Ankit Aich

Senior SWE-Bench: Evaluating Coding Agents Like Senior Engineers

At our latest Snorkel AI Reading Group, Henry Ehrenberg presented Senior SWE-Bench, an open-source, Harbor-compatible benchmark for evaluating coding agents on realistic, senior-level software engineering work. Its 100 tasks, with 50 public and 50 kept private to mitigate contamination, are sourced from real pull requests across 12 production repositories and cover complex features, migrations, bugs, and performance issues. Senior SWE-Bench

July 16, 2026

•

Snorkel Team

Grok 4.5 Testing Results: How SpaceXAI’s New Model Performs on Real Professional Work

We’ve evaluated Grok 4.5 on Snorkel’s GDPval+ dataset, Snorkel’s expert-created dataset of professional workplace reasoning tasks from across the economy. To compare performance against other frontier models, we ran the evaluation alongside GPT 5.5 and Claude Opus 4.8. Overall, Grok 4.5 demonstrated the strongest overall performance. Dataset GDPval+ is part of the Snorkel Data Series (SDS), Snorkel’s portfolio of expert-curated

July 8, 2026

•

Jacob Fleisig

The Art and Science of Building AI Benchmarks That Shape the Field

Recommended articles

Join our newsletter

How do you want to work with Snorkel?