Research

Benchmarks should shape the frontier, not just measure it

April 7, 2026
3 min read

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks drive how the field allocates research effort, the bar for benchmarks has risen as well. Here, we share what’s now table stakes for useful benchmarks, and what separates the ones that shape the frontier from the ones that just measure it.

Useful benchmarks are, first and foremost, effective measuring sticks

  • Rigorously-validated tasks: The individual tasks are high quality (e.g. real-world complexity, well-structured instructions, verifiable solutions), as validated by real domain experts. GPQA introduced new adversarial quality control mechanisms to ensure that tasks were not only well-posed, but also tractable for other experts to solve.
  • Fine-grained distributional diversity: The benchmark defines a clear taxonomy for its domain and distributes tasks across it deliberately, so results are actionable. MMLU constructed an ambitious taxonomy of 57 academic subjects (across STEM, humanities, and professional domains).
  • Robust eval methodology: Metrics go beyond raw accuracy — capturing cost, latency, reasoning quality, or whatever dimensions actually matter for real-world use of the capability. The benchmark measures what it claims to, and the methodology is reproducible and robust to contamination. TAU-bench evaluates both task completion and adherence to policy constraints, e.g., a model that books the right flight but violates fare class rules still fails.
  • Model headroom: The benchmark is unsaturated. It exposes real soft spots in model capabilities and reliably separates frontier models. At its release, ARC-AGI-3 had frontier models scoring below 1% over tasks that were 100% solvable by humans.


Lasting benchmarks push the frontier

  • A thesis on the frontier: The benchmark defines a new subspace of capabilities for the frontier or revisits a previous research question with new assumptions. The most ambitious benchmarks have a thesis on where the world is going: Terminal-Bench was a bet on the CLI– not only for coding agents, but for general-purpose computer use.
  • Roadmaps for the field: The benchmark produces new roadmaps. It inspires new attacks against important research problems, including follow-on benchmarks and methods that advance the field. SWE-Bench spawned a whole family of benchmarks (e.g. Lite, Verified, Multilingual, Multimodal), and its evolution has shaped how teams build coding agents.
  • Researcher UX: The benchmark builders are committed to the “researcher experience”. This means the benchmark is simple to run models/agents against, simple to contribute to/extend, and simple to adapt supervision/reward signals for RL/tuning. HELM pioneered a standardized and modular harness for reproducible evals; Terminal-Bench2.0 shipped with Harbor, which has become de facto tooling for teams building agents.

Every benchmark highlighted here has had a lasting impact— a reminder that individual researchers and small teams have enormous agency to define and advance the field. We’re excited to support the next ones with the Open Benchmarks Grants. Share your proposals or reach out at benchmarks.snorkel.ai!

Share this article
Vincent Chen headshot
Vincent Sunn Chen
Research Fellow & Founding Team

Vincent Sunn Chen is a Research Fellow on the founding team at Snorkel AI. His work centers on systems for high quality AI evaluation & data development with experts in the loop. He currently leads the Open Benchmarks Grants, a $3M commitment to funding benchmarks and infrastructure for frontier agents. Prior to Snorkel, Vincent was a researcher at the Stanford AI Lab, where he studied the foundations of data-centric AI systems.

Recommended articles

View all articles
alex-meta-scale-thumbnail
Agentic AI evaluation: Closing the gap with better benchmarks and data
Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to close that
June 23, 2026
Snorkel Team
judgment-bench
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based
June 18, 2026
Snorkel Team
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.