Research

Benchmarks should shape the frontier, not just measure it

April 7, 2026
3 min read

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks drive how the field allocates research effort, the bar for benchmarks has risen as well. Here, we share what’s now table stakes for useful benchmarks, and what separates the ones that shape the frontier from the ones that just measure it.

Useful benchmarks are, first and foremost, effective measuring sticks

  • Rigorously-validated tasks: The individual tasks are high quality (e.g. real-world complexity, well-structured instructions, verifiable solutions), as validated by real domain experts. GPQA introduced new adversarial quality control mechanisms to ensure that tasks were not only well-posed, but also tractable for other experts to solve.
  • Fine-grained distributional diversity: The benchmark defines a clear taxonomy for its domain and distributes tasks across it deliberately, so results are actionable. MMLU constructed an ambitious taxonomy of 57 academic subjects (across STEM, humanities, and professional domains).
  • Robust eval methodology: Metrics go beyond raw accuracy — capturing cost, latency, reasoning quality, or whatever dimensions actually matter for real-world use of the capability. The benchmark measures what it claims to, and the methodology is reproducible and robust to contamination. TAU-bench evaluates both task completion and adherence to policy constraints, e.g., a model that books the right flight but violates fare class rules still fails.
  • Model headroom: The benchmark is unsaturated. It exposes real soft spots in model capabilities and reliably separates frontier models. At its release, ARC-AGI-3 had frontier models scoring below 1% over tasks that were 100% solvable by humans.


Lasting benchmarks push the frontier

  • A thesis on the frontier: The benchmark defines a new subspace of capabilities for the frontier or revisits a previous research question with new assumptions. The most ambitious benchmarks have a thesis on where the world is going: Terminal-Bench was a bet on the CLI– not only for coding agents, but for general-purpose computer use.
  • Roadmaps for the field: The benchmark produces new roadmaps. It inspires new attacks against important research problems, including follow-on benchmarks and methods that advance the field. SWE-Bench spawned a whole family of benchmarks (e.g. Lite, Verified, Multilingual, Multimodal), and its evolution has shaped how teams build coding agents.
  • Researcher UX: The benchmark builders are committed to the “researcher experience”. This means the benchmark is simple to run models/agents against, simple to contribute to/extend, and simple to adapt supervision/reward signals for RL/tuning. HELM pioneered a standardized and modular harness for reproducible evals; Terminal-Bench2.0 shipped with Harbor, which has become de facto tooling for teams building agents.

Every benchmark highlighted here has had a lasting impact— a reminder that individual researchers and small teams have enormous agency to define and advance the field. We’re excited to support the next ones with the Open Benchmarks Grants. Share your proposals or reach out at benchmarks.snorkel.ai!

Share this article
Vincent Chen headshot
Vincent Sunn Chen
Research Fellow & Founding Team

Vincent Sunn Chen is a Research Fellow on the founding team at Snorkel AI. His work centers on systems for high quality AI evaluation & data development with experts in the loop. He currently leads the Open Benchmarks Grants, a $3M commitment to funding benchmarks and infrastructure for frontier agents. Prior to Snorkel, Vincent was a researcher at the Stanford AI Lab, where he studied the foundations of data-centric AI systems.

Recommended articles

View all articles
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image
Benchtalks #2: The future of coding benchmarks
For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel
June 3, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.