Press releases

Snorkel AI, Princeton, and UW-Madison researchers release Senior SWE-bench

July 1, 2026

Top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.

SAN FRANCISCO, July 2, 2026 — Snorkel AI, with researchers at Princeton University and University of Wisconsin–Madison, released Senior SWE-bench, an open source and Harbor-native benchmark that assesses agents as senior engineers on long-horizon tasks with realistically underspecified instructions.

The release lands as companies race to hand AI agents senior-level autonomy by default with no reliable way to check whether that trust is earned. Senior SWE-bench is built to answer that.

“Most existing benchmarks treat agents like junior engineers, providing overly specific requirements and assessing agents primarily on their ability to write correct code, rather than demonstrate a wide range of skills,” notes Henry Kiss Ehrenberg, Senior SWE-bench Project Lead.

Senior SWE-bench grades on what that misses: interpreting under-specified requests, investigating runtime behavior, and shipping code a maintainer would actually merge.

Leaderboard at launch

*Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score

What’s in the benchmark

Senior SWE-bench splits tasks into two types:

Design-and-Build (features, migrations): tasks read like a quick Slack message rather than a spec. A validation agent, working from an expert-designed spec the solving agent never sees, writes fresh behavioral tests tailored to whatever solution gets submitted.
Investigate-and-Fix (bugs, performance): agents get a plain-language issue report and must investigate themselves, from starting services to debugging subtle runtime issues, sourced from real PRs requiring serious runtime digging, including logs, profiling data, reproduction steps.

Across both, a taste judge grades the resulting code for quality (minimality, approach, hygiene, fluency, craftsmanship) and fits with the codebase’s own conventions, calibrated against human reviewers.

The benchmark ships 100 tasks (50 public, 50 private), sourced from PRs across twelve open-source repositories, most authored by engineers with 100+ commits in their repos. Every task and every reward mechanism, passed through multi-stage testing: the validation agent itself is vetted against known-good and known-bad solutions, discarding fewer than 5% of trials for misalignment, and every task is reviewed by the contributing research team and SWE experts from Snorkel AI’s expert network (average experience over 8 years).

About Senior SWE-bench

Senior SWE-bench is a collaboration between Snorkel AI, Princeton University, and the University of Wisconsin–Madison, led by Henry Kiss Ehrenberg and Vincent Sunn Chen (Snorkel AI); Austin W. Hanjie and Karthik R. Narasimhan (Princeton); and Gabriel Orlanski and Frederic Sala (UW–Madison). It is open source and Harbor-compatible. Learn more at senior-swe-bench.snorkel.ai.

About Snorkel AI

Snorkel AI is the frontier AI data lab, helping teams build the data and environments behind high-performing frontier and agentic AI. We combine platform technology with research-driven data development to create datasets, benchmarks, evals, and custom solutions for real-world AI systems. Founded out of the Stanford AI Lab in 2019, Snorkel works with leading AI labs and enterprises to move from better data to better outcomes. Snorkel led the development of Senior SWE-Bench and launched Open Benchmarks Grants with a $3 million commitment to support open-source datasets, benchmarks, and evaluation research. Supported projects include Agents’ Last Exam, OSWorld 2.0, Terminal-Bench, Continual Learning Bench, and SlopCode Bench.

Media Contact: Snorkel AI Press Team, press@snorkel.ai

Share this article

Snorkel AI, Princeton, and UW-Madison researchers release Senior SWE-bench

Leaderboard at launch

What’s in the benchmark

About Senior SWE-bench

About Snorkel AI

Recommended press articles

Join our newsletter

How do you want to work with Snorkel?