Senior SWE-Bench
A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions.


# Set env vars required for your model # export ANTHROPIC_API_KEY=sk-ant-... # export OPENAI_API_KEY=sk-proj-... # export MY_PROVIDERS_API_KEY=... # [Optional] Set the models for the test stage (defaults below) # export SSB_OVERRIDE_VA_HARNESS=miniswebench # export SSB_OVERRIDE_VA_MODEL=anthropic/claude-sonnet-4-6 # export SSB_OVERRIDE_ALL_JUDGE_MODEL=anthropic/claude-sonnet-4-6 # export SSB_OVERRIDE_CLASSIFIER_MODEL=anthropic/claude-haiku-4-5 # Set depending on what you want to run MODEL=anthropic/claude-opus-4-8 AGENT=mini-swe-agent # Run Harbor harbor run --repo snorkel-ai/senior-swe-bench -a $AGENT -m $MODEL
Most software-engineering benchmarks evaluate AI agents like junior engineers, over-specified requirements graded against a fixed test suite. Senior SWE-Bench reframes the problem around the senior-level work deployed coding agents are actually expected to do, sourced from real pull requests across twelve open-source projects.
Headline finding: no frontier agent exceeds 25% tasteful solve rate. Claude Opus 4.8 leads at 24.0%. Brand-new Claude Sonnet 5 lands second at 19.7% (*flagged: reward hacking detected on 26 tasks, filtered). GPT-5.5 tops the basic-correctness axis at 55.0%. The rankings flip between metrics: frontier models pass runtime tests far more often than they pass with senior-level taste.
At a glance
100
tasks (50 public/ 50 private)
12
6
evaluation gates per task
10
frontier agents evaluated
100+
PR-author commits required
Leaderboard
| Rank | Model | Harness | Effort | Tasteful Solve Rate | Basic Solve Rate | Avg Steps | Avg Tokens |
|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-8 | Mini-SWE-Agent | max |
24%
|
42%
|
131 | 117.1K |
|
2
Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score
|
claude-sonnet-5 | Mini-SWE-Agent | max |
19.7%
|
45.5%
|
262 | 304.4K |
| 2 | gpt-5-5 | Mini-SWE-Agent | xhigh |
16%
|
55%
|
89 | 36.3K |
| 3 | claude-opus-4-7 | Mini-SWE-Agent | max |
14.1%
|
40.4%
|
153 | 96.0K |
| 4 | gpt-5-4 | Mini-SWE-Agent | xhigh |
14%
|
49%
|
82 | 52.0K |
| 5 | glm-5-2 | Mini-SWE-Agent | max |
12.5%
|
31.3%
|
211 | 65.1K |
| 6 | kimi-k2-6 | Mini-SWE-Agent | default |
8.2%
|
23.7%
|
220 | 492.1K |
| 7 | claude-sonnet-4-6 | Mini-SWE-Agent | high |
8.2%
|
31.6%
|
173 | 60.6K |
| 8 | gemini-3-1-pro | Mini-SWE-Agent | high |
6.1%
|
26.3%
|
89 | 20.2K |
| 9 | gemini-3-5-flash | Mini-SWE-Agent | medium |
3%
|
19%
|
253 | 83.7K |
Basic solve rate is the share of an agent's runs that pass every pre-written verifier and automated validation test. Tasteful solve rate requires all of that and clears every additional quality gate: rubric, bloat, codebase practice, and relative taste vs. an expert reference.
*Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score.
Source repositories
Twelve open-source projects sampled across libraries, tools, services, and full applications. Most Senior SWE-Bench tasks are based on PRs authored by engineers with 100 commits in the respective repository, with maintainer-authored PRs oversampled.
| Repository | Languages | Type | Description | LOC | Started | Stars |
|---|---|---|---|---|---|---|
| electric-sql/electric | Elixir, TypeScript | Service | Postgres real-time sync | 345k | 2022 | 10.2k |
| go-gitea/gitea | Go | Application | Self-hosted Git forge | 397k | 2016 | 56.3k |
| PostHog/posthog | Python, TypeScript | Application | Product analytics platform | 3.8M | 2020 | 35.1k |
| PrefectHQ/prefect | Python | Library | Workflow orchestration | 664k | 2018 | 22.6k |
| better-auth/better-auth | TypeScript | Library | Authentication framework | 289k | 2024 | 28.7k |
| gravitational/teleport | Go, TypeScript | Application | Infrastructure access platform | 2.8M | 2015 | 20.5k |
| vercel/turborepo | Rust, TypeScript | Tool | Monorepo build system | 215k | 2021 | 30.6k |
| plausible/analytics | Elixir | Application | Privacy-friendly web analytics | 228k | 2018 | 27.2k |
| firezone/firezone | Elixir, Rust | Application | Zero-trust access platform | 247k | 2020 | 8.7k |
| paperless-ngx/paperless-ngx | Python, TypeScript | Application | Document management system | 148k | 2022 | 42.2k |
| immich-app/immich | TypeScript | Application | Self-hosted photo backup | 542k | 2022 | 103.6k |
| harbor-framework/harbor | Python | Tool | Agent evaluation harness | 219k | 2025 | 2.5k |
Sample tasks
All 50 public task families across twelve open-source projects. Each task ships with a sandboxed environment, an expert-authored validation spec, and a reference solution. Tasteful Solve is the share of frontier-agent attempts that pass both the functional verifier and the taste review.
Realistic vs. over-specified instructions
For comparison, here are two bug-task instructions from real-world source PRs. Senior SWE-Bench frames bugs as natural-language behavioral reports; SWE-Bench Pro spells out full reproduction steps and expected behavior. Behavioral testing lets the realistic version stay short without sacrificing reliable grading.
549 chars
~0 code symbols
5,888 chars
~32 code symbols
Illustrative samples. Senior SWE-Bench instructions read like an issue report on Slack; verifier-driven benchmarks lean on rigid, over-specified reproduction steps. Note: instructions do not represent the same task.
Comparison to other benchmarks
Several recent benchmarks make progress on behavioral testing and instruction realism. The following table provides a brief comparison.
Benchmark
Task style and source
Instruction realism
Reward mechanisms
Open source
Senior SWE-Bench
Real-world PRs
High (natural language message)
- Verifiers (behavioral)
- Validation agent
- Task rubrics
- Taste judge
Yes
SWE-Bench Pro
Real-world PRs
Low (full specs)
- Verifiers (implementation-specific)
- Rubric
Yes
DeepSWE
Invented tasks in real repos
Mixed (some full specs)
- Verifiers (behavioral)
Yes
FrontierCode
Real-world PRs
Unknown (examples are mixed)
- Verifiers (behavioral)
- LLM-adapted verifiers
- Agent-written tests (reverse)
- Code quality judge
No
ProgramBench
Full program recreation
—
- Verifiers (behavioral)
No
Methodology
Metrics
Pass@1 on the Mini-SWE-Agent harness (Harbor-compatible). Tasteful Solve Rate requires all six gates (verifiers + validation + rubric + bloat + practice + relative taste) to pass simultaneously; Basic Solve Rate removes the taste-related gates and measures correctness only.
For feature tasks, an agent (Mini-SWE-Agent with Claude Sonnet 4.6) writes behavioral tests adapted to each submitted solution using an expert-authored recipe. Each task is calibrated by running 3× on the oracle patch and 3× on no-op, rejected if pass³ < 1 on oracle or pass³ > 0 on no-op. Wall-clock time overhead 6–20% (median 11%); token cost overhead 2–16% (median 6%). Measured on Claude Opus 4.8 trials. In practice, less than 5% of trials are discarded.
An LLM judge grades each patch against the expert reference solution along two axes: relative code quality (minimality, approach, hygiene, fluency, craftsmanship) and codebase practice alignment (style consistency, pattern adherence, library usage, abstraction level, documentation fit). Thresholds set conservatively (any score > 2/5) and calibrated against human reviewers.
QUALITY CONTROL
Every task passes three layers of review: automated LLM-based checks, research-team review for overall design and implementation quality, and SWE-expert review via the Snorkel AI expert network using an extensive rubric. Each task includes a "guided" variant whose instruction adds optional hints (useful for performance diagnosis or curriculum learning) without prescribing the solution.
What the results show
Three patterns emerge from the leaderboard. None depend on the absolute scores, all are about the gap between correctness and senior-level taste.
Taste opens a 2–6× gap.
Frontier agents pass basic correctness (verifiers + validation tests) on 19–55% of tasks but earn a Tasteful Solve on only 3–24%. Every model loses 43–84% of its basic-solve credit when the taste, bloat, and codebase-practice gates are applied.
Correctness and taste are different skills.
GPT-5.5 wins on Basic Solve at 55.0%, but Claude Opus 4.8 wins on Tasteful at 24.0%, the rankings flip between the two metrics. Models that write the most runtime-passing patches don't always write the most senior-grade ones.
Even the top model misses senior taste 3 out of 4 times.
Claude Opus 4.8 leads at 24.0% Tasteful Solve, meaning an agent that passes runtime tests is still failing the bar a senior engineer would hold 76% of the time.
Resources
Blog
Github
Task viewer
# Set env vars required for your model # export ANTHROPIC_API_KEY=sk-ant-... # export OPENAI_API_KEY=sk-proj-... # export MY_PROVIDERS_API_KEY=... # [Optional] Set the models for the test stage (defaults below) # export SSB_OVERRIDE_VA_HARNESS=miniswebench # export SSB_OVERRIDE_VA_MODEL=anthropic/claude-sonnet-4-6 # export SSB_OVERRIDE_ALL_JUDGE_MODEL=anthropic/claude-sonnet-4-6 # export SSB_OVERRIDE_CLASSIFIER_MODEL=anthropic/claude-haiku-4-5 # Set depending on what you want to run MODEL=anthropic/claude-opus-4-8 AGENT=mini-swe-agent # Run Harbor harbor run --repo snorkel-ai/senior-swe-bench -a $AGENT -m $MODEL
Acknowledgments
Senior SWE-Bench is led by Henry Kiss Ehrenberg with contributions from Vincent Sunn Chen at Snorkel AI; Austin W. Hanjie and Karthik Narasimhan at Princeton University; and Gabriel Orlanski and Frederic Sala at the University of Wisconsin–Madison.
All tasks are created and reviewed by contributing research staff and software engineers from the Snorkel AI expert network, in concert with specialized coding and evaluation agents. The benchmark is open source and Harbor-compatible.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.

