Image

Senior SWE-Bench

A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions. 

Built with
Snorkel AI logo lockup mono white outline pngImageImage
Copy
Copied
# Set env vars required for your model
# export ANTHROPIC_API_KEY=sk-ant-...
# export OPENAI_API_KEY=sk-proj-...
# export MY_PROVIDERS_API_KEY=...
 
# [Optional] Set the models for the test stage (defaults below)
# export SSB_OVERRIDE_VA_HARNESS=miniswebench
# export SSB_OVERRIDE_VA_MODEL=anthropic/claude-sonnet-4-6
# export SSB_OVERRIDE_ALL_JUDGE_MODEL=anthropic/claude-sonnet-4-6
# export SSB_OVERRIDE_CLASSIFIER_MODEL=anthropic/claude-haiku-4-5
 
# Set depending on what you want to run
MODEL=anthropic/claude-opus-4-8
AGENT=mini-swe-agent
 
# Run Harbor
harbor run --repo snorkel-ai/senior-swe-bench -a $AGENT -m $MODEL
Run with Harbor
Overview

Most software-engineering benchmarks evaluate AI agents like junior engineers, over-specified requirements graded against a fixed test suite. Senior SWE-Bench reframes the problem around the senior-level work deployed coding agents are actually expected to do, sourced from real pull requests across twelve open-source projects.

Headline finding: no frontier agent exceeds 25% tasteful solve rate. Claude Opus 4.8 leads at 24.0%. Brand-new Claude Sonnet 5 lands second at 19.7% (*flagged: reward hacking detected on 26 tasks, filtered). GPT-5.5 tops the basic-correctness axis at 55.0%. The rankings flip between metrics: frontier models pass runtime tests far more often than they pass with senior-level taste.

At a glance

100

tasks (50 public/ 50 private)

12

open-source repositories

6

evaluation gates per task

10

frontier agents evaluated

100+

PR-author commits required 

Leaderboard

Rank Model Harness Effort Tasteful Solve Rate Basic Solve Rate Avg Steps Avg Tokens
1 claude-opus-4-8 Mini-SWE-Agent max
24%
42%
131 117.1K
2
Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score
claude-sonnet-5 Mini-SWE-Agent max
19.7%
45.5%
262 304.4K
2 gpt-5-5 Mini-SWE-Agent xhigh
16%
55%
89 36.3K
3 claude-opus-4-7 Mini-SWE-Agent max
14.1%
40.4%
153 96.0K
4 gpt-5-4 Mini-SWE-Agent xhigh
14%
49%
82 52.0K
5 glm-5-2 Mini-SWE-Agent max
12.5%
31.3%
211 65.1K
6 kimi-k2-6 Mini-SWE-Agent default
8.2%
23.7%
220 492.1K
7 claude-sonnet-4-6 Mini-SWE-Agent high
8.2%
31.6%
173 60.6K
8 gemini-3-1-pro Mini-SWE-Agent high
6.1%
26.3%
89 20.2K
9 gemini-3-5-flash Mini-SWE-Agent medium
3%
19%
253 83.7K

Basic solve rate is the share of an agent's runs that pass every pre-written verifier and automated validation test. Tasteful solve rate requires all of that and clears every additional quality gate: rubric, bloat, codebase practice, and relative taste vs. an expert reference.

*Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score. 

Source repositories

Twelve open-source projects sampled across libraries, tools, services, and full applications. Most Senior SWE-Bench tasks are based on PRs authored by engineers with 100 commits in the respective repository, with maintainer-authored PRs oversampled.

Repository Languages Type Description LOC Started Stars
electric-sql/electric Elixir, TypeScript Service Postgres real-time sync 345k 2022 10.2k
go-gitea/gitea Go Application Self-hosted Git forge 397k 2016 56.3k
PostHog/posthog Python, TypeScript Application Product analytics platform 3.8M 2020 35.1k
PrefectHQ/prefect Python Library Workflow orchestration 664k 2018 22.6k
better-auth/better-auth TypeScript Library Authentication framework 289k 2024 28.7k
gravitational/teleport Go, TypeScript Application Infrastructure access platform 2.8M 2015 20.5k
vercel/turborepo Rust, TypeScript Tool Monorepo build system 215k 2021 30.6k
plausible/analytics Elixir Application Privacy-friendly web analytics 228k 2018 27.2k
firezone/firezone Elixir, Rust Application Zero-trust access platform 247k 2020 8.7k
paperless-ngx/paperless-ngx Python, TypeScript Application Document management system 148k 2022 42.2k
immich-app/immich TypeScript Application Self-hosted photo backup 542k 2022 103.6k
harbor-framework/harbor Python Tool Agent evaluation harness 219k 2025 2.5k

Sample tasks

All 50 public task families across twelve open-source projects. Each task ships with a sandboxed environment, an expert-authored validation spec, and a reference solution. Tasteful Solve is the share of frontier-agent attempts that pass both the functional verifier and the taste review.

Investigate
performance
better-auth-fix-api-key-run
Solve rate: 0%
 By better-auth
Investigate
bug
better-auth-fix-api-return-response
Solve rate: 0%
 By better-auth
Investigate
bug
better-auth-fix-oauth-provider-return
Solve rate: 0%
 By better-auth
Investigate
bug
better-auth-fix-resolve-dynamic-baseurl
Solve rate: 0%
 By better-auth
Design
feature
electric-feat-add-variadic-function
Solve rate: 0%
 By electric
Design
feature
electric-feat-sync-service-start
Solve rate: 33%
 By electric
Investigate
bug
electric-fix-classify-admission-control
Solve rate: 11%
 By electric
Investigate
bug
electric-fix-elixir-client-cache
Solve rate: 22%
 By electric
Investigate
bug
electric-fix-resolve-pending-shapes
Solve rate: 0%
 By electric
Investigate
performance
electric-perf-array-filter-eval
Solve rate: 0%
 By electric
Design
feature
firezone-feat-portal-add-recent
Solve rate: 0%
 By firezone
Investigate
bug
firezone-fix-connlib-align-device
Solve rate: 0%
 By firezone
Design
feature
gitea-add-project-column-picker
Solve rate: 0%
 By gitea
Design
feature
gitea-feat-fast-forward-only
Solve rate: 0%
 By gitea
Investigate
bug
gitea-fix-codeql-code-scanning
Solve rate: 0%
 By gitea
Investigate
bug
gitea-fix-diff-highlight-overlap
Solve rate: 0%
 By gitea
Investigate
bug
gitea-fix-force-push-timeline
Solve rate: 0%
 By gitea
Investigate
bug
gitea-fix-incorrect-viewed-files
Solve rate: 11%
 By gitea
Design
migration
gitea-refactor-auth-middleware
Solve rate: 78%
 By gitea
Design
feature
harbor-add-agent-file-retention
Solve rate: 11%
 By harbor
Design
feature
harbor-add-multi-step-tasks
Solve rate: 0%
 By harbor
Design
feature
harbor-add-windows-tasks-support
Solve rate: 11%
 By harbor
Design
feature
harbor-refactor-optional-sandbox-deps
Solve rate: 0%
 By harbor
Design
feature
immich-feat-recently-added-assets
Solve rate: 13%
 By immich
Design
feature
immich-feat-release-candidate-detection
Solve rate: 0%
 By immich
Investigate
bug
immich-fix-server-live-photo
Solve rate: 0%
 By immich
Design
feature
paperless-ngx-feat-saved-view-sharing
Solve rate: 29%
 By paperless-ngx
Investigate
performance
paperless-ngx-perf-document-counts
Solve rate: 11%
 By paperless-ngx
Investigate
performance
paperless-ngx-perf-workflow-queries
Solve rate: 89%
 By paperless-ngx
Design
feature
paperless-ngx-refactor-task-system
Solve rate: 0%
 By paperless-ngx
Design
feature
plausible-feat-shared-dashboard-deeplink
Solve rate: 0%
 By plausible
Investigate
bug
plausible-fix-cross-site-resource-attach
Solve rate: 0%
 By plausible
Investigate
bug
plausible-fix-top-pages-comparison
Solve rate: 33%
 By plausible
Design
feature
posthog-feat-approval-gating
Solve rate: 22%
 By posthog
Design
feature
posthog-feat-llma-enable-tagger
Solve rate: 67%
 By posthog
Design
feature
posthog-feat-personhog-writer-add
Solve rate: 0%
 By posthog
Design
feature
posthog-feat-prompt-versioning
Solve rate: 56%
 By posthog
Design
feature
posthog-feat-schema-ingestion-block
Solve rate: 0%
 By posthog
Investigate
bug
posthog-fix-llm-gateway-add
Solve rate: 0%
 By posthog
Investigate
bug
posthog-fix-replay-buffering
Solve rate: 0%
 By posthog
Design
feature
prefect-add-isolated-workspace-resolver
Solve rate: 0%
 By prefect
Design
feature
prefect-feat-dbt-per-node-hardening
Solve rate: 0%
 By prefect
Investigate
bug
prefect-fix-resolve-race-condition
Solve rate: 22%
 By prefect
Design
feature
prefect-fix-subflow-cancellation
Solve rate: 0%
 By prefect
Design
feature
teleport-add-traits-matching-logic
Solve rate: 44%
 By teleport
Investigate
bug
teleport-fix-scoped-kube-clusters
Solve rate: 0%
 By teleport
Design
feature
turborepo-feat-add-circular-package
Solve rate: 11%
 By turborepo
Investigate
bug
turborepo-fix-preserve-package-json
Solve rate: 0%
 By turborepo
Investigate
bug
turborepo-fix-prune-missing-sources
Solve rate: 0%
 By turborepo
Investigate
performance
turborepo-perf-reuse-input-hashes
Solve rate: 0%
 By turborepo
Show all public tasks

Realistic vs. over-specified instructions

For comparison, here are two bug-task instructions from real-world source PRs. Senior SWE-Bench frames bugs as natural-language behavioral reports; SWE-Bench Pro spells out full reproduction steps and expected behavior. Behavioral testing lets the realistic version stay short without sacrificing reliable grading.




senior-swe-bench/instruction.md

549 chars
~0 code symbols

1
it looks like the PG replication slot lag is growing with no bound on prod stacks. the flush LSN we send back to the DB just stops advancing, but all shapes on the stack still seem to be working (clients are getting new data, storage is writing, nothing crashes, etc). Slot lag grows for hours until someone restarts the stack. always seems to start just after a txn whose changes spanned multiple WAL fragments. find why the global flush boundary is getting stuck and fix it. note: the upstream tracker should only see flush acks at txn boundaries.



swe-bench-pro/instruction.md

5,888 chars
~32 code symbols

1
# ImportAPI does not correctly split `publishers` and `publish_places` when the `publisher` field contains multiple locations
2
3
## Problem
4
When importing editions through `/api/import/ia` without a MARC record, if the Internet Archive `publisher` metadata contains several locations separated by `;` and a publisher separated by `:`, the entire string is stored in `publishers` and the `publish_places` field is left empty. In the Open Library data model:
5
* `publishers` should hold only the publisher name(s).
6
* `publish_places` should list the location(s).
7
8
## Reproducing the bug
9
1. Call the endpoint:
10
POST /api/import/ia
11
{ "identifier": "" }
12
2. View the created edition on Open Library.
13
14
* Expected behavior:
15
"publishers": ["Berlitz Publishing"],
16
"publish_places": ["London", "New York", "Paris"]
17
* Actual behavior:
18
"publishers": ["London ; New York ; Paris : Berlitz Publishing"]
19
// publish_places is missing
20
21
Requirements:
22
- The `get_ia_record` function should always return the `publishers` key as a list of strings, whether the original publisher value arrives as a single string or as a list, and should preserve the exact name(s) received.
23
- When processing the `isbn` field, `get_ia_record` should classify each value solely by length: 10-character entries go to `isbn_10`, 13-character entries go to `isbn_13`; any other length should be silently discarded, and leading or trailing spaces should be stripped.
24
- If the `publisher` value contains at least one `:`, `get_ia_record` should assign everything to the right of the first `:` to the `publishers` list and everything to the left (one or more locations separated by `;`) to `publish_places`, removing square brackets `[]` from both sides and preserving order. This split should be delegated to `openlibrary.plugins.upstream.utils.get_location_and_publisher`, which returns `(publish_places, publishers)`.
25
- The helper `get_colon_only_loc_pub` should return a tuple `(location, publisher)` when the input string contains exactly one `:`; if no `:` is present, the location should be an empty string and the entire trimmed input should be considered the publisher; if the input is empty, both elements should be empty strings. This helper should only trim characters listed in `STRIP_CHARS` and should not remove square brackets; its caller may handle bracket removal.
26
- `get_location_and_publisher` should return `([], [])` when the input is empty, not a string, or is a list, without raising exceptions in these cases.
27
- If the string includes the phrase “Place of publication not identified”, `get_location_and_publisher` should remove that phrase before further processing and then treat the remaining text normally.
28
- When the pattern is “location : publisher” and multiple segments are separated by `;`, `get_location_and_publisher` should collect all locations (segments before each `:`) into `publish_places` and each publisher name (segment immediately after each `:`) into `publishers`, maintaining original order. Square brackets `[]` should be removed from both locations and publishers.
29
- If a segment contains more than one `:` (an invalid case for the expected pattern), `get_location_and_publisher` should ignore anything after the second `:`, keeping only the first identified `location : publisher` pair extracted so far.
30
- When the string contains a comma `,` as the principal separator and lacks a `:`, `get_location_and_publisher` should assume no reliable location information is present and should return an empty locations list, assigning the portion after the comma (after removing square brackets and the unidentified-place phrase) to `publishers`.
31
- The utility `get_isbn_10_and_13` in `openlibrary/utils/isbn.py` should accept either a single string or a list of strings, strip any extra spaces, and classify values strictly by length (10 or 13 characters), returning both lists in a tuple; values of other lengths should not appear in the output. The function name should be imported from `openlibrary.utils.isbn` where used (e.g., in `openlibrary/plugins/importapi/code.py`), and should no longer be imported from `openlibrary.plugins.upstream.utils`.
32
33
New interfaces introduced:
34
1. Function `get_colon_only_loc_pub` — openlibrary/plugins/upstream/utils.py
35
Input: pair (str): a single “Location : Publisher” string.
36
Output: (location, publisher) (tuple[str, str]): part before the colon (trimmed with STRIP_CHARS) as location, part after as publisher.
37
Splits a simple “Location : Publisher” string into its two components. Returns `("", original_string_trimmed)` if no single colon is found. Leaves square brackets intact for the caller to handle.
38
2. Function `get_location_and_publisher` — openlibrary/plugins/upstream/utils.py
39
Input: loc_pub (str): an IA publisher metadata string, potentially containing multiple locations separated by `;` and one or more `location : publisher` pairs.
40
Output: (locations, publishers) (tuple[list[str], list[str]]): locations (trimmed, brackets removed) from before the colon(s); publishers (trimmed, brackets removed) from after.
41
Parses a compound “locations : publisher” string into ordered lists. Handles edge cases (empty/non-string/list input, the phrase “Place of publication not identified”, multiple colons) and falls back to the entire input as a single publisher when no `:` is present.
42
3. Function `get_isbn_10_and_13` — openlibrary/utils/isbn.py
43
Input: isbns (str | list[str]): an ISBN or list of ISBN strings with no hyphens.
44
Output: (isbn_10_list, isbn_13_list) (tuple[list[str], list[str]]): inputs of length 10 and 13 (after trimming).
45
Classifies raw ISBN metadata into ISBN-10 and ISBN-13 lists based solely on string length, without validation. Callers should import this from `openlibrary.utils.isbn`.

Illustrative samples. Senior SWE-Bench instructions read like an issue report on Slack; verifier-driven benchmarks lean on rigid, over-specified reproduction steps. Note: instructions do not represent the same task.

Comparison to other benchmarks

Several recent benchmarks make progress on behavioral testing and instruction realism. The following table provides a brief comparison.

Benchmark

Task style and source

Instruction realism

Reward mechanisms

Open source

Senior SWE-Bench

Real-world PRs

High (natural language message)

  • Verifiers (behavioral)
  • Validation agent
  • Task rubrics
  • Taste judge

Yes

SWE-Bench Pro

Real-world PRs

Low (full specs)

  • Verifiers (implementation-specific)
  • Rubric

Yes

DeepSWE

Invented tasks in real repos

Mixed (some full specs)

  • Verifiers (behavioral)

Yes

FrontierCode

Real-world PRs

Unknown (examples are mixed)

  • Verifiers (behavioral)
  • LLM-adapted verifiers
  • Agent-written tests (reverse)
  • Code quality judge

No

ProgramBench

Full program recreation

  • Verifiers (behavioral)

No

Methodology

Metrics

Pass@1 on the Mini-SWE-Agent harness (Harbor-compatible). Tasteful Solve Rate requires all six gates (verifiers + validation + rubric + bloat + practice + relative taste) to pass simultaneously; Basic Solve Rate removes the taste-related gates and measures correctness only.

Validation Agent

For feature tasks, an agent (Mini-SWE-Agent with Claude Sonnet 4.6) writes behavioral tests adapted to each submitted solution using an expert-authored recipe. Each task is calibrated by running 3× on the oracle patch and 3× on no-op, rejected if pass³ < 1 on oracle or pass³ > 0 on no-op. Wall-clock time overhead 6–20% (median 11%); token cost overhead 2–16% (median 6%). Measured on Claude Opus 4.8 trials. In practice, less than 5% of trials are discarded.

Taste Judge

An LLM judge grades each patch against the expert reference solution along two axes: relative code quality (minimality, approach, hygiene, fluency, craftsmanship) and codebase practice alignment (style consistency, pattern adherence, library usage, abstraction level, documentation fit). Thresholds set conservatively (any score > 2/5) and calibrated against human reviewers.

QUALITY CONTROL

Every task passes three layers of review: automated LLM-based checks, research-team review for overall design and implementation quality, and SWE-expert review via the Snorkel AI expert network using an extensive rubric. Each task includes a "guided" variant whose instruction adds optional hints (useful for performance diagnosis or curriculum learning) without prescribing the solution.

What the results show

Three patterns emerge from the leaderboard. None depend on the absolute scores, all are about the gap between correctness and senior-level taste.

1

Taste opens a 2–6× gap.

Frontier agents pass basic correctness (verifiers + validation tests) on 19–55% of tasks but earn a Tasteful Solve on only 3–24%. Every model loses 43–84% of its basic-solve credit when the taste, bloat, and codebase-practice gates are applied.

2

Correctness and taste are different skills.

GPT-5.5 wins on Basic Solve at 55.0%, but Claude Opus 4.8 wins on Tasteful at 24.0%, the rankings flip between the two metrics. Models that write the most runtime-passing patches don't always write the most senior-grade ones.

3

Even the top model misses senior taste 3 out of 4 times.

Claude Opus 4.8 leads at 24.0% Tasteful Solve, meaning an agent that passes runtime tests is still failing the bar a senior engineer would hold 76% of the time.

Resources

Blog

Website

Github

Task viewer

Copy
Copied
# Set env vars required for your model
# export ANTHROPIC_API_KEY=sk-ant-...
# export OPENAI_API_KEY=sk-proj-...
# export MY_PROVIDERS_API_KEY=...
 
# [Optional] Set the models for the test stage (defaults below)
# export SSB_OVERRIDE_VA_HARNESS=miniswebench
# export SSB_OVERRIDE_VA_MODEL=anthropic/claude-sonnet-4-6
# export SSB_OVERRIDE_ALL_JUDGE_MODEL=anthropic/claude-sonnet-4-6
# export SSB_OVERRIDE_CLASSIFIER_MODEL=anthropic/claude-haiku-4-5
 
# Set depending on what you want to run
MODEL=anthropic/claude-opus-4-8
AGENT=mini-swe-agent
 
# Run Harbor
harbor run --repo snorkel-ai/senior-swe-bench -a $AGENT -m $MODEL
Run with Harbor

Acknowledgments

Senior SWE-Bench is led by Henry Kiss Ehrenberg with contributions from Vincent Sunn Chen at Snorkel AI; Austin W. Hanjie and Karthik Narasimhan at Princeton University; and Gabriel Orlanski and Frederic Sala at the University of Wisconsin–Madison.

All tasks are created and reviewed by contributing research staff and software engineers from the Snorkel AI expert network, in concert with specialized coding and evaluation agents. The benchmark is open source and Harbor-compatible.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.