Open Benchmarks Grants

SlopCode Bench

A benchmark measuring code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat ratio), and verbosity to evaluate whether models produce correct and clean code under realistic conditions.

Built with

Overview

SlopCodeBench (SCBench) is a benchmark designed to evaluate coding agents the way real software actually gets built: through repeated requirement changes and extensions. Instead of treating the spec as a one-shot oracle, each task is a sequence of checkpoints where an agent implements an initial version, then extends its own solution multiple times as new requirements arrive.

The v1.0 release includes 36 problems with 196 total checkpoints, evaluated in a black-box setting where only a CLI or API contract is given. No prescribed architecture, function signatures, or module boundaries, so early design decisions can meaningfully help or hurt later work.

Leaderboard

Showing best version by % Checkpoints. Select both Model and Harness to view all versions.

Model	Harness	Version	Strict Solve %	Iso Solve %	Core Solve %	$/CKPT	Erosion	Verbosity	% AST-Grep	% Cloned
GPT 5.5 (High)	Codex	0.124.0	14.29	28.06	65.31	$1.51	0.494	0.269	0.249	0.047
GPT 5.3-Codex (High)	Codex	0.98.0	11.22	26.02	59.18	$0.69	0.644	0.336	0.314	0.069
GPT 5.4 (High)	Codex	0.110.0	10.71	23.47	61.22	$0.82	0.508	0.273	0.240	0.058
GPT 5.2-Codex (High)	Codex	0.93.0	9.69	21.94	54.59	$0.85	0.728	0.398	0.364	0.097
Opus 4.6 (High)	Claude Code	2.1.32	9.69	20.92	65.31	$3.17	0.737	0.318	0.288	0.103
Opus 4.7 (High)	Claude Code	2.1.111	8.16	20.92	64.29	$2.17	0.759	0.357	0.327	0.084
Kimi K2.6 (High)	Kimi CLI	1.37.0	10.71	18.88	51.02	$0.74	0.764	0.399	0.359	0.129
Opus 4.5 (High)	Claude Code	2.0.51	9.18	17.35	56.12	$2.53	0.691	0.297	0.294	0.091
Sonnet 4.6 (High)	Claude Code	2.1.44	7.14	16.84	56.12	$1.96	0.741	0.316	0.298	0.093
Composer 2	Cursor CLI	2026.04.13-a9d7fb5	6.12	16.33	51.53	$0.44	0.716	0.353	0.318	0.107
GLM 5.1 (High)	Claude Code	2.1.44	9.69	13.78	38.78	$1.47	0.684	0.322	0.301	0.096
GPT 5.4-Mini (High)	Codex	0.110.0	5.10	13.78	51.02	$0.45	0.655	0.330	0.305	0.076
Kimi K2.5 (High)	Kimi CLI	1.37.0	4.59	9.69	39.80	$0.33	0.712	0.309	0.306	0.094
Kimi K2.5	OpenCode	1.4.3	4.59	8.67	31.12	$0.53	0.702	0.319	0.297	0.117
GLM 5.1 (High)	OpenCode	1.4.3	5.61	8.16	20.41	$0.59	0.550	0.387	0.329	0.145
GPT 5.3-Codex-Spark (High)	Codex	0.100.0	3.06	8.16	29.08	$0.20	0.586	0.357	0.340	0.086
Kimi K2.5 (High)	Claude Code	2.1.44	3.57	7.14	28.06	$1.07	0.692	0.310	0.301	0.097
MiniMax M2.7 (High)	Claude Code	2.1.44	2.55	4.08	28.57	$0.33	0.500	0.265	0.227	0.108
MiniMax M2.7	OpenCode	1.4.3	1.53	3.57	20.92	$0.27	0.746	0.418	0.379	0.146

Performance scatter

X axis

Y axis

Why iterative evaluation

Aider and SWE-Bench evaluate an agent’s ability to solve an issue given a frozen repository. Undoubtedly, this is an important capability, but this is a single point in time. An agent could produce an entirely viable, but utterly different from the ground truth, fix that would fundamentally change how a developer would solve the inevitable extension. Thus, measuring qualitative metrics at a single snapshot in time yields a noisy signal that is scaffolded by prior human decisions. Furthermore, agents are not evaluated on their performance in long-horizon coding tasks, where they must either live with or redesign their original choices. Viewing agentic benchmarks as iterative processes is the only way to evaluate the true nature of software engineering.

We must adopt this framing both now and for the future of agentic coding. Much of the recent discourse on agentic coding tools has focused on the “slop” they generate (verbose comments, defensive coding, bloat). While “slop” is ill-defined, the core of these grievances hits squarely on the limitations of single-iteration benchmarks. It is tough to understand and maintain code riddled with these issues. This extends to structural issues generated by models: making minor modifications often requires rewriting the entire codebase because it is easier than extending agent-written code. Iterative benchmarks like SCBench are crucial for truly autonomous SWE agents. Without them, we would have no way to measure their ability to function autonomously given only specification updates, because it is impossible for us to know every required feature or extension from the outset.

Design principles

None of this would be possible without deliberate design choices in benchmark construction:

No prescribed interfaces

All that is provided is the external contract of either the CLI interface or the API endpoints and response formats. Agents select the underlying architecture and the approach to solving the problem. Providing a function signature or other internal hints would mask the signal we want to measure.

No explicit test cases or test suite

The model only sees the examples in the spec and the explanation of the behaviors. Part of eroding code quality is the inability to think of obvious edge cases for a spec. Thus, we require the agent to identify and handle the specified edge cases.

Black-box, language agnostic evaluation

Solutions are judged purely on the outputs they produce, given an input. Each problem includes normalization code to ensure that minor arbitrary decisions, such as white-space formatting, do not affect the solution’s correctness.

Problem catalog · All 36 v1.0 problems

developer-tools

web

data-processing

cli-tools

configuration-management

dsl

algorithms

simulation

databases

networking

file-systems

easy

medium

hard

cfgpipe

configuration-management

easy

CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition.

Multi-language code search tool (inspired by ast-grep) that finds patterns and applies refactorings. Starts with regex search in Python, adds AST-based pattern matching with metavariables, then auto-fix with conflict resolution. Supports Python, JS, C++, Rust, Java, Go, Haskell.

HTTP service for ingesting tabular files from URLs/uploads into queryable datasets with pagination, sorting, filtering, export, caching, config-based runtime controls, access control, and optional enrichment metadata.

Declarative system provisioning planner CLI that validates module configs and generates deterministic execution plans across macOS and Linux. Includes package/app installs, file actions, preferences, dock configuration, language runtime environments, profile manifests, and standalone build script generation.

CLI that parses and executes ETL pipelines defined in JSON. Supports select, filter, map, rename, and limit operations with a custom expression language. Adds conditional branching, reusable sub-pipelines with parameters, and a library system for modular definitions.

HTTP server that executes shell commands and returns results. Adds file tracking with globs, multi-format output support, command chains with hooks, caching, persistent environments with concurrency modes, and job scheduling with queues, templates, and dependency graphs.

Backup scheduler CLI that reads YAML configs to run scheduled backup jobs (daily/weekly/once) with glob exclusions. Supports full backups, tar packing, verification mode, and incremental backups using SHA-256 to skip unchanged files.

Command-line resource broker inspired by OpenStack Cyborg concepts. Manages blueprints, allocations, units, modules, tags, revision gating, and admin workflows with strict JSON contracts.

Command-line tool that converts LaTeX source files to KaTeX-compatible Markdown.

Config migration CLI that applies transformation rules to JSON/YAML/TOML/INI files. Supports value replacement, key renaming, pattern matching, array filtering, config inheritance with cycle detection, file relocation, and pre-transformation validation.

Self-hosted text sharing HTTP service with strict boundary validation, markdown + metadata rendering, TOC/preview generation, static docs/assets, lifecycle auth/drain controls, signed per-user cookies, and pluggable local/object storage with startup/runtime failure contracts.

CLI query tool for XML/HTML/JSON with XPath and CSS selectors, text extraction modes, file input precedence, and smart XML/JSON output formatting.

CLI tool for digital circuit evaluation and optimization. Parses scalar and vector circuits in .circ, .json, and .bench formats. Evaluates with 2-valued and 3-valued logic, generates truth tables, checks equivalence, and optimizes circuits with configurable passes.

SQLite migration CLI. Starts with basic DDL (create table, add/drop columns), adds data transformations and backfills, then foreign keys/indexes/check constraints with rollback support, and finally dependency management with topological sorting and cycle detection.

By Albert Ge

dynamic-config-service-api

web

medium

REST API for versioned configs with inheritance and deep-merge. Adds JSON Schema validation and multi-format input (JSON/YAML/TOML), then approval workflows with drafts and quorum-based review, and finally OPA/Rego policy enforcement.

Jump Freighter route planner for EVE Online. Calculates optimal routes with fuel costs, jump fatigue, and 3D spatial distances. Adds cloak-and-jump mechanics for extended range and handles high-sec destinations by finding nearby low-sec entry points.

EVE Online route planner with realistic warp physics (acceleration/deceleration, gate locks). Adds cargo hauling with manifests and multi-trip planning, then contract optimization to select the most profitable jobs given time and ISK/jump constraints.

CLI that merges data files (CSV, TSV, JSONL, Parquet) into sorted, partitioned CSV output. Handles schema alignment, compression, and external sorting. Adds Hive-style partitioning, file sharding, and nested type support.

SQL engine for querying data files (CSV, Parquet, TSV, JSON). Supports joins, aggregations, filtering, glob patterns for sharded tables, window functions (ROW_NUMBER, RANK, etc.), CTEs, and subqueries.

By Gabriel Orlanski

layered-config-synthesizer

configuration-management

medium

CLI that merges layered YAML/JSON configs for ML training with deterministic conflict resolution. Adds fragment expansion, env var interpolation, multi-run manifest processing, and JSON Schema validation. Outputs canonical JSON with sorted keys.

NDJSON query engine with custom SQL-like syntax. Implements filtering, aggregations with GROUP BY, multi-source joins (CONFLATE), schema mapping via GLOSS labels, and subqueries with custom keywords (POCKET, BEHOLDS, etc.).

CLI tool for creating and maintaining local vaults that archive content metadata from an online media platform. Tracks timestamped field-level history across three catalog schema versions, supports selective sync with media downloads, format-aware digest reports, a local HTTP viewer, and annotation-triggered auto-migration.

Interactive CLI password manager with encrypted local vault storage, master key unlock flow, search/add/edit/delete operations, category management, clipboard integration, tab completion, import/export, and vault locking controls.

REST API for storing ML agent trajectories with token/cost tracking, search, and reports. Adds mutable trajectories with ETag-based concurrency, forking with lineage tracking, EBNF grammar parsing for tool call extraction, and sandboxed Python/Bash execution.

Workflow orchestration system with a custom DSL for defining DAGs of tasks with dependencies and parameters. Includes a parser, execution engine, and JSONL logging. Adds caching with content-hashing and time-based strategies, then dynamic cache overrides per-task.

Code generator that infers data transformations from input/output examples and emits working code in Python, JS, C++, or Rust. Handles filtering, column ops, stateful transforms (prefix sums, sliding windows), and window functions. Generated code streams data with fixed buffers.

EVE Online manufacturing planner that parses the SDE to compute recipes, material costs, and build times. Adds invention probability calculations, ME/TE efficiency with waste tracking, full build planning with job scheduling, and recursive build-all with automatic job splitting.

REST API for EVE Online market data. Ingests market orders, builds price books, and provides regional stats and hub comparisons. Adds reprocessing yield calculations, minimum-cost ore optimization across hubs, and profit-finding for arbitrage and hauling.

CLI tool for managing distributed cache mesh resources through declarative YAML specifications. Validates specs, applies defaults, persists state, and reports structured JSON to stdout.

By Gabriel Orlanski

metric-transform-lang

dsl

hard

Interpreter for MTL, a DSL for processing event streams. Handles CSV/TSV/Parquet input, aggregations, window functions (lag/lead), joins with temporal constraints, and resumable execution. Output is deterministic JSON.

Multi-protocol mock server with YAML-defined behaviors and admin controls.

CLI framework with hierarchical command dispatch, argument validation, YAML configuration with inheritance, aliases, output formatting, file caching, SQLite persistence, container orchestration, version upgrade infrastructure, and system requirement checks.

A synthetic data generation pipeline that maximizes throughput against a rate-limited LLM API. Supports multiple task types, generation schemes, in-context learning setups, agentic tool-call loops, and multi-provider routing.

CLI spreadsheet grader for .xlsx answer keys and student submissions. Supports typed literal checks, tolerance/alternates/penalties, formula grading, dependencies/fatal/concealed controls, minimum thresholds, check-mode scenario verification, and HTML report rendering.

Advanced Python code intelligence CLI for static and interpreter-assisted analysis. Supports completion, inference, goto-definition, references, signatures, project search, refactors (rename/inline/extract), syntax diagnostics, environment discovery, scope context reporting, and project-level configuration/settings overrides.

CLI test-harness translation engine that generates and runs language-specific tester files (Python, JavaScript, TypeScript) from a structured tests.py spec, with line-based test discovery, deep equality checking, JSON result output, and strict generate-before-test enforcement.

By Gabriel Orlanski

Show all 36 problems

View on scbench.ai

Methodology

CKPT Solved

Checkpoint, and all prior checkpoints, are solved

Isolated Solved

% Passes only the tests for the checkpoint.

Core Solved

Just passes the core tests for a checkpoint.

$ / CKPT

Average USD cost per checkpoint

Erosion

Fraction of total complexity mass in high-complexity functions (CC > 10), where mass(f) = CC(f) × √SLOC(f). 0 = no high-complexity functions, 1 = all mass in high-CC functions.

Verbosity

Union of AST-Grep flagged lines and clone lines divided by LOC. Bounded [0, 1].

% AST-Grep

Percentage of lines flagged by AST-Grep rules for wasteful code patterns.

% Cloned

Percentage of lines that are structural duplicates (clone lines / LOC).

What the AST-Grep rules look for

The % AST-Grep metric scores generated code against 341 named slop patterns (205 unique rule types after deduplication) defined in configs/slop_rules.yaml. Each rule pairs an AST-Grep pattern with a human-readable diagnosis. Diagnosis text is quoted verbatim from the YAML. (The file has 14 additional work-in-progress entries we exclude from these counts.)

341

Production patterns

205

Unique rules

331

Warning

Info

Hint

Python

Language scope

chained-comparison-opportunity

warning

Use chained comparison (e.g., a < b < c) instead of 'and'

isinstance-return-ladder

warning

Long isinstance/elif ladder returning simple values; prefer a dispatch table or polymorphism.

json-dumps-then-loads

warning

json.loads(json.dumps(x)) is noisy; copy the structure directly

nested-if-no-else

warning

Nested if statements without else - consider flattening or combining conditions

manual-min-max

warning

Manual min/max logic - use built-in min() or max()

for-range-len

warning

range(len(seq)) loop suggests index juggling; prefer enumerate

Sample rule:

chained-comparison-opportunity

$A < $B and $B < $C
$A > $B and $B > $C

AST-Grep matches the pattern above; the rule fires on each match and contributes to the % AST-Grep score.

Show all 205 rules

View on GitHub

Resources

Blog

Paper

Docs

Website

Video

Acknowledgments

The benchmark is led by Gabriel Orlanski (University of Wisconsin–Madison) with support from DARPA, NSF and Snorkel AI through the Open Benchmarks Grants Program.

FAQs

SlopCodeBench reports the average model and agent cost required to complete each checkpoint. Tracking the cost across a problem shows whether continued development becomes more resource-intensive as the codebase grows and earlier design decisions constrain later work. Cost should be read alongside solve rates and code-quality metrics because a cheaper trajectory may also complete less of the specification.

Extension robustness describes whether software can absorb new requirements without requiring major rewrites, breaking previous behaviour, or accumulating structural problems.

SlopCodeBench measures extension robustness by placing increasing architectural pressure on the agent’s original implementation. A design that appears adequate during an early checkpoint may become difficult to extend once later requirements introduce new formats, behaviours, or execution paths.

Strict Solve requires the current checkpoint and every previous checkpoint to pass. Isolated Solve evaluates the current checkpoint without counting failures in regression tests from earlier checkpoints. Core Solve checks only the behaviour explicitly described or demonstrated in the current specification.

The three scores distinguish regression-free progress from partial implementation and cascading failures caused by an earlier checkpoint.

Hidden tests prevent the evaluation from revealing internal interfaces, architectural hints, or expected implementation details. Agents receive the written specification and its examples, but they do not see the complete test suite or receive test feedback during implementation.

The setup requires agents to reason about edge cases from the external contract instead of designing their code around visible tests.

SWE-bench evaluates whether an agent can resolve an issue inside an existing open-source repository. The repository’s architecture and much of its development history were created by human engineers before the agent begins working.

SlopCodeBench starts the agent with an empty workspace and lets it choose the architecture. Later checkpoints then test whether the agent can extend the consequences of its own earlier decisions. The benchmark therefore targets iterative software development rather than a single repository-level repair.

TheSlopCodeBench leaderboard reports results for complete model and agent-harness configurations. A model’s performance can change when it runs through a different harness, reasoning setting, prompt, or harness version.

Strict Solve measures regression-free completion, while Isolated Solve and Core Solve reveal partial progress. Erosion and verbosity describe the resulting code, and cost per checkpoint captures the expense of continuing the development trajectory. Readers should compare these dimensions together rather than treating one score as a complete ranking of coding-agent performance.

Get notified when we launch a new benchmark

Share this benchmark

SlopCode Bench

Leaderboard

Performance scatter

Why iterative evaluation

Design principles

Problem catalog · All 36 v1.0 problems

Methodology

What the AST-Grep rules look for

Resources

Acknowledgments

FAQs

Get notified when we launch a new benchmark

More benchmarks

Senior SWE-bench

OSWorld 2.0

Agents’ Last Exam

Agentic Coding

Continual Learning Bench

Terminal-Bench 2.1

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?