Open Benchmarks Grants

SlopCode Bench

A benchmark measuring code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat ratio), and verbosity to evaluate whether models produce correct and clean code under realistic conditions.

Built with

Overview

SlopCodeBench (SCBench) is a benchmark designed to evaluate coding agents the way real software actually gets built: through repeated requirement changes and extensions. Instead of treating the spec as a one-shot oracle, each task is a sequence of checkpoints where an agent implements an initial version, then extends its own solution multiple times as new requirements arrive.

The v1.0 release includes 36 problems with 196 total checkpoints, evaluated in a black-box setting where only a CLI or API contract is given. No prescribed architecture, function signatures, or module boundaries, so early design decisions can meaningfully help or hurt later work.

Leaderboard

Showing best version by % Checkpoints. Select both Model and Harness to view all versions.

Model	Harness	Version	Strict Solve %	Iso Solve %	Core Solve %	$/CKPT	Erosion	Verbosity	% AST-Grep	% Cloned
GPT 5.5 (High)	Codex	0.124.0	14.29	28.06	65.31	$1.51	0.494	0.269	0.249	0.047
GPT 5.3-Codex (High)	Codex	0.98.0	11.22	26.02	59.18	$0.69	0.644	0.336	0.314	0.069
GPT 5.4 (High)	Codex	0.110.0	10.71	23.47	61.22	$0.82	0.508	0.273	0.240	0.058
GPT 5.2-Codex (High)	Codex	0.93.0	9.69	21.94	54.59	$0.85	0.728	0.398	0.364	0.097
Opus 4.6 (High)	Claude Code	2.1.32	9.69	20.92	65.31	$3.17	0.737	0.318	0.288	0.103
Opus 4.7 (High)	Claude Code	2.1.111	8.16	20.92	64.29	$2.17	0.759	0.357	0.327	0.084
Kimi K2.6 (High)	Kimi CLI	1.37.0	10.71	18.88	51.02	$0.74	0.764	0.399	0.359	0.129
Opus 4.5 (High)	Claude Code	2.0.51	9.18	17.35	56.12	$2.53	0.691	0.297	0.294	0.091
Sonnet 4.6 (High)	Claude Code	2.1.44	7.14	16.84	56.12	$1.96	0.741	0.316	0.298	0.093
Composer 2	Cursor CLI	2026.04.13-a9d7fb5	6.12	16.33	51.53	$0.44	0.716	0.353	0.318	0.107
GLM 5.1 (High)	Claude Code	2.1.44	9.69	13.78	38.78	$1.47	0.684	0.322	0.301	0.096
GPT 5.4-Mini (High)	Codex	0.110.0	5.10	13.78	51.02	$0.45	0.655	0.330	0.305	0.076
Kimi K2.5 (High)	Kimi CLI	1.37.0	4.59	9.69	39.80	$0.33	0.712	0.309	0.306	0.094
Kimi K2.5	OpenCode	1.4.3	4.59	8.67	31.12	$0.53	0.702	0.319	0.297	0.117
GLM 5.1 (High)	OpenCode	1.4.3	5.61	8.16	20.41	$0.59	0.550	0.387	0.329	0.145
GPT 5.3-Codex-Spark (High)	Codex	0.100.0	3.06	8.16	29.08	$0.20	0.586	0.357	0.340	0.086
Kimi K2.5 (High)	Claude Code	2.1.44	3.57	7.14	28.06	$1.07	0.692	0.310	0.301	0.097
MiniMax M2.7 (High)	Claude Code	2.1.44	2.55	4.08	28.57	$0.33	0.500	0.265	0.227	0.108
MiniMax M2.7	OpenCode	1.4.3	1.53	3.57	20.92	$0.27	0.746	0.418	0.379	0.146

Performance scatter

X axis

Y axis

Why iterative evaluation

Aider and SWE-Bench evaluate an agent’s ability to solve an issue given a frozen repository. Undoubtedly, this is an important capability, but this is a single point in time. An agent could produce an entirely viable, but utterly different from the ground truth, fix that would fundamentally change how a developer would solve the inevitable extension. Thus, measuring qualitative metrics at a single snapshot in time yields a noisy signal that is scaffolded by prior human decisions. Furthermore, agents are not evaluated on their performance in long-horizon coding tasks, where they must either live with or redesign their original choices. Viewing agentic benchmarks as iterative processes is the only way to evaluate the true nature of software engineering.

We must adopt this framing both now and for the future of agentic coding. Much of the recent discourse on agentic coding tools has focused on the “slop” they generate (verbose comments, defensive coding, bloat). While “slop” is ill-defined, the core of these grievances hits squarely on the limitations of single-iteration benchmarks. It is tough to understand and maintain code riddled with these issues. This extends to structural issues generated by models: making minor modifications often requires rewriting the entire codebase because it is easier than extending agent-written code. Iterative benchmarks like SCBench are crucial for truly autonomous SWE agents. Without them, we would have no way to measure their ability to function autonomously given only specification updates, because it is impossible for us to know every required feature or extension from the outset.

Design principles

None of this would be possible without deliberate design choices in benchmark construction:

No prescribed interfaces

All that is provided is the external contract of either the CLI interface or the API endpoints and response formats. Agents select the underlying architecture and the approach to solving the problem. Providing a function signature or other internal hints would mask the signal we want to measure.

No explicit test cases or test suite

The model only sees the examples in the spec and the explanation of the behaviors. Part of eroding code quality is the inability to think of obvious edge cases for a spec. Thus, we require the agent to identify and handle the specified edge cases.

Black-box, language agnostic evaluation

Solutions are judged purely on the outputs they produce, given an input. Each problem includes normalization code to ensure that minor arbitrary decisions, such as white-space formatting, do not affect the solution’s correctness.

Problem catalog · All 36 v1.0 problems

developer-tools

web

data-processing

cli-tools

configuration-management

dsl

algorithms

simulation

databases

networking

file-systems

easy

medium

hard

cfgpipe

configuration-management

easy

CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition.

SlopCode Bench

Leaderboard

Performance scatter

Why iterative evaluation

Design principles

Problem catalog · All 36 v1.0 problems

Methodology

What the AST-Grep rules look for

Resources

Acknowledgments

Get notified when we launch a new benchmark

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?