SlopCode Bench
A benchmark measuring code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat ratio), and verbosity to evaluate whether models produce correct and clean code under realistic conditions.
SlopCodeBench (SCBench) is a benchmark designed to evaluate coding agents the way real software actually gets built: through repeated requirement changes and extensions. Instead of treating the spec as a one-shot oracle, each task is a sequence of checkpoints where an agent implements an initial version, then extends its own solution multiple times as new requirements arrive.
The v1.0 release includes 36 problems with 196 total checkpoints, evaluated in a black-box setting where only a CLI or API contract is given. No prescribed architecture, function signatures, or module boundaries, so early design decisions can meaningfully help or hurt later work.
Leaderboard
| Model | Harness | Version | Strict Solve % | Iso Solve % | Core Solve % | $/CKPT | Erosion | Verbosity | % AST-Grep | % Cloned |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT 5.5 (High) | Codex | 0.124.0 | 14.29 | 28.06 | 65.31 | $1.51 | 0.494 | 0.269 | 0.249 | 0.047 |
| GPT 5.3-Codex (High) | Codex | 0.98.0 | 11.22 | 26.02 | 59.18 | $0.69 | 0.644 | 0.336 | 0.314 | 0.069 |
| GPT 5.4 (High) | Codex | 0.110.0 | 10.71 | 23.47 | 61.22 | $0.82 | 0.508 | 0.273 | 0.240 | 0.058 |
| GPT 5.2-Codex (High) | Codex | 0.93.0 | 9.69 | 21.94 | 54.59 | $0.85 | 0.728 | 0.398 | 0.364 | 0.097 |
| Opus 4.6 (High) | Claude Code | 2.1.32 | 9.69 | 20.92 | 65.31 | $3.17 | 0.737 | 0.318 | 0.288 | 0.103 |
| Opus 4.7 (High) | Claude Code | 2.1.111 | 8.16 | 20.92 | 64.29 | $2.17 | 0.759 | 0.357 | 0.327 | 0.084 |
| Kimi K2.6 (High) | Kimi CLI | 1.37.0 | 10.71 | 18.88 | 51.02 | $0.74 | 0.764 | 0.399 | 0.359 | 0.129 |
| Opus 4.5 (High) | Claude Code | 2.0.51 | 9.18 | 17.35 | 56.12 | $2.53 | 0.691 | 0.297 | 0.294 | 0.091 |
| Sonnet 4.6 (High) | Claude Code | 2.1.44 | 7.14 | 16.84 | 56.12 | $1.96 | 0.741 | 0.316 | 0.298 | 0.093 |
| Composer 2 | Cursor CLI | 2026.04.13-a9d7fb5 | 6.12 | 16.33 | 51.53 | $0.44 | 0.716 | 0.353 | 0.318 | 0.107 |
| GLM 5.1 (High) | Claude Code | 2.1.44 | 9.69 | 13.78 | 38.78 | $1.47 | 0.684 | 0.322 | 0.301 | 0.096 |
| GPT 5.4-Mini (High) | Codex | 0.110.0 | 5.10 | 13.78 | 51.02 | $0.45 | 0.655 | 0.330 | 0.305 | 0.076 |
| Kimi K2.5 (High) | Kimi CLI | 1.37.0 | 4.59 | 9.69 | 39.80 | $0.33 | 0.712 | 0.309 | 0.306 | 0.094 |
| Kimi K2.5 | OpenCode | 1.4.3 | 4.59 | 8.67 | 31.12 | $0.53 | 0.702 | 0.319 | 0.297 | 0.117 |
| GLM 5.1 (High) | OpenCode | 1.4.3 | 5.61 | 8.16 | 20.41 | $0.59 | 0.550 | 0.387 | 0.329 | 0.145 |
| GPT 5.3-Codex-Spark (High) | Codex | 0.100.0 | 3.06 | 8.16 | 29.08 | $0.20 | 0.586 | 0.357 | 0.340 | 0.086 |
| Kimi K2.5 (High) | Claude Code | 2.1.44 | 3.57 | 7.14 | 28.06 | $1.07 | 0.692 | 0.310 | 0.301 | 0.097 |
| MiniMax M2.7 (High) | Claude Code | 2.1.44 | 2.55 | 4.08 | 28.57 | $0.33 | 0.500 | 0.265 | 0.227 | 0.108 |
| MiniMax M2.7 | OpenCode | 1.4.3 | 1.53 | 3.57 | 20.92 | $0.27 | 0.746 | 0.418 | 0.379 | 0.146 |
Performance scatter
Why iterative evaluation
Aider and SWE-Bench evaluate an agent’s ability to solve an issue given a frozen repository. Undoubtedly, this is an important capability, but this is a single point in time. An agent could produce an entirely viable, but utterly different from the ground truth, fix that would fundamentally change how a developer would solve the inevitable extension. Thus, measuring qualitative metrics at a single snapshot in time yields a noisy signal that is scaffolded by prior human decisions. Furthermore, agents are not evaluated on their performance in long-horizon coding tasks, where they must either live with or redesign their original choices. Viewing agentic benchmarks as iterative processes is the only way to evaluate the true nature of software engineering.
We must adopt this framing both now and for the future of agentic coding. Much of the recent discourse on agentic coding tools has focused on the “slop” they generate (verbose comments, defensive coding, bloat). While “slop” is ill-defined, the core of these grievances hits squarely on the limitations of single-iteration benchmarks. It is tough to understand and maintain code riddled with these issues. This extends to structural issues generated by models: making minor modifications often requires rewriting the entire codebase because it is easier than extending agent-written code. Iterative benchmarks like SCBench are crucial for truly autonomous SWE agents. Without them, we would have no way to measure their ability to function autonomously given only specification updates, because it is impossible for us to know every required feature or extension from the outset.
Design principles
Problem catalog · All 36 v1.0 problems
Methodology
What the AST-Grep rules look for
The % AST-Grep metric scores generated code against 341 named slop patterns (205 unique rule types after deduplication) defined in configs/slop_rules.yaml. Each rule pairs an AST-Grep pattern with a human-readable diagnosis. Diagnosis text is quoted verbatim from the YAML. (The file has 14 additional work-in-progress entries we exclude from these counts.)
Production patterns
Unique rules
Language scope
$A < $B and $B < $C
$A > $B and $B > $C
Acknowledgments
The benchmark is led by Gabriel Orlanski (University of Wisconsin–Madison) with support from DARPA, NSF and Snorkel AI through the Open Benchmarks Grants Program.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.

