SlopCodeBench: Measuring Code Erosion as Agents Iterate

AI coding assistants have become remarkably capable at solving isolated programming tasks. They can implement features, fix bugs, and even navigate complex codebases. But there’s a growing frustration among developers: these models can write code, but they can’t maintain a codebase. They make mistakes, they add unnecessary complexity, they write redundant tests – in other words, they generate slop. And, since a non-trivial codebase isn’t written one-and-done, the slop accumulates with each iteration until the code becomes unmanageable.

SlopCodeBench (scbench.ai) is designed to fix this. We spoke with Gabriel Orlanski, the project lead for SCBench and a PhD student at the University of Wisconsin-Madison, to discuss SCBench and the future of agentic coding.

The problem: single-shot evaluation misses the real story

Current benchmarks like SWE-bench evaluate coding agents on whether they can solve a problem—full stop. Did the agent fix the bug? Did it implement the feature? Check. Move on.

But as Gabe points out, “that’s not really how you code, it’s not how you develop software.” Real development is iterative. We add features, refactor, debug, and maintain code over time. The impact of your architectural decisions compounds with each change.

“If the codebase is garbage, you’re better off rewriting the entire thing from scratch,” Gabe explains. “If you have to go through 10 if statements branched into all one function with seven levels of nesting—it’s more work to try and figure out what’s going on than it is to rewrite the entire thing.”

This is the gap that SlopCodeBench addresses. It doesn’t just measure whether an agent can solve a problem. It measures what happens to code quality as an agent makes multiple changes over time.

How SlopCodeBench works

Rather than evaluating agents on single, isolated tasks, SlopCodeBench presents multi-checkpoint problems that simulate real feature development:

Checkpoint 1: Implement the core functionality
Checkpoint 2: Add a new feature that builds on the first
Checkpoint 3: Add another feature
And so on…

Each checkpoint represents a realistic feature addition that a developer might face. These checkpoints validate how strong coding models are—and how much they tend to produce slop or erode code quality as we go.

“The point of this benchmark is for models to make decisions regarding design and architecture that will have impacts later on,” Gabe says. “That’s our core thesis here.”

Why AI coding needs SlopCodeBench

As far as AI coding has come, it’s still a good idea to expect more. Advanced models display consistent patterns of problematic behavior:

Selective amnesia

Models frequently ignore code they’ve already written. In one problem involving manufacturing logic from a video game, agents define a function at checkpoint one, then are explicitly told to use that calculation at checkpoint five. Instead of reusing their own code, models often try to reimplement the logic from scratch—and get it wrong.

“The number of times I’ve seen Opus just try and rewrite that logic and get it wrong has been more than enough,” Gabe notes. Ironically, this happens most frequently when models are in “high thinking” mode—they literally overthink the problem.

Library aversion

Models are “allergic to libraries,” preferring to hand-roll implementations rather than use well-tested code. “Why aren’t you using pandas?” Gabe asks rhetorically when the model decides to burn AI compute budget rewriting CSV parsing from scratch.

Deletion phobia

Perhaps most frustratingly, models refuse to delete unnecessary code. “These models refuse to ever delete code unless there’s no other choice,” Gabe observes. “That is huge bloat, huge slop. It’s just eroding your codebase.”

The complexity spiral

As agents patch and re-patch code without refactoring, complexity compounds. Functions balloon with nested conditionals. Logic that should be abstracted gets duplicated. The codebase becomes progressively harder to reason about.

“The erosion compounds at every single step,” Gabe explains. “It’s just taking the least resistance approach and patching solution after solution after solution.”

Why this matters for production AI systems

For organizations deploying AI coding assistants, these patterns have real consequences. Code that passes tests but accumulates technical debt creates long-term maintenance burdens. When human developers need to intervene—to debug, extend, or refactor—they’re stuck with a codebase that’s harder to work with than necessary.

“If these models are entirely autonomous, or even if there is a human in the loop, what’s gonna happen when a human has to intervene?” Gabe asks. “If the codebase is garbage, you’re better off rewriting the entire thing from scratch.”

This is particularly relevant for teams thinking about technical debt remediation. While coding agents promise to help clean up legacy codebases, current models show a troubling tendency to add to the problem rather than solve it.

At the end of the day, the goal isn’t just autonomous coding. It’s autonomous coding that produces maintainable, high-quality software. SlopCodeBench helps us measure the gap between where we are and where we need to be.

A benchmark built for community

Following the successful model of Terminal-Bench, SlopCodeBench is designed for community contribution. The team is looking for developers who are opinionated about code quality to write additional problems.

“The best people that we’ve found for writing problems are those who, even if I disagree with the majority of their opinions, are really opinionated about coding and how coding should be done,” Gabe says.

Interested contributors can:

Visit scbench.ai to review the design philosophy and contributing guidelines
Check out existing problems for inspiration
Join the Discord community
Submit new problems to the repo via PR

SlopCodeBench evaluates model behavior in a way that single-shot benchmarks miss. And in doing so, it provides a path toward building coding agents that don’t just solve problems, but solve them in ways that humans can actually work with. Snorkel is proud to support SlopCodeBench and other open benchmarks. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.

Full interview on YouTube

Timestamps

[00:00:00] Introduction and Gabe’s journey into ML for code

[00:02:52] The frustration with “slop” and what’s missing in current benchmarks

[00:07:57] Design philosophy: Why hand-written problems matter

[00:10:22] SlopCodeBench and technical debt

[00:15:13] Benchmaxing: How to spot when models are over-optimized

[00:16:30] Where advanced models struggle most

[00:19:00] Recommendations for model builders and agent designers

[00:21:49] Gabe’s approach to AI coding tools (skills and subagents)

[00:23:54] Building the SlopCodeBench community

[00:25:45] Fred’s perspective on evaluation research

[00:26:56] Contributing to the benchmark

Transcript

Speakers:

Fred Sala, Chief Scientist, Snorkel / Faculty, University of Wisconsin-Madison
Kobie Crawford, Developer Advocate, Snorkel
Gabriel Orlanski, Project Lead, SlopCodeBench / PhD student, University of Wisconsin-Madison

[00:00:00] Kobie Crawford: Gabe, thanks for joining us today. Really excited to get a chance to talk to you about SlopCodeBench. I wanted to dive right into the questions and just throw out really quickly. So tell us a bit about how you got into this particular part of this little particular subset of ML and what you’re excited about in the, in the coding agent space going forward.

[00:00:16] Gabriel Orlanski: Yeah, thanks for having me. I am, I guess going back to the beginning, I, during my undergrad at RPI in upstate New York, I, you know, last year I got really into just ML for code specifically. You know, I had been into ML for a bit, but just the idea of coding seemed so perfect for machine learning. It was essentially like building blocks on building blocks on building blocks that seemed very learnable, that you could compose really well. And, you know, this was back in, back in the day, but back during when it was still CoNaLa, it was still very much semantic translation with surface metrics. And then code, the OG Codex came out alongside the original program synthesis with large language models from Google. And, you know, that really kind of kickstarted my interest in this area essentially the idea of, well, you have execution, things should be possible, right? I worked a lot on multilingual coding, on building benchmarks for multilingual coding. And then what it was 2024, 2023 is when like SWE-bench came out, and I think it was kind of the most obvious next steps for where the field was going. It was, you know, essentially it was a question of who’s gonna do this versus if it’s possible. To build a benchmark like that, to really make these agentic systems what they are today is wasn’t really a question of will it happen? It’s a question of when. Because code has all these amazing properties. It has all these amazing verification properties. It really just made so much sense that this was the direction everyone would start heading. And I think that’s kind of what got me into this area. Well it was in 2024, I was interning at Replit and you know, they were really at the forefront of, they still are at the forefront of end-to-end agents. They’re really just doing amazing work there.

[00:01:54] Kobie Crawford: Yeah.

[00:01:55] Gabriel Orlanski: And I remember what it was there, there was some company at the time reported 70% on SWE-bench light. And it was just, it was kind of exciting to see one, everyone trying to figure out how they did it and then also realizing, oh, this, you know, the likelihood that that was correct is near zero. But it was really, it was still when it was super unsaturated, it was so super exciting. And then SWE-bench Verified, came out, you know, same level of excitement.

And then essentially there’s been all these different SWE-bench or X bench. something or multi SWE, multi-agent, multi PR, bench, whatever. What’s really, you know, I think kind of hit the, I get I, I don’t wanna say the peak, but it’s hit the moment where we’ve had these single instance PR benchmarks.

We’ve had these single instance issue benchmarks, and it doesn’t really feel like there’s much juice left there in terms of what is the future. And so that’s how I got into this agent field. That’s how I got into this project overall. So that’s the overarching, you know, story very condensed into how I’m in agents for code now.

[00:02:52] Fred Sala: Thanks for the background. So how’d you come up with the idea for Slop Code Bench, especially kind of the initial idea? And, in particular, what aspects of existing benchmarks did you feel were missing or just kind of weren’t doing the job that we really are interested in?

[00:03:05] Gabriel Orlanski: I think it was really a combination of getting very frustrated with all these different toolings, getting frustrated with curs, like Cursor’s amazing, Claude Code’s amazing. They’re all really just amazing, like in a vacuum, they’re amazing products. I was getting annoyed because it didn’t do, it wasn’t using the patterns that I like.

And when I looked at over the code, it was incredibly frustrating because I was having to spend all this time being like, why is it doing this? Why is it doing that? This is stupid, right? Like essentially that was the inner monologue going on. It was getting really annoying.

And then on Twitter, you see, you know, people like Lucas Beyer, you see people, you know, any of the Twitter influencers with different degrees of influence in the field saying essentially the same thing I was feeling and that was really validating, which is that these models are generating slop, right? And I, and obviously slop is not necessarily a scientific term, but it really kind of hit home at what I was feeling, that they’re generating this very verbose, weird code that just isn’t doing like what I wanted it to do. It’s not doing even what I would think it would be doing. Just missing something.

And it just leaves your codebase in a very bad state, state, it’s very hard to recover from it’d be as if I asked an undergrad, Hey, can you implement an agentic tool calling system? Right. And I looked over their code and it’d just be incredibly frustrating to deal with. But I, you know, I’m paying 200 bucks a month for that privilege of getting frustrated with my code. It’s, you know, it’s something I felt was really missing. It was something I felt that was not there. And especially when you look at how b enchmarks were, and I, it’s the issue of the current benchmarks. Were all very single moment in time, right? Very single moment. Very much this idea of, OK, once you solve the problem, that’s it. No one cares what happens after it. No one really cares what’s going to come of a solution that you just submitted. It’s done. It’s solves. It gets the number. Benchmark benchmax models, everything’s perfect. Right? But that’s not really how, it’s not even, not really. It is how you benchmark. It’s not how you code, it’s not how you develop software. And I think everyone kind of knows where these agents will converge to, where the ultimate goal is, even if it’s uncomfortable to say, it’s that they will converge a completely autonomous coding, no human in the loop eventually, right? That’s kind of, I think the, at least in my opinion, that’s the only way this entire timeline goes, but there’s still the question of, well, okay, if they’re entirely autonomous, or even if there is a human in the loop, what is gonna happen when a human has to intervene? When a human has to try and debug because the models can’t do it. It’s like, well, if the codebase is garbage, you’re better off rewriting the entire thing from scratch, alright, if you have to go through 10 if statements branched into all one function, then there’s like seven levels of nesting that sounds, it’s more work to try and figure out what’s going on than it is to rewrite the entire thing, And that doesn’t seem like an optimal solution, doesn’t, it also doesn’t seem like a satisfying solution when you think of how much money is being invested in these agents, how they’re supposed to be, like realistically, how they are supposed to be so much smarter than all of us because they have the entire internet.

They have however many times multiplier number of synapses that, that we have in our own brains. They should like, they should be better than us, but they’re not. It’s really, it’s incredibly unsatisfying to think, okay, these models are making weird mistakes that I as a human would never make, that any software engineer that I know regardless of, you know, their core talent would make, why are we okay with this?

Right? And so that’s I guess, the motivation for SlopCodeBench. And then there’s also the motivation of, okay, code doesn’t live in a single shot. How do we simulate this overarching process of erosion? How do we simulate this instance where the models keep patching, keep going essentially for the laziest approach possible that’s gonna keep building up and building up and building up. And when you looked at the benchmarking landscape, when you looked at the code landscape, there was nothing that even came close to getting this idea. And so that essentially was the motivation, was the story behind, okay, there’s this issue of slop and then there’s issue of no one’s actually properly reflecting the real software engineering process. Is there a way to combine them? And the answer is yes. The answer is a hundred percent yes, because the erosion compounds at every single step. Now we found that it’s not entirely that erosion’s bad for the models, almost certainly, because what happens is that it’s just taking the least resistance approach and it’s just patching and solution after solution after solution, but it still builds up.

It builds up very quickly and to like astonishing degrees that we would not have thought. And the only way to evaluate this is if you do it in this like iterative. Each step is a new feature that’s being added like a real software engineer does, like real developers do. So it was just a marrying of all these topics together, kind of produced the perfect outcome. And the only way, at least in my opinion, to really evaluate this type of patchy behavior. Because if you evaluate patchy behavior on a single time step, you’re not given the full Oracle for where it’s gonna go. You’re not given the full intuition of here’s where it should go. It doesn’t really matter where it should go, it’s how you get there. But you can only measure that if you have multiple steps. You can’t measure that in a single time, in a single isolated instance. So that’s the overarching motivation for this massive benchmark that we’ve created.

[00:07:57] Fred Sala: Given this motivation and the initial idea, and you were starting to describe this, how did you kind of initially choose the design for the benchmark?

[00:08:05] Gabriel Orlanski: Part of it was definitely stubbornness of like, okay, how are we gonna get all these different problems? Well, I’m just gonna hand write them myself. But the other part is that it’s like, well, if you look at how to mine these, like any type of problem where it’s like this iterative first checkpoint is the overarching problem.

Next checkpoint is feature one checkpoint. The next checkpoint is feature two, et cetera, et cetera. It’s very hard to mine this, right? Especially because between PRs, which would be the most ideal, like feature in a sense, so much else can change. That’s orthogonal to your, to the actual core feature you’re trying to implement, that just gets caught up in the noise. That, really you like the only way to do this is to start from scratch and write your own problems. It’s also just a stronger benchmark overall. So in that sense, we really chose to write our own benchmarks and in writing our own benchmarks, the biggest, like this was probably the most illuminating, I think it’s the wrong word, but essentially the most illuminating perspective is that the point of this benchmark is for models to make decisions regarding design and architecture that will have impacts later on. That’s our core, core thesis here. It’s very difficult to just churn out problems like that. It requires a lot of very careful considerations. You have to really think, okay, where’s this problem gonna go? Okay. What’s the like, what’s my opinion as, you know, a relative, as a very novice engineer, but someone who’s been coding a while and who’s very opinionated on things I probably shouldn’t be opinionated on. But the overarching thing is where is this going to go? This problem going to go, where is it going to go and what’s, what’s in my head? The approach I would use. And even through writing the problems, new constraints come through where it’s like, okay, I should have thought about this. So it’s really this like the only way you can really design really good problems for this benchmark is you have to think from scratch. Okay, here’s a tool I’ve used. How would I build it? What’s the it iterative steps I would go through. That’s the overarching design and how we came across as designs that really, you can’t just scrape problems, you can’t just use GitHub repos. As much as I would love that to be the case, it’d make my life a lot easier. But repos are just too, there’s too many other things going on, and at the scale you would be wanna do it. It’s just not feasible at any economic, you know, unless you’re in anthropic or open ai, it’s not feasible for any real amount of money unless you get to use these models for free. So it’s really like, is there a deterministic process to do this?

No. So you have to write it by hand to get really good, high quality problems.

[00:10:22] Kobie Crawford: So many organizations are dealing with significant tech debt in their codebases. And what do you think SlopCodeBench is gonna be able to do to help us realize the promise of coding agents actually being able to help with that? Compared to like the idea of what you’re talking about right now, where people are seeing that a lot of the code that is generated is itself likely to become tech debt in the current state of affairs.

[00:10:41] Gabriel Orlanski: Yeah, so I think one of the biggest issues that I’ve noticed with models currently is that they like say they miss a test case, right? It’s not, it’s not a crazy thing in the world that they miss a test case. It’s never ever discovered again, even if it’s used in other tests that fail, models have an allergy to try and figure out something that’s broken in the code.

They, they are operating under the assumption that if the code’s broken, it’s not their fault, which I, which, you know, it makes sense to a degree, right? You know, a human being working on a legacy codebase, if something’s broken, it’s not their fault. They shouldn’t really necessarily put all their effort into fixing it.

But at the same time, if these models are this much better than us and they really are, like, there are been, like I’ve had AGI moments with these models. But it should be that these models are exceptionally better than us at these types of things at catching these little details throughout iterations and they should fix them. Now when it comes to technical debt, that’s… it is a difficult thing to measure. It’s very difficult to measure technical debt because just how big the codebase needs to be and also the idea of what is debt versus what is something you can kind of be like, okay, it’s not necessarily needed, that you can skip over and make that compromise where it’s not really worth it to deal with it.

It works. Why do we need to rewrite it? Right? If something works and it works to the standard, you need it to, it’s not necessarily technical debt to rewrite it. It just works. There’s no need to overcomplicate it. I guess what I would consider technical debt would be, okay, there’s this hacky thing to get the core codebase to work, but we have to do it because we don’t have enough time.

That is, I guess, more of the technical debt that I would consider and I think SlopCodeBench can really measure that in the sense that the erosion that we measure and how we’re measuring erosion through, where is all the complexity going? Where’s all the if statements going? Where are all the control blocks going? If they’re all going into one big function, then that is in itself technical debt because you’re just making something that’s harder to debug. You’re making something that’s too difficult to reason through quickly. Someone’s gonna eventually need to fix that. There’s no way that 30 different if statements in one function is gonna work forever. It’s gonna break. I would put a lot of money on that, that eventually breaking. So for people who have real technical debt, for people who have real issues with these agents getting, like trying to figure out, okay, where can I deploy them best to actually save real dev hours? SlopCodeBench really does provide a great insight into how they deal with the kind of exponential increase in complexity throughout a project’s lifecycle. Now, is this real, like, is it gonna be the most one-to-one? No, but it’s, I think a very good proxy for how people really will interact with these models. And I think part of that is driven by the fact that we are kind of very hands off in terms of how the agents are implemented and what they do. I would say I’m fundamentally against the idea that everything needs to be evaluated in the same exact harness with the bare minimum to get a good idea of what the, like mini-swe-agent’s great, but you don’t need to evaluate all your models in mini-swe-agent. I think if you just throw in Claude Code, specify the version, you’re good to go. Like that is a more useful way of evaluating these models than any generic harness will ever be. I think if you just, you throw in open hands, you give it the version, you evaluate it. I think that’s exponentially more useful to human, to actual developers than a mini-swe-agent or a bare bones ReAct framework. So we are really hands off in that sense. I think, you know, a lot of it, a lot of this entire design decision was because of the amazing work from Terminal Bench. We saw how they did their agents, we saw how they handled the, you know, Claude Code, open code, that stuff. And I like, I love it. I thought it was super simple, like got the job done. It was perfect design decision. And so we, you know, we piggybacked off that quite a bit and you know. I think it helps a lot in your benchmark design when you are just like, how are people gonna actually use this? How are people gonna actually interact with these models? And the closer you can get to true one-to-one, how a real developer’s gonna interact with it. The more value your benchmark has and the more you can kind of avoid the issues of, of benchmaxing, for lack, like, it’s not a scientific term, but it is, you know, it’s a real thing that happens and you, you feel it the moment you use one of these models that’s been like, you know, maximized perfectly for benchmarks, but not for real human use is that they just are miserable to interact with. That’s the overarching take for where we are now with, with how we’ve designed like these very core issues of like, how do you pick the model harness, how do you pick the prompting. It’s as close to how I at least use these models, which is kind of sloppy itself, but that’s how real people use them. So it’s like the closer we can get to real human use. The more value our benchmark has in, that’s been one of the core principles of our design.

[00:15:13] Fred Sala: Thanks Gabe. So, I’m curious actually, how can you tell a model is benchmaxed when you start using it? How quickly do you, immediately see that kind of behavior?

[00:15:21] Gabriel Orlanski: At least for me, it’s been more just like it’s been more, you know, it has to be, well, I think like the most recent one was, I guess GLM 4.6. It was really disappointing. It would just get very simple things wrong. It would get very simple things wrong. The moment you ask it to go outside of like the very simple one issue one, solving it, I mean, GLM is a great model.

It just wasn’t great for me, maybe in how I’m using it, but there’s, it’s very clear that things fall apart with the, I guess not superhuman, but the, the leveraging compute to replace menial work. That’s where it falls apart. And like the very simple things, right? Because these model, like, I guess another good example is, you know GPT 5.2.

Recently I tried to have it implement a spec. It did great at implementing part of the spec, but then just skipped over everything else. Never wrote test, didn’t do anything else. It’s like, okay, why? Like, cool, you got a good score. Where are my tests? Right? It’s easy, like it’s, the attention to detail I think is the biggest issue plaguing these models now. And that I think is the core of the Ben, like the bench maxing issue is that they’re great at a very specific, within the evaluation harness, the moment things go a bit outside of it and that you really need that attention to detail, it kind of all falls apart.

[00:16:30] Fred Sala: So speaking of that, and like models getting simple things wrong, just kind of more broadly, where do you see advanced models having the biggest trouble with slop code bench kind of tasks?

[00:16:40] Gabriel Orlanski: some of our tasks, I would say are like incredibly simple, yet difficult or not difficult in terms of their, like all gotchas. It’s more difficult in terms of, OK, you’ve implemented something at Checkpoint one, will you use it? And the interesting thing that we’ve seen, for like for some of our problems, like one where it’s like it’s implementing this, manufacturing logic from a video game, right? It defines a function that it has to define to get something correct in checkpoint one, and then we tell it not directly, but we say, okay, in checkpoint five you need to, like in checkpoint five, we tell it, okay, you need to use this calculation that you’ve done before.

The number of times I’ve seen Opus just try and rewrite that logic and get it wrong is been more than enough where I’m like, okay, why is it not doing this? And I in like, interestingly, we’ve seen that it skips previously defined logic the most when it’s in high thinking mode this is

obviously caveated all on however the harness or however Claude Code gives high thinking to the model. It’s not necessarily always gonna be high thinking, but it just gives it the ability to to think longer. We’ve seen almost entirely that this happens across a few other problems as well, where it just ignores what it’s done before and tries to re-implement it based off what it thinks the answer is. It’s essentially overthought what’s going on.

Really, I think, a[n] emblematic issue that a lot of people have with these models is that they just overthink, they just do too much.

The idea with SlopCodeBench in terms of like the, their design is that we would want it to eventually refactor its own code.

That’s the type of overthinking that I would want.

[00:18:08] Gabriel Orlanski: Right. Once some, once the complexity has gone big enough that the model can recognize it’s time to refactor, it refactors. I don’t want it overthinking, Okay, how do I calculate this when you’ve already calculated it before, right? That should be implemented by the model.

A hundred percent. And it’s really frustrating that I’ve seen this in my own code and I see it in the benchmark that it just cannot remember what it’s done, and not even remember. It sees what it’s done and just decides, okay, I don’t need that. Right.

The most powerful model, Opus 4.5, which honestly is still like the best that I’ve used when coding still makes these mistakes, and it’s even worse in real applications. I can think of so many times, even developing the benchmark where it’s just forgotten to update all the uses of something that has changed and now my entire codebase raises errors every time I try to use it.

[00:18:53] Gabriel Orlanski: It’s these types of very simple. just boring work.

That I would hope that these models can do better. And SlopCodeBench brings that to the surface.

[00:19:00] Fred Sala: Yeah, so I think overthinking and this kind of selective amnesia that you’ve described as model behaviors that are very costly are fascinating. Can you actually speculate on what model builders, agent designers, and so forth should really be thinking of doing to try to minimize these kinds of behaviors going forward?

[00:19:17] Gabriel Orlanski: So I think part of the issue, I don’t necessarily think that this is solvable by pure continual, like whatever continual learning, the vague posting is suggesting at the time. I think that this is an issue of purely documentation in the code. Now, obviously that could be considered a degree of continual learning, but it should be that it seems the pendulum has swung all the way towards minimizing comments and documentation because the public was getting annoyed by how many comments. And so I think the lack of comments hurts. I think also the reasoning through prior code to understand the linking of earlier concepts with current tasks, I think that’s the biggest issue is that. It’s not like, it’s not continu learning of remembering what you’ve done. ’cause I, there’s been so many times in my life where I’ve looked back at a codebase I wrote the previous day, and it’s taken me a while to figure out what’s going on. Right. I think realistically it’s just purely about how do you prioritize using what’s there versus trying to rewrite.

I think this also connects really well with the issues that we’ve seen through SlopCodeBench, we’ve seen through everyday life with these models is that they’re allergic to libraries. They’re allergic to using any code that’s not hand rolled from scratch. I don’t know why, like I really do not know why.

It’s incredibly frustrating when it’s something that, okay, why aren’t you using pandas? Well, I decided to write CSV parsing all by myself. It’s like, cool. I just wasted 1% of my usage for the week on that. Amazing. I think realistically the biggest focus that these model builders can do is purely on how do we get these models to prioritize existing versus rewriting. And also to an extent being okay with gutting. I think the other thing that we’ve noticed is that these models refuse to ever delete code unless there’s no other choice. That is huge bloat, huge slop, for lack of a better term. It’s just, it’s eroding your codebase. Why can’t these models delete things?

And I understand the counterpoint, which is that okay if models start deleting everything, and that’s worse than, than deleting nothing. But, again, these models should be better than us. They should be much better coders than us. They should be much faster processing all of these types of signals. Why isn’t it the case?

And I think it’s because benchmarks that have evaluated this setting have not existed until ours. And so now people can actually start looking into these questions at a, in a really rigorous way.

[00:21:27] Kobie Crawford: Yeah, that really makes a lot of sense. And one of the things we’ve certainly seen over and over again is exactly that until you actually provide some reward signal or something to indicate what’s better or worse in these kind of contexts, the models don’t learn that. So that’s perfect.

Just outta curiosity, there are so many good AI coding tools out there, what’s your approach to keeping, keeping an eye on all of them? How do you make time to try them all?

[00:21:49] Gabriel Orlanski: Well, okay. Okay. This is a bad answer, but this is the truth. I ignore most of the tools I see on Twitter, I just ignore them. I don’t trust most of them. All I really used. ‘ cause it’s been like, it’s been good enough for me is recently I, I got into like skills I’ve been getting into the subagents a lot more and figuring out hooks. I don’t think they’re really like, I don’t think you need huge harnesses that are completely different. Anthropic, OpenAI have been really smart about making it difficult to use their subscription OAuths in other tools, but. I think realistically all you need is like skills and subagents.

So I mean, honestly like the first AGI moment I had this past break was we, we completely redid how we’re doing tests in SlopCodeBench. We completely migrated the entire thing to a new system so it’s easier for people to add in, contribute, or add in problems. I did that entirely through skills and subagents. That would’ve easily taken me probably, you know, three to four weeks of manually going through each one, updating them. It was just pure skills and subagents figuring out how to do that. Really you just need to define your tasks well, you need to define the skills well, you need apply a bit of thinking you need to turn on your brain a bit when you’re thinking of the skills. And that will pay off so much more later on. It’s, you know, no amount of different terminal UIs, different colored texts is gonna solve the issues that you’re having. If you’re not like thinking through what you’re actually asking the model to do. The models suck at creating new things. They are amazing at doing the same thing over and over and over again in a structured manner. If you set it up like that, if you give them CLI tools, if you give them essentially an interface for them to just like brute force their way through. They will and they will do it well enough to your liking. There were obviously a few issues that occurred at certain points where I was like, why did it do this? This is really annoying. But eventually it got the job done really well because we like, because I essentially made a single CLI command that it can run with no extra output besides the JSON. Are the tests the same or not? And it was able to do that really quickly, really well. And it got what it estimated at seven weeks, which is for a human three to four weeks done in like a weekend.

[00:23:54] Fred Sala: Awesome. So kind of returning to Slop Code Bench, Gabe, what are your hopes for the community that you’re trying to build around? And do you have any particular goals or kind of directions that are most interesting? .

[00:24:04] Gabriel Orlanski: Yeah. I think the hope for the community is that people see this as an imperative problem because we believe that this is a core problem to building the future of autonomous coding. Beyond that, this is just a quality of life thing that everyone should want, is better code that you have to review. ‘ cause reviewing bad code is a miserable experience no matter how good your tooling is, it’s one of the worst experiences you can do. And so the goal is to get the community really excited about this. And I think. everyone we’ve showed this to, everyone who’s heard about this has been excited about the ideas. It’s been more of a question of how can we get them to contribute and really we’re gonna follow Terminal-Bench because they did such a great job with this because they’ve done really, they’ve paved the way for how do you build a excited community? And that’s through our discord, which you can find on the website.

That’s gonna be through our contributing guidelines, where we’re gonna lay out explicitly, you know, what people need to do to get. A problem approved, what type of problems we’re looking for, and when we submit the final paper, how they can be part of this effort. Because we really think that this is so core to the, to the future success of these coding models. Really it’s core to the future of autonomous coding is, are they generating high quality code that other agents, that other humans can work on that they can easily add to without breaking everything. It’s such a core problem that we believe fundamentally. Once people start seeing this and seeing what this benchmark can do, they will look to see if they can contribute. And when they see how easy we’ve made it, hopefully they get as excited as we are. Hopefully they can, you know, just submit a single problem via PR and we approve it and it’s amazing and we have a hundred problems. We’re gonna see how it goes. But really it’s been such an important problem to us. And we hope that people see our enthusiasm and get excited the same way we are about this entire issue.

[00:25:45] Kobie Crawford: Fred, I wanted to ask you a question since you’ve had the opportunity to work with Gabe at the University of Wisconsin Madison, when you look at project like, like SlopCodeBench and you look at what Gabe is, is doing, what do you see as the kinds of things that you’d like to see happen in your team, in your department that he’s doing that you wanna help the department learn from in terms of how SlopCodeBench is growing right now that you’d like to see more people doing.

[00:26:08] Fred Sala: Yeah, I think evaluation in general is something really exciting. I think especially coming from academia, we typically really like to propose sort of new algorithms. Maybe that means things like new models. And, and I think that aspect of research has been very successful, but I think even just understanding the state of the field is something really critical and in academia and beyond, we really have a pretty big role to play there. That’s a lot of why I really love the work that Gabe is doing. And of course, you know, folks like Terminal Bench as well, have done a lot of really exciting work that really has a lot of immediate impact. Both kind of within the general research community, but also beyond it, in industry as well. So I think kind of continuing on this tack of, hey, how do we do the right kind of evaluations? How do we figure out very carefully what the status of models, agents, etc. are? I mean, these are really important directions, and I think we should all continue thinking more about them too.

[00:26:56] Kobie Crawford: Awesome. Yeah. Well, Gabe, thank you very much for making this happen, for making time to be a part of this conversation to tell us about your benchmark. And we really are looking forward to seeing how far SlopCodeB ench goes and what people do with it next. It’s a very exciting time in this space. I hope we get to get that enthusiasm growing for you. Want to help you do that. And with that in mind are there specific ways, you mentioned the Discord. Is there anything else that we should flag as like specific ways that we can put in the links in our blog post in the comments of the video here?

[00:27:27] Gabriel Orlanski: Yeah. So, you know, in our documentation we’ve put a lot of effort into how to write good problems. We put a lot of effort into essentially thinking through what’s the easiest way to, teach people our philosophy. So we have on our website, scbench.ai, we have a design philosophy for this benchmark. We have a contributing guide as well. Really, it’s all about. Is there a tool that you like, that you’d like to see cloned? That’s a great starting point for people is, is there a tool that they like that they wanna see cloned? Well write a problem about it, how you go about designing it, make a PR and you know, we’re gonna sit down and help you make this problem as good as possible. Because there are so many different ways to write different problems realistically, everyone has a different idea that they can do. And so if they just go to our docs, if they go to our website, that’s a great jumping off point. If they look at our problems that we’ve written that are all on our website as well, you can see what we’re looking for and get inspired there to write your own problems, to contribute more testing framework stuff. We’re gonna have this in Harbor as well, so it’s gonna be super easy to run. But really the biggest thing to getting this effort even further towards where we all believe it can be is just we want more problems. We want people who are passionate about coding, who are really opinionated about coding too. I think the best people that we’ve found for writing problems are those who, even if I disagree with the majority of their opinions, are really opinionated about coding and how coding should be done. Those types of problems really yield the best results. So just join the Discord, look at the contributing page, look at the repo, and you will see something that hopefully inspires you to write up a problem.[00:29:01] Kobie Crawford: Awesome. Sounds good.

SlopCodeBench: Measuring Code Erosion as Agents Iterate

The problem: single-shot evaluation misses the real story

How SlopCodeBench works

Why AI coding needs SlopCodeBench

Selective amnesia

Library aversion

Deletion phobia

The complexity spiral

Why this matters for production AI systems

A benchmark built for community

Full interview on YouTube

Timestamps

Transcript

Recommended
articles

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Coding agents don’t need to be perfect, they need to recover

Closing the Evaluation Gap in Agentic AI

Join our newsletter for expert advice, the latest research, and exclusive events.

SlopCodeBench: Measuring Code Erosion as Agents Iterate

The problem: single-shot evaluation misses the real story

How SlopCodeBench works

Why AI coding needs SlopCodeBench

Selective amnesia

Library aversion

Deletion phobia

The complexity spiral

Why this matters for production AI systems

A benchmark built for community

Full interview on YouTube

Timestamps

Transcript

Recommended articles

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Coding agents don’t need to be perfect, they need to recover

Closing the Evaluation Gap in Agentic AI

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles