Chat With the Terminal-Bench Team

Snorkel Chief Scientist Fred Sala and Kobie Crawford had a chance to catch up with Mike Merrill and Alex Shaw on November 5, just before they released Terminal-Bench 2.0 and the Harbor framework. They gave us a peek into the thought process and design decisions underpinning the benchmark, and shared insights about how Harbor supports a wide range of use cases beyond benchmarking.

Alex Shaw: I’ll introduce myself first. Yeah, my name’s Alex, I work at Laude Institute. I’ve been at Laude for about a year. I was at Google for a while before that. Working on, like, ad recommendations and, and conversion modeling stuff, and then, joined Laude Institute right around when it was being founded, and I’m happy to talk more about what exactly Laude Institute is, if you guys are interested later. But, yeah, we work on, a couple different directions, but we do kind of what we call impact research. Which is… which we define as research that doesn’t stop at the paper, so we love publishing papers, but we also love producing actual artifacts and products and projects that real users can use in order to adopt the inventions that made it into the research paper. So, yeah.

Mike Merrill: And I’m Mike Merrill, I’m a postdoc at Stanford, excited about all things agents and evals and autonomy, and yeah, really excited to chat with you guys today.

Kobie Crawford: Right on. Thanks so much. Fred, do you want to do a quick intro for yourself as well?

Fred Sala: Awesome. Yeah, hey folks, I’m Fred Sala. I’m a professor at the University of Wisconsin-Madison, and I’m also the chief scientist at Snorkel AI. I’m also super interested in all data-centric and agent-focused things.

Kobie Crawford: Super cool. Thank you, thank you. So, Kobie Crawford here, I’m Developer Advocate at Snorkel.

Rationale for creating Terminal-Bench

Kobie Crawford: Let me throw this first question out: how did you come up with the initial idea for Terminal Bench? And, you know, what aspects of the existing benchmarks did you feel were missing?

Mike Merrill: So, I think there was one insight that started Terminal Bench, which is that if you go back a year or so, the primary way that people were interested in getting language models to control computers was through operating computers’ GUIs. So, you know, you had these systems which took visual reasoning models, and they would move a cursor around, and it would click on drop-downs, and it would do things like configure EC2 instances, or go book you a flight, or something like this. And this made for very flashy demos, but if you ever actually tried using one of these systems, you found that they fell apart pretty quickly. Like, they weren’t able to maintain a long chain of thought, they weren’t very good at, like, precisely clicking on GUI elements, and you know, I think about a year ago, we were thinking, like, what’s a better way of doing this? And one of the insights that led us towards building Terminal Bench was that, like, even if reasoning visually is not so good, language models are really good at writing code, and they’ve been good at writing code for a number of years. And so, why don’t we write code to control a computer instead of relying on this dragging and dropping and clicking through GUIs that might not even be well-designed in the first place? And I think that’s where, like, the core heart of Terminal Bench came from, was the terminal is this interface that allows us to control computers through text, which is the modality that still works best for language models. And I think Ludwig and I have been talking about promoting this as a way of doing agentic use. A benchmark seems like a really good place to get started, and through some previous connections that Ludwig and I had, we were able to meet Andy, who is the… one of the founders of Laude Institute, who introduced us to Alex, who, like, really helped get the benchmark off the ground, and, has been a great collaborator ever since. I don’t know if you’d add anything to that.

“…the terminal is this interface that allows us to control computers through text, which is the modality that still works best for language models.”
— Mike Merrill

Alex Shaw: Yeah, I guess I would just add from Laude Institute’s perspective, or from our perspective, when I first joined Laude, we spent 3 months building this thing called the K Prize, I guess technically the Konwinski Prize, which is this $1 million prize around SWE-Bench, where we built, like, a continual… continually updating version of SWE-Bench that was contamination-free, that people could compete on for a $1 million prize. And SWE-Bench specifically is… you take GitHub issues that have been closed by some PR that added unit tests, and you give the language model a GitHub issue, and then you ask it to solve the problem, and then you run it against the unit tests that were contributed as part of the original pull request that solved it. And we just spent a lot of time looking at that benchmark and thinking about how they implemented it, and it became clear that there was maybe a broader abstraction that didn’t constrain tasks to be, GitHub-specific with open source Python repos and pull requests that had to close it, but what if you actually just had, like, a generic instruction, container, and some sort of test executable, like a test script? Then, in theory, you should be able to frame almost any task on a computer, which a container is essentially just, like, a simulated computer. You should be able to frame any task on a computer in this task format. And then that became the task format for Terminal Bench, and the terminal became the tool by which these agents solved these tasks. And, yeah, that was… that was right before the release of all these CLI agents, and then they started to come out. I think Devin had already come out at that point, or their demo had come out at that point, but then Claude Code, Codex CLI, and all of these other ones have come out since then. So I think it is a paradigm that is powerful already in terms of, like, adoption.

Kobie Crawford: No doubt. Agreed 100% about that. I got to see Mike give a presentation, and he talked about, you know, that coder that everybody knows who does everything on the command line anyway. The fact that the OS happens to provide this really rich GUI is like, so what? I’m getting… I’m very productive like this, and sit in Vim and do everything from there and crush it, and it does feel like that that remains the case, that, like, you know, the text modality does offer certain kinds of speed and certain kinds of efficiency that… like you said earlier about how language models being built on text, then, you know, use the modality that they’re built for.

Alex Shaw: And even the original computers, right, were just terminals. Like, the GUI actually came later, so it’s kind of proof that that is a method of operating computers generally.

Kobie Crawford: It just so happens (laughs)

Benchmark design considerations

Fred Sala: And thanks for that description of, kind of, the origination of the terminal bench idea. Once you folks came up with that idea, we’re curious how you chose to design the benchmark, and, like, what inspired those choices.

Alex Shaw: I guess I kind of partially answered that in terms of, looking at how SWE-Bench framed their problems informed how we designed, what we thought was a slightly more abstract task formulation. There were a lot of questions in the early days of, like, what actually is a headless terminal? Like, how do you give an agent a terminal? Like, what are the affordances of a terminal, right? You could give it the ability to execute bash commands, but actually a terminal is more powerful than that, because from Bash, you can activate other programs, like VIM and things like that, that are interactive. So, we spent a lot of time, yeah, really trying to dial in, like what does it mean to give an agent a terminal, what are all of the affordances of a terminal, and how do we design that to make it possible for an agent to use? And then, specifically, one thing we cared a lot about was making it interactive, so the agent actually had a live container that was running that it would interact with by executing these commands. It could look at the feedback that it was getting from the terminal output and inform its decisions for what to do next. And I think these types of live, like, agent interactions with containers to generate rollouts and rewards is also something that has become fairly population-set for both training models and also evaluating them.

“…one thing we cared a lot about was making it interactive, so the agent actually had a live container that was running that it would interact with by executing these commands.”
— Alex Shaw

Mike Merrill: I’ll add one other thing to that, which I think was a great design decision that Alex really made earlier in building the benchmark, which is to make it as possible to evaluate any agent and any language model inside of our harness. So, the way that our harness works is that the agent gets installed directly into the container. And what this means is you don’t have to worry about the complexities of hooking up this external environment to your agent. You know, your agent doesn’t need to communicate with your environment over, like, an SSH connection, or over MCP, or over some other bit of plumbing. Like, all we do is just take the agent and put it directly into the environment. So this means you can run Claude Code, or Aider, or a Codex CLI, or Gemini CLI, or Cursor, or OpenHands, or any of these other agents within our harness, because they just get installed directly into there. And really, any agent that can be run on a computer can be run in our harness. And we saw that as being huge for driving adoption, because anyone could just bring the code that they already had and plug it directly into the harness pretty quickly and get a number back out.

Design constraints and impact on agent evaluation

Kobie Crawford: Yeah, that’s huge, super cool. I appreciate the idea that you set up, saying, like, this is most generally applicable and can be… any agent can run in there, but by installing an agent inside the container versus, you know, exposing, like, a uniform interface for those containers to connect to, are there any constraints that you feel like you ended up taking on that you’re thinking about now?

Mike Merrill: Yeah, yeah, like, I think probably the best example of this is how we handle the termination of a task. So, every task in Terminal Bench is given a time limit, and this time limit varies from one task to another. If it’s an easy task, it might only have a couple of minutes. If it’s a substantially harder frontier task, then it might have several hours. And, one limitation of doing this is that if your API calls are particularly slow, or your agent runs particularly slow, or your hardware is slow, then you’re more likely to time out. Because the agent just can’t operate as quickly in the environment. And this is… this is unfortunate. But the reality is that we do need some way of telling the container that it’s done. That the agent is probably stuck in some loop, that it’s not going to actually solve the problem, and we need to terminate and move on. You asked about the compromises that we had to make, and the affordances that we designed into the benchmark, and the reason that this is a good example is that there are probably more elegant ways of doing this that just wouldn’t be compatible with an arbitrary agent. So, like, one thing you might want to do is limit the number of turns that an agent takes, right? Like, if you said every agent only gets to make 100 turns, then it wouldn’t matter what hardware it was running on, it would still time out at a consistent and uniform place. The problem is, is that there’s not a good way of taking an arbitrary agent and getting a hook into it to tell you how many turns it’s taken. And it’s not even, like, well-defined what a turn is. So, if Claude Code calls some sub-agent, are we gonna count that as a turn, or are we just going to say that that’s, like, part of the main agent loop, right? Similarly, you could do something like limit the number of tokens that an agent was allowed to use to answer the task. But this would also require you to build hooks into each agent to get that information out while it’s executing. And so we just said, like, we don’t want people who use our benchmark to have to deal with this, and so we’re gonna make the compromise of just attaching a timeout to every single task. And this way, you know, you don’t need to write any code to tell us how many tokens you’ve used, you don’t need to write any code to tell us how many turns you’ve made, like, we’re just gonna cut you off after a set amount of time. And everyone’s gonna be on that same playing field.

Kobie Crawford: That makes sense.

Alex Shaw: Can I add one more compromise to the list, actually? Because I think it illustrates an interesting part of our process. Early in the development of Terminal Bench, we had ideas around how agent developers were going to use the benchmark, and we had ideas around how many tasks people needed for the benchmark to be useful for them. And we took these assumptions, and we told Andy, my boss, the founder of Laude Institute, which, by the way, Andy, I should just explain his background really quickly. Andy Konwinski co-founded Databricks and then co-founded Perplexity, took his knowledge from how to turn research into startups and products and open source projects like Spark, and has embodied that into Laude Institute, so that’s kind of where this mission around impact research came from. And, Andy is very, very much like a move fast, get early feedback type of person, and we told him these assumptions. And he was like, I think you could release a benchmark with 20 tasks. I don’t think people care how many tasks you need. We were like, no, that’s crazy, like, we need lots of tasks, and he was like, the bigger question is just how are they going to use it? Like, will they be able to adopt it? So we went and talked to a bunch of different agent developers, like OpenHands, even Manus, who has this general purpose agent that uses a terminal, and the consistent feedback we got from all of them is: we don’t want more tasks, we want to know how in the world we’re supposed to evaluate our agent with its current tools. And at that time, we had not allowed people to install agents into the container to evaluate them, and we ended up having to make that compromise where installing… being able to install and run agents inside a container constrains, to a certain degree, the types of tasks that you can have, because that means, like, you need to expect a certain style of environment. For us, it’s Linux, like Ubuntu-style environments, and then you can’t break things like internet connectivity inside the container, because the agent has to be able to make LLM calls. So, those are some compromises that we made based on feedback that we got from prospective users, because Andy told us to go out and talk to users, pretty much, and I think that that was critical in, like, adoption.

“…we went and talked to a bunch of different agent developers… and the consistent feedback we got from all of them is… we want to know how in the world we’re supposed to evaluate our agent with its current tools.”
— Alex Shaw

Kobie Crawford: Nice And great, and great to, like, recognize, you know, these design decisions actually have, like, that practical impact always, right? So, like in every situation, like, who your target user is has to be, like, front and center in the decision process, so that’s awesome.

Building the Terminal-Bench community

Fred Sala: I can ask the next one, and this is, like, near and dear to my heart, managing big academic projects as well. So I think, yeah, pretty much all the really successful benchmarks nowadays are these very broad collaborations, where there’s lots of folks who form a community and really, like, help drive things forward. I’m curious how you folks manage that aspect, how you built up that community for Terminal Bench, and how you kind of keep that going.

Mike Merrill: When people ask me about Terminal Bench, I always say, like, this has been the hardest and most rewarding part of the project, has been, like, building this community and sustaining it, and it’s, like, the thing that I would be least excited to do again. Not because it wasn’t rewarding, but because it’s very hard, and not something that I had done before, either. So, I think, like, initial traction to the community came because Ludwig, you know, my postdoc supervisor, gets probably one or two emails a day from undergrads or master’s students who want to work with him. I think a lot of professors are familiar with this, Fred, I’m sure you’ve got…

“When people ask me about Terminal Bench, I always say, this has been the hardest and most rewarding part of the project, building this community and sustaining it”
— Mike Merrill

Fred Sala: Very familiar.

Mike Merrill: similar, similar inbound. And what I just started doing was every time one of these people would email Ludwig, I would get on a call with them. And so, for a period of time there, I was taking, like, 2 or 3 half-hour calls a day with people who had reached out to Ludwig, asking to help with just any research project. And I think something that made those calls successful was we had a very clear ask for what we wanted people to do. Like, if you wanted to help out with the project, you could go and make some tasks. You know, we said if you make three tasks, we’ll make you co-author on the paper, and if you make fewer than that, we’ll thank you, and we’ll be very grateful. And it’s just, it’s like an easy way to get involved. You know, so having a very clear, highly parallelizable, easy way for people to get involved with the project, I think was huge for our initial growth. There is also this phenomenon of growing an open source project, which actually Andy talked to me a lot about, which is this idea of your funnel, like your engagement funnel. And with any open source project, you’re getting a lot of people coming in the top. You know, people are gonna find you on Twitter, people might see you trending on GitHub, and they’re gonna come check in, maybe they’ll join your Discord, maybe they’ll ask a question. And very few of these people actually convert into serious contributors to the project. You know, probably, like, less than 5% of the people are going to actually go on and do something. I mean, we have a thousand stars on GitHub, and we have 100 contributors, and we have probably 5 active contributors, right? And so, you need to be able to identify these people, but also have a big enough funnel, that when you get down to that .5% of people who are going to be active, you still have a critical mass that’s going to be able to help out the project. So we made the funnel bigger by tweeting about it, you know, going and talking to, like, everyone we knew, you know, using these people who reached out to Ludwig, promoting in other Discords, like, sending out messages on mailing lists, one of our contributors posted on RedNote, you know, like, the Chinese TikTok, saying that if people came and helped and contributed to the project in a certain way, we’d make them co-authors. And this led to, like, 200 stars on GitHub, basically overnight, because people were excited and joined, and his video went kind of viral. So, you know, like, there have been many inflection points where the community’s grown very rapidly. But I think the fundamental thing that we did well was, like, in the beginning, when we had a small funnel, really hold people’s hands through the process, and give them clear ways to improve and act on the project. And then, at larger scales, the problem becomes about identifying talent and giving people responsibilities so that they feel like they have ownership in the project to make them more engaged members of the community. And that’s been our philosophy.

Fred Sala: Love it. It also resonates with me both on, like, managing academic projects, which have a very similar kind of funnel experience, often at smaller scale, because we’re not always building a big community around a project. But it’s also kind of true with the contributors for Snorkel, for example.

Mike Merrill: Sure.

Fred Sala: Lots of folks say they’ll help build data, and then you do end up funneling down to, like, you know, the three MVPs who are actually doing 80% of the work, or what have you.

Mike Merrill: Yeah, yeah, and those people are so valuable. I mean, like, every open source project has a core of people without whom the project would be impossible, you know? And there’s huge stacks of the internet that are held up on the backs of these, like, very valuable open source contributors, and so, you know, we love them.

Alex Shaw: We treat those people like our customers as well, right? Like, we try to think about what are the points of friction that makes it harder for them to contribute to the project, how can we get them onboarded most easily and have the fewest human blockers in their way for getting things merged and taken care of.

Laude Institute and the project philosophy

Kobie Crawford: That makes a lot of sense. Super cool. I love that vibe of treating them like customers, like, because it’s like, it’s actually different than treating them like colleagues. It’s a different… a different approach, thinking about, like, the customer point of view. I really like that a lot. And I guess that ties in with the way you described it in terms of the funnel as well, contributions coming in with some way, because, you know, on the marketing side, people always talk that way, right? So, it’s very interesting, that parallel. That’s really cool. Alex, earlier you talked about Laude Institute’s approach to research, and, like, what Andy brought to the table on that. Would love to know, from your point of view, what…feels like the… sort of like the biggest win for Terminal Bench, based on working that way. And, you know, the energy around, like, having shipped, like, a research idea, like, and then actually saying it’s got to get out in the community’s hands, sooner. Are there things that you think about about that are biggest there? Maybe you’ve already spoken about them because, you know, this sort of, like, ties into, like, what’s already happened for the project, or are there other things that might come up about, like, that particular angle approach that you might want to bring up?

Alex Shaw: Yeah, I think I’ve alluded a couple times to ways in which that mentality or style of research has helped us. I guess in general, just to, like, give the full motivation of how Laude Institute came to be was Andy was a PhD student at Berkeley. Him and his lab mates built, Spark, which was a massive open source success, and they ended up then co-founding Databricks to be kind of, like, the best place to run Spark, and obviously took that to… I mean, it’s gotta be hundreds of millions of users, if not a billion users at this point, right? And that was… that’s now a $100 billion company, and then Andy, like, 5 or 6 years later, stepped down from Databricks and, went and teamed up with Aravind to create Perplexity, and that was a lot faster. I think they built 7 products in 4 months, so every 2 weeks, pretty much, a new product, new demo, until they got to product-market fit, to what Perplexity is today, and now that’s, I think, a $20 billion company. And I think… and Aravind had just graduated from his Berkeley PhD as well, so it came from this, like, research-y background, trying to then, like, take what he had been researching, which at the time was large language models, and turn it into a product that actually solved users’ problems. And I think both of these just highlight the fact that we believe there’s this inefficiency in researchers solving novel problems that real people experience, but then not delivering the actual solution to the user. And I think whenever there’s an inefficiency, there’s an opportunity to create value, and we believe that we will create value by not only participating in impact research ourselves, that’s what I do at Laude Institute, but also, like, funding researchers, through quick and fast grants, like $20,000 grants that go to PhD students who want to ship a project that they’re working on, all the way up to, like, funding entire labs for multiple years. And, I think for Terminal-Bench specifically, I guess for me, when I was doing my master’s degree, I spent a lot of time trying to make language models take surveys to see if they could accurately predict human opinions or responses to certain styles of questions, and I remember we built this cool framework for how you could, like, quickly import surveys and have language models take them, but then in my mind, I was like, well, we need to publish a paper, like, we’re not going to release this framework, because that’s not our purpose. Like, we’re not going to actually spend time writing quality code. And now I look back on that, and specifically with Terminal Bench, it was the exact opposite. We said, “We will write a paper, but we first and foremost want to build something that people are going to use and get value from.” And I look back at my experience as a master’s student, and I think to myself, oh, that would have been such a great opportunity to build and open-source this tool so other people can make language models take surveys as well. It’s a simple example, but I do think that thinking about users and use cases has generated a ton of adoption for Terminal Bench, as well as the novel research insights.

“We will write a paper, but we first and foremost wanted to build something that people are going to use and get value from.”
— Alex Shaw

Kobie Crawford: Right on, right on. Yeah, Qualtrics and SurveyMonkey are busy, like, you know, saying, phew, we just ducked a bullet because Alex Shaw didn’t release his project. That’s actually a really great story. I like the point that you make about, like, actually beginning with practical application as the primary thing, and yes, we will publish, but seeing that the first step is actually making it… put it… making it useful and putting it in people’s hands. That’s awesome.

Introducing the Harbor execution framework

Fred Sala: Can you folks tell us about your new project, Harbor?

Alex Shaw: We’re still working out the right way to pitch this, you know, because it’s completely new, but I think this is also, one of the fruits of the Laude Institute approach to research, which is we built Terminal Bench in such a way that it could be easily used, and because of that, it actually got used in ways that we didn’t even anticipate. So we released it as just… originally, we set out to build a dataset, right? Like a benchmark, and we saw people plugging Terminal Bench into their CICDs as, like, unit tests for their agents before they deployed it, with tasks that they created themselves in-house. There was this popular project that came out called Terminal Bench RL, which was somebody who used synthetic Terminal Bench tasks to do reinforcement learning on a Qwen model. We saw GEPA integrate it as one of their adapters for prompt optimization. So, really, there was a broad range of use cases that I guess we didn’t necessarily anticipate, and they, at face value, seem like they’re maybe completely different use cases. You wouldn’t maybe anticipate that there’s a common layer of abstraction between all of them. But there is, and the abstraction is containerized environments that you perform rollouts with, and then you return tokens and a reward. And it turns out that’s the exact same thing you need for an eval, as you need for RL, as you need for prompt optimization, and the container-based interactive approach means it works with agent scaffolds, and agents is the primary form in which language models are now being deployed as, like, solutions, or at least it’s the approach that people are most excited about, I think. That also requires the most tuning and optimization and confidence in terms of understanding your agent’s performance through evaluations. So, we have built a new package and repo called Harbor that is essentially everything you need to create and use these environments for all of these different use cases. And our goal was to make it as few lines of code as humanly possible to get up and running and scale to hundreds or thousands of containers quickly.

“We have built a new package and repo called Harbor that is essentially everything you need to create and use these environments for all of these different use cases.”
— Alex Shaw

Kobie Crawford: Nice. That’s fantastic. I mean, one, the idea that you can simplify what people have to do for that, you know, like, again, you sort of optimize for that use case, sounds like, what people typically want, and then my question that follows onto that is, how do you feel about the abstraction that you’re choosing right now to offer them, and what do you see as, like, the pluses and minuses of the choices that you’re making around the abstraction that you’re providing for that?

Alex Shaw: I think there’s two types of agents that are gaining popularity right now. There’s tool use agents, which constrain the action space to be a set of very specific tools that are intended for a very specific task. And then there are general purpose agents, which right now, mostly are CLI agents which have a huge action space, usually an action space as large as the terminal, plus maybe, like, a search or a browser tool. And they’re intended to accomplish all sorts of different tasks. I saw, like, the Anthropic head of developer relations tweeted and said, like, what non-coding tasks are you using Claude Code for? And there’s, like, this huge string of comments of all these things that people use it for. And then I’ve seen other agents, like Biomni, which is an agent that came out of Stanford recently, which is a computational bio agent, but it also is general purpose, right? It has this massive action space of the terminal. And we believe that these general purpose agents will be popular in many verticals, not just coding, and also that that will just be a dominant use case in general. And we’re specifically targeting those right now. So, rather than going after the tool use agents, I mean, maybe we will expand the framework to account for those as well. I think you still need to containerize those to scale rollouts. But, yeah, right now we’re really focused on providing infrastructure for general purpose agents that use the terminals.

Transitioning to Terminal-Bench 2 and Harbor

Kobie Crawford: Right on… that makes sense. And then from a configuration and usage standpoint, given what people already are comfortable with using Terminal Bench, what’s the delta between how they’ve worked with Terminal Bench before now to what they would do working with Harbor?

Mike Merrill: I think what we’ve done with Harbor is we just made it easier for people to do all the things they were doing already. Let’s have a good concrete example of this. So, like, before, if you wanted to run Terminal Bench in parallel, or any of our adapted datasets, we also offer several other benchmarks that can be run inside Terminal Bench, like SWE-Bench, for example. Like, if you wanted to run one of these benchmarks, and you wanted to do it in parallel, you would have to go get the biggest EC2 instance that you can, that has, like, 128 cores, and, like, a terabyte of RAM, and run 32 Docker containers at the same time on a single machine, you know? And this becomes necessary, because, you know, there’s 500 tasks in SWE-Bench, like, what, you’re gonna run those one at a time locally on your laptop, like, doesn’t really make sense, particularly if you have the API rate limit for it. But what Harbor offers is this abstraction where we can just deploy into arbitrary containers hosted somewhere on the cloud. So if you’re using one of these neoclouds, like Daytona, or E2B or Modal, you can point the Harbor harness at these containers, and then deploy all your rollouts there with potentially thousands of containers at the same time. Or you can attach to your Kubernetes cluster, and you can run your deployment there. And, like, this is something that you could probably have hacked together using the terminal bench harness, but it wasn’t abstract enough to make this easy. So that’s… that’s how we think about it. It’s like, people are gonna keep doing the things that they were already doing with Terminal Bench, but in a way that’s, like, fully supported by the software that we’re offering, and makes it easier to experiment and evaluate and optimize your agent.

“If you’re using one of these neoclouds, like Daytona, or E2B or Modal, you can point the Harbor harness at these containers, and then deploy all your rollouts there with potentially thousands of containers at the same time.”
— Mike Merrill

Kobie Crawford: That makes sense. That’s awesome. That’s fantastic. I saw that in the Harbor repo that, like, a lot of it is still essentially gonna be, like, a YAML file for providing the config, and say, go get these things, and then do that stuff. And so, if people are able to see it as a config step like that, that’ll be super cool. When it comes to the containers themselves, are you starting with some assumptions about what will be in the container, or can people start literally from a, like, a from scratch and, like, go from there, or… like, what… are there any, like, sort of, like, prerequisites that you want, kind of, like, most everybody to provide to make your container viable?

Alex Shaw: We’re not making any assumptions about the container environments. I guess most of the agent implementations assume that Ubuntu will be the operating system running inside the container. But, yeah, in Terminal Bench, we had some pre-installed packages, but we actually even pulled those out with Harbor. Yeah, I think… well, I was talking to one of the leads on SkyRL, which is a reinforcement learning framework out of Berkeley, the other day, and he said, actually, Often, when they have helped people start using SkyRL, the hardest part has been scaling container rollouts, rather than the actual weight updates using the SkyRL framework itself. So, hopefully, I mean, to be honest, we’re still working to make this as seamless as possible, because to some extent, you’re beholden to the reliability of the cloud provider that you choose to use. But, I think it is… it’s a deceivingly difficult problem to scale these container-based rollouts. Deceiving, I say, because at face value, it appears like it might be simple, but there’s a lot of gotchas, I think, that can hit you, so our goal is, yeah, abstract it away for all the users and let them actually just do their research and their development.

What’s exciting about Terminal-Bench 2.0?

Fred Sala: Awesome. Yeah, we should ask you guys the most important question. What’s exciting about Terminal Bench 2.0?

Mike Merrill: Yeah, yeah. I think there’s a few things that made us really excited to do Terminal Bench 2.0. The first is that, you know, we had always planned for this to be a live benchmark. You know, like, from the very beginning, we built versioning into our harness, because we anticipated that we’d be making changes to the benchmark as model capabilities changed. It’s really important to us that, like, this not become saturated. You know, you’re seeing Frontier SWE-Bench performance being at, like, you know, 85%, and it’s, like, not clear if those final 15% of tasks are well specified, or, you know, maybe don’t have great test cases, and so, like. We wanted to make sure that, like, there was always going to be room to improve on Terminal Bench. And what this means is we need to increase the difficulty as frontier models performance increases, too. So, that’s the first thing about Terminal Bench 2, is that it’s harder. The second thing about Terminal Bench 2 is that it’s even better verified than Terminal Bench 1 was. And so, what does this mean? If you look at the tasks in Terminal Bench 1, you’ll find that several of them have bugs that are not the most desirable. So, several tasks in Terminal Bench 1 are unsolvable, or very difficult to solve for artificial reasons. Like, we may have thresholds that are set a little bit arbitrarily, like an agent could still, for example, train a model that does very well on a validation set, but not a test set, and just for reasons of random noise, it ends up failing, like, this is not desirable, it’s not reproducible. Some tasks aren’t particularly robust. So one task in Terminal Bench 1 requires you to download a video from YouTube, and YouTube is constantly changing its anti-bot protections. And so, like, a solution that works on one day of the week may not work on another, and this is clearly not desirable for reproducibility.Some tasks are too easy in Terminal Bench 1. You know, we have this Hello World task that we included for debugging, and that was great, like, it’s useful for debugging, but it’s also not something that we want to be evaluating, like. The strongest, most badass models out there on, because it’s just kind of, like, an easy point that we give away to every good model. And some tasks also just weren’t, like, intrinsically valuable, or gains, you know, like, things that might say something interesting about model behavior, but don’t tell us anything about the impact of models on the economy and their ability to do valuable work. So, we revisited all of these things for Terminal Bench 2. We made sure that all of our tasks are much harder. We spent hours manually verifying each task. Like, Alex and I looked at every task in the benchmark and went through a checklist, and ran a million agents against it, and like…

Alex Shaw: It was excruciating.

Mike Merrill: I mean, you guys are in the data business, like, you get it, right? It’s a lot of just, like, looking at spreadsheets and organizing people and making sure that everything is at the highest possible standard of quality. So we spent way more time on that for Terminal Bench 2. And I think… I think the end result is a benchmark that is harder, but also substantially higher quality. So, like, we have much more confidence in every task that’s part of the benchmark now, and, like, I can say with a high degree of certainty that the highest possible score a model or agent could get is pretty close to 100. You know, every task is possible, and it’s, like, reproducible and consistent. So, it just has better smells, you know? Like, Terminal Bench 2 just smells better than Terminal Bench 1 did, and we think when people take a look at the tasks that are in it, they’ll agree.

“I think the end result is a benchmark that is harder, but also substantially higher quality.”
— Mike Merrill

Alex Shaw: And I will say, we have 89 tasks, and you might think to yourself, “That is such a random number, why would you have 89 tasks?” And it is literally because we wanted so badly to get to 100 tasks, despite Andy constantly telling us that we don’t need 100 tasks, but we really wanted to get to 100, but for us, the quality threshold was the most important part of Terminal Bench 2, because we saw how that could negatively impact a benchmark, just because the quality of Terminal Bench 1 wasn’t as good as we aspired for it to be. So, we would… we filtered out every single task that didn’t live up to that standard, regardless of whether that meant we hit our 100 tasks or not. So, there’s 89 tasks, but I think it’s 89, really good tasks.

Mike Merrill: It’s 89, but you’re gonna like them.

Alex Shaw: It’s been really fantastic working with Snorkel, and Snorkel contributed tasks. And also, some Snorkel team members worked directly with us to help analyze a bunch of data that we generated while we were running all the evaluations. So, yeah, Snorkel data has been great, and the Snorkel employees have been great as well. So, just a 5-star experience.

Mike Merrill: Yeah, pleasure working with you guys.

Fred Sala: We appreciate you guys saying that!

Terminal-Bench and the Harbor framework are open-source projects led by Stanford University and Laude Institute, with contributions from a vibrant community of individuals and organizations, including Snorkel AI. To learn more, visit tbench.ai, harborframework.com, or join their Discord community. To find out more about Snorkel’s data development platform and our work with frontier AI labs, visit us at snorkel.ai and connect with our team.

A chat with the Terminal-Bench team

Jump to:

Rationale for creating Terminal-Bench

Benchmark design considerations

Design constraints and impact on agent evaluation

Building the Terminal-Bench community

Laude Institute and the project philosophy

Introducing the Harbor execution framework

Transitioning to Terminal-Bench 2 and Harbor

What’s exciting about Terminal-Bench 2.0?

Recommended
articles

Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

Intelligence per watt: A new metric for AI’s future

Join our newsletter for expert advice, the latest research, and exclusive events.

A chat with the Terminal-Bench team

Jump to:

Rationale for creating Terminal-Bench

Benchmark design considerations

Design constraints and impact on agent evaluation

Building the Terminal-Bench community

Laude Institute and the project philosophy

Introducing the Harbor execution framework

Transitioning to Terminal-Bench 2 and Harbor

What’s exciting about Terminal-Bench 2.0?

Recommended articles

Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

Intelligence per watt: A new metric for AI’s future

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles