Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control. Co-Gym is an open framework for building and evaluating agents that communicate bidirectionally with humans while working inside a task environment, using a flexible, non-turn-taking interaction paradigm, with an evaluation suite that measures both the outcome of a collaboration and the process behind it. Across travel planning, related-work writing, and tabular analysis, the best collaborative agents consistently beat their fully autonomous counterparts when evaluated by real users – and Yijia is candid about where today’s models still fall short, especially on communication and situational awareness.

Transcript

Lightly edited for readability.

Thank you so much for the introduction, and hi everyone, thank you for having me. It’s my first time visiting Snorkel’s office; it’s a really nice place, and it’s great to see a full room, with more people than I expected.

I’ll start with a quick introduction. My name is Yijia, and I’m a computer science PhD student working with Professor Diyi Yang. Last year I was also fortunate to work as the first intern at Thinking Machines, with John and Lilian. My PhD research focuses on human-agent collaboration. The work I’ll share today, Collaborative Gym, was actually envisioned back in 2024, and over the last two years we’ve put out several works in this direction, engaging not only AI researchers but also product builders and economists. So today I want to share not just the paper itself, but also the story behind why we thought beyond full automation early on, and why we believe human-agent collaboration will be a new frontier.

Looking beyond full automation

I think the most unprecedented change in our society over the last two years is that AI agents have finally crossed a capability threshold. This is a plot from METR, an organization focused on evaluation. On the x-axis is the model release date, and on the y-axis is the length of task these models can do with a high success rate. What they observe is that the length of task AI can do is almost doubling every seven months, which feels like a new Moore’s law.

To give you a sense of how dramatic this is: compared to five or six years ago, when I started working on language modeling, it took a lot of effort just to get these models to answer questions. By 2026, with some agent scaffolding, these models can do AI research on their own to a certain degree.

The direct consequence is that AI agents are now leaving the lab and entering concrete human work, and this goes beyond software engineering. Here’s a study from Anthropic analyzing what people use Claude Code for. Beyond software engineering, people are using these agents in domains like medicine and finance, which are highly specialized but also critical.

It’s great to see agents able to do more and more on their own, for longer and longer. But in practice we also see a lot of tension. These agents go off and work for several hours, and when they report back, people often say they’ve produced a lot of “AI slop”, content that’s very hard for a human to audit or verify. As a result, people sometimes have to spend even longer verifying or cleaning up the agent’s output, or, for the general public, who don’t have much information about AI, they just choose to do the task themselves instead of leveraging the agent.

Here’s a concrete example I found hilarious. When OpenAI released a new version of Operator, a browser-use agent that can click and type like a human, a user on Twitter said they were trying to use Operator to do some grocery shopping. Instead of asking what they actually wanted or where they were located, the agent went online and searched for milk at a grocery store in a random city.

Beyond individual friction like this, at the societal level we also see autonomous agents causing the deprivation of human agency. Here’s a quote from Reddit: people say we developed AI to do the boring jobs and let humans be creative, but the reality is the opposite is happening, humans are reviewing what AI produces, and we haven’t found that we have a better life.

Motivated by this, early on I settled on human-agent collaboration as my PhD focus, with the goal of building agents that collaborate better with humans and preserve human agency. Along the way there were skeptical views, and this is the one I encountered most: people would ask whether my research would just be washed away, suppose someone releases a larger model trained on more data, won’t the problem be solved? I don’t know whether GPT-6 will solve it, but at this moment, with GPT-5.5, a lot of the problems I just described still happen. And while these models do a good job in domains like software engineering or math, making them genuinely useful for collaborating with humans, especially the general public outside the AI bubble, is still an unsolved problem.

One thing I want to call out: besides simulated experiments, we’re strong believers in testing with real humans. Some people think human-in-the-loop studies involve a lot of fuzziness, but we think that fuzziness is not a bug. It’s a new frontier: we need new algorithms and methodology to figure out how to design AI that functions in this fuzzy environment and works with humans.

The Collaborative Gym framework

Motivated by all this, the work I want to share today, Collaborative Gym, is our first large piece of work in this direction. It provides a framework for enabling and evaluating human-agent collaboration in a dual-control, non-turn-taking environment that goes beyond normal chatbot interaction.

Before introducing what we did, a quick recap of the status quo of today’s fully autonomous agents. When people build agents today, we assume the agent works in a certain environment. At every step, the agent takes an observation of the environment and, based on that, decides the next action. This is a clean input-output situation, and we’ve seen a lot of progress in optimizing these agents to operate longer and longer. This continues until the agent decides it’s time to finish the task.

Now it gets more interesting when you add a human component. First, as a human, besides using the agent to interact with the environment, you can also take actions and observe the environment at any time. Think about using a coding agent today: when it’s writing files, you can let the agent modify them again, but you can also review the files or edit them yourself. This poses a challenge, because even before the agent takes an action, the environment can change while the human is actively doing something. Second, the human and agent have bidirectional communication. Communication can be initiated by the human, like sending messages during a task, but the agent can also proactively message the human. Ideally, if the agent needs additional information, it should message the human before the human asks anything.

So compared to the fully autonomous setup, there are more components, and the challenge is how to define an environment API that makes it easy to develop agents for this situation.

Collaboration acts: communication and intelligent deferral

The first thing is the action space for the agent. Normally, to coordinate a human-agent team, people use approaches like turn-taking, where the model always responds to the human, or predefined agentic workflows, where each step is fixed. But suppose we give the agent the freedom to decide how to coordinate with the human. In this framework, we introduce two collaboration acts on top of the task-specific action space. This is inspired by how humans actually team up, through active communication and intelligent deferral. Intelligent deferral means that if there’s a part you can’t do, out of capability or out of scope, you can defer it to another member of the team.

The task-specific action space depends on the environment you deploy the agent in. If your agent works in a browser, the task-specific actions are click and type; in a coding environment, they’re terminal or bash commands. The collaboration acts, by contrast, are shared across tasks and provided by the Collaborative Gym framework.

These collaboration acts look simple, but they unlock a different way for humans to work with agents. Here’s an illustration where each worker, human or agent, is a separate bar. Suppose a session starts with the human sending a message to the agent. In a fully autonomous setup, the human would just wait for the agent to do everything, maybe browsing Twitter in the meantime, unable to interfere. But because our framework supports dual control, both the human and agent can work on the task environment simultaneously. The agent can use a collaboration act to respond to the human’s message, and even before the human sends an additional message, it can keep working on task actions. If it realizes it needs the human to answer before it can proceed, then instead of going into an endless loop doing random things, it can defer, skipping later actions and waiting for the human. So even with just this wrapper of collaboration acts, the human-agent team has far more flexibility than autonomous agents.

The notification system

From a developer’s perspective, there’s still a key problem: how does the agent know when to take its next action? In the fully autonomous setup this is simple, the agent takes one action after another. But now the human is actively working in the environment, and the agent is ultimately a computer program that needs to be queried at certain times. So how does it know when to act?

This is the second major component of Collaborative Gym: a notification system that ensures the agent is notified at the right time. This draws inspiration from how real-time web frameworks work. Each component runs an event loop, and certain components emit events; from each component’s perspective, it just monitors events and acts when a new one arrives.

We adopt this idea. From the developer’s perspective, the agent just sits on an event loop and handles emitted events, which the notification system handles across four situations. First, if there’s a shared component in the environment, like a text editor or a codebase, and it changes, the agent is notified. Second, if there’s a private observation update, since this is a teamwork setup with access control, the human and agent may see different things, the relevant party is notified. Third, the remaining event types relate to the collaboration acts: when a new message arrives, the recipient is notified, so the agent knows it needs to handle the human’s message.

The last event type is interesting. In our initial experiments, we found a failure case in human-agent collaboration: a livelock, which you may know from operating systems. Because it’s a teamwork setup and each member can defer parts as they see fit, there can be a situation where both the human and the agent are waiting for each other, and the team makes no progress. To prevent this from blocking collaboration, our framework can signal the agent when the team has made no progress for a certain period, ensuring the team keeps moving.

Agent design

As I mentioned, the API design aims to put as few restrictions on the agent as possible, whenever an event arrives, the agent can decide its next action however it likes. This resonates with findings in the industry: with a very strong model, you don’t need complicated agent scaffolding. If you look at open-source code like Codex or Gemini CLI, the agent runs on a large loop, a very large system prompt without complicated components. Following that essence, in our experiments we mostly used a ReAct agent: when a new event arrives, the agent accesses its own memory and decides the next action. To test how much the collaboration acts matter, we also tested another agent that uses an additional language-model core to decide, at each point, whether to communicate with the human, take a task action, or defer. This variation was designed to test whether explicit planning is especially important for human-agent collaboration.

Experiments and results

The most important question is whether human-agent collaboration actually helps, whether it outperforms fully autonomous agents.

Collaborative Gym supports three representative tasks. The first is travel planning, where the human completes a travel plan that must comply with certain constraints. The second is a literature survey, where the agent has access to an external database and search tools. The third is tabular analysis, where the agent has a Python executor to write code. These represent different scenarios where people use agentic systems today. Excitingly, the framework provides a harness for experiments in both simulated and real conditions. The field does a lot of simulated experiments, but it’s often still challenging to set up experiments with real humans, so our open-source package supports testing agents with real users too.

In simulated experiments with different model backbones, we saw significant improvement over fully autonomous agents when the best-performing collaborative agents worked with the human. We also tested an open-source model, Llama at the time, because in many scenarios people want smaller, cheaper models, and we wanted to see whether the benefits hold. The answer is yes.

Even more excitingly, we looked at how real users perform. We compared our collaborative agent with the autonomous agent, with real humans recruited on Upwork who were paired with the agents to do assigned tasks. The collaborative agents were also preferred by real users. When we looked at why, here’s a quote that captures a lot of our original motivation around preserving human agency and ensuring better outcomes: people found that collaboration not only improves task performance but also leads to more natural and flexible interaction, better than letting the agent do everything, waiting, coming back unsatisfied, and repeating the loop.

Where models still fall short

Finally, analyzing the trajectories, we identified limitations in both the underlying models and the agent scaffolding. The two most striking problems were communication and situational awareness. On communication, the agent sometimes hallucinates updates to the human, which is very confusing, there’s an example where the agent says, “I’ve already organized all the material,” but when the human checks, it hasn’t. This kind of communication error really hurts collaboration, especially when the human doesn’t have high AI literacy. On situational awareness, whether the agent can decide when to do the task versus wait for the human, we found that even models like Claude or GPT struggle to decide the right time to act in a teamwork setup. This is probably because today’s training pipelines mainly run a loop to make the agent execute a task, with task completion as the objective. But in human-agent collaboration, the optimal policy is sometimes to let the human do certain parts. Comparing real and simulated trajectories, we found a high correlation in error types, which is why, in a lot of our current work, we use simulated users to train models but test them with real users.

What’s next

That’s a quick summary of what we did in the Collaborative Gym paper. To end, I want to share what this work inspires and what excites us next.

First, more and more people in the field have realized that collaboration capability is a new frontier for today’s models. Anecdotally: the Co-Gym paper is now at ICLR this year, but last year we submitted to NeurIPS and ICML and got rejected, both times there was a reviewer asking, “Why do we want a human in the loop? Why would an agent need to collaborate with a human?” Other scores were above seven while that reviewer gave a three, which made it hard to accept. But for follow-up work in this line, we submitted to COLM and got very high scores, and we no longer saw people asking why you’d need a human. Even at frontier labs, when I talk to people, they’re focusing more and more on collaboration capability rather than just training models to do code or math. One concrete example is the interaction model released by Thinking Machines, optimized exactly for multi-turn interaction, a big part of which is making the model know when to interrupt and when to let the human talk, and how to make multi-turn interaction smooth. And recently, in a technical report from Anthropic, they point out that even for their strongest models, multi-agent collaboration is still hard. Their setup is a bit different, they study whether multiple agents can do a larger project, and they found that while it can speed things up through parallel execution, it’s still hard to achieve higher task performance because of coordination issues, very similar to what we observe in human-agent collaboration.

Moving forward, one thing the current framework doesn’t provide is a way to separate contributions: the final task performance is a combination of the human’s and the agent’s work. Even in the plots I showed, where collaborative agents do better, there’s a possibility the gains come from the human doing a lot of work. So in recent work we try to disentangle the contributions of humans and agents. I hope we can make more progress on upskilling human workers, because the performance difference between different agents, like Codex or Claude Code, is much smaller than the difference between different human workers paired with them. If you check our current Collaborative Gym framework, it has this update, and hopefully we’ll present the work at COLM, which will also be in San Francisco later this year. We’re grateful to be supported by the Laude Institute to focus more on the human side. For my PhD, my goal is to advance human-agent collaboration, but it’s a bit sad that over the last two years most of my work has still focused on building better agents and training better models to be collaborative. I think at least the same amount of effort should go into upskilling humans to work with agents more effectively.

For more information, we’ve put everything on our platform, and we host an interactive platform for people to try it out. Finally, I run a podcast with a friend at MIT and another at Stanford. Over the last few years, more and more people have started working on human-centered AI, so we started the Augmented Mind podcast this year to share work beyond my direct team, covering user modeling, building AI systems for imperfect humans, and infrastructure for interaction models and human-centered development. Feel free to check it out. Thank you so much for your attention, I’m happy to take any questions.

Q&A

Q: In your evaluations, did you look at what happens to the performance difference when using a more advanced model, does the gap get larger or smaller? And how does this relate to interacting in Claude Code, where you can add a prompt while it’s working?

Good question. The Co-Gym framework was first developed to help us evaluate different models, and of course larger or proprietary models like OpenAI’s GPT or Claude perform better than the Llama model. But we see the same trend, effective human-agent collaboration can outperform fully autonomous agents, and this holds even with the Llama model. If you’re interested, there’s a paper from MIT using our platform that looks at collaborative scaling. They found that different models scale this collaborative effect differently: models like Claude keep the user engaged, so with more tokens or turns the team gets a better outcome, while for some models they didn’t observe this collaborative effect when scaling up turns or compute.

Q: In real-world complex workflows, you’d have people with multiple skills and varying levels of expertise, likely multiple humans, not just one. How does the framework account for that? And second, to improve performance on this framework, would model training be done similarly, or would you need to make changes?

I’ll take those separately. On multi-party collaboration beyond one human and one agent: the framework has two major parts. The collaboration acts sit on top of the task environment and are shared, so they still apply when you increase the number of agents. The notification protocol also applies with more than one agent. So the framework can scale to one human with multiple agents. With multiple humans it gets more interesting, because the humans also need to coordinate and chat with each other, that becomes a human-coordination problem, which we haven’t really tested. So the short answer: the framework supports one human with multiple agents; multiple humans raises questions beyond the AI part.

On training: I can share how it’s done for the interaction model I contributed to last year. For the human-in-the-loop part, it’s hard to have a real human sit on the GPU to provide interaction, so people often use a simulated user, but you can still grade the whole rollout. This elicits capabilities like knowing when to ask a question or defer to the human, and it’s doable within today’s frameworks given the right task and a good user simulator. But the harder challenge, making multiple agents coordinate and amplify each other on complex tasks, is still open, and we’re actively working on better training methods beyond using a simulated user.

Q: (Bruce, Northeastern University) This is exactly the capability I want for Claude Code, when I launch a long task and it goes wrong, I have to wait until it crashes. How do you make the agent aware of when to ask a question during task execution?

Since this work is aimed at serving as an evaluation platform and a base for building agents, we didn’t spend time training the agent to improve it. To test whether collaboration matters, we used a specific prompt that, given the current situation, lets the agent explicitly reason about whether it should communicate with the human, defer, or take a task action. Just having the agent spend more compute on that decision already helped. Moving forward, and this is relevant to the previous question, we need new training methods to improve that decision-making.

But it’s even beyond whether the agent knows when to communicate or defer; it’s also about the human-agent interface. In recent work, we used Co-Gym to compare commercial agents like Codex, Claude Code, and Cowork (another agentic product from Anthropic). Cowork and Claude Code share the same agent scaffolding and the same underlying model, but when we paired them with humans, general-public users did significantly better with Cowork than with Claude Code. The difference mainly comes from the interface: with Cowork you don’t need to use a terminal, which many in the general public aren’t familiar with, and it shows the agent’s progress, like a checklist of what it has done, which helps the human audit the system. So there are two parts: better methods to train models for situational awareness, and better human-agent interface design to keep the human in the loop. Human-in-the-loop isn’t just having a human present, it takes a lot of design.

Q: On the benchmark, was the evaluation process automated? How do you evaluate?

I didn’t go deep since I was asked to talk for about 20 minutes. In our paper, the three environments each come with their own grader, a combination of LLM-as-a-judge and rule-based checks. The framework supports both a simulated setup, where the user is a simulated language model, and a real setup with a user interface, so you can hire people to pair with the agent.

Q: (Seth) It feels a little weird to frame this as human-agent rather than just collaboration, where the collaborator could also be an AI. Did you try simulating the human with another agent, like an agent overseer managing the agent through the checklist?

That’s actually what we do in the simulated setup: we use a language model to simulate the human, so you don’t need an actual person sitting there. We showed that the error types in the simulated and real setups have a high correlation, which is why, for model training, people use a simulator to do on-policy training that still has some transferability to real humans.

As for whether there were tasks where agent-agent did better than human-agent: there’s an interesting result I didn’t share much. With simulated humans, the team sometimes can’t deliver the task, the two agents have coordination issues, both thinking the other should do something, so within a compute limit they don’t finish. But with real humans, and we got over 150 people to join the study, all the sessions we collected finished. So a major discrepancy between simulated and real experiments is that a real human will drive the task to completion even when the agent doesn’t behave well, which won’t happen if you use an agent to simulate the human.

Q: When you have an agent simulating a human, what biases does the agent-agent framework have versus human-agent? What is the human-agent fine-tuning on top of the agent-agent setup?

Starting this year there’s been a lot of investigation into user simulators, partly because the field realized human-agent collaboration is important and many labs are training against a user simulator. One thing people found is that, compared to real users, simulated users are more willing to cooperate with the agent, they’re easier to satisfy. As for comparing against training on real users: it’s very hard to train on-policy against real users. Recent successes, like Cursor training their Tab model, do live rollouts on consumer data, but that’s because the Tab model is simple, so you can do pseudo-on-policy training with real humans. For agentic tasks where each session takes an hour, it’s very hard to train on-policy against real humans. So I don’t personally know of a comparison in the training setup, but for evaluation, people find user simulators are usually easier to satisfy than real humans.