Code World Models and AutoHarness for LLM Agents

At our latest Snorkel AI Reading Group, Carter Wendelken of Google DeepMind walked us through two related papers he presented at ICLR: Code World Models for General Game Playing and AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. Both ask the same question from opposite ends: when you want an LLM to act reliably in a complex, possibly novel environment, how much of the world should it have to model itself, and how much should be off-loaded to synthesized code? Code world models try to learn a full simulator of the game and hand it to a planner. AutoHarness keeps the LLM as the strategist and only learns the smallest patch of code needed to keep it from making invalid moves. Carter walks through both — including how to handle partial observability and stochasticity — and where each approach earns its keep.

Transcript

Lightly edited for readability.

Today I’m going to talk about some work we’ve done in the past year at Google with a large group of people, focused on doing code synthesis for agentic decision-making. And the focus here was really on creating game-playing agents. We have two different approaches that I’m going to talk about: code world models and AutoHarness. And a large group worked on this — the group of people shown here, scattered around lots of different places.

Standard approaches and their limits

Our goal is to create game-playing agents — to use LLMs to create better game-playing agents. The first question is, what are the standard ways you might go about this? What are the most obvious approaches to making a game-playing agent with an LLM?

The first approach would be to use the LLM as a policy directly. So you have a gaming environment, generate some state, some observation, and you ask the LLM what to do. Popular approach, definitely — this gets tried a lot, and this is our main baseline that we want to compare against. The LLM gives you an action, you repeat that loop. So that’s LLM as policy.

Another approach you might take, and this is maybe the next most obvious thing to do with an LLM, is to ask the LLM to generate a policy. So the LLM can write code; it could write code that’s a policy for playing a game. You could train that policy with a bunch of trajectories and do something like behavioral cloning, or you can do it online with an environment and try to optimize the score that you get. Those are good things to do — a possibility — we’ve looked at that, but that’s not the focus here.

There are reasons why you might not want to do either of these. LLM as policy directly can work, for sure, definitely for simpler games, for well-known games that are in distribution. But complex adversarial multiplayer games often don’t work that well. Games that are out of distribution for the LLM’s training probably won’t work very well. Also, it can be really slow at test time — you have to call the LLM every time you want to play a move, that could be a problem. And if you want to fine-tune the LLM to improve its performance, that can also be very slow and sample inefficient. So that may not be the ideal thing, although something you might try.

And then code as policy. That could be a good approach, and it also can work really well for a lot of simple environments. But the more complex the environment gets, the more you bring in multiplayer and players that are hard to predict, that becomes less and less effective.

So what’s an alternative? We have a couple different approaches that we looked into here. The first is to synthesize a world model — we call it a code world model. Here you want to use the LLM not to generate a policy, but to generate an entire world model of the game, and then use that with a planner to play. We know that if we have a world model, a game engine, we can hook that up to a planner and play a game. So here we just want to learn the world model. And the reason we want to do this is because we want to be able to play games that may not exist already. So if we have maybe a user interacting with an LLM, and they’re defining the game on the fly, we want the system to be able to learn to play that game. Being able to build a model of that game is one approach you can take there that we think is promising.

As a group, we were interested in this code world model approach more generally beyond games, and decided that games were probably a really good first case to explore this in, because games are well controlled in the way that some other environments are not.

The approach here is: you have your trajectories, and you can train the world model offline with the LLM. So you ask the LLM to generate a model, maybe you refine that, maybe you refine that over and over again until you get what you want. And then online, you can evaluate that by hooking it up to a planner. The paper here is Code World Models for General Game Playing, and that’s the author list.

In our second approach, it’s a synthesized code harness. The idea is that maybe you don’t need to learn a full world model. Maybe you can just learn something that can supplement the LLM. So a harness, generally, is some bit of code that works with the LLM to make it better. I’ll get into details of what we’re actually learning later, but the basic idea for learning a harness is that you just want to learn some code that works with the LLM to make it do a better job of what you’re trying to do. And here we’re focused on a harness for gameplay.

Part 1: Code World Models

Alright, let me talk about the code world models work.

This has a background. It was inspired for us mainly by the WorldCoder paper. So that’s a paper that took an approach similar to generating a code world model hypothesis, and then they used Thompson sampling for the refinement. That’s basically the approach we followed in the core part of our design.

Some other works I won’t get into detail on. The main difference for our work, expanding on WorldCoder and the other background in the literature, is that we have a much stronger focus on stochasticity and partial observability. I think that’s an important additional bit of work that we have. And then also, we have a focus on multiplayer gameplay and evaluating actual agent performance. A lot of prior work on code world models focuses on the model performance but doesn’t really take those models to the agent and look at the agent performance. So we have that emphasis with our games.

Method overview

This is just an overview of the code world model method. At a high level, we’re taking a natural language game description and game trajectories — in this case, random gameplay trajectories — and then passing that through a code generation, code refinement stage to produce a model. Then we take that model and integrate it with a planner, which here is MCTS, or ISMCTS, and that produces an agent.

Our actual model follows an API that we specify. This API is based on OpenSpiel, which is a large collection of games that we started out using and then added to for this project. We follow their basic API. That includes the core functions, which I’ve marked as the transition model, also labeled as F in some cases — the apply_action function, where you take a state and an action and you get the next state, so your standard transition function. And then the observation model, where you take a state and get the observation, the player observation. Those are the core two pieces that we’re learning, but there are some other convenience functions also that make it more useful to test and work with the whole system. To make it more like a full game-playing environment.

Feedback and code evolution

We have this API, and when we’re learning, we’re also going to be giving some feedback to the system in order to try and improve. This is not a zero-shot generation — we’re going through many stages of refinement. The feedback the system gets is both a score based on how many unit tests are correct — so we define a bunch of unit tests based on the model, things that it should pass, and this code should be verified against the trajectory data. We have a bunch of unit tests, we compute a score as the number of unit tests that pass, and we also give feedback in terms of the failing unit tests. Those two pieces are passed back to the LLM to generate the next round.

Our code evolution follows the REx tree search approach, which is also in WorldCoder, where they use Thompson sampling. We generate evolved code in a tree, so each tree node is a generated piece of code with an associated score. Thompson sampling involves choosing a next tree node to expand — a next piece of code to build on — based on both the score and how frequently it’s already been chosen. You want to pick things that are high-scoring and also not overly frequently expanded. So that’s basically Thompson sampling. Then the LLM’s going to improve based on the score and feedback I already described.

Partial observability and ISMCTS

Okay, so how do we deal with partial observability? Well, the first note is that for the planning in the partially observable case, we use a variant of MCTS called ISMCTS. ISMCTS is designed for dealing with hidden state, and what this requires is the estimation of hidden state from observed variables at each step — the ability to take whatever you’ve observed, your subset of observations and actions that you know about, and be able to estimate the state from that.

In order to provide that requirement of ISMCTS, we add an additional function to the API for partially observable games, and that’s this resample_history function. That takes a history of observed observations and actions and returns a full list of actions, so whatever you haven’t observed — other players’ actions. We call this the imputation model I in some of the later slides.

Stochasticity via chance nodes

One note about OpenSpiel and its modeling of stochasticity. This is the model we follow for all our stochastic games. OpenSpiel defines what’s called chance nodes, where all the stochasticity of the game is absorbed by these chance nodes, chance events. Once those are present in the model, everything else is deterministic. A chance event could be shuffling a deck of cards, or rolling a die — rolling a die has six possible outcomes. Once you know the outcome of that chance event, then everything else is deterministic. It simplifies the modeling by being able to isolate the chance into those fixed chance nodes.

This is just a model that shows the actions and the states and the observations across different players. I think that’s straightforward.

Evaluating a synthesized model on trajectory data

How do we actually evaluate the synthesized CWM on trajectory data when we’re dealing with the partially observable case? You can think of this kind of like an encoder/decoder. There’s a first step, which is maybe like encoding — this resample_history step — where we take the initial bit of data that we know (the initial state, some of the actions, some of the observations) and we run resample_history, this imputation function. In this graph, the bold things are the things you already know. We know a few things, we impute all the actions.

Once we have that, we can run our transition function, our apply_action, iteratively at each step. Apply action here, we get the next state, and then we have that, we apply action here, and through applying that multiple times we get all the states. Then to check that, we can validate the observations. So this is about taking the model and testing it against trajectory data.

Open deck vs. closed deck training

There’s another consideration here for the partially observable case, and that’s what the LLM actually sees during code evaluation. We have two cases: open deck and closed deck.

The closed deck case is the normal gameplay case. The player sees what they’re supposed to see and nothing else. But the open deck case — we allow, during training, that what the LLM sees while it’s trying to build the world model is the full state. So you get to see all the players’ cards. That’s an easier situation, but the idea there is that it’s kind of like the model of teaching someone else how to play. While you’re learning the game, you can show them your card, you show them everything, but then once it’s time to play, you hide your cards and play normally. Testing is always going to be closed deck, but we allow these two different ways of training.

What that looks like in practice: for the open deck case, we have this algorithm. This is the algorithm of the evaluation that occurs during training — taking the model functions, which here are written as F, M, and I, and being able to test those against the data. For the open deck case, we go through each of the transitions and we can test: is the transition function correct for that transition? That’s straightforward. Is the observation function correct? That’s straightforward — but those both involve the state. And then we can go through and run the imputation function, the sequence I showed before, and we can also test that. So that’s what we have for the open deck setting.

For the closed deck setting, we don’t have the states, so we have to take out these tests. Basically, these assertions are removed from the algorithm, not available to the LLM as feedback, not part of the score, not part of the feedback that it receives if it’s wrong. But we still have this other part that involves the imputation function. In some ways, it’s not as strong of a test, but it still runs through all the generated code. More likely to make a mistake, because it’s involving this imputation of actions. But it is testing everything.

Results: model accuracy

Okay, so what are our results like? The first set of results looks at just the accuracy of the world model.

We have transition accuracy for the training data. I didn’t mention, but in the training, in the learning of the world model, we’re providing it with five different game trajectories — I think of length 50, it might be 100, I forgot. Anyway, we’re trying to get five different trajectories, and those are the training trajectories, the only five trajectories that it sees. In the testing, we have 100 new trajectories that it never sees. They don’t go into the scoring, they don’t go into the feedback, but those are what we run for the test evaluation.

For the perfect information games, we have five different perfect information games here — fully observed. Basically, on all of these, it learns very well — 100%, 99-point-something or 100% on all of these games, after usually far less than 20 iterations of training cost of the LLM. Those are mostly pretty easy games, but some are more complicated. Also, I should point out, some of these games are part of OpenSpiel — well-known games — and a couple are things we created. Generalized Chess involves completely novel rules that wouldn’t be known to any LLM. Generalized Tic-Tac-Toe is different sizes and lengths of all sort of variations of tic-tac-toe, so still well-understood but novel. Chess is very novel in terms of the moves. But all of those worked, novel or in distribution alike.

For the imperfect information games, most of the games worked pretty well, but those results weren’t quite as strong. We have one game, Gin Rummy, which was quite complex in terms of the state representation and the game length, and that one was a mixed result. The others mostly learned quite well. Here again, we have three known games and two that are made-up novel games — Hand of War and Quadrant Probe are completely made up, definitely out of distribution for the LLM.

Okay, so I showed you transition accuracy — that’s the accuracy of the transition function. We also have inference accuracy, the accuracy of basically the imputation part. This is only relevant for the imperfect information games, for the partial observability case. Here we see most of the games working well in the open deck case, except for — well, three out of five, I guess — Gin Rummy failing completely and Hand of War falling down a bit. And then for the closed deck case, which is definitely the hardest case, results get a bit worse. It’s very spread out, from working quite well to somewhat to not at all.

Results: gameplay

Okay, so here are the gameplay results. Now we take our models that we learned previously and attach them to MCTS, or later ISMCTS, and run an agent, and then we compete that agent against Gemini Pro. Our model was trained with Gemini 2.5 Pro, and then we compete against 2.5 Pro as our main opponent. We also compete against MCTS using a ground truth model — that’s the second set of bars. And we also compete against a random agent. The random agent obviously isn’t expected to be very good, but it does have the advantage that it never makes an invalid move, so it can win for that reason sometimes. It can even beat me, yeah.

The results here are pretty solid for the perfect information case. It’s almost always winning. The green bars — the first column in each game is the most important, versus Gemini, the LLM as policy. Green bars mean the code world model is winning, either outright or by forfeit because the LLM is making invalid moves. Happens sometimes. The only case where it’s not a clear win is Tic-Tac-Toe, and that’s because the game is too easy. It’s basically a tie. They’re always tying. So that’s not surprising — not a very interesting game. Those games are mostly pretty easy, except for Backgammon perhaps.

We move on to the imperfect information case. I should note these graphs — some of these look a little different. Some of these games are zero-sum, and the standard win/lose calculation makes sense. Some of them are not zero-sum, and so instead of plotting win/loss we plot the actual reward. In these first graphs, larger blue lines mean the code world model is winning. On the second graph, again, green. For these games, we see results that kind of mirror the model quality you saw before. Three out of five games, we get pretty clear wins. Hand of War — actually, the open deck case was not nearly as good. That was maybe an unusual case. And then Gin Rummy is a problem case across the board.

Additional experiments

We have a few additional experiments. I’ll just mention these and come back to them in questions if we want to dive in more. These are not core parts of the paper, but these are things we did. We also looked at value functions — that’s an obvious thing to do here, and that can improve MCTS. It didn’t make a big difference in a lot of cases here, but it does generally work. We also did code as policy, which works, but the code world model was generally better. And we also did some reinforcement learning using the CWM models, which — like code as policy — worked in the sense that it could beat LLM as policy a lot of the time, but it was also not better than the code world model. I can get back to some detail there in questions if you care about that, but I’ll move on to AutoHarness for now.

Part 2: AutoHarness

I already talked about why we might not want to use LLM as policy, or code as policy, but also with our work on the code world model, we realized that’s a pretty complicated thing. It’s hard to learn a reliable world model for complex environments. It’s really nice when you have a world model, but that may not be the easiest thing to learn. And also, if you’re depending on MCTS, that can be slow, so that’s not really ideal. If you can create an agent that doesn’t depend on that — it’s a maybe debatable whether it’s better to depend on an algorithm like MCTS or just depend on the LLM at test time.

And really, we had this observation that when we’re competing against the LLMs, a really common failure mode is that they fail because they make invalid moves. So we’re thinking, what’s the simplest thing we can do to correct the LLM to make it play better? Instead of going to all the complexity of learning a model, the idea is just to maybe learn a harness that can fix the problems that are present — and in this case, that is trying to correct for illegal moves.

We still want to take advantage of the LLM as a base policy, we want to take advantage of the LLM where we can, but we want to fix its problems. We could write a harness, we could manually write code, or do some zero-shot code generation and try to create a harness that we want. But that might not get us where we need to be, because it’s not getting feedback from actual experience. What we really want to do is evolve a code harness, similar to how we evolved the world model. So that’s the approach here — to iteratively evolve the harness, just as we iteratively evolve the world model. And then once we have that harness, we can use it with the LLM to play the game.

Code-as-Verifier algorithm

This is the algorithm for what we call Code-as-Verifier — this is the AutoHarness. The policy here is to repeatedly prompt the LLM to propose an action for the game, given the current game state. Once we have a proposed action, we run that action through our harness, through our legal-action checker that we’re going to learn. If it’s valid, then we play that action. If not, we go back to the LLM. So if the harness says it’s invalid, we give that feedback back to the LLM and say, this wasn’t valid — and then try again. The hope is that, after a few tries, the LLM will get consistently correct answers if you learn a good harness solution.

We can think of this harness as a partial world model, insofar as the full world model is predicting the next state, but this partial world model is just predicting a validity bit. Given state and action, is it valid? Rather than given state and action, what’s the next state?

We’re doing a code evolution similar — basically the same — as we did before. The difference here is we’re learning a different API, different functions. We’re going to evolve two functions: propose_action and is_legal_action. So it’s learning both things. In some cases, it’s learning a full policy when it’s evolving.

Once we have these, we can actually create several different kinds of agents. We have Code-as-Action-Verifier, where is_legal_action is the harness and we let the LLM do the action proposing. That’s our main method here that we’re presenting. But this learning also allows for some other things. It allows for what we call Code-as-Action-Filter, where you have code that can propose some actions, and then you ask the LLM to choose which one of those is best. That’s a totally valid way to take the output of the learning here and create a different kind of agent — not really well explored yet here, but you can do that. And then there’s also Code-as-Policy, where if you take both of these together as code and ignore the LLM, you basically have a code policy. So this is a learning method that allows you several different ways of working with the outputs. But our focus is on the first.

TextArena results

For this work, we looked at TextArena rather than the OpenSpiel games we had before. These have some advantages — they’re really designed for LLMs, a little bit easier to work with in a larger collection for this purpose. These are some examples I won’t go through, but they were fairly diverse.

Let’s see what the observations look like. We took the subset of TextArena games where Gemini 2.5 Flash was producing illegal actions — if it had less than 99% legal action accuracy, then we considered it. 99% may sound pretty good, even 90% doesn’t sound too bad, but if you’re playing 90% legal actions, then most games you’re going to lose because of illegal actions. It needs to be really high for it to work. So these are all games where the LLM is probably going to lose to another system just because it makes illegal actions.

The numbers on the right show what we get once we’ve trained up a harness for these games. Basically all of these games, we get to 100%. So this works quite well. And then these results show the comparison between AutoHarness — Gemini 2.5 Flash with the learned harness — versus Gemini 2.5 Pro, just showing that we are, in most cases, beating the larger model by including the harness.

My last slide here just shows a demo of AutoHarness in action on a chess game. You can see the generated code on the right, and whenever the LLM proposes an invalid move, you get an X on the screen, and then it goes back and refines — presents the feedback and refines the code.

And that’s all I have, so thank you.

Q&A

Q: I was wondering about that term you used at the beginning of the slides — non-stationarity. How have you modeled non-stationarity across different types of world models, and what’s your intuition on how non-stationarity relates to whether you should use a code world model or AutoHarness?

I don’t think we specifically modeled non-stationarity here. That was just an observation, but it wasn’t really a consideration in the design of the model.

To elaborate a bit — the question is that certain worlds change their rules more often than others. A chessboard is very deterministic and strict; what’s a better approach for non-stationary worlds?

We didn’t really focus on that. I think it’s a fascinating topic. I actually thought about building a game that had that property — a chess-like game where the rules actually changed, or you had to figure them out. We didn’t do that, though. I’d love to see what world-model learning does in that case.

That said, I think having a world model is still a better approach than learning a policy when you have non-stationarity. You have the possibility of exploring in the current world — you can have a model that models what you know about the world and gives you some rules for how to deal with uncertainty. Whereas with a policy, you’re kind of fixed into making some assumptions, and it’d be very hard to adapt. I think a world model — modeling what you do know — gives you much more adaptability to non-stationarity than the other approaches. But I have no actual results that would speak to that.

Q: What is your vision from a commercialization standpoint? What would this look like?

I’m not really involved in that side of things, but the commercialization goal is: it’d be nice for the models to be really good at playing games. We’d like to have a system that you can sit down with and say, “I have a new game in mind. This is how you play it. Can you play this with me?” Maybe we play a few turns, and then I say, “Well, that was wrong, I didn’t mean that.” You give it some natural language description, you give it some trajectory feedback as you play, and from that it can build the model, build the game for you. That’s the sort of commercial tool I’d imagine coming out of this kind of work. We’re not building products at all, but that was the idea there.

Q: I was wondering if it had implications, especially around next-generation developer tools, where a lot of this could be applied — not in a strict gaming context, but more in an enterprise context.

I think the approach of evolving code — of course, we know now we’re letting the LLMs write code all the time, and that works really well — I mean, it works really well now, but I think there are definitely situations where bringing in code evolution with some kind of feedback is still likely to work better. In some ways, my coding process right now is automating this kind of loop, where I’m letting the model generate code, and it’s pretty good but not quite right, so I give it some feedback and say, “No, I wanted this,” or “This wasn’t good,” and maybe I have a metric I can say “improved” — but I’m doing that part of the loop. Automating that, where I can just say, “I want this kind of code artifact, these are my metrics — now go run, evolve, iterate, use tree search, or whatever” — that seems like a good next step for a coding product.

Q: I’m curious about the AutoHarness paper. Which game did you see the most improvement on?

The ones that start at the lowest point. Stratego in this list — Checkers and Chess probably are, and maybe Santorini. But that’s more about where the LLM started out than about AutoHarness, really. I think the headline here is that it worked on all the games. Some games were just worse than others to start out with in terms of the LLM.

Q: Empirically speaking, do you see larger models producing better CWMs or AutoHarnesses? Or models that are RL’d on harnesses producing better — is this emergent behavior at a certain scale of the foundational LLM training?

For the code world models, model size definitely made a difference. We tried that out with smaller models, including Gemma, older Gemma, Flash, and Pro, and each step up in model produced better results. Noticeably better results for the core model quality and ultimate agent performance. For AutoHarness, that was run with Flash and Pro. It worked well in both cases. You maybe see a bigger difference with Flash, because the model is not as good unharnessed.

For RL’ing a smaller model on harnesses — we didn’t try that. So it’s hard to really compare what would work better, whether a fine-tuned smaller model or a larger model. I don’t know.

Q: In MCTS, you select a node based on some score function and visitation frequency. Can you elaborate on what the score function is? For a lot of games, you have intermediate metrics — production scores, military scores — not just whether you won. What are your thoughts on a multi-dimensional scoring system?

In the planner, in the MCTS — normally the scoring is just rolling out to the actual end of the game, then backing up. We can also plug in a value function, and then you estimate the score more immediately. Those are two options. But the scoring is just the scoring from the game.

On extending MCTS to a multi-dimensional scoring system, I have not thought about that at all in this context. I would guess that’s probably been done in the literature, but we definitely didn’t look into that here, because we had just single-dimensional scores.

Q: You have two fundamentally different types of computation here — one is this model, the other is code execution. Have you thought, theoretically, about models of computation — when should you just let the language model go, versus when is code execution good for verification in a way the other type of compute isn’t? Is that an interesting direction to understand how a harness should be built to combine these types of compute?

I think we’re first just focused on the practical questions of when each works, and I think that looks more like the different kinds of games and the complexity — what is the state space, what are the rollouts like, is it partially observable. All those dimensions would affect that choice.

Comparing code world model versus AutoHarness: a code world model is definitely a more complex thing to learn, probably going to be harder to learn. But once you have it, it gives you more — so it also depends on your use case, what you want to end up with as an output. If your goal is just to be able to play games, then AutoHarness — learning quickly to do that — seems to solve the problem pretty well. But it doesn’t give you as much of a useful artifact as a model. Theoretically, that’s probably a harder question. I’d have to think about what the theoretical dimensions are that really inform that choice. I don’t think I have further insight right now.

Q: How is this different from the major applied-industry approaches like context engineering? In terms of making a harness or a code world model, what works? Is it because you go to learn this representation? Or is it because it’s a new paradigm that actually restricts the possible states for a specific problem?

Let me jump in there — I’m not quite sure I got the whole question. For the first part: how does this compare to context engineering? I think this is a kind of context engineering. But the main comparison with what I think you mean is that this is context engineering that involves some iterative process with feedback, where you define some metrics. I think that’s a really useful thing to include, and gives you some additional benefit that you don’t have without that. But I think it’s a kind of context engineering.

Yeah — I think the question was: this is a special kind of context engineering that’s not really just accumulating a piece of text in the context that helps the model reason or act, but more actively generating states and transitions that restrict the model’s prediction or generation in some degree.

Maybe I have a tangent there, but it might be relevant. A lot of context engineering involves giving it some additional text and maybe some natural language feedback. And here, what we have is, in some sense, code, or feedback about code. So maybe it’s a little bit different kind of context in this case, but I think it’s still in the same category of thing.

Q: If I’m understanding correctly, the primary goal is to be pretty accurate in terms of playing the game — playing it correctly and well. If you were to build on this and now refine it so that the model makes all the correct choices and doesn’t make any illegal moves — how would you build on it to make the best choice?

So you’re thinking about the harness in particular? The design of AutoHarness is really relying on the LLM to make the best choice, to use the LLM as the strategizer. The harness is only checking validity, legality. The harness has a very limited role here, and we’re trying to take advantage of the LLM’s ability to reason about the game, to the extent that it’s good at that. There is an alternative way of using the pieces — we can let the code propose actions and let the LLM choose, assuming maybe the code is coming up with ideas and the LLM is picking. That’s not as obviously useful, but that’s another way we could build a system.

Q: The examples you showed — tic-tac-toe and chess — are very deterministic. If you were to do the same thing for games that are more player-dependent — World of Warcraft, League of Legends — where you can’t predict every decision a player makes, will they go left, right, up or down?

Half of the games we showed here are partially observable and more along the lines of what you’re describing. The games we have here don’t get super complex — there are a lot of much more complex games out there that we haven’t looked at. We actually did start looking at Dominion, a few more complicated games. But those were harder to deal with — all the complexities — so I don’t have those in the paper. There’s definitely a world of more interesting games out there that we haven’t explored here. Building on the tools we developed to deal with partial observability and stochasticity would be very critical for a lot of those games.

Also, a lot more games involve more natural language than the games we have. Our games are all symbolic — the representation of game states are dictionaries. It’s text, but it’s all symbolic. Once we start looking at TextArena, it’s more textual descriptions, but then there are a lot of games out there that involve a lot more free-form text, and we’re not dealing with those right now. That’s also a really interesting case to tackle at some point, but it’s a more complicated, more difficult thing for these setups.

Q: How would we go about applying this to other problems? What do you think the constraints would be for applying this to real-world problems?

Games are definitely friendly for this approach, in that we have or can create ground-truth environments and generate data from them. That gives us a lot of capacity to do the actual experimentation, which is much harder when we don’t have that. That’s kind of a direction I’m heading now — thinking about applying these in other cases. Our initial motivation was that we wanted to think about code world models in general and where they can apply. Games are the obvious easy first case to develop the methodology, but not really the end goal. You might think about scientific datasets — that’s a direction I’m thinking about taking this, but I don’t really have anything there yet.

Q: Is there a situation where, even after feedback, the model can produce multiple times wrong moves? Is there a threshold for how many retries the model can get, and what happens if after a number of retries it’s still wrong?

A lot of these games are set up, when playing with an LLM, to allow one or two retries. In our version, we didn’t allow retries. We wanted fast failure. That applies to both code world models and the AutoHarness work — we turned off retries in all the games.

Q: Did you have a way of scoring the complexity of action spaces, to help identify when this could be used or not?

Not formally. We definitely looked at the action spaces and took that into account — picked things that started out simple and got more and more complex — but always aware of action space, and even more importantly the state space, because the state space for some things would get very complex. We didn’t have any formal scoring of that, or results that really compare based on that. But it definitely matters — this gets harder and harder as those things get more complex.

Q: How long, on average, did it take to develop these code world models?

I think I have that, but I didn’t really talk about it. For the perfect information games, we’re going up to about 20 iterations — 20 LLM calls. For the imperfect information case, the open deck is similar. For the closed deck, we actually had a lot more — went up to 200 to get the results we had. That was a much harder case. We ran those for much longer to get to somewhat comparable results, and it still wasn’t as good.

Q: You mentioned a lot of these games can be represented in symbolic space. But take something like AutoHarness — is there a way to simplify that specific harness or technique into natural language so it can become like a coach?

The code world models work — the games all had symbolic state representations. For AutoHarness, we’re using TextArena. It actually wasn’t the case. We didn’t really have state representations there — it was just using the text observations.

Q: I’m interested in more real applications, in medical, for example. In terms of harnessing, do you have any parameters to tune to change the weight on the harness? For example, in medical applications, we can’t make any mistakes — is there any way to tune that kind of parameter for this approach?

It would be a very different harness. The specifics in terms of the choices we make here probably wouldn’t apply. But the general approach, I think, would. If you have good metrics on which you want to train, and you have an idea of what you need to harness — take an LLM and figure out how it’s failing — then in your medical application, the first question I would ask is: how is the LLM failing? What’s the number-one failure mode for the LLM? If you can figure that out, can you have a metric related to that? And then can you ask the LLM to generate code that fixes that? Make the harness be really simple — something that you think should be pretty easy to learn — and then have a metric from your application that enforces that. Follow that structure, and you have a good chance of getting something that’s going to help.

Follow-up: So this is useful for non-stationary environments, because the environment can change and the agent or model can flexibly learn to adjust to that environment?

That’s right. If you can define a way to correct the harness, then you can do this.

Q: For the feedback the model gets — is it only valid/invalid moves, win/loss? Or is there also some generative text feedback as it explores the tree?

For the code world model, it’s getting the accuracy in terms of the percentage of transitions, the unit tests that passed. It has a list of unit tests based on comparing the functions to the trajectory data. The percentage of unit tests that pass is the main score feedback, but each failed unit test — maybe up to five failed unit tests — would also be provided as feedback. Those two pieces, all coming from the unit tests that relate the generated code to the data.

For AutoHarness, we have a performance of whether or not the action is good, and whether the checker is correct in this particular instance. I’m trying to think if there’s any text feedback there — I don’t remember right now. I think it’s basically just the score there. No real unit-test feedback, no meaningful text feedback. There might be something minor, but yeah.