At our latest Snorkel AI Reading Group, Mayee Chen (Stanford, Hazy Research) stopped by our San Francisco office to walk us through Olmix: A Framework for Data Mixing Throughout LM Development — work she contributed to during her internship at Ai2 on OLMo 3. Olmix tackles one of the messiest, least-documented levers in LLM pre-training: how to set the ratios across data domains, and how to keep those ratios good as the underlying datasets keep changing throughout development. The talk covers a practitioner’s handbook of design choices for data mixing methods, plus a new mixture reuse approach that recomputes mixes far more cheaply when data evolves.
Transcript
Lightly edited for readability.
So a brief overview of what data mixing is. In mixing, you have a lot of different pre-training domains — things like web, math, and code — and your goal is to figure out how to combine training data from all of these so that your model exhibits a broad variety of capabilities.
Generally, I want to emphasize that mixing is kind of everywhere if you’re trying to develop data to train a model. You could mix on different languages, you can mix at different stages of training — so not just pre-training. Whenever you have more than one dataset you’re working with, you’ll need to decide how to actually compose these things.
We just said mixing is everywhere; now I’ll say mixing is quite important. As a teaser of our results, here’s a figure that shows that training on a good mix can result in a 12% improvement in downstream task performance, and it allows us to arrive at the same performance three times faster than training on a more naive baseline mix. So this procedure — or composition more broadly — can really make a difference.
Currently the naive approach is brute-force search or tuning the ratios manually. This requires a bunch of training trial and error, which is quite expensive. There are also many recent mixing methods that aim to learn these mixes more efficiently.
In this talk, rather than talking about a new mixing method, I’m actually going to focus on issues that arise in mixing during large-scale model development. Let’s take this timeline. You’re here on the left, maybe a few months or weeks out from the day you want to start training your large model. You have your domains and you want to figure out the mixture on them.
The first question is: how do I get a good mix on these domains that I start out with? It’s not super clear what to do. Should you run some existing method out of the box? Should you build your own method? That was basically the first challenge we faced at the start of building our LLM, OLMo 3.
Now let’s suppose we figured out our initial mix. The second challenge is that data is constantly changing during model development — that’s just how reality is. People are always iterating on and building new datasets, or refining some other datasets. The question becomes: how do you maintain a good mixture throughout development?
As an overview of this talk, I’ll be going over our contributions in two parts. The first is sort of a practitioner’s handbook to data mixing — we did a comprehensive study of all the design choices you need to consider when constructing these sorts of methods. The second part is a method we introduce for using information from existing data mixes to efficiently recompute ratios as data changes throughout development.
Part 1: A practitioner’s handbook
Getting into the first part, I’ll give some notation on the formal problem. We start with these M training domains, D1 through DM. Think of these as your math and your code and your science and your web data. We want to train a model on these datasets. The data mix P here is represented as a probability vector over the M domains. The final mix training dataset is going to be created by sampling R · P(i) tokens for each domain D(i). Then we’ll denote LM(SR,P) as a target model trained according to this data mix P, and we’ll evaluate this trained model on N tasks.
Here we’re going to measure what’s called bits per byte, also known as BPB. This is kind of like a loss, but lower is better. The goal of data mixing is to efficiently find the mix P* that minimizes the average BPB over all of the downstream tasks.
There are many mixing methods that aim to solve this problem, and a lot of them actually follow this general template, which we call the offline mixing schema. The first step is that we train a lot of smaller models — we call these proxy models — on different mixes, and then evaluate each of them on the tasks we care about. We refer to these as a swarm of proxy models.
Next, using this swarm, we fit a regression model that learns the relationship between the mixing ratios and the performance. This regression model — this f̂(P) — is going to allow us to predict the performance of any data mix without training.
Finally, we propose a mix by solving an optimization problem that uses the regression model as a surrogate for true downstream performance. We essentially find the mix that has the best predicted performance. So these are the three steps of the offline schema. The question now is: how do we instantiate a good method that follows this general procedure?
By examining this schema, we came up with a list of questions that a practitioner needs to go through in order to instantiate a mixing method. For instance, in constructing the swarm, you need to determine how big these proxy models are, and how many of them you need for the regression fitting. The obvious question is what regression model you should use. And finally, for the optimization stage, we need to figure out how to handle the fact that data is finite and we don’t want to excessively repeat samples. We have a bunch of other design choices in the paper that I don’t really have time to talk about today.
Before we dive into how we addressed all of those questions, I wanted to show you this pretty big table of how all of the existing literature answers them. Each row is one of these questions, and each column is an existing paper. I want to highlight two observations. First, any pink cell means that we could find justification for that configuration. Unfortunately, there’s a lot of white cells, which shows that many design choices really lack guidance beyond the specific instantiation. The second thing to notice is that even the rows with a lot of pink cells lack consensus. For example, each existing method proposes a different regression model and reports a strong regression fit to support it. This conflicting guidance makes it really hard for practitioners to determine what they should actually use when adapting these papers to their setting.
To study each choice, our setup is that we train 1-billion-parameter models from scratch on 100 billion tokens, and we experimented on DCLM-Baseline, a web corpus that we partitioned into 24 topic-based domains — things like literature, politics, science. We measure BPB on a suite of 52 tasks that covered math, code, and question answering.
Proxy model size
The first question is proxy model size: how small can these proxy models that you train be so that you can still provide good signal at the 1-billion-parameter target model scale? To study this, we took a bunch of mixture ratios, and for each ratio we trained both a proxy model and a target model on it. We looked across performance rankings of these mixtures at both scales and measured rank correlation. The idea is we’re trying to see if data mixes that do the best on small models also do the best on large models. We repeated this experiment for many different proxy model sizes.
Here are our results. On the x-axis is the size of the proxy models — we did 1 million, 15, 30, and 60 million. On the y-axis is the rank correlation with our target 1B-parameter model. We can see that everything over 15 million parameters has pretty good rank correlation. Our recommendation from this experiment was that, in order to do well at least at the 1-billion-parameter scale, we should use proxy models that are at least around 15 million parameters.
Swarm size
The next question is how big the swarm should be. Concretely, if we have M domains, how many proxy models do we need as a function of M? Existing works give a bunch of different recommendations, and most don’t provide a rule of thumb. They just say, okay, for 21 domains we use 512 runs; another one says for seven domains we use 20 runs. There’s no clear formula. For our experiment, we constructed various swarms by sweeping over M (the number of domains) and K (the number of proxy runs), and we measured the performance of these resulting mixes.
On the x-axis is the swarm size, where we’ve scaled everything to be not the absolute size but just the size relative to the number of domains. On the y-axis is a notion of error — how close we are to the best mix we found. The thing to notice is that across all of these M’s from 6 to 24 domains, the curves all look generally the same. This indicates that the error only depends on the swarm size relative to the number of domains. You just need roughly O(M) proxy models to learn a good mix over M domains.
Regression model
The next question is what regression model we should use to capture the relationship between the data mix and the downstream performance. To study this, we take a swarm of proxy models and try five different regression models on it. We measure how good the regression fit is, and also the final performance. We consider a bunch of approaches that have already been used in other works, ranging from gradient boosting to Gaussian processes to some parametric functions.
First, we measured how good the regression fit is for each of these models, and how this depends on the swarm size. This actually reveals something pretty interesting: different models excel at different swarm sizes, which potentially explains why there’s a lack of consensus in the literature on what model to use. For instance, BiMix in dark green is quite good for small swarms, while LightGBM in pink needs a lot of proxy runs. In deciding what regression model we wanted to use, we noticed that the log-linear models in purple outperform other models quite consistently once we have around 75 runs — basically at least 3× the number of domains. This also matched in downstream performance: the log-linear model achieves the best average BPB for downstream tasks. Our takeaway was to use the log-linear regression model moving forward. I’ll give an exact form of the equation later.
Data-constrained settings
The last design choice we’ll look at in this section is how to handle data-constrained settings. In the chart here, we’re considering an example where we have three domains: general web data, math, and code. The y-axis tells us how big each of these domains is in trillions of tokens versus how many tokens we’re trying to sample from each domain. Moving from left to right: suppose we request fewer tokens of web data than is available — everything is great there. Then math in the middle: we request two and a half times more data than is available, and the literature generally says you can do that without degradation. Finally, code on the right: we request eight times more data than is available, which has been shown to be problematic by prior works. Typically, repeating past four to five times causes degradation. So it’s incredibly important to produce mixes where we aren’t having this excessive repetition past four or five times.
The question is how we enforce repetition caps in the data mixing problem. I’ve formally written the repetition constraint at the bottom here. This basically means we want to ensure that we don’t sample more than K times from any domain. The constraint encodes that as a function of R (how many tokens we need) and N(j) (how big your domain is). We just add this constraint to the optimization problem, which is quite simple because it’s a linear constraint. In our experiment, we’re going to see how this impacts the proposed mixes.
Our results show that the proposed mix depends heavily on the repetition factor. We vary K from two up to infinity, which is basically unconstrained optimization. As we do that, the proposed mix can change a lot. For instance, software development goes from under 10% all the way up to around 35%. So this is a very important piece of the mixing method that has been historically underlooked.
Olmix-Base
After we studied all of these design choices, we used our findings to define a base mixing method, which we call Olmix-Base — our recommended default configuration for the offline mixing schema that a lot of these approaches follow. Recall the first step, constructing the swarm: we ended up doing it with 30-million-parameter proxy models — anything above 15 million we saw was fine — and we use O(M) proxy models. Next, in the regression model step, we decided to fit one log-linear model for each task. Finally, in the optimization step, we optimize the average predicted performance and we add a constraint to ensure that the proposed mix does not excessively repeat samples.
Part 2: Mixing under evolving data
I’ll move on to the second part of the talk: mixing under evolving data. This is a pretty novel problem setting, motivated by what we saw in practice when we were actually trying to develop data for a model. Consider the following scenario: you have a pretty good mix on your datasets and you spent a lot of time figuring out these ratios — but then the data changes. Someone on your team finishes preparing a new dataset that is really good. Snorkel gives you a new dataset. You realize that you should more aggressively filter one of your existing domains. The reality is that these sorts of changes are always happening — data is always getting better. So what do you do with this mix that you’ve prepared on your old datasets?
To further add evidence that this is a real phenomenon in model development, we can look at the pre-training datasets for OLMo 2 versus OLMo 3. You can see that over the course of a year or so, there’s been quite some changes to the pre-training corpus. Some domains are the same, like arXiv and Wikipedia. But some domains are gone, some are added, and some are revised. For instance, the code source changes from StarCoder to Stack-Edu, and the math source changes from OpenWebMath to FineMath.
Now I’ll formally describe this problem of mixing as data evolves. We start with a domain set D of M domains, and this is updated into D′, which consists of M′ domains. Based on our experience, we found four major ways this update happens. We have addition and removal of datasets — these are pretty self-explanatory. We also have partition, where we take a dataset and we’re able to annotate it such that we now have finer-grained categories. Lastly, we have revisions, which can be filtering, rewriting, and so on.
Our goal in this setting is that we have an existing mix P̃ on the old set of domains D, and we want to figure out the optimal mix Q* over the new domains D′.
Before I jump into our method, a simple baseline I want to draw attention to is full recomputation: we apply a mixing method — it could be Olmix-Base — directly to D′ every time the domain set is updated. The problem is that it becomes expensive very quickly. If your data is constantly changing throughout development, this means you’re constructing a swarm of size O(M′) every time an update happens. Instead, we’re going to see how to use the existing mix P̃ so that we don’t have to relearn everything each time.
Mixture reuse
To explain our method, mixture reuse, let’s look at this example of partitioning. You have four domains, and the fourth one gets partitioned into two subdomains, D4′ and D5′. Let’s say we start out with science, politics, literature, and code. Then code gets split up into Python versus not-Python. We’ll have some existing P̃ ratios over the initial four domains. Once we normalize them on the first three domains, we’ll notice that these normalized ratios are not really impacted by the partition at all.
The core bet we’re going to make is that these normalized ratios are still going to be good to use after the code is partitioned. The fact that we now have to also mix over Python versus non-Python code is not going to change the relative ratios of science and politics and so on. This assumption motivates us to reuse these normalized ratios throughout development.
To get into the method: we take this D′ and split it into two subsets. There’s Dfix, the domains that weren’t affected by the update, and Dcomp, the domains affected by the update (whose ratios we’ll recompute). The first thing is that we freeze the ratios within Dfix, basically treating Dfix as one giant virtual domain — one giant composite dataset.
Rather than recomputing the mix over everything, we’re only going to recompute over what we call the collapsed mix. This is the single Dfix virtual domain plus the affected domains. In this example, rather than learning a mix on all five domains, we’re just going to learn on three. That’s a much simpler and cheaper procedure. Finally, once we’ve learned the mix on three domains, we use the frozen ratios to expand the learned mix back to D′.
A quick example: recall these are the normalized ratios of P̃ within Dfix. Suppose this second step of recomputation gives us a mix of 0.6 on the entire Dfix virtual domain and 0.25 and 0.15 on the others. Then we construct the full proposed mix by multiplying 0.6 by the normalized ratios above. You can think of this procedure as applying a mixing method in a lower-dimensional slice of the mixture space. The main thing this allows us to achieve is significantly less recomputation.
Now we can see that this full recomputation baseline and mixture reuse are two very different approaches. Full recomputation learns everything each time, while mixture reuse only recomputes over a subset of the domains, which results in lower cost. The big question now is when the performance of mixture reuse is close to full recomputation. You can imagine that if our existing P̃ was straight-up terrible, then mixture reuse won’t do well. The next slide or two will focus on understanding when mixture reuse can perform well.
One of our key findings was that mixture reuse depends on the relationship between Dfix and Dcomp — the domains whose ratios we keep frozen and reuse versus the domains whose ratios we recompute. Basically, if the domains in each set target different downstream tasks, then mixture reuse tends to perform well.
Let’s take a real example of when mixture reuse did not work well due to interaction between the two sets. Here, the DCLM software development dataset — web software development data — is part of our frozen ratios, and Stack-Edu, which is also code, is being added to our domains. We saw that both DCLM software development and Stack-Edu target coding abilities. As a result, the optimal ratios after we added in Stack-Edu involved moving all of our allocation from DCLM software development to Stack-Edu. So this 0.6 gets moved to 0.3 and 0.3. The core problem with this shift from the original mix to the optimal mix after the update is that mixture reuse, if we keep all of these ratios the same, has no way of coming close to the optimal ratio.
Partial mixture reuse
Fortunately, we have an extension to address this, which we call partial mixture reuse. We only freeze some of the unaffected domains, and we actually recompute the other unaffected domains along with the affected ones. In the diagram, all we do is move the software development domain from the frozen ratios to the recomputed ratios — we’ll learn its ratio along with the Stack-Edu data. This results in the reused ratios being a smaller set of domains, and we recompute the mix on more domains.
The core gain is that this slight increase in recomputation ensures that Dfix and Dcomp now don’t have domains that interact. The result is that the performance of partial mixture reuse is significantly closer to full recomputation than regular full mixture reuse. In the figure on the right, lower is better — lower is how close we are to recomputing everything.
Lastly, taking a step back and looking at all the strategies — full recomputation, partial mixture reuse, and full mixture reuse — this presents a spectrum across the cost-performance trade-off. Partial mixture reuse requires a little bit more recomputation than full reuse, but it’s generally able to achieve better performance.
Empirical results
For evaluation, we simulated a realistic development scenario where data was evolving over five updates, and we end up with a final dataset of 64 domains. We first start with the web dataset DCLM, which we split into 24 topics, and then we’ll add domains. We’ll revise, remove, and partition numerous domains. In this setting, we’ll be training 1-billion-parameter models again and evaluate them over 52 tasks.
Our main baseline is the natural distribution, which is to mix proportional to the final domain sizes. You can think of this as the mix that is induced if you just concatenate all your datasets together and sample naively from that. We also consider full recomputation, which applies Olmix-Base to all the domains after every update — you should think of that as roughly an upper bound on performance that is very expensive. Lastly, we have our two methods: full and partial mixture reuse.
Our first result is on the performance versus cost trade-off. On the x-axis is the total number of proxy models that the method uses throughout the five updates, and on the y-axis is the performance improvement over the natural distribution. Methods in the top-left corner are the best. If we look across horizontally, we can see how mixture reuse does compared to full recomputation: we achieve between 95 and 98% of full recomputation’s performance while using around 67% fewer total proxy runs.
Next, we can look vertically to see how all of these methods do when we fix the budget. At a budget of around 200 to 300 of these small models, mixture reuse achieves a 12% improvement over the natural distribution. Full recomputation achieves around 10.8%. And beyond trade-offs, our best mix, as I mentioned earlier, is three times more data-efficient than the natural distribution.
Finally, this figure shows some of the proposed mixture ratios for full recomputation, our best mix from partial mixture reuse, and the natural distribution. Our proposed mix in purple is more similar to full recomputation in pink than to the natural distribution. This aligns with our downstream performance results from the previous slides.
Wrap-up and future directions
To wrap things up, our paper has two main contributions. The first is a study of the core design choices needed to configure a mixing method from the offline schema. The second is a mixture reuse method that allows us to efficiently recompute data mixes even when the data changes throughout model development.
I’ll touch on some future directions I’m pretty excited about. All of this is considering static mixes — you fix your percentages from the start of pre-training. But we could also think about: what are the design choices, and how do we handle evolving data, if we have a curriculum and have mixes changing throughout training?
Another direction I’m quite excited about is co-design along with other stages of data development. Typically we think about all these data curation stages one after another — you crawl your data, you do your deduplication, you do your quality filtering, you mix, and then you’re ready to hit run. But actually there’s a lot of interaction between these stages, and you can use feedback from some of them to inform better curation in others.
The last thing is automating partial mixture reuse. Currently, we need some domain expertise or a rule of thumb to determine what ratios to reuse versus recompute. Our rule of thumb here was, going back to the software development versus Stack-Edu example: if we know that we’re adding in data that is very similar to a domain we already have, we should adjust the partial mixture reuse algorithm accordingly.
That’s it. Some links and all my collaborators are on this slide, and I can take any questions.
Q&A
Q: In terms of measuring interaction with downstream tasks, have you thought of using methods like semantic similarity or other methods? And the approach of looking at swarms is very systematic — what about something a bit more evolutionary, where maybe you take something that’s optimal at the regression level and use that instead?
Great questions. Let me answer the second one first. For swarms, there are some approaches that do this iteratively, where you run the approach and you get an optimum, and then you can do another swarm centered around the optimum. Those guide the learning and the search towards the optimum, so I think you can look at these design choices in light of that. I’ve also seen some works more from a Bayesian perspective, where you define some acquisition function and do it sequentially. You start with a small set of swarms, and then you iteratively figure out which ones you need to cover the space.
On interaction with downstream tasks: we have some formal results where we define interaction as how it impacts the performance. But I do think some notion of taking your training domains and doing some embedding similarity to the evals could be a really good way to get a proxy for that interaction — even just general practitioner knowledge. If you know that this domain contains, say, healthcare data and you have a healthcare eval, you know.
Q: In the mixture reuse experiments, were you testing the mixture by retraining the model completely, or based off the checkpoint from training it on the previous mixture and then continuing training with the new mixture?
It was the former — every time we tested, we trained from scratch again. You can think about the problem setting as: you haven’t started training the big 70B or 100-billion-parameter model yet, you’re just doing some small experiment, so each time you’re running it again to check. But I would love to extend this work to the online mixing scenario, where we already start training the big model and then maybe halfway through we find out there’s some new corpus that gets released and we want to add it in. How can we do that? That adds another dimension of complexity that would be really exciting to see.
Q: Do you have an intuition on the unconstrained optimization? Is the optimal distribution of data the one that achieves roughly equal performances across different domains?
Let me try to answer. For this work, we were just looking at minimizing the average BPB — doing well across one metric that uniformly averages all the things. This can be a little bit problematic if you imagine that 80% of my evals are coding evals; this will really bias my mix towards that. We did some work on weighted objectives in this f̂(P), and we also did some work on adding Pareto constraints in the optimization. Basically, if you have some reference mixture, you say: I enforce that my proposed mix will never do worse than, say, 10% on any task compared to my reference mix. So that was how we were thinking about doing okay on each task.
Q: You said the mixture reuse doesn’t work very well in the interaction case. Is the reason that adding more data effectively amounts to changing the K factor in this optimization?
You should think about K as something you externally set — like, I don’t want to repeat my data more than five times. But as you add more data or change data, this can change this constraint. We actually do have some results buried in the paper where mixture reuse can perform a little bit unpredictably when the optimum is on the boundary of the repetition constraint — things will be a little bit weirder. Happy to discuss more of this offline.
Q: This methodology to systematically look at mixes makes sense. It’s great that you applied it to training models, but what about post-training or RL and data mixes there?
I think this procedure — train a bunch of models, fit a regression model, optimize it — you can think of as the mechanisms going on inside someone’s mind who is manually tuning things. You try a bunch of mixes, you see, oh, if I increase this thing it gets better; if I decrease this thing it gets better. We’re just trying to formalize it. So yes, it can be applied at post-training. One thing is that in pre-training, the regression model is very clean — this log-linear thing — because it’s just so early on that the log-linear model basically means increasing code is going to improve code performance. But at post-training you have more complex capabilities and complex evaluations, so you might not see as clean a model where if I increase this data it smoothly and consistently improves performance. But we can apply the same general principles: try a bunch of mixture models, try to learn some pattern on how the mixes inform downstream performance. Same with RL — we would replace the metrics and replace the regression model, but this general meta-procedure can still hold.
Q: On the comparison between your partial mix versus full recomputation — do you tend to find that they both converge to the same ratios in terms of the mix? You show similar performance, but are the final ratios also similar?
They are kind of different. For the case I showed, we saw that full mixture reuse produced ratios that were quite different — it basically double-counts the code from DCLM and from Stack-Edu, because it has no lever to downweight the DCLM data anymore. So we did find it was pretty different. The figures for this are in the paper.
Q: How does that compare to recomputing everything again, in terms of the ratios?
The pink one is recompute everything, and the purple one is partial reuse. Full mixture reuse is a little bit different from both — Python was down here and stuff — but it was definitely closer to those than to the natural distribution.
Q: This is more about data mixing in general — does it tend to be that performance is very sensitive to data mixing? If the optimal is 20%/80% and you end up at 30%/70%, in general is it super sensitive or not sensitive?
It depends on what your domain is, unfortunately. Consider mixing over two domains that are really similar — suppose they’re actually identical — then any mix I do, whether it’s 50/50 or 20/80, should give roughly the same performance. So that’s a case where performance is not sensitive at all to mixing. It remains a question of what makes a good domain. You want something that’s sensitive but not too sensitive, and ideally helps you maximize performance on the things you care about. Empirically, I see it’s a little bit of both. There are some mixes that are straight-up atrocious — for instance, we have a subdomain of DCLM called “adult content”; I don’t know why we have that. If you train on that, I haven’t looked at what happens, but presumably nothing good. But if you’re within some radius of the optimal mixes, it tends to be fairly stable.