Gideon Mann is the head of the ML Product and Research team in the Office of the CTO at Bloomberg LP. At Bloomberg, he leads corporate strategy for machine learning and alternative data, which was responsible for the BloombergGPT project. In June, he participated in a live fireside talk with Snorkel AI co-founder and CEO Alex Ratner during our Future of Data-Centric AI virtual conference. Below follows a transcript of their conversation, edited for readability, brevity, and clarity.
Alex Ratner: Gideon, thank you so much for taking the time and for the awesome work that you’ve been doing.
I know your team does a lot beyond BloombergGPT, but as an example of what so many of us are excited about right now, it’s a great one. A lot of the models, algorithms, and even the public data that’s been used to train models like ChatGPT is really commoditizing, and it’s become apparent that the most valuable asset now is all the private, domain-specific data and knowledge being used. The BloombergGPT results are an incredible example of that.
How does BloombergGPT, which was purpose-built for finance, differ in its training and design from generic large language models? What are the key advantages that it offers for financial NLP tasks?
Gideon Mann: To your point about data-centric AI and the commoditization of LLMs, when I look at what’s come out of open-source and academia, and the people working on LLMs, there has been amazing progress in making these models easier to use and train. I think one of the contributions of our work is the way we have been very explicit about all of the decisions that you have to make, and there are a lot of them.
What’s amazing is the core architecture, the core design of the language model, is the same. It’s a transformer architecture. It’s this deeply layered, large, decoder-only language model. There are some things around the edges that change the layer norm here or the gradient clipping there, and we tried to detail all those, but it’ll still feel very familiar.
The part that was really different was the training data. At Bloomberg, we as a company provide analytics, we have a finance community, and we’ve collected a lot of data over the years in news and finance contained in research filings and the documents that are necessary for bringing data to our clients. So, when we were training this model we had a lot to work with.
The basis of the model was a lot of public data resources. We used the Pile and Wikipedia, for example. On top of that, we used a lot of our own internal data that we had collected for our finance purposes over many years.
AR: I’ll quickly note that when I say commoditization, it’s been extremely exciting and gratifying to see. It’s everything that we at Snorkel aim to do, especially on the academic and open-source sides. And on the company side, it lifts everyone’s boats, so to speak. But it does highlight the problem: where do you get an edge now? Where do you get significant delta? Is it tweaking the algorithm or the model? There is exciting work coming out all the time on that subject.
I love the example that you and your team set here. It has a lot more to do with data. I also love that you went into detail about how you made all of those decisions, because a lot of people think they can do better if they do domain-specific pre-training. I’ll just dump in a pile of data that’s sitting around my org and it’ll work better. It’s obviously not that simple.
You went into a lot of detail on all the steps of this process in your paper. If you look at recent announcements from companies about new large language models, the training-data mix and distribution is often one of the pieces they keep most secret. You were able to share what you did with the community.
For years you’ve been a big leader in applying AI—generally in the NLP and AI research communities, but also specifically for finance. Obviously, you were part of an org that was already very sophisticated on the research and operationalization side. What were some of the biggest challenges that you faced when taking on this project?
GM: Well before this training challenge, we had done a lot of work in organizing our data internally. We had spent a lot of time thinking about how to centralize the management and improve our data extraction and processing. We were already well positioned. And, at Bloomberg, we’ve been doing NLP for a long time. So when we approached this, the conversation with my senior management was characterized by enthusiastic support. They might not have totally understood the technology, but they said, if you and the NLP people feel like this is the direction we should head in, let’s do it, because it’s so central to Bloomberg’s business.
The hard parts were that there is a great deal of research literature out there, and the amount of work coming out accelerates all the time. From when we started the work to when we ended, several major pieces came out that changed the way we thought about our project.
To give one detailed example, the numerical precision of the parameter estimates and the gradient estimates during training is very important to the stability of the model. There was work—I think it was Bloom—that showed that the BF16 encoding of the half precision floats during the parameter estimate process as gradients improved the stability, as opposed to the FP16 precision. The authors believed that was very important to their stability. So, in the middle of our process we had to say: wait a minute, we need to make sure that Amazon (who we were working with) supports BF16 half precision floats.
Both understanding the research literature as well as taking the time to go through and see what all of the choices were, and even what a conservative approach might be, and then integrating those choices—it was a lot of work. Then, at the end of it, you get these configuration choices. Once you’ve set the number right, then you’re good. But that part was challenging.
Then, we had a lot of machine-learning and deep-learning engineers. The work involved in training something like a BERT model and a large language model is very similar. That knowledge transfers, but the skillset that you’re operating in, there are unique engineering challenges.
AR: It’s the combination of these model-centric and data-centric operations that matter. The data-centric operations can’t really be commoditized when you get down to domain-specific data. As you said, you had to put in all that work to curate it. The wonderful thing about the model-centric operations, you still need to go through a lot of work there, but they’re shared pretty rapidly in the community. It’s a challenge in itself to track them all, but at least they’re out there.
You trained a 700-billion-token corpus, if I have the number right, and it was a mixture of finance and general-purpose datasets. Could we double-click that? I imagine you can’t just look online. People do share some high-level things—here’s what splits we used from the web Pile and from Wikipedia, and we did this mixture—but especially once you bring domain-specific data into it, no one is going to be talking about the right way to mix Bloomberg data together. How did you approach that challenge?
GM: One of the surprises for me, from a model point-of-view, is there used to be a tremendous amount of work in thinking about different model architectures. That doesn’t really happen now. What’s been very interesting in the model world is the optimization layer, which I love and find academically interesting with all of the optimization-level details. It’s almost like there’s been a hollowing out of the model architecture. There’s still work there, but mostly it’s on the nitty-gritty optimization part and the data part.
One other big surprise has been that, while data always felt like it was important, these models make all the data that you already have much more useful and discoverable. The value of the data goes up out of these models.
To go into some details, we split it roughly 50-50 between public data sources that we had collected and our own internal resources. Since then, we’ve discovered a lot more that we’re trying to bring in. We didn’t have any great ways of deciding how many epochs we should spend on Wikipedia data or on financial data. There’s a beautiful paper, I think it’s the DoReMi paper?
AR: I mentioned it in my talk as well. It’s by Percy and Michael.
GM: It’s a beautiful paper. It is inspiring for us to think about whether we actually have the right mix. If you look at the amount of training data that we used, our corpus was 700 billion tokens. I think we ended up using around 600 billion tokens of it. But in the Llama-style training, they used 1.5 trillion tokens for a much smaller model. One of our takeaways from looking at Llama is, well, we should have trained a lot longer. Maybe we should have done a few epochs. I think they did multiple epochs of Wikipedia.
At the intersection of the data story and the model story (because you’re doing this stochastic gradient pass) you only see every sentence once in training these big models. It is a little weird. When I was a kid, you did multiple passes over your model until it converged. Then you said, okay, good, we know everything now! We have reached the convex, the top of the distribution, the top of the loss function. There’s no real sense of that anymore. Now, I’m tempted to think about it as more: what is the sample efficiency for training over this corpus? Have I wrung out all the information that is available in each sample? It feels like we’re truly just at the beginning of that question. DoReMi is an example of how we don’t know. And I don’t mean “we” at Bloomberg. I don’t think the community knows what the right mix is or how to gain all the value out of the training data and during the optimization step.
“When I was a kid, you did multiple passes over your model until it converged… There’s no real sense of that anymore.”Gideon Mann, Bloomberg
AR: I agree, a lot of the challenge is the optimization and all of the engineering aspects of running that process, and then it’s the data and the model architect. My co-founder and former advisor at Stanford is doing cool work on new model architectures. There’s very fascinating work there. But for most of us, a new model architecture comes out and if it starts to dominate, you plug it in. Even as there’s a great model-centric or model architecture development process—whether it’s as sophisticated as what you’re building or what we at Snorkel are trying to build, or more applied—that does hollow out your process.
You mentioned the data mix having an impact. We ran this consortium project via the University of Washington, and it was Stability, Inline, Apple, Google, and a bunch of others. It’s at DataComp.ai, and it was basically a contest. If you fix the model architecture, the training script, the algorithm, everything, and you just work on your filtering and sampling of the data, you get a new state-of-the-art score. This was for a CLIP-style model.
So, the mix of data, or if you’re trying to optimize that mix with a “Do Re Mi” or important-sampling approach, it matters a ton. That’s also where the domain knowledge comes in—applying some of what you already know is going to be the right mix of tasks.
I like to compare it to database tuning. When you’re doing database tuning, you’re not trying to find a “right” answer. You’re trying to tune it for an expected query workload. What this is going to turn into is trying to “tune” the mix of data from these big foundation models for the kind of expected distribution of things we’re trying to build on top. How do you ever answer that question without having deep knowledge of your organization’s objectives?
GM: I had a conversation with Kevin Knight yesterday, and he made the point that people don’t believe in evaluation datasets anymore. And that is such an anathema to me as an academic. To me, you have to create an evaluation test set that mimics what you want to do in practice, then you have to evaluate against it.
There’s a real conversation to be had around how you evaluate these large models. One of the surprising results of our paper was that the training-data mix really does make a difference in evaluations, but evaluations in-domain. Even at the levels of hundreds-of-billions of tokens you’re still getting some performance gains in-domain. We were all hoping for that outcome, but we didn’t know it definitively when we started. Even at these very large training-data domains, you can still detect differences in performance on in-domain evaluation.
That says two things. One, you need to consider what your domain knowledge is and what data you’re feeding into the models. And two, you have to think carefully and somewhat differently about how you approach evaluating these models. That part of the equation hasn’t gone away. What are we using for evaluation? How do we know that we’re getting the right performance? What’s the level of hallucination? There’s a lot there.
“What are we using for evaluation? How do we know that we’re getting the right performance? What’s the level of hallucination? There’s a lot there.”Gideon Mann, Bloomberg
AR: There is such an interesting conversation around evaluation these days. And, again—surprise, surprise—it comes back to the data or data-centric operations. A better evaluation algorithm doesn’t solve it. What mix of data or skills or tasks do you put together? It’s domain knowledge mixed with data science.
Someone at the last talk asked a question: how should org structures be built in this new age? I think data science skills have to be mixed with data skills and with domain-expert or product management skills.
So much of what we teach and have been taught as data scientists has been very anchored on having a black-box test set and getting a single accuracy score. I still think that’s extremely valuable. We at Snorkel still focus on use cases that can be boiled down to a single number, because often that’s the easiest to evaluate and that’s what’s tied to business value. But as we look multiple years into the future, we are going to start unpacking that black box in a big way. We can’t just have a single number. Every organization, every team will need their own private benchmarks. The AI development game is going to be a lot more architecting evaluation benchmarks, versus the model itself. But then, of course, they go back and forth, because when you shift the benchmark you have to shift the data mix that you’re putting into the model.
Everyone in data science knows about cheating on the test set, or test-set leakage. We’ve been taught, in a sense, not to look at the test set. But we have to engineer the test set, even for those models where you’re just spitting out a single accuracy number. How you engineer the test set matters a ton for whether the model actually works in the real world.
“Every organization, every team will need their own private benchmarks.”Alex Ratner, Snorkel
GM: It’s funny, people treat “prompt engineer” as a derogatory term. But that term undersells the real work that goes into constructing prompts and interacting with these models. When you’re constructing a prompt, you’re thinking very carefully about what kind of answer you want. What’s a good answer, what’s a bad answer, how do you specify what that looks like? As a manager, you interact with your teams in much the same way. How do you talk to your teams about what you want them to do? How do you evaluate all the people that you work with on a project?
I almost wonder if “large-language-model manager” would somehow be a more appealing term than prompt engineer, because it’s a little more in that spirit. And I think that holds for inference time as well as for evaluation.
AR: It’s worth noting that fine tuning generally outperforms prompting or prompt tuning. It’s a spectrum, in our view. It comes down to specifying what exactly you want the model to do and what’s the definition of success and doing this through a data interface.
If you think about how you do it as a human manager, you might say, “okay, do this,” and that’s the prompt. If it’s a challenging task, you might give some examples. If it’s a really challenging task, you might have to give a lot of examples until someone is trained to proficiency. Going from a prompt, to a prompt plus some examples, to a big set of examples—that’s all part of the same thing. This content, from “let me just engineer a prompt” to “let me label data to fine tune,” it’s all part of the same spectrum of, based on the difficulty of your task, cleanly specifying what you want it to do. That’s both at the input time and on the evaluation side.
The product-management notion—what do you need to be good at for building these broad organizational models—is super relevant.
Just one more broad question. What do you think is ahead in terms of building from not just BloombergGPT but also for domain-specific large language models?
GM: It seems like there are two big questions being debated on LLMs. One is the critique about whether the models are creative versus just synthetic or syncretic. How does the human creative process play into these models? That’s not a question that can be answered in six months, but more likely on a five-to-ten-year timeline. The models don’t do everything that people do. They do a lot of things badly. So, true creativity is one example. There has to be new ways of creating those insights and developing them and somehow interacting with LLMs around them. It feels like the old way of doing things—writing a research paper or book and somehow feeding that as training or as context—is very slow. So, as you have a new idea or begin exploring it, maybe there’s a faster way to turn that into the gristle that the model can operate on.
The second big thing is that I don’t think programming is going away. The need to be super pedantic about what exactly you want on a hardware level in order to achieve computational efficiency, that’s not going away. The process of software creation is going to be vastly changed and expanded in really wonderful ways that will be transformative. Those are the two big questions I’m excited about when considering what’s next.
AR: Thank you so much, Gideon, for the discussion today and for all the great work that you and the team are doing.
See what you missed at the Enterprise LLMs Virtual Summit!
We have released individual recordings of all eight sessions from the well-attended Enterprise LLM Virtual Summit. You can see them—including the very lively Q&A session—here.