Alex Ratner spoke with Douwe Keila, an author of the original paper about retrieval augmented generation (RAG) at Snorkel AI’s Enterprise LLM Summit in October 2023. Their conversation touched on the applications and misconceptions of RAG, the future of AI in the enterprise, and the roles of data and evaluation in improving AI systems.
Alex Ratner: Douwe, thank you so much for joining us. Just to introduce you: Douwe has had a long career. He’s an adjunct professor at Stanford, he was previously head of research at Hugging Face and a research scientist at Facebook AI Research. He has done tons of seminal work on evaluation, via projects like Dynabench (a platform for dynamic benchmarking) and work on some topics that are now all the rage, like retrieval-augmented generation (RAG), for which he co-wrote the seminal paper.
Let’s maybe start there. You were one of the lead authors of the original RAG paper, where you coined the term. Now that retrieval-augmented generation is essentially a “household term” in the AI world, what do you think people get right about RAG, and what do you think they get wrong?
Douwe Keila: It’s very interesting because what people call RAG now is almost a baseline in the original RAG paper. I don’t think we would have been able to write a paper about just “vector-database-plus-language-model.” I think it would have been rejected.
What everybody’s doing already seems to work quite well. And we know for a fact that you can do even better.Douwe Keilla, CEO of Contextual AI
The paper is really about how we can have these two separate sources of data or two separate models: an embedding model that you use for retrieval and a generative model, or a language model, basically, for generating answers in open-domain, question-answering settings. The paper is about how can we learn to do the retrieval so that you find things that are relevant for the language model that consumes the things you retrieve, essentially. So, for anyone interested in looking at the paper, we showed that if you fine tune this system—if you learn at least the query embedding—then you do much better than the frozen RAG that everybody uses right now.
I think that’s an encouraging thing. What everybody’s doing already seems to work quite well. And we know for a fact that you can do even better.
AR: I agree with that take. I was just talking with Percy [Liang] about the general view that, along with the choice of foundation model, these are foundations.
In your paper, you did both fine tuning and RAG, together. So, it’s a nice exemplar of that complimentarianism. How do you see people navigating this emerging toolset, when you consider RAG, fine tuning, and prompting?
DK: I always tell people (and it probably doesn’t feel this way) but it’s still very early days. We’re still figuring a lot of this out. My bet would be on some sort of hybrid between retrieval augmentation on the one hand, when you need it to generalize to your own data, but then probably some fine tuning, not necessarily on your own company data. I think you can cover most of that with retrieval augmentation. But you want to fine tune the end-to-end model—not just the language model, but the whole thing—on the use case for which you want to use this. Then you can learn what sort of retrieval you want to do for solving the use case and how you want the language model to deal with the things you retrieved.
You want to fine tune the end-to-end model—not just the language model, but the whole thing.Douwe Keilla, CEO of Contextual AI
Prompting obviously plays an important role too—although ideally it shouldn’t play any role because these models should be very robust to different prompts. But, as Percy Liang also points out with a lot of his evaluation work, these things are still very brittle. That’s something that we have to improve across the board.
AR: I’ve heard others say that the advent of prompt engineering and all the novel and sometimes hilarious strategies for prompting are really signs of the lack of robustness—or the failure of robustness and stability—of current models.
DK: Absolutely. This is one big issue I have with prompt engineering these days. We have best practices in academic machine learning around having held-out test sets and validation sets. You’re not supposed to tune your model on the test set, right? That’s one of the golden rules. Everybody knows that. But it seems like the new generation is less familiar with that separation, and a lot of folks are tuning prompts directly on the test set saying, “okay, this works,” then deploying it in production. They’re finding out that their test set was actually a dev set, and it doesn’t generalize to their real data. That’s a big problem.
They’re finding out that their test set was actually a dev set, and it doesn’t generalize to their real data. That’s a big problem.Douwe Keilla, CEO of Contextual AI
AR: Yeah, there’s a fine line. I think we swing between two extremes. On the one hand, we’ve by accident taught a generation of data scientists to never look at the test set. People interpret that as, “just don’t look or think about the evaluation data set.” Consequently, we’re now behind in terms of the maturity of evaluation and testing, and in tested engineering in general.
But you can also swing to the other extreme, where you do more than look at the test set. You literally develop and tune your prompts on it, which is cheating, right? That’s not going to lead to good, robust generalization.
I’m interested to know if you like or dislike this metaphor. I think of out-of-the-box LLM as a med student. You’re trying to get the student to make a specialty diagnosis. You could try giving the med student very careful instructions. And for some simpler tasks, like taking basic vital signs or measurements, you can probably accomplish that. That’s a bit like prompting. But for some things you’re going to need lots of training. That’s why med students go to specialty school, and that’s like fine tuning. Then in some cases, even the expert oncologist needs the patient’s chart and/or relevant academic literature.
DK: Absolutely, I think that’s a perfect metaphor. As humans, we learn a lot of general stuff through self-supervised learning by just experiencing the world. Then you go to university, you specialize, you do your fine tuning. But when you take a lot of these very difficult exams, they’re open book exams. And, when you do your job, you can always go back to the book or to the internet and ask questions. It’s not like you have to cram all of this knowledge into your parameters. You have access to extraneous sources of information. That’s why I think it’s a very natural paradigm.
Another interesting point is that this idea of long-context transformers, long-context language models, the way we make those work is through sparse attention mechanisms. The sparsity basically says: we’re going to discard whole chunks in our context window and only focus on the things that really matter. This sparsity is very similar to the non-parametric version that you do with a retrieval model. So, long-context windows feel inefficient because you have a lot of information there that you don’t need. If you take long-context windows to the extreme, then you end up with a non-parametric retrieval model.
AR: I really like that connection. It highlights the transparency or lineage trade-off as well.
DK: That’s exactly it. If you do it non-parametrically, like with a RAG model or retrieval augmented language model, then you get attribution guarantees. You can point back and say, “It comes from here. I have this much confidence that I use this thing.” That allows you to solve hallucination. We have papers from 2020 where we showed that these models hallucinate less than regular parametric models. It’s a better architecture—but obviously I’m very biased!
AR: It’s not an either/or. A lot of this stuff is going to come together.
One thing that we’re not yet thinking about enough is that it’s a more interesting trade-off space. When we try to get things to work with customers or support their platform, we’re thinking about accuracy, and we have a lot of work to do yet to even get that right. But, eventually as you go from the “first-mile” exploration to actual system deployment, you have a richer trade-off space. You have to think about latency, and cost, and size, and attribution. There are rich trade-offs across all of those. Sometimes memorizing things parametrically can be advantageous, and sometimes it can be a severe drawback, depending on how much you care about all of those different axes.
It’s going to be interesting to navigate. I don’t think it’s rocket science, but there’s a lot of growth that we need to go through.
DK: But that’s also a good sign, right? Where we are right now in the field is that there’s been this kind of “demo disease,” as we call it at Contextual AI. Everybody wants to build a cool demo. You can do that right now with one of these frozen RAG systems. At Contextual, we call those Frankenstein’s monsters. You’re coupling things together and it can walk, but it doesn’t really have a “soul.”
Now we’re starting to see real production deployments in real enterprises. But it’s moving a lot slower than a lot of people have predicted. A big factor in that is evaluation. So, understanding the risks of a deployment, going beyond just accuracy on some held-out test set on which somebody might have tuned a prompt, and going to a real robust system that you can expose to your customers or to your internal employees without worrying about things going disastrously wrong.
AR: Let’s jump into evaluation. You’ve done some exciting work there. Dynabench is an example. What are the most salient things that are missing when you’re trying to evaluate?
DK: I’m a big fan of the HELM (Holistic Evaluation of Language Models) project and Percy Liang’s work there and in general.
The first thing you need to do is think about it much more holistically. It’s not just about accuracy, and it’s definitely not just about one test set. It’s about many test sets and then there are lots of different things you can care about.
You mentioned latency, but there’s fairness, robustness, and all kinds of others. Is it actually using your company’s tone? Are there any legal risks that we’re exposing ourselves to? Are we allowed to talk about our own company stock when somebody asks us? At Google or Meta, for example, they have to be very careful with what their language model says about their own stock. Those are the kinds of things that, if you’re an academic, you don’t really think about, but these are really make-or-break things for big enterprises.
Let me ask a little bit about how you make the transition from the academic side of things to leading a startup. It’s like you’re switching to different modes during different phases of a project. There are certain places that are best for academic and research-centric roles, and there are phases where you can’t accomplish your goals unless you build a team that has different skill sets to productize, to get out and deploy with real enterprises in real settings.
What have been some of the lessons for you so far in making that transition and starting Contextual AI?
DK: This is a fascinating question.
One way to think about that is that the two sides are actually very similar. If you’re good at being an academic, then you have to also be very good at branding and marketing and the product work around, for instance, having a very polished paper and things like that.
You need to have a solid core, a good research finding. But then, if you want to be a really successful academic, then you need to do a lot of the same things that you need to do if you’re in a startup.
AR: I agree with that. I’m a little biased here as well. Sometimes people think this answer is a little suspect, but a lot of the core skills are very similar. A lot of the metrics and objectives are obviously different. Having to be a “one-person shop”—understanding marketing, messaging, branding, and packaging; having to productize and get feedback from users—is a skill set that is easily shapeable into what you need for startups.
The metrics are indeed different. When you’re running a business and you’re serving customers, you don’t care as much about novelty, for instance.
DK: But in many ways it’s still the same thing. If you build something awesome or you have an awesome research finding, but you present it in the wrong way, then nobody’s going to care. It’s exactly the same with a startup.
Another thing is failing fast. That’s something you want to do both in startups and in academia.
You should be ready to discard most of your ideas as bad ideas. That’s where good ideas come from.
AR: I love that point as well.
Another question: What are some of the challenges in the enterprise that you think the academic or open-source community underestimates? Are there any that you’re running into again and again? What was surprising for you as you made that transition to the startup world?
DK: I would say evaluation. I still think the field is in a complete evaluation crisis.
HELM is trying to do something about that. I think Dynabench is even more extreme and experimental, where we have a dynamic leaderboard and there isn’t really a static ranking anymore. We have a dynamic approach to models, continuously updating them and then having people try to break them.
Tying into that point, what’s really missing in academia is systems thinking. That’s by design sometimes, but right now we’re getting to a point where you can evaluate either the language model itself in a very isolated setting, or you can evaluate the whole system that is built around that language model. And those are very different things and you cannot compare them.
For instance, when you are calling GPT-4 API, that’s a system, that’s not a model. There’s a lot of stuff going on there. There are safety filters on top and other things where we don’t actually know what’s going on inside that system. It’s just a black box.
If you compare that to a bare-bones model, maybe one that doesn’t even have any reinforcement learning from human feedback—no fine tuning—that’s not even a comparison you want to make.
This is going to require a lot of standardization around what these systems look like, but if we can figure that out, then we can start thinking much more holistically about what we’re doing with these powerful things.
AR: I think it makes a lot of sense. That’s such a good point. And that gets to your work on Dynabench.
It’s two axes. One is thinking about a system versus one module of a system. The full systems take longer to build and they take longer to research as well. Academia has a quite long and fruitful history of systems research, but it always is a lot slower. It takes a little longer for the systems work to be appreciated, but we do need to focus there.
Then there’s evaluation that’s more dynamic and holistic.
When I was in middle school, I took a hundred basketball shots every day, by myself, and I thought that would make me good at playing basketball. And, surprise, it did not! It’s one target task in isolation versus a real dynamic system. They’re a little bit different.
DK: Yes. And something that really ties in nicely around evaluation and systems is data.
As everybody at Snorkel knows, but I think a lot of people out there need to know that data is the real gold here. The architecture is generalized. Everybody uses the same kind of thing with a couple of tweaks. Compute is currently a scarce resource, but that’s going to become less scarce over time. So, it really is all about the data.
Maybe this is starting to change now, but for a long time, both in industry and academia, people didn’t have enough respect for data and how important it is and how much you can gain from thinking about the data. For example, by doing a weekly supervised learning, or trying to understand your weaknesses and trying to patch up those weaknesses you can close the additional 10 percent that you need to get to a production deployment from your cute demo.
AR: You’re preaching to the Snorkel choir here! It’s all about how you label and develop the data—and doing so more programmatically and more like software development—to tune our customized models.
And as an aside, you need evaluation to be mature to do that programmatic development. It’s one thing that people historically misunderstand but that they are now starting to realize.
The original paper that coined the term “large language model” was a 2007 Google paper where they used an algorithm called “Stupid Backoff.” It’s one of the funniest trolls of all time. In academic papers, they used all these fancy French names for algorithms, for the decoding stage of the language model—because language models have existed since the 1980s and 1990s—and they used an algorithm called Stupid Backoff, but they used a hundred times the amount of data, or something like that, and they blew them all away.
Everyone understands: more data and better data; garbage in, garbage out. You need good quality data. But what folks generally underestimate, or just misunderstand, is that it’s not just generically good data. You need data that’s labeled and curated for your use case.
That goes back to what you said: It’s not just about “cleaning data.” It’s about curating it and labeling it and developing it for a specific use case and a specific evaluation target. You can’t know where you need to do that development if you don’t have mature evaluation in place to tell you where the model or the system is falling down.
Evaluation maturity and how fine-grained your evaluation is goes hand-in-hand with how efficient and successful you can be at data curation and development.
You need to know what you’re hill-climbing on. We’re optimizing all of these systems, big end-to-end neural networks, but we don’t really have an idea of what’s going on inside the neural network.
We need to be able to measure the impact of our data on our downstream performance. It’s all a system and it’s one big loop that’s driven by data.
AR: I like to think about a really simple sandwich model.
You’ve got data coming in. You’ve got data and all the knowledge to curate that data. That data comes in through various ports: retrieval, augmented generation, fine tuning, pre-training, and in-context learning that’s more ad hoc, like prompt engineering. Then, on the output side, you’ve got how you deploy the model. What is the UI/UX? What’s the deployment mode? What’s the business process or user experience that it plugs into?
The inputs and outputs are naturally where all of the diversity and the action is at. A lot of the model, algorithm, and compute layers are already a lot more standardized and are going to continue to become moreso.
DK: For sure.
AR: We’ve seen an explosion recently where models are coming out at a high rate. It may even accelerate because of how enterprise is embracing open-source models that they can own and tune and and serve efficiently.
What is one big area where we’re going to see some significant, step-change innovation?
DK: I can give a different answer to this one to make it less boring!
There are maybe three topics I’ve been most interested in my career. We mentioned retrieval augmentation and evaluation. The other big topic in that list for me is multimodality.
My thesis was on multimodality and how that can lead to a much richer understanding of the human world. If systems understand the human world better, they will be able to talk to us much more effectively as well.
Natural language processing itself shouldn’t just focus on text. It should focus on basically all the data that humans generate, because then you can really understand humans.
I think this trend is starting right now. There’s a lot of amazing work from 10 years ago in which people were already doing very similar things. It’s fun to see this revival happening right now. I expect a lot of big breakthroughs in multimodality, if they haven’t already happened. GPT-4V is very impressive. We’re definitely getting there.
AR: I think that’s a great point.
Thank you so much for joining us, Douwe. I really appreciate it. It was a great chat. I know I really enjoyed it, and it was great to see you.
DK: Thank you.
See what you missed at Snorkel's Enterprise LLM Virtual Summit!
LLMs are rapidly transforming the enterprise and have the potential to revolutionize the way we work. During this free 3-hour virtual summit on Jan. 25, attendees learned how enterprises can harness the power of LLMs and use their data to deliver value with GenAI.
This virtual summit brought together leading experts on LLMs to share their insights, best practices, and practical approaches on how to use LLMs to drive innovation and business growth within the enterprise in 2024 and beyond.
We have also released individual recordings of all eight sessions from October's summit. You can see them—including the very lively Q&A session—here.