Alex Ratner spoke with Alex Taylor, Global Head of Emerging Technology at QBE Ventures at SnorkelAI’s Enterprise LLM Summit in January 2024. The pair explored the challenges and impact of AI in the enterprise—particularly in the insurance world—and the importance of amplifying subject matter experts in the loop. A transcript of their talk follows, lightly edited for readability.

Alex Ratner:  Alex, it’s an absolute pleasure to have you on today, and I’m excited to have a pretty wide-ranging conversation. I’ll queue it up: the theme today is—on the method side—the importance of data.

If you think about simple ingredients in the “cooking show” of AI—models, algorithms, and data—a lot of the models and algorithms have gotten a lot more black box; a lot more commoditized; frankly, available in the most positive sense. Often the bottlenecks for getting things to work across the last mile and in real enterprise settings are with the data.

I’ll start by asking a little bit about what you see with your QBE hat on as the opportunity for AI. Then, what are some of the challenges that you think are unique in the enterprise versus the Twitter demos and academic benchmarks we see?

Alex Taylor: It’s a really good starting point because we’re in this transitionary period at the moment on the accessibility of models—particularly to people that don’t have an engineering background, that don’t have a data science background, that are suddenly realizing that their skills are just as applicable in this domain as they are in the domain that they’re an expert in.

We’re in this transitionary period at the moment on the accessibility of models—particularly to people that don’t have an engineering background, that don’t have a data science background, that are suddenly realizing that their skills are just as applicable in this domain as they are in the domain that they’re an expert in.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

My industry, insurance, is particularly fascinating in this way. There’s been a lot of discussions today around modes of data, structured vs. unstructured. I think that insurance is the absolute center of the universe of different modes of unstructured information.

An insurance broker…bless them. We’ll toss a packet of information, hundreds of pages long over the fence to an insurer and say, “Insure this, whatever it is. Figure it out, give me a price and I’ll put it to my customer.”

We’re at this stage now where suddenly underwriters who have built a lifetime of experience have realized that this generation of AI—particularly large language models (LLMs) and related embedding models—can be taught in the same way that a junior underwriter is. They can treat it almost like a human being to perform a task that was previously not only very time-consuming but only achievable by a human being.

Alex, you made the point in the last talk about this generalizing of models. The center of the universe seems to be these behemoth models from OpenAI, from Anthropic, from Mistral. Yes, absolutely—we’re seeing this capability generalization. But the ability to query information, to retrieve information, to perform basic reasoning on top of that information seems to be their superpower.

We’re only really right now a year into people realizing this in general, and all of these use cases are popping up in industries like banking and insurance. All these industries that have vast amounts of unstructured information or multimodal information are coming out with all of these use cases that are going beyond these initial experiments, that look promising to real examples that accelerate business processes and improve customer servicing.

There’s so much distance to go, but so much promise already.

There’s so much distance to go, but so much promise already.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

AR: That’s an awesome intro—you did way better than my basic, very generic question deserved, but that’s awesome! A couple of things to unpack as we go through: the heterogeneity of unstructured data. I also think the way that the structure of a specific industry impacts that is fascinating. A lot of people think about that. You don’t realize that it’s literally baked into the structure of a specific vertical or use case. You might have all these different, almost adversarial, formats and diversities of data—all of which are most of which are not out on the Internet for models to memorize or learn about.

The idea of retrieving and extracting from that data, as well as using it to tune these models to go beyond the “junior underwriter” or “undergrad” level. I think that’s a fascinating area. We had a talk earlier about some of the work that we’re doing at Snorkel on not just tuning an LLM model, but tuning a retrieval model. A brief aside from our last LLM summit—Douwe Keila (who is one of the lead authors on the original RAG paper) made the point that the way that people use RAG now is actually just the baseline. Actually tuning your RAGs to know how to retrieve and how to embed your specific data is one of the key steps to getting it to work. There’s tons of both value and depth to getting those things to work.

Image1

Obviously there’s a lot of academic and methods stuff in this summit, but we’re also trying to bridge it to the real impact. How have you seen lines of business impact the implementation of common genAI tools like RAG in your area and at QBE?

AT: It’s almost amusing at the moment that a lot of consumers of technology in this space see RAG as the end game. I’ll tell them now, they’re in for a big surprise in 2024 as multimodality becomes common, and as agent-based systems become more common.

It’s almost amusing at the moment that a lot of consumers of technology in this space see RAG as the end game.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

But look, RAG is a useful starting point because it’s understandable. People can immediately say, “This is the process a human goes through. We look at a large corpus of information. We select a subset of that. We inject that into our minds or into the context of the model. And we interrogate that to get to a transitionary point of value.” That’s essentially what an underwriter in insurance does, what someone looking at a loan application does, and in banking and all these different use cases.

The interesting thing that we’re starting to see is the evolution necessary for these models to make them generally useful in domains of information that might not have been part of the training data set.

People forget that it’s not just about the generative language model. It’s also about the vectorization process and the embedding. One of the things that I was looking at quite recently was the use of acronyms common in a particular industry domain, or even within an organization. You can’t retrieve information that the model hasn’t been exposed to previously because it wasn’t embedded appropriately.

Making sure that we get this evolution of models, but also the ability to fine-tune models—this is where it gets really fascinating. Companies are starting to realize that we’ve got these great general-purpose models. They’re behemoths (up to trillions of parameters) but might not perform very well if they’re not aware of the information domain or the question domain that you have to put in front of them.

One of my favorite things to do with models at the moment is, in the structured response you ask a model to provide, get it to give you a reason why it behaved in a particular way. Not only does that let you iterate on a prompt, it lets you understand what the model itself doesn’t know. That can be a great way of moving into fine-tuning the models as well as the RAG and in some cases, giving it exposure to corporate information—not just the information itself, but prior examples of what human beings have done in similar situations—to improve the performance of the model later on.

The thing that I love about this space and what I’ve just described is that the process we take the models through is almost identical to the process that we take human beings through to train them. It’s this alignment between the human process and the machine process that I think has a lot of people paying attention now.

AR: It’s super interesting. Especially before this current wave, I have always been hesitant to rely too much on human metaphor and on anthropomorphizing the systems because it can lead to error modes, especially in past iterations. But a lot of these metaphors to the way that LLM-based systems are currently constructed are just so apt.

There’s a lot of chatter about fine-tuning vs. RAG. I think it misunderstands how the tools fit together. Fine-tuning is teaching an underwriter how to make better decisions with data, whereas RAG is about accessing the data. Just like someone could memorize data that they might look up on Google, or they could Google something that they might have otherwise learned, there’s an overlap between what you can accomplish with fine-tuning what you can accomplish via RAG. They’re complementary.

An underwriter or any professional has to retrieve data. It requires skill to understand, “Where do I look in a multi-hundred-page document packet I got from the broker?” That has to be tuned and learned. Then, once they have the information, they have to understand and be tuned how to make the right decision for their domain, their company, their objectives.

This segues into another topic that I’m always a broken record about. The first DARPA funding that Snorkel ever had back in 2012, 2013 was a wonderful project called SIMPLEX. It was all about how the data scientists and thesubject matter experts (SMEs) collaborate.

We view that as critical. How can you teach the model to go from “undergrad trained on the Internet” or “junior underwriter level” to “senior underwriter level” without looping in not just the data, but the people who know all that stuff—say, the senior underwriters?

What has your experience here been? How do you see those subject matter experts and the data science team working together at QBE to tune and advance AI?

AT: It’s a great question. The initial answer is that, until recently, it was a challenge to get the two teams to even talk to each other. They didn’t see a connection between their domains of expertise or a way of making them accessible to each other.

It’s been one of my favorite things, particularly with LLMs, to throw subject matter experts that are non-technical into the deep end.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

It’s been one of my favorite things, particularly with LLMs, to throw subject matter experts that are non-technical into the deep end. We ran an internal hackathon in Hong Kong about 11 months ago— so, right at the beginning of the public visibility of this wave of LLMs. We said to a team of people in the room (they were marine cargo underwriters, customer service people who had never written a line of code in their life nor did they really want to), “Over the next three days, you’re going to create a system that helps with the process that you go through every day.”

We asked these underwriters, “What is the problem that you have?” They said, “Our team size is limited. We go through this process where it takes five hours to three days to review information, to either decide not to underwrite this customer or to come up with something that approaches being able to create a premium for them. It’s the rating process.”

Moving between “What questions you ask? How do you know what to ask of this information as a human being?” and “We’re going to sit here and watch you do this and record the questions you ask. Then we’re going to take the same thing and put it into this fancy new black box that we’ll call RAG for the time being.” (They didn’t need to understand what that was.)

As we put them up on a big screen at the front of the room, they started to see that the machine responded similarly to the way that they did. They could sit there and they could direct it and say, “Well, it’s got this page of information, but when I asked this question, I would actually look for it in this page over here.” And you say, “Why would you do that? How would you ask that question in a different way to be more focused on the information that you would actually take into account?”

It’s this iterative process where a data scientist is essentially laser-focusing on the way that they’re vectorizing content, the embedding model approach that they’re using, and the way they’re injecting context.

Phrasing is something—people still can’t believe that they can talk to machine a little bit like they can with a human—but we’re starting to see people realize that, even though they’re speaking in words, yes, we call them tokens. The non-engineers don’t need to understand that, but in the end, you’re providing input into a model. You’re letting it pay appropriate attention to the right tokens in the input. That provides the right output.

People still can’t believe that they can talk to machine a little bit like they can with a human.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

Obviously, yes, there’s a highly technical reason for why these models perform as well as they do for that kind of use case. But because people happen to speak with words already, it’s immediately accessible to people that don’t have a machine learning or data science background or mindset.

We’re seeing this tremendous people-led collaboration as a result of this. People are clawing over each other to be the next use case in organizations like QBE where we explore this technology, because they can immediately see what the application is and how this can accelerate what they do today.

Until literally the last 12 months, that would never have been the case. Yes, people were aware of data science, they were aware of machine learning. But they couldn’t get over that initial hurdle to understand how something that they could have input into would be something that would drive the capability forward.

Now that’s all changed.

AR: That’s an awesome perspective and I hadn’t heard about that hackathon before. Such an awesome example of a way to approach that. One of the first Snorkel experiences back at the Stanford lab was sitting with (in our case) pediatric genomicists and saying, “Why did the model get this wrong? What are you looking for?” They were already very advanced on the computational side as well. It was a very constrained domain.

This ability to open up the accessibility of the first mile—and this ability, even before you work on solving a problem or getting it to production level, to scope the problem. What are you actually trying to accomplish?

That’s one of the trickiest questions and I think most underrated in all machine learning. What do you actually want the model to do? Where does it fit into a human process? What are you trying to have as an output? What are the standards for that output?

Before this natural language interface, it was a very difficult thing to do. It took a lot of upfront effort to do. Now, you can just start playing around. Sometimes, after that playing around and giving a little bit of feedback, it’s good enough—great, you ship that to production. Often in domains like insurance, you need to go further. You need to fine-tune or distill or tweak.

Still, that first mile part is getting the subject matter experts engaged and helping, having them give some input, having them give feedback, having them buy into the problem and the part of the process.

I sound like some kind of hack marriage counselor, but if the subject matter experts who know how to solve the problem and the data science teams are separated, that’s never going to work. Getting them involved at square one.

AT: We’re going to see the next evolution of this quite soon as people start to realize the power of and the necessity for fine-tuning. This interplay between questioning why a model performed in a particular way, and then understanding how we might need to expose it to different training material, different source datasets to make it perform better in a particular domain.

There was a really good example I saw the other day. One of our engineering teams looks at architectural plans a lot. Giving even a multimodal vision model access to an architectural plan and then asking it, “How big is this room? Where’s this room located? What’s the projection like?”—some state-of-the-art models like GPT-4 Vision perform pretty well in that domain, but it could still be better. Understanding why it doesn’t perform well for certain tasks and what you might do either in a smaller, more focused model or what material it might need to see to behave a little bit more the way you would expect, when a human was looking at a similar diagram.

This next phase, I think, is going to be subject matter experts having input into the training data corpus to fine-tune a model.

This next phase, I think, is going to be subject matter experts having input into the training data corpus to fine-tune a model.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

AR: I couldn’t agree more. I like that highlight—it’s not just about a fancy fine-tuning algorithm. It’s about the data and it’s about the subject matter experts.

We chase after leaderboards, and we just had a talk about some cool results there out in the public domain, just like anyone in the AI space. They’re never going to tell you where the gaps are in your domain and where the focus points are in your domain—your subject matter experts are going to.

Once you get that, you can fine-tune models. You can distill them to be smaller and more efficient. I couldn’t agree more than it’s about that data development, the fine-tuning. But it’s also how the subject matter experts lean into that process and lead it with the data science teams.

Any closing thoughts?

AT: 2024 is going to be an exciting time. We see a moving on from this general capability of language models that people have suddenly realized is here (even though it’s arguably been around between 7 and 20 years, depending on how you look at it.)

2024 is going to be an exciting time.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

Combinations of capabilities, particularly multi-modality. A mixture of experts, agent-based systems of cooperative problem solving. All of these things are going to keep blowing people’s minds as to what’s possible. 

Of course, as you introduce more components into the mix, it creates a space for more careful fine-tuning, for more careful combinations of capabilities. There’s a lot of blank space and improving the way that these systems perform. I predict what will happen over the next 10 years or so.

On the regulatory side—of course, that’s a topic we could talk about all day—the one comment I’ll make here is that we’re seeing a lot of similarity in the statements being made by regulators globally and the questions they’re asking. The thing that gives them most comfort at the moment is oversight.

As long as you’re saying that there’s a human being appropriately licensed in their legislative environment in that jurisdiction to make sure that the model is performing appropriately—even on aggregate decision making—then you’re generally okay.

The thing that I don’t think I’ve seen anybody say at the moment is that we’re letting models make decisions entirely on their own behalf or their own accord and it’s not supervised. That would be a very dangerous space to be. Over time, though, we’re seeing this emergence of systems that makes it easier to oversee model performance. There’s a lot of companies starting to be very successful in that space.

Combinations of capabilities, particularly multi-modality. A mixture of experts, agent-based systems of cooperative problem solving. All of these things are going to keep blowing people’s minds as to what’s possible.

Alex Taylor, Global Head of Emerging Technology, QBE Ventures

AR: That’s fascinating. I’ll just quickly say, Alex, thank you so much for joining us today. I could have asked you a million more questions.

Learn how to get more from foundation models without fine tuning!

At Noon Pacific, April 5, PhD Student Dyah Adila from the University of Wisconsin-Madison will discuss how you can achieve higher model performance from foundation models such as CLIP without spending days, weeks, or months fine tuning them.

Learn more (and register) here.