At the August 2022 FDCAI virtual conference hosted by Snorkel AI, Meta senior applied research manager Anoop Sinha and Snorkel AI co-founder Braden Hancock joined the founder of TWIML, Sam Charrington, for a panel discussion on the current state and future potential for mastering speech and search AI with data-centric natural language processing. A transcript of the panel session appears below, lightly edited for reading clarity.
Piyush Puri: Now, I’m pleased to introduce our next session: a panel discussion on the topic of mastering speech and search with data-centric ML. Joining us to moderate is founder of TWIML, Sam Charrington.
Sam, would you join me on stage?
Sam Charrington: Before we dive in, I want to remind you that I’ll be keeping an eye on the Q&A panel, and I’ve got some time reserved to pull in audience questions toward the end as well. To get things started, I’d like to give our panelists an opportunity to briefly introduce themselves. Let’s get started with you, Anoop.
Anoop Sinha: Sure. Thanks for having me. My name is Anoop Sinha. I’m a senior applied research scientist manager at Meta. I work in the fair research team. And I’ve been at Meta over the last six years working on a variety of different topics, including search and natural language.
SC: Awesome, thanks. Braden?
Braden Hancock: Yeah, and I’m at Snorkel today. I’ve been working on this product for a number of years—at Stanford, originally, when the product began, and then now at the company. Before that, and partly during that, I did brief stints as part of Facebook research and Google research as well. Glad to be here.
SC: Awesome. So let’s dig in. Anoop, you’ve got a really interesting set of experiences relating to search quality and harm avoidance, and how these impact upstream data collection and labeling. I’d love for you to dig in and tell us a little bit about the problem you faced.
AS: Sure. And thank you again.
So over the last 10 years, I’ve worked on a couple of major search engines. First at Apple, where I helped introduce Siri search and then also at Facebook, where I worked on Facebook search for a number of years.
One of the key problems for both search engines is to not show harmful content. Both companies have community standards about content and/or pages that they do not want to show. And with these search engines [you] really need efforts and work in order to avoid showing those. I wanted to focus on that story, around avoiding harmful content, by starting out saying how hard a problem it really is.
The upside, when you do a search engine correctly and you show something good, is there. But the downside of showing harmful content is even more extreme. There are legal issues as well as ethical issues for that other side of it. And the other challenge with that type of content is that it might be engaging to users. So in fact, it actually becomes even harder to identify because it sometimes looks like good content in the stream. Real key for us was to develop technical systems that have good high-recall ability to identify those contents, but we need to really narrow it down to be able to avoid showing that harmful content.
The real key to this was humans in the loop, and that’s both for setting up the analyses for how to measure harmful content and also the labeling that’s used. And that labeling would also be used to develop the systems that are classifying that content overall going forward in order to make this a scalable solution.
I’d like to say this is very akin to active learning and the need to really zoom-in on issues that are there. At the end of it, our hope and our goals were to develop both good metrics and good classifiers for that type of harm.
SC: Awesome. You mentioned active learning and the need to zoom-in, and we often talk about, in the context of building models, this idea of class imbalance and how class imbalance makes the kind of modeling we’re trying to do more difficult. But in your case, it sounds like not only do you have the class imbalance but there’s also this disproportional cost or risk associated with surfacing harmful content. Can you talk a little bit more about the challenges that presented and how it impacted the approach that your team took.
AS: Yeah, absolutely.
If I think about metrics for search engines, there’s a variety of different approaches to it. But a standard metric, such as success-on-clicks or other things would actually hide the issues of harmful content quite significantly, if not actually mix it in an improper way.
The real key for us when we set metrics was to stratify our sense of query sets so that we’re actually looking disproportionately at harmful content, the risk of harmful content, in the core metrics that we set. Stratifying the metrics, to just give you an example, may involve looking at query sets that we feel are at risk of showing harmful content and overemphasizing them in the objective function that we’re looking for. It may even be the case that in those query sets that are of risk, we may even find ourselves filtering the content because there’s that much risk in it that we said we’re going to completely shut it off. We worked together with the policy teams to make those decisions—in both companies that I worked in. It was a very active discussion around the right decisions to make. But it’s certainly the case that we need to emphasize harm in the core metrics themselves.
One thing I wanted to maybe give a thought about was “the hunt.” The hunt for harmful content is actually pretty difficult because it’s fairly low prevalence in these query streams. In order to do that, we may start, just to give an example, with a set of example queries that we’re concerned about and then look at the results that those queries may show. On looking at those results, we may just do a round trip and look at other queries that may lead to those results. So we’re doing this closure set, a bit, around the possible risks in those harm sets. This is giving us a large set to take a look at and examine for harmful content. Hopefully it has the categories of risk that we’ve got. But we need humans in the loop, certainly, in order to take a look at that. And it’s really critical that we use the correct guidelines and the human judgment in order to identify risk or not.
Then, just to make it an even harder problem, it’s a really dynamic threat or a dynamic change, because: let’s say we’ve identified and built classification that helps us eliminate a set of harms that we think we’ve got. The harms may change. The bad actors in the system may be using different content or different queries, as well, in order to just move the problem forward. This is a challenge that we’ve definitely faced. It’s a constant effort.
SC: I want to pull in a question from the Q&A panel: Did you approach these problems exclusively in a supervised manner, or were there unsupervised elements to the way that you attacked them?
AS: Yep. It’s a very good question, and I’d say in the era in which we were working at it, I’d say we highly emphasized supervised methods, because we both needed to analyze and understand the threats.
As techniques have changed and as things are moving forward, one could argue that this closure classification is kind of an unsupervised method. But we still would review it in a supervised method in that case. So I’d like to say during the eras that I was involved, we were quite heavily working with humans in the loop.
SC: Got it. And you’ve emphasized humans in the loop several times here. To what degree have you explored machine labeling—machine classification—machine labeling as a way to help generate your dataset and reduce the burden of human labeling?
AS: Yeah. With those sets that we’re concerned about, there are some cases where—I think everyone has their lists of terms, toxic terms, that they’re trying to avoid. And so, those terms were some things that we would use, right, to say these possible documents have words that we don’t want to see. Those could be offensive words or toxic words. Those were used even then. I’d say the techniques have gotten a little bit more sophisticated over the last few years about those types of supervision, to relate it to the question that came up as well. I think that there are greater opportunities to use more sophisticated models for that.
During the eras I was involved, let’s say, we were hesitant to do automatic or even weakly supervised methods, because we really needed to take a look at what was happening.
SC: Yeah. So in our pre-call, Braden asked a really good question, which is around the need for, and the role of, iteration in your world. Can you talk about how the project evolved? Where did you start and how did you know what the right next-step was at any particular point in time?
AS: Yeah. And that’s a great question. In a certain set of cases, we were at the beginning of introducing either the search engine or a new feature. So at that beginning, we made very intensive efforts to make sure we were safe at launch. Then there were various cases along the way where there were some critical issues that we had to address, perhaps in war rooms, or just intensive issues to address. But over time, I think we got better in these techniques and realized that the real key was to centralize.
At the end of the day, the measurement was there, but we’re also building machine learning and we’re building machine learning classifiers as a deliverable. And those classifiers are getting better over time. Then in the case of both companies: how do we centralize those classifiers as assets of the company so that they’re usable across all sorts of applications? That was really the journey that we went on.
I think that centralization was extremely beneficial because it combined insights from various teams who were working on similar problems into what I’d say, right now, are at least attempts at really good classifiers for these types of techniques. There is, of course, an ongoing threat, so they constantly need maintenance and change. I’d like to say the core work that these companies have done in this area, it’s full of dedicated individuals and dedicated teams to try and really reduce harm. And that’s shown in all those different products.
Now, Braden, you’ve come at similar sets of issues from the perspective of dialogue systems and chatbots, in particular. Can you tell us a little bit more about that use-case and how it compares and contrasts to what Anoop saw in search?
BH: Yeah. There are some definite similarities from the standpoint of: both of those scenarios put you in what I think of as an open-world setting, where the model feels like at least it’s generating new content. With search or ranking, it’s technically not generating, but when the space of options is so large, it does feel like it’s making a statement when it returns something. When you get to doing generation, I feel like the error modes become a little more severe because it feels more personal or intentional when something goes wrong, so there’s extra sensitivity around what’s going to come out of our AI’s mouth and how do we control this. There are different ways of doing that, both in the way of how you train it—what’s in the data that the model is going to learn from?—as well as in attempts at post-talk filtering. Are there ways that we can catch bad things before they get out or remove them entirely?
I’d say in general, historically there’s been more of a post-talk—let’s put on post-processors to catch these types of things. I think that’s still present. But there’s increasingly an awareness and a push for making the data upfront that things are trained on reflect the types of behavior, or topics, or general attitudes that you want to come from your model.
SC: Alright, let’s maybe back up and have you share a little bit about the scenario that you were working on and the type of chatbot that you were building.
BH: Yeah, so, at Facebook I was working on a project for chatbots to do, basically, “chit chat,” I called it. It was not necessarily goal oriented—like, “help me complete this ATM transaction”—but more, “let’s talk a little more broadly around get-to-know-you topics.” So again, a very open world, which is tricky. In that particular project we were focusing on, in a way, an aspect of control. If we could give some context or a background for the chatbot to condition everything that it said, so that it could be more consistent with a certain personality or background information.
In that project, there’s step one, which is: teach the chatbot to speak coherently at all. Two maybe is to speak on-topic, and three is: say useful and interesting things.
Natural language is hard. To learn how to speak like a human, obviously, it takes years for us to do, either the first time or with a second language. You need a lot of data for a model to learn to do that as well. And the more data that you use for training, the more chances there are for things you don’t want to necessarily teach your chatbot to slip in. In some ways that’s inevitable when you get to a certain scale that you can’t handpick every conversation for it to learn from. So that was something that we spent a fair bit of time thinking about and looking into: how can we get the benefits of training on massive amounts of conversations, perhaps scraped from the internet, but not necessarily teach our chatbot to be an internet troll? Help it have better behavior and vocabulary than that.
SC: Were you fundamentally approaching the problem as one of generating responses or were you selecting among pre-vetted responses, or a combination of the two?
BH: We experimented with the whole spectrum, and this was a few years ago. In a fast moving field like this I can’t speak to even what’s being done today. But at the time, on the spectrum of most conservative to most aggressive, I’d say on the conservative side you can essentially provide templates, right? Let a chatbot slot-fill. And you’ll see various digital assistants, especially on certain common topics like weather and sports scores and things, be very rote in the way that they respond, where there’s fewer ways it can go wrong because you know exactly the shape of what they’re going to say and they’re just filling in the details.
One step beyond that would be, not just slot-filling the template but selecting from pre-approved responses. In this case, we had 100,000 or so conversations that had been supervised, had been more meticulously crafted together. So, the chatbot, when it’s time to say something, can figure out what it would like to say, find the most similar example that’s been pre-approved, and then say that.
So, there was an element of: we’re not going to be surprised by some new combination of ideas being presented. You don’t know, still, what the conversation flow is going to be. So there is still risk there. But in that case, we were using more of a, we’ll say, restricted set. But the goal always was: how far to the right can we push this over time? And to begin to do—even at the end of my time there—some experimentation around ways to control and even in freeform generation, a little bit more, letting the bot say new ideas that had not been seen before but still with basically its internal barometer for how how edgy a response may be.
SC: One of the ideas that you explored and published on, to overcome the cost of collecting labeled datasets, was this idea of a self-feeding chatbot: allowing the bot to extract new training examples from the conversations that it was participating in. Can you talk a little bit about that effort and how it turned out for you?
BH: Yeah. That was a really fun project because my research interests have been, for such a long time, on weak supervision. Where is there more signal that we can pull in without a whole lot of additional effort? So we were looking at specific conversations which we can teach the bot to model after, which is maybe the most direct, traditional supervision. But then, we figured, you spend some time training a model, and then you deploy it. And probably, 90-percent-plus of a model’s lifetime is spent in deployment, not in training. There’s all the rest of this time that a model’s being used, where it’s having conversations, and as a human you can tell when a conversation has gone well or gone wrong when there’s maybe a misstep and you weren’t on the same page. Basically what we looked at is, can we teach this chatbot a simpler kind of auxiliary task of: identify when a conversation is going well—when the user is satisfied or delighted with your responses—or when they’re saying, “what are you talking about?” Or, “I don’t understand,” or, signs that it’s not going well.
That became, basically, another set of weakly applied labels, somewhat automatically applied labels—automatically, but based on this simpler task of just: are things going well or not? As opposed to, “what should I say?” which is a little more open-ended. By having this model learn from both of these types of data—both conversations as well as telling how well a conversation is going—it ended up actually refining its ability to speak, or at least to understand what conversations should look like. That ended up resulting in a higher quality chatbot that was rated as more engaging and more on topic and more satisfying to speak with.
SC: Now, this was a little bit before the era of large language models, but I’ve got to imagine that the whole time your team is looking enviously at the corpus that is the internet, and trying to think about how you might advance your own systems using all of that information.
To what extent did you…what avenues did you pursue there? Or, how did you think about that opportunity?
BH: Yeah. It’s an old cliche at this point, but “data is the new oil,” right? It’s what runs these systems we’re working on. It was absolutely on our mind. In general, you see that curve. You see that you add more data and your model gets better, and so you’re always looking, where can we get more data? And appropriately so. We weren’t using internal Facebook data. We really were looking more, from the internet, where can we find conversations happening? There are different places—different forums—where chats happen and you can tell that it’s a continuous conversation that we were planning for chatbots that speak in a one-on-one setting, not a big group setting. So, you’d look for where similar user IDs had gone back and forth a bit, presumably following through a thread of conversation. That’s the type of skill set you want to teach your model. We definitely did look and see what we could scrape where there was open-source or readily available conversation-like data on the internet.
And, it did in fact as expected. Pre-training on that you could improve the quality of the model, or at least its ability to understand language. But then, what we found worked best was to, as is often the case, pre-train on broadly applicable data to learn skills and features in a general sense, and then fine tune on your most relevant data at the end, teaching it, more specifically, something that you’ve got finer control over, that last data set that’s maybe more hand curated.
SC: I’d like to open up William’s question to both of our panelists. How do you think about balancing this need for scale, both in the systems you’re building and in the datasets that those systems require and consume, with the requirements for control in the output of those systems?
Why don’t we have you start, Anoop?
AS: Sure. I think it’s a great question with also a really key part of the answer—in my view—in the question, which is: balance. These systems have, potentially, behaviors that really do need to be studied in detail. If we go back to the case of harm classifiers, we need to understand how accurate they really are. Though I did mention the emphasis on making sure there was good coverage and recall in our datasets, the classifiers themselves have characteristics that we really need to understand in detail. The precision of those is critical.
I’d like to say, in terms of balance, it’s a lot of analysis and human judgment, in my experience, that’s involved, in partnership with policy teams and other teams, to assess, in the case of harm, risk ratios versus the actual harms that are expected, as well as the behaviors of the product. So I think the balance is based on the requirements of the product in full, and I’m sure that will vary based on different scenarios.
BH: Yeah, completely agree with Anoop’s answer. The only thing I’ll add is—and he mentioned this at the beginning—but iteration is key. There is just no way to anticipate all of the scenarios that you can find yourself in or all the possible ways things can go wrong. So the answer, in my opinion, is to have levers to pull, control over your data, and a cadence where you’re able to periodically review and assess how things are going. Then, in a scalable way, update the behavior of your model.
Obviously, at Snorkel, one of the things we’ve invested in is programmatic labeling, with the idea being that it’s human-driven but then when it’s time to change things, whether that’s to add a new signal or tweak an existing one, you can modify a line of code, push re-execute, and regenerate your training set, so that you can pretty quickly—and oftentimes significantly—modify the behavior of your model by modifying the training set in bulk. Iteration basically ends up being the key then, so you can be as aggressive as you’d like, or can be, on how much data you take advantage of, but then have the ability to address error modes when they arise.
SC: Great. We also wanted to look forward a bit in this conversation.
Braden, how do you think the growing popularity of large language models will play [out] as organizations try to take advantage of them to build these kinds of systems but still ensure safety.
BH: Yeah, large language models are such a promising and exciting thread. As people have found, they make this fantastic base to build on. It can basically take care of the large part of basic natural language understanding or generally relevant features. Then your task as the domain expert now becomes basically just the transfer—helping to identify the relevant pieces of what it knows to your specific problem. We’re matching your particular goals for that model.
Super exciting. I think in many ways it’ll change where the focus needs to be for subject-matter-expert time on a problem—where a lot more of it can be now on that final-mile transfer of: how do I map what you know to my problem, as opposed to teach you skills from scratch.
I think in many cases it’ll make sense to begin your model on that base of a foundation model. Then you’ll just need to recognize that that comes with certain biases. I know there are attempts with these different large models to help reflect particularly fair or reasonable training data to begin with. I think the big science workshop that we’ve participated in as well and been part of publishing some of that work over the past year has led to some, we’ll say, very intentionally built foundation models, where there was a lot of attention paid to what data is going to go into this, and what kind of skills do we want these models to have, including performing well on multiple languages, or different things like that.
There are no easy answers here. There are a lot of open research questions. It’s obviously hugely on the mind of the research committee right now. But there are a lot of exciting potential directions that it can go.
SC: Again for you, Braden, pulling in a question from the queue, do you think that large language models might be part of a solution to handling the long tail of customer responses with chatbots that are traditionally really hard to replicate?
BH: Yeah, that’s an area where you definitely see the benefit of going into deep-learning land. Because once upon a time, chatbots, being much more templatized, depended more on discrete signals, specific words, specific paths. As soon as you get off track, you quickly don’t know what to do. Whereas, with these large language models that have seen so much data, they often have what I think of as a smoother understanding of the space. And so, a word they’ve never seen before, or have seen very rarely, there’s still often enough of a similar embedding, or context, or whatnot that they can do pretty reasonable things.
So it definitely helps a lot. That’s the benefit of moving to machine learning from a more rule-based world. You get the benefit of a lot of generalization and smoother decision space as a result.
SC: Anoop, what are you seeing with regard to the role of foundational models and LLMs in the context of the types of systems you’re working with?
AS: I think it’s a strong role in the future with a lot of open questions at the moment. I’d like to say I think the industry is rightfully moving cautiously in terms of putting large language models into production. There’s a whole set of issues that need to be studied. My guess is we end up with large language models, fine tuning, and a set of additional models as an aggregate that actually lead to safe systems that have higher capability.
I’d like to thank our panelists for a wonderful conversation and thank our audience for contributing some great questions. At this time we will turn it back over to you, Piyush.
PP: Awesome. Great panel and great discussion. Thanks so much, Sam, Anoop, Braden.
Follow Snorkel AI on LinkedIn, Twitter, and YouTube to be the first to see new posts and videos!