In this session, Ravi Teja Multapudi, a research scientist at Snorkel AI, speaks with Madhu Lokanath, product owner at Ford, about the MetaPix Data-Centric AI platform. Below follows a transcript of their conversation, edited for readability.
Madhu Lokanath: I’m excited to be sharing this platform with the industry leaders in the space talking about data-centric AI. We are also doing our part in the space. We have named it MetaPix, and I’m really excited to share, basically the learnings and the challenges to build a platform like this.
Ravi Teja Multapudi: That’s great. So, maybe just to get the conversation started, before we even talk about MetaPix, what are the challenges at Ford that you’re seeing today, and what motivated you to move in this direction? What are the broad challenges that your team is trying to tackle?
ML: So, we are talking about data-centric AI where we’re focusing on building quality data sets that can be useful for ultimately building better models. In that context, how can you reduce cost? There’s a lot of costs to get a training data set. How much time does it take? How good is the quality of the training data sets? How good is your training data? These are the top challenges.
RTM: So if you have any thoughts on the differences you have from a computer vision perspective as opposed to natural language?
ML: Prior to this, I helped build an ML Ops platform at Ford. If you’re looking from 10,000 feet, most of them look the same. You’ll see a pattern there. I would not say it’s easy, NLP, but relative to computer vision, it’s a level less.
RTM: Would you say the complexity comes from the fact that the volume of data is just higher than text, and in general text is more understandable by humans and annotations is easier?
ML: It’s a huge topic. Let’s say you’re talking about a text. There’s a clear definition, and there’s this one word that you need to extract, and the context may not be very clear. But in a picture, what are the pixels that you are interested in? Where do you draw the box? What is the right box? We’re talking perspective annotation. I’m teaching this to AI. It is a little complicated.
RTM: Perhaps you can give us an overview of what MetaPix does so we can have more context.
ML: What we saw for computer vision, there’s a lot of data processing done in silos for individual use cases. Because there was a gap, they couldn’t come directly to this platform to train the model; they had to prepare their data. Annotation was a major piece of that. There are many other categories in this data pre-processing. So we had to take care of that, too. Now we are creating a data-centric AI platform. When you are creating a platform, it is not about one use case. You are creating a platform for many use cases, which can come and use this framework. How do you create a platform to address those different use cases, right? So that’s the inspiration. That’s where the AI for data comes in, and we call it extractors.
If I go one level deeper within extractors, there’s data annotation, augmentation, enrichment, and anonymization. The list goes on. So now we are talking horizontally, but you can go vertically and get very deep with each topic. For example, annotation. There are many sorts of annotation services. So it could be a crowdsource or dedicated workforce or programmatic labeling or auto labeling. And then our focus is on golden data. How good is this data? Some features around golden data are, how easy is it to search, data versioning, lineage, governance, access, and management, for example.
The last product under MetaPix platform was AI pipelines. Now the request is how do we automate most of it? How do we build a pipeline? And now we have AI pipelines which are stitching it end-to-end, all of this together. And now when we want an AI, it’s not just going from left to right, it’s a cycle. You have to iterate through this loop many times to be able to get this model and also use some of this model to get the good data right. It’s an iteration. That’s MetaPix.
RTM: So you mentioned you had a range of use cases. Can you give us an idea of the spectrum of use cases?
ML: It’s huge. It’s like three dimensions. You’ll find many use cases—manufacturing defects, general computer vision—there are many other departments.
RTM: Since you are building this platform, trying to get all these people on board, there are bound to be challenges on things like “how deep do we go into the workflow?” or “do we go broad?” How do you decide where the generalization is in these techniques and where you go deep?
ML: That’s a very interesting thought. So for example, let’s say you have a use case where we are training for face recognition. We should help with algorithms to prepare data, but not do the face recognition for the team. You should look for patterns, to make templates that can be used for more than one use case.
RTM: Let’s talk about iteration. Gone are the days when you basically said, okay, here is my training data. I built a model and I’m done. You have to go back, take the new data, make it, and so forth. You have to go from raw data to actually actionable data, then build a model and iterate on the process. Are teams being bottlenecked by this?
ML: Oh, this is happening a lot in the industry right now. Some people think they trained their model, put it in production, and they’re done. No, they’re not. You put a model into production which needs to be updated, because data patterns change. It has to adapt. There has to be an active learning loop. That is complex. A second thing I see happening is that people say they have a lot of restrictions, say with GDPR, and they could claim your data is biased. You need to be able to show the lineage all the way to the raw data. We have to be able to explain not just what this black box of model is doing, but how it was created and with what kind of data. Was the data biased? Can we go back and fix it? There’s a lot of other problems as well, but these are two big ones.
RTM: People don’t realize they’re signing up for a lot of pain because you have to keep maintaining these models.
ML: It’s like a subscription that you can’t stop.
RTM: We have an audience question: Given that you have so many use cases, how do you quantify your prioritization of these use cases, and do you have to assign dollar values to these use cases?
ML: This is also a problem. There are so many use cases. We just start at some point. Start somewhere, find the local optimum. Do an inception design workshop, where you bring in people from across the organization with different use cases. I would say design and inception design is very important. They explain their problems and you look for overlaps. Then you can see a pattern, we drew a value-complexity chart, that looks at complexity and cost. You have to find the niche on the chart where you’re bringing the highest value at the lowest complexity. So you’re solving the highest business value, with the lowest complexity. That’s the place to start.
RTM: So what does the design workshop look like? People will come with different requirements. Do your customers have a strong understanding of what they want? Or is there a lot of back and forth?
ML: I’m not an expert in design workshops. We have exclusive people who are trained in this area and we brought them on our platform. What do they do? Their work is to extract this information, identify the problem, and convert that into these different design sessions. What happens is they’re converted into features and into user stories also. It’s amazing to watch those design sessions. These problems they convert into epics, sub-epics, or features, and user stories. To get there they get all the questions from the customers. So we put the customers in front and ask these questions. Sometimes they don’t know what they want, but they understand the problem. We do the design sessions and then slowly help them get to their solution. But we don’t come up with solutions in these sessions, we just understand the problem. It’s very wrong to start solutioning in those sessions. Just understand the pain points.
RTM: I wanted to touch on the extractors, which basically go from raw data to training data. There are a lot of approaches to this, and maybe you use tools like Snorkel to get from raw data to training data, filling in the broader platform.
ML: Let’s say both are data-centric AI platforms. It’s a huge space, and everyone can coexist. And still there are so many opportunities—so many problems that nobody has solved. There is definitely a bias to the use cases that we each try to solve. You start with a use case, create a product, then see if it works, iterate, and it’s a loop. You’re creating products, focus on a use case, come up with a better fit. But you also have this knowledge. It’s like transfer learning and better iteration, right? So the loop goes on. It’s not a circle, but a spiral. You keep rising by solving different use cases. It’s an amount of experience and collaboration, right? It’s about sharing and transfer learning, so that when someone starts working on a data-centric project in 2023 or 2024, they don’t have to go through these problems, these challenges.
RTM: One thing we see a lot in computer vision, is the kind of techniques that you require (like programmatic labeling) revolve around large models these days. From your customers, are you seeing that be the case? How integrated are models in the extraction process? Is it happening a lot? Is it model driven or human driven?
ML: There are different layers in extraction. We are extracting some static data at times—which probably makes sense for some data. There is also rule-based extraction. So what is rule-based? Let’s say this is a time and date, and if any picture is taken between morning nine and evening six, we call it daytime. That’s rule-based. How do you teach an AI to learn this and then extract what we want? And to get there, you have to solve the above problem because to even identify the data sets, you have to first use the rule-based extraction, and this is just one layer of complexity. Then you dive into all these use cases that you can apply this to. You see a plethora of algorithms just for extraction out there. That might work and might not. Pick up anything that ends up being a service, and convert it to an API and put it on a platform. All of this is encapsulated for these use case teams. It’s like building a block of Legos, it’s an example to show this to our use case teams, which can then take it and be creative with these composable blocks.
RTM: You’re saying there’s like multiple levels and sources of supervision. We need methods to put them together and get some signal out of it. Another audience question: what is the target level for the platform? Is MetaPix for AIML engineers or subject matter experts?
ML: We create multiple personas and use those when we’re building products. We have to know how to wear our customers’ shoes. But who do you think has the highest value? It’s all based on what the organization needs.
RTM: I would assume it would be AIML engineers?
ML: I would start from there, because from them you will know the problem. All the teams existed, but before in silos. So now we are calling to each other, helping each other, and trying to understand the problem and encapsulate it into most services. The last thing I would say is we have two use cases: one, you have a fridge at home, you open it and see your ingredients and then decide what to cook. Or, you can order your ingredients and they’re available in a couple of minutes, and then you can cook what you want to cook. I think that’s where we are going with data.
RTM: That’s a nice way of putting things. Thank you so much for this conversation.
Follow Snorkel AI on LinkedIn, Twitter, and YouTube to be the first to see new posts and videos!