Abhishek Ratna, in AI ML marketing, and TensorFlow developer engineer Robert Crowe, both from Google, spoke as part of a panel entitled “Practical Paths to Data-Centricity in Applied AI” at Snorkel AI’s Future of Data-Centric AI virtual conference in August 2022. The following is a transcript of his presentation, edited lightly for readability.
Abhishek Ratna: Hi everyone, great to be here, thanks for the wonderful introductions, Piyush. Once again, I’m Abhishek Ratna. I lead product marketing efforts for TensorFlow and multiple open-source and machine learning initiatives at Google. And it’s my absolute honor and pleasure to speak alongside Robert Crowe. Robert, I will leave the introduction part to you.
Robert Crowe: I’m Robert Crowe, I’m a developer engineer in the TensorFlow team. And my real focus is on production ML, ML ops, and TFX, so this topic is near and dear to my heart.
AR: Today we will cover some of the most common questions we get from our practitioner audiences as they think about moving to data-centric practices and applied AI. We thought we’d structure this more as a conversation where we walk you through some of our thinking around some of the most common themes in data centricity in applied AI.
The first of these questions that we often see coming from our community is that in an age of big data, is the sheer volume of available data the primary determinant of machine learning success? Is more data always better? Maybe I’ll start us off here Robert? And I’d say it’s widely understood in the industry that the quality of our data determines the quality of our ML outcomes. And Google famously demonstrated through multiple experiments that a model-centric approach is often exceeded by a data-centric approach when it comes to the overall quality of outputs. That said, I don’t think you’d go very far if you simply focused on the quantity of data. Organizations struggle in multiple aspects, especially in modern-day data engineering practices and getting ready for successful AI outcomes. There are three major challenges that I see organizations struggle with.
One of them is that it is really hard to maintain high data quality with rigorous validation. The second is that it can be really hard to classify and catalog data assets for discovery. Generally, data is produced by one team, and then for that to be discoverable and useful for another team, it can be a daunting task for most organizations. Even larger, more established organizations struggle with data discovery and usage. The third problem that we see in bringing data to bear fruit for AI outcomes is organizations increasingly are moving to multi-cloud deployments. At the same time, we are moving into an environment that is very privacy-preserving and privacy-focused. So it becomes very hard to anonymize sensitive data to do AI responsibly in a privacy-preserving way and to monitor data usage for compliance. So quality, validation, discovery, and compliance all need to be solved meaningfully to build a good foundation for successful AI. Robert, what do you think?
RC: Yeah, I agree with everything you said. Just looking at the question, and thinking about the idea that the sheer volume is a determinant of success, there are a lot of aspects to that. First of all, volume itself is not a good measure because you can have tons of data that is essentially duplicates or in a very narrow region of the feature space where your prediction requests cover a much broader region.
And in that case, the volume doesn’t do you any good at all. It’s certainly not a primary determinant. Also, machine learning success–that concept is more than just the metrics that your model returns. It’s, “Can you use your model effectively in a product or service or business process within the latency requirements you have, the cost requirements that you have, and so forth?” So there are a lot of factors. Sheer volume—I think where this came about is when we had the rise of deep learning, there was a much larger volume of data used, and of course, we had big data that was driving a lot of that because we found ourselves with these mountains of data. But it’s really much more subtle. And the important thing here is really the predictive signal in the data.
AR: Yeah, absolutely agree, Robert. The signals and data are what will determine success more than just the quantity. And that leads us to our second question, which is “Models learn to ignore or devalue features that don’t improve the model’s ability to learn. So does that mean feature selection is no longer necessary? If not, when should we consider using feature selection?” Robert, maybe you can lead that one.
RC: I have had ML engineers tell me, “You didn’t need to do feature selection anymore, and that you could just throw everything at the model and it will figure out what to keep and what to throw away.” This is true, except it’s also incredibly wasteful and expensive. The model will zero out the features and signals that it doesn’t need, but you’re gonna spend a lot of time and a lot of effort collecting that data in general. The other thing is it’s gonna make the model much more computationally complex than it needs to be. So yeah, I’m a big fan of feature selection to try to craft your data. And it leads you to understanding your data better too and leads to a better understanding to do your feature engineering around. So yeah, I’m a big fan of feature selection.
AR: I would echo what you were saying, Robert. The complexity grows exponentially as we start adding more and more features. It’s expensive, it’s time-consuming. But one more thing to think about here is just how complex it is to even access that data. If you are at a large or even mid-sized organization, chances are you’re spending a lot of time stuck behind fairly time-consuming processes to get access to data. More features mean more data consumed upstream. And data quality and breakage and lineage issues—all these things happen at the backend, which means that you could be working with stale data or broken data or data that are out of compliance or data that is biased, and you would never know. So I feel it’s very important for organizations to trim the number of features they need to the absolutely necessary ones, and then ensure that they have the governance and the quality checks in place to make sure that the data they use for building features are fresh. So I think feature selection is an enabler of quality outcomes in machine learning.
RC: Yeah, that’s a good point, especially in areas like PII, healthcare, or so forth where gathering those features is not just expensive but sensitive.
AR: Absolutely. That actually brings us to a good point. A large percentage of ML projects are based on supervised learning, which is very dependent on good feature selection. But then, Robert, what do you think are some of the challenges applied folks in the supervised learning space face when trying to productionize these use cases?
RC: Like a lot of machine learning, the big problem is the data–—getting data that has the predictive signal that you need. And in supervised learning, it has to be labeled data. You’re gonna need to be able to label a decent-sized dataset. And it’s very domain dependent. So for some things where you’re getting data from a clickstream that’s pretty much automatically labeled, that’s great. That tends to be low-hanging fruit and you can use that sort of thing. In other areas like looking at an X-ray and deciding whether it shows a broken bone or not, you’re gonna need a medical professional to label each of those images which is very expensive and slow, and time-consuming. There are also all the problems that you have with data and concept drift with supervised learning. There are some things that you can apply to try to improve the situation, one of which is weak supervision which Snorkel is famous for. So I’m sure at the conference you’ve heard a lot about weak supervision and other techniques like synthetic data and so forth. But the big challenge, especially for supervised learning, is getting data with the right predictive signal and getting those labels.
AR: Yeah. I think you covered the most important talking points there, Robert, but I think if I was to summarize this, I see three challenges to just not just the effort that’s involved in labeling data and creating these expensive datasets, but outside of that, there’s a lot of work that needs to be done in terms of processes and organizational readiness—making sure that internal processes support access and sharing of high-quality datasets that can be used for machine learning. So a lot of organizations are not there yet.
As you pointed out in the healthcare examples it can take months to get the right dataset for machine learning by when the industry has changed. And so the outcomes are meaningless at that point.
Building and sourcing data, you talked about bias as well, so I think that’s a big challenge too. A lot of times the datasets may have a lot of inherent bias in the way they were created, and biased labeling leads to biased reserves.
The third challenge really is around privacy. Running privacy-preserving AI practices can be hard, especially with the rise of regulations like GDPR, HIPAA, or CCPA. You may very well be faced with legal barriers to accessing some data points which are deemed sensitive or frankly illegal. All of those things add up and make supervised learning more challenging. I like what Robert mentioned too, and it’s a good segue to the next question.
What we are seeing is access to quality datasets is always challenging, but are there best practices to achieve meaningful results with limited labeled data or low access to quality data? And I can get us started here. I think Robert talked about two principles. One of them is synthetic data, and we definitely see a lot of growth in that area especially when it comes to computer vision, especially when it comes to industrial robotics in those areas where deep generative models and neural networks can study the distribution of quality samples of data, which could be small in size, but then they could use that sampling, they could use that distribution to create larger datasets with artificially introduced samples. In some cases, synthetic data could actually be better than real data because it can eliminate a lot of glaring outliers which can skew the outcomes to potentially bias results. The other area that Robert mentioned was weak supervision. So I’d love to learn a little more about that, Robert.
RC: Weak supervision is the idea of using the domain expertise that you have to create labeling functions that apply probabilistic labels to your data. So by doing that, you can create a signal and usually pre-train your model and bring it up to the point where you can apply a small amount of data to do the final fine-tuning for your model. It is very similar in some ways to how you use a pre-trained model and then apply fine-tuning for a particular domain.
This is another good technique when you have a limited amount of data. If you can find a pre-trained model that is close enough to what you’re trying to do, especially in things like vision or question and answering or language models, those tend to be pretty good for applying limited amounts of data. So people are learning to adapt to where the opportunities are and what the realities of their data are to try to deal with issues like this.
AR: You brought up a really good point around large language models and I believe there’s also a question here on the same topic. So this is interesting because there’ve been so many inspiring advances with large language models of LLMs such as LaMDA or PaLM. What are some of the challenges of applying large language models in production use cases? If you want to get us started, Robert, that’ll be great.
RC: The big problem is just the size. These are large models that are expensive to run and so that really limits the use cases that you can use them for. They need to be high-value use cases. There’s a lot of effort being done to try to take the large model versions of things and trim them down to much smaller versions that can be run in a broader selection of use cases like edge deployments, for example, or motor deployments. But even in server-based deployments just trying to maximize the usage of the model by decreasing the cost.
There are a number of other issues in terms of the models themselves—things like accuracy for retrieval-based models and so forth. Bias—is the model returning a result that is novel or returning a sort of canned response? All those things come into play, but usually at the level of the larger model development itself. So we’re getting into a situation where we’re having these developments of large models that then need to be changed and fine-tuned into particular domains and much smaller amounts of compute resources.
AR: I really like that. I think of two challenges in addition to everything that you shared, Robert, echoing some of what you said. One of the challenges really is how these large language models are created. While these models demonstrate state-of-the-art results, even in a few short applications, there often is a lot of bias in the way that data was collected, and that bias propagates itself in transfer learning or through downstream applications. One challenge really is how we do transfer learning with carefully pruned datasets. And the second thing is just the logistics of deployment itself. A typical inference, for example, is a simple question like “How tall is the president of the United States?” This task can be decomposed into multiple tasks. For example, finding out the name entities, figuring out the database IDs for the name entities, deciding the appropriate UI to render the answer, and then doing the inferencing or the estimation itself.
All of that requires a lot of pipeline tasks. And when you have a large language model sitting in the pipeline and that inferencing, then debugging becomes difficult, and retraining the model becomes difficult. So there are lots of practical challenges with directly working with large language models in this case. And then one of the questions that we also see, and I think the last question before we move to Q&A, is about the topic of change. Robert, you talked about drift and bias and how models adapt to changes in the world around them. In a production use case, how should developers think about change?
RC: So change is, in my view, one of the two major drivers for ML ops. There’s a change in the world and then there’s the process. Change is true for models, just like it’s true for humans. Depending on what area you’re talking about, things can change very slowly or they can change at a much faster pace.
In medicine, there can be new techniques all the time or new results, and new treatments. There are constant changes in the law. In markets, there’s constant change. Your model, just like a human, needs to learn to adapt to that. And that means it needs new data to adapt to that because every training dataset you gather is just a snapshot in time of whenever that data was collected. So that’s as much as the model knows about, it doesn’t know about anything that’s newer than that. That said, it’s really domain dependent. So some things like language models, language doesn’t change all that fast.
Piyush Puri: Yeah, Robert, Abhishek, that was a great presentation. Thanks so much for walking us through that. Abhishek, I think you guys got to one of Ishisha’s questions, but I see a question from Rahul. “How do you speed up gathering data, the labeling process, and just uplevel everything around data labeling?” I know Snorkel has an answer to that as well, but would love to hear from you all on where you see that.
RC: Yeah, again, I’d say it’s very domain dependent. I’ve been a part of projects where we’ve spent an incredible amount of money just trying to collect a small amount of data. But in other cases, as much as you can automate, the better you are. Taking advantage of weak supervision, taking advantage of synthetic data, and data augmentation, all those things can really help. If you can establish some sort of feedback for your process where you’re getting your data labeled as part of serving your model, that’s good too. It’s just very dependent on the problem that you’re solving and the domain that you’re working in.
PP: Absolutely. Yeah. I think we’d all agree that iteration is key, especially as things drift and change over time.
We have a question from Andrew here about one obstacle to sharing data, even within a single organization is that so much information about the dataset is documented poorly, if at all. Could you speak to the use of maybe data cards or other techniques for capturing metadata such as the definitions of features, how the data was sourced, assumptions implicit in the distribution, etc?
AR: Yeah. Robert, you can go first. I have a certain point of view on this.
RC: Yeah. That is definitely a problem. As an organization, you often have multiple teams or multiple developers who are working on different things, but to the extent that they can share their data and avoid duplication, it really helps. Feature stores can help with that by gathering all of your features in one place and making datasets rationalized, and they will capture some of the metadata associated with that. Things like metadata can really help to understand your data over the lifecycle of your product or service so that you capture the artifacts that are generated. Not just datasets, but the rest of the artifacts too—things like metrics and model results and evaluations and so forth. So understanding how your model evolves over time is part of that too. But he’s really asking about, I think, the initial dataset. So there’s a lot of governance that has to happen.
AR: That’s what I was thinking too, Robert. It’s the governance aspect. Sometimes unfortunately you have to deal with a bad dataset. But it really is about bringing that data culture into the organization. Even the best tools can only go so far if the teams are not committed to making sure that sanity is in place. Yeah, I don’t know if there’s a silver bullet for making up for missing data lineage, but all the techniques that you mentioned, Robert, definitely go a long way. And if we can have feature stores, if we can have some sort of automated labeling and some sort of automated metadata generation, all of those steps go a long way.
PP: Yeah, I think you guys are spot on. It’s something that everyone struggles with, but it’s a team effort and everyone needs to be disciplined in making that happen. We have another question that’s come in about what things you focus on in terms of monitoring and observability of the production ML pipelines. What would be the things to focus on first and what tools would you recommend?
RC: The production ML pipeline is broad. It covers the training pipeline and the serving or inference process as well. Taking those apart: in training, you really want to capture information about the whole pipeline, from the beginning dataset through the different transformations you do to your data and the information you capture about it through the training process itself and the evaluation process. What we do in TFX is we use ML metadata as a tool to capture all those steps and it preserves the lineage of all those artifacts.
So from the training perspective, that gives you the entire history of that model and that pipeline throughout the life cycle of that product or service, which might be living for years. On the serving side, really what you want to monitor is things like operational performance, how much is your latency, making sure that someone is alerted when the thing goes down, scaling up, scaling down, that kind of thing. But also capturing—and this is really important—capturing the prediction requests that are made of your model because those will form the basis of your next dataset. Then you need a labeling process behind that to take that data and get it labeled.
But that’s your signal from the outside world about how it’s using your model and what your model needs to do to serve that well. That’s where you start to see data drift. And when you get to the labeling part of that, that’s when you start to see concept drift.
PP: Unfortunately, I think that’s all the time we’re gonna have for questions today. But really quickly, if anyone wants to follow along with this work or learn more about everything that you’ve spoken about, what’s a good way to either connect with you or follow the work at large?
RC: Well, tensorflow.org
PP: Perfect. We’ll leave it at that, tensorflow.org it is.
Catch the sessions you missed!
The Future of Data-Centric AI 2023, our two-day free virtual conference, brought together thousands of data scientists, AI/ML practitioners, researchers, and the AI community at large to hear about and discuss the latest trends and research in data-centric AI. If you registered for the event but didn't see all the sessions you wanted, you can now catch up. The recorded sessions are available for registrants at the same Zoom portal as the live sessions.