Dr. Ce Zhang is an associate professor in Computer Science at ETH Zürich. He presented “Building Machine Learning Systems for the Era of Data-Centric AI” at Snorkel AI’s The Future of Data-Centric AI event in 2022. The talk explored Zhang’s work on how debugging data can lead to more accurate and more fair ML applications. This talk was followed by an audience Q&A conducted by Snorkel AI’s Priyal Aggarwal. A transcript of the talk follows. It has been lightly edited for reading clarity.
In my research group, we are curious about looking at the process of building machine-learning applications for everyone in our society.
One challenge that we are excited about (and also worrying about) is the observation that building machine learning and AI models today can be ridiculously expensive.
If you look at where the cost comes in, it comes from so many different directions. Some of the models simply need a lot of computation and a lot of storage. If you look at those big language models, it’s not uncommon for them to require thousands of petaflop/s-days. If you look at those state-of-the-art recommendation models, it’s very easy for them to occupy 10 terabytes or even 100 terabytes of storage. They simply need a lot of computing and storage.
On the other hand, the cost of ownership for infrastructure is actually not that cheap. If you combine these two together, it’s actually not uncommon to spend a couple of million dollars to train those big language models—and to spend tens of thousands of dollars just to hold those big recommendation models in your memory.
But this is just about building a single model. When I talk about developing a single machine learning model, it’s getting more and more expensive. There’s a cost for development. Hiring somebody to build your application is very expensive. Also, the cost of data is also non-trivial. In some of our applications, we could easily spend a couple of dollars just to allocate and clean something. People also start to care about things beyond accuracy and quality when it comes to fairness and robustness, which are very important for your application.
There’s also the cost to be compliant with regulations, and there’s the operational cost when you deploy your machine learning models to practice. How can we monitor it? How can we test it? How can we type and how can we scale?
The goal of my research group is to look at this landscape and try to understand how can we bring down all those costs by at least 10x. Our dream is to turn what costs you one hour to train down to a couple of minutes—and turn whatever you are doing that requires 10 labels down to one label.
The key hypothesis of our group is that, once we decrease this cost by one order of magnitude, it’s going to change how people are building, developing, and testing those models today. Hopefully, that can make many very hard socio-economical discussions much easier to have.
This is a hard problem because if you look at this landscape, we can only tackle all these dimensions at the same time. Often, it requires you to co-design the algorithm and also the system set.
In terms of my research group, we focus on two different questions. The first question that we’re looking at is, by looking at every single bit that needs to communicate through the network, every single floating-point operation (FLOPs) that you have to run, every synchronization barrier that you are enforcing, every single data movement and shuffling you’re conducting, every piece of data that you must hold in memory and every fast connection you are building between those devices to make your infrastructure so expensive.
We look at all those components that are making the machine training process so expensive and try to ask, are they really necessary? If they’re necessary, how can we create a new algorithm to accommodate it? If they’re not necessary, how can we get rid of them?
The second thing that we have been looking at is the process of building those applications. How can we improve the quality of our machine-learning model? How can we continue testing the quality of the machine-learning model? How can we adapt the model to different scenarios as systematic and data-efficient as possible?
Over the years, what we realized is you can make progress on both fronts—on the system side, by just co-designing this algorithm, the data ecosystem hardware, you can actually get a lot of performance. There’s a whole amount of open-source products that we contribute to you can play with different ways to scale up your machine learning training. On the development side, what we realized is that if you give the user a principled systematic guideline through this end-to-end process of building machine applications, if you direct the attention of the user to the right direction, often there are a lot of savings that you can enable.
Today, I’m going to talk a little bit about one very specific component. It’s about how to draw and analyze data quality and machine learning quality, which is actually very related to this current trend of data-centric AI.
What motivates our study is that if you look at how people are building machine learning applications today, it has never been easier. Most of the cloud service providers all have their autoML platform. If you go to Google, you can just drag and drop your dataset and then the system is going to produce a model for you. It has never been easier to get a machine-learning model from some cloud service providers.
But what’s next? What if this model is not good enough? What’s the reason that this model is not good? To fix something in the model or in the data, what is the most important piece that we should focus on?
If you ask these questions, it becomes challenging because building applications in the real world never look like this textbook picture—that you have your training side and reference side, and you pipe that into a training process. You get a model, you measure quality, and you have some utility of your model, maybe about accuracy, maybe fairness.
It’s never as simple as that. In practice, it’s often the case that you have a whole bunch of problems in your data. You could have a missing value, you could have a wrong value, and you have a whole bunch of those data examples.
Traditionally, in data management, we have these four dimensions of data quality: accuracy, completeness, timeliness, and consistency. There is an interesting mapping between data quality and model quality. If your data is not accurate enough, it’s highly likely your model quality will suffer. If your data is not complete, in the sense you are missing a certain population, it’s entirely possible that your model will actually not be fair at all.
If you look at this mapping in one experiment that we were running a couple of years ago, we take a whole bunch of dirty datasets with a whole bunch of noise and try to run a whole bunch of machine learning models on them. As you can see, if you try those different machine learning models, they do have different accuracy. On the other hand, what was interesting is all those models actually stop at this magic number, 76%. This is the best that your data can support, not the best that your model can learn.
What if you just clean your data a little bit? If you have a clean dataset, you can easily add a couple of points of improvement. This is a statement that you have been hearing a lot during the last couple of days—often, the best way to improve your machine learning model is to try to improve your data.
Unfortunately, this is actually something that’s very easy to say but a little bit harder to do. When people try to improve their data in their machine learning applications, there are a whole bunch of struggles they’re facing today.
The first struggle is there are so many things that you can do to your data. You can get more features. You can remove all outliers. You can get more data examples. You can clean your data. There’s an overwhelming amount of possible operations that users could do on the dataset.
The second struggle is not all those operations are equally beneficial to the final utility. Often, there’s only a very small collection of operations that are really crucial and many others are not useful at all.
The third struggle is after you fix something in your dataset, the feedback loop is very slow. Often you have to wait for the machine learning training process to finish until you get your signal that whether your data operation actually helped your model utility.
The last struggle is that in the real world, machine learning application is rarely simply about a machine learning training process. Often, we are talking about a complex program where only small components are about machine learning training, and the majority part of your application is about data transformation.
If you put these four different struggles together, what we observe in practice is that often our users waste their time and money trying to focus on data problems that often do not matter at all. On the other hand, they’re missing those really important data problems. They need help.
Wouldn’t it be nice if every single Jupyter notebook had a magic button called “Debug?” Once you click that, it will highlight every single operation inside that piece of code that is causing problems with the quality of the model, or fairness, or robustness. If you do not want to write code, wouldn’t that be nice if every single autoML platform had this magic button called improve data? Once you click that button, you select the objective and we bring you an ordered list of the examples that you should debug from the most important one to the least important one. These are the things that we really want to enable for our users.
Unfortunately, there are very hard technical problems to tackle to actually make this happen.
This is the technical problem that we have been looking at. On one hand, you have a dataset that could contain a whole bunch of data problems. You pipe that into a feature extraction pipeline and machine learning model, and then you get your model. You can measure a certain utility of this model.
The thing that we want to understand is, can we trace back to the original dataset to compute certain notional importance for each data example? How can we trace back to the feature extraction pipeline to compute importance for every single feature extraction operator?
There are multiple questions that we need to tackle. The first one is how can we define importance? It’s actually very challenging because essentially we are compressing a very complex problem into a single-dimensional number associated with the data example and also with the operator.
The second question is, how can we make that fast? How can we enable real-time interactions between the user and the system? The third question is, how can we use them to do something useful?
First, how can we define importance? This is actually not that easy. Imagine you have four different data examples. What is the importance of the red tuple? What can we do? One simple thing that you can do is say, let’s remove this red tuple. You can compare the difference in accuracy by adding this red tuple or removing this red tuple. This can tell you something about the importance of red tuple.
On the other hand, it could also be suboptimal. It’s actually very easy to find an example of why it is suboptimal. Imagine you have four different data examples—two are bad and two are good, and your utility is formed in a way that as long as you have one bad example in the dataset, your accuracy going to suffer. Only if you have only good examples, your accuracy is good.
If you compute this leave-one-out notion of importance for the red tuple, you’ll get zero. If you compute this given notion of this yellow tuple, you’ll also have zero. As you can see, leave-one-out importance can be confusing, especially when there are strong correlations between those examples, which is actually often the case in practice.
What can we do? Consider all those different combinations of other tuples, and then you can aggregate those improvements and use that as your importance. You can aggregate them in different ways. You can simply compute the average, which has a close relationship with multi-linear extension. This works well in many cases.
You can also rate them in a specific way called the Shapley Value. It has a very strong game theoretical foundation and many good properties, and people have been showing that it works well in many cases.
There’s a very interesting trade-off between these different ways due to the weight of those improvements. In practice, they’ll often perform better than the leave-one-out notion of things.
I’m not going to detail the trade-off, but essentially in many cases, this uniform weighting and Shapley Value could outperform leave-one-out pretty significantly. In many applications, there’s also a trade-off between Shapley Value and Uniform Weight.
Once you define importance, the second question is how can we compute this? If you look at Shapley Value or Uniform Weighting, there are multiple challenges associated with them. The first one is that you need to enumerate a whole bunch of possible worlds which could be exponentially hard. Also, for other notions of importance, for example, entropy or expected improvement, you’ll have the same problem. We have to do something here.
The second challenge is that in practice, machine learning applications rarely look like this.
Often they look something like this.
A small component is our machine learning training, but there’s an even larger component about feature extraction.
If you look at the two challenges together, we are actually lacking a fundamental understanding of how to connect these two different areas. On one hand, there’s a data management community trying to understand data transformation and computing some functions over exponentially many databases for decades.
On the other hand, there’s a machine learning community trying to understand data importance for machine learning training over the last few decades. We’d like to bring them together.
This is what we know. If you look at this machine learning pipeline, it’s actually super hard to analyze it. On the other hand, you actually approximate them in a certain way, and then you have this proxy pipeline for your original pipeline. That’s going to make your problem much easier.
Essentially given this end-to-end program, you can approximate your feature extraction pipeline as data coordinates. Theoretically, you can just represent those as some polynomials in the provenance semiring. You’ll have different shapes of these pipelines. You can approximate your machine learning training components into some simpler classifiers—for example, a k-nearest neighbors classifier.
Once you do this, you can actually have a polynomial-time (PTIME) algorithm for a whole bunch of important metrics, such as Shapley Value, expected prediction, or entropy.
The improvement can be pretty dramatic. Here we have a system called DataScope. This is just one data set. As you can see, comparing and computing those values using MCMC, you have the speed up four to five orders of magnitude, compared with those original approaches.
It actually works. This is an example where you have some wrong labels in your training set. You train it into your machine learning pipeline, and then if you follow the Shapley Value computed on each of these data examples, you have this data-debugging mechanism that improves the accuracy of these machine-learning applications much faster than a random strategy.
You can then plug in different types of objectives. Here’s one application where you have a 100% clean data set that also has some fairness issues, meaning that if you clean up the whole dataset, the model could be unfair. In this case, you can also use fairness as an objective for data debugging. If you look at this improvement of accuracy, if you have this fairness objective for the Shapley Value you can actually still improve your accuracy. On the other hand, if you measure the fairness of the trained machine learning model, if you do cleaning in a random way, you could become more and more unfair. On the other hand, if you have fairness as an objective, you can have a cleaning strategy that brings up your accuracy by at least 20 points and keeps the same level of fairness.
This is something that we can play with. We have two open-source repos that are very easy to use. You can just plug in your sklearn pipeline, and then you input your utility. With a single line of code, you can compute the Shapley Value for each of those data examples.
Another thing that could be interesting for everyone to look at is the DataPerf Challenge. It’s going to compute figures like this to compare different types of data debugging methods. Please, play with it and give us feedback!
Priyal Aggarwal: Thank you so much, sir. It was such an incredible talk and you laid out your presentation very clearly.
Braden has one question for you. He says, the value of data cleaning seems pretty clear from this presentation, but techniques vary pretty widely in complexity. For example, drop exact duplicates vs. analyses that require trained model embeddings, etc. From your experience with CleanML, what data cleaning techniques would you recommend first, from a return on investment standpoint?
Ce Zhang: That’s a very interesting question. There are two different dimensions. There’s the mechanism of cleaning things up, like duplicates or missing value, wrong value. Usually, you have one dataset that contains all those problems. So, we do not actually have a strong opinion on what is the best mechanism for you to clean your data. It is our belief that it’s going to be closely related to your application. For some applications, maybe duplicates matter more, for some application may be missing value matters more. It also depends on your final utility. Whether you care about accuracy or fairness, they will give you different importance for different mechanisms.
What we are interested in is gaining those oceans of potential operations that you could do. How can we give the user guidance about which is the potentially important thing? That is what we have been looking at. So I would say I actually don’t know what mechanism going to perform better for any application, but there is a systematic way to compute which is more important for the user.
PA: That makes sense. So your technique would enable the user to make their own decisions as to what is important and what to focus on next.
CZ: Yeah, exactly. The whole thing will be highly related to application, that’s what we found out.
PA: Braden has a follow-up question. He says it makes sense that what technique you use will be very much task-dependent. What are the major categories of techniques or types of data dirtiness in your mind?
CZ: There is outlier, there’s duplicate, there’s missing value, there’s distributional drift—you have a different type of sample probability for different subpopulations. Those are things that we have been looking at at least in the clean ML benchmark. I’m sure there are more, so it’s interesting for the whole community to come together and try to have a complete list of the data problems, but so far these are the things that we have been focusing on.
PA: Next question is from Kia and they ask, is there a research paper that details your approach or methodology?
CZ: Yeah! I think the best reference, or the latest one, would be this paper about the computational complexity of computing Shapley Values over different types of end-to-end pipelines. This would be the paper I would read first if I was interested in what we do.
PA: Another question from Roshni and she says: have you thought about how to improve the timeliness problem? How can we iterate on feedback from the data debugging faster?
CZ: There are two different angles to that. The first one is, the consequence of focusing on timeliness is you need to have this data debugging loop run multiple times. Our hope is, by making this importanance computation thing run very fast, the user can have a real-time debugging experience in the sense that every single time they fix some data problem, they get some feedback in a couple of seconds. We want to make sure they are very comfortable in the loop. That is something that, with this k-nearest neighbor proxy thing, to a certain extent we are able to achieve.
But there’s another dimension of timeliness. Whenever you are using your data example multiple times, you start to have the problem of overfitting, in the sense that your data doesn’t really reflect reality that much, or starts to cause problems. We have another paper about how to quantify the staleness of your data and try to bound the difference between what you can get on the older order of data and what we can get with the latest version of the data. We don’t know how to put them together—it is very interesting to actually try to think about how to put them together—but right now we have these two different thoughts about timeliness.
PA: Roshni has a follow-up question and she says, how can you validate the accuracy of the importance calculation?
CZ: Short answer is, I don’t know! How can we evaluate that? There are two different ways to think about it. One way is to measure the effectiveness of importance computation by its utility. A good importance will be useful. The way to do that is to try to understand, if you have this importance, can you use it to enable something interesting?
Here you have two different notions of importance. One is the random importance, and another is the Shapley value. Given some task, you can actually measure if we try and perform better. That’s one way to measure what’s a good importance. But we are not happy about it, because that relies on your underlying task.
One thing that we have been looking at is trying to understand, how can we define what is the optimal importance that you could have? The current thinking that we have at this moment is by defining this, using the framework of intervention. How can we define the optimal importance which will lead to optimal intervention of a good example, and try to compute the difference between your importance and this optimal importance?
We have some thinking here, while we don’t have anything published, but I definitely agree with you. How can we evaluate so many different ways for computing importance? It is definitely a very important problem. We hope we’ll have something for you in six months.
PA: That’s amazing. Thank you so much for sharing, sir. It was amazing to have you here. We learned so much about your work and thank you so much for answering the questions.
CZ: Thank you!
See what you missed at Snorkel's Enterprise LLM Virtual Summit!
LLMs are rapidly transforming the enterprise and have the potential to revolutionize the way we work. During this free 3-hour virtual summit on Jan. 25, attendees learned how enterprises can harness the power of LLMs and use their data to deliver value with GenAI.
This virtual summit brought together leading experts on LLMs to share their insights, best practices, and practical approaches on how to use LLMs to drive innovation and business growth within the enterprise in 2024 and beyond.
We have also released individual recordings of all eight sessions from October's summit. You can see them—including the very lively Q&A session—here.