The Future of Data-Centric AI Talk Series
Background
Chelsea Finn is an assistant professor of computer science and electrical engineering at Stanford University, whose research has been widely recognized, including in the New York Times and MIT Technology Review. In this talk, Chelsea talks about algorithms that use data from tasks you are interested in and data from other tasks.
If you would like to watch her presentation in full, you can find it below or on Youtube.
Below is a lightly edited transcript of the presentation.
A large part of machine learning often focuses on how to change the algorithms rather than actually changing the data. This talk focuses on algorithms that don’t just leverage data for the specific tasks you care about but try to also leverage data from other tasks and other environments.
The usual “machine learning paradigm” collects data for a specific task, trains a model using that dataset, and evaluates the model. This process has been the traditional approach to machine learning since its inception.
“In reality, there are a lot of challenges that come up in a typical machine learning paradigm.”
Highlighting the challenges above, sometimes there is a very small or narrow dataset that isn’t sufficient for learning a model from scratch, and second, there might actually be some form of distribution shift at the time of evaluation—as a result, occasionally, the training dataset isn’t sufficient for evaluation scenarios.
What if we have datasets from other tasks or domains that can help us improve the model?
What if we have datasets from other tasks or other domains that the model can leverage to allow the model to improve for the target task and others. There are three different approaches for efficiently using data from other tasks and domains to solve the target tasks.
- First, is there a way to jointly train on prior datasets?
- Second, is it possible to selectively determine which prior datasets would be most helpful?
- Lastly, can we learn priors from previous datasets and use that to solve the target tasks we have at hand efficiently?
Let’s take a closer look at each of the approaches.
Jointly training on prior datasets
Consider a scenario in which we allow a robot to determine whether or not it has successfully completed various tasks, which can be accomplished by training a binary classifier. Unfortunately, collecting data on real robots is costly, and it takes a lot of time, and the available data represents demonstrations of the robot performing a few different tasks in one environment. Ultimately, we want a classifier that can’t just generalize to one environment with a few tasks but to many tasks in many environments.
The idea is that instead of training a model on a currently limited and narrow dataset, the algorithms can leverage a lot of other datasets, such as video footage of people performing many different tasks, and merge it with the original dataset. By combining these two datasets, the model might learn a reward function or a success classifier that broadly generalizes many different tasks in different environments, so eventually, this is what the training setup looks like. At test time, the robot will be in a new environment with a new task that it hasn’t completed before. Given a video of a human performing the task, the robot will infer a success classifier and then use it to determine the behavior that solves the task.
The approach, in this case, is straightforward—jointly train a classifier on the narrow robot data already available and the diverse human data. The classifier takes two videos as input and outputs whether or not those two videos are doing the same task. For example, we can take a video of a person closing a drawer and a robot closing a drawer and train the model to output a label of one, or when we pass in two videos showing different tasks, we train the network to output a label of zero. Then once we have this classifier, we can use it with methods like reinforcement learning and planning to allow the robot to solve new tasks, hopefully in new environments.
“Joint training on diverse prior data can substantially improve generalization.”
In this scenario, the performance was disappointingly low when the initial classifier was trained just using the narrow robot data. However, by incorporating human data, the environment generalization increases by more than 20%. This improvement would not have been possible with only a change to the algorithm. So the takeaway here is that joint training on diverse prior data can substantially improve generalization.
Selectively train on prior datasets
What if a vast number of prior datasets exist, perhaps from various tasks? Training on all of the data at once may not be ideal. Some tasks and prior datasets may complement the target task, but in other cases, the dataset may be incompatible and cause the performance to suffer. It’s difficult to say if they will complement each other. For example, a previous paper looked at training on different computer vision tasks, like predicting semantic segmentation, depth, surface normals, key points, or edges from the image. The researchers found that training tasks together sometimes improves performance and occasionally hurts performance. The affinity of any two tasks will be determined by factors like the quantity of the data, existing model knowledge, and more subtle optimization features. The bad news, in this case, is that there is really “no closed-form solution” for measuring the task affinity from the dataset alone.
According to our research paper “Efficiently Identifying Task Groupings for Multi-Task Learning,” it turns out that it is possible to approximate the affinity of two tasks with just a single training run. In the previous paper, what they did to measure task affinity was essentially to do this exhaustive search over pairs of tasks to understand how well they train together, and obviously, this becomes enormously expensive as you have lots of different datasets and lots of various tasks. To make this drastically cheaper, we first train all of the tasks together once in a single multi-task network, then analyze this optimization process of jointly training everything to compute inner task affinities. Then, we use these affinities to select groupings of tasks that will maximize this and ultimately train a network for each group of tasks.
The most challenging step is analyzing a multi-task learning problem training run to determine the different affinities. The way to approach this is to measure how well the dataset for task i helps solve a different task j. This will depend on task i, task j, and the model’s current parameters, which can be achieved by constructing a look ahead using the equation:
Fundamentally, this ratio tells us if a gradient step on one task improves the performance of the other. If the equation yields a positive value, that indicates that task i aids in the improvement of task j. Note that this quantity is asymmetric, it is measuring how much i help j, and it may be the case that j actually isn’t helpful for i even if vice versa is true. We average this quantity over multiple training iterations to obtain a training level score for distinct pairwise task affinities, which will be used to select which tasks should be trained together and which should be trained separately. The test results show that this method achieves the same results as other known approaches but with significantly fewer GPU hours. Essentially, the ability to dramatically enhance multi-task learning performance while using far less processing was discovered. The takeaway, in this case, is that we can actually “automatically select” task grouping from a single training run.
Learn priors from previous datasets
There may be situations when we want to extract prior information from previous data but not train jointly. These are the situations where joint training is too computationally expensive or doesn’t work well. In this scenario, a good solution is to derive a prior from the datasets that we can transfer to the target task.
There may be situations when we want to extract prior information from previous data but not train jointly. These are the situations where joint training is too computationally expensive or doesn’t work well. In this scenario, a good solution is to derive a prior from the datasets that we can transfer to the target task.
The problem considered to explain this approach is providing feedback to many student test submissions. The challenge here is that the diagnostic test submissions are open-ended python code, and it is estimated that giving feedback to all of these students would take eight or more months of human effort. The feedback is given by defining a rubric and then telling students whether they made certain misconceptions on that rubric, and hence it becomes a multi-class binary classification problem. Now let’s see why this is a challenge. Grading student work is very time-consuming, and it takes a lot of expertise. For example, even annotating like 800 pretty simple block-based programming codes takes someone 25 hours. Students solve the problem in many different ways, which creates a very long tail distribution of solutions that the system must handle. Lastly, there isn’t an enormous dataset of a single exam question because instructors are constantly updating and changing their assignments and exams. As a result, the solutions and instructor feedback look very different from year to year.
Our solution is for these algorithms to leverage past data, which will consist of 10 years of feedback from different tests, and the target task will be to provide feedback on new problems using a minimal quantity of labeled data. We will build on a few-shot learning technique known as Prototypical Networks. We will split the task dataset into two parts for each classification task: a support set and a query set. We utilize the support set for each class label to construct a prototype embedding and then categorize new examples in the query set by measuring the distance between the example embedding and the prototype.
We can visually depict it as something like this: the prototypes are shown in black, which are the average of the examples that you have, and then to evaluate a new example, you’ll look at the distance to each of the prototypes.
Mathematically, p corresponds to a kind of prototypical embedding, x is the example, measures the negative l2 norm between each example and the prototype. The embedding space is trained end-to-end with respect to prediction error on the query set of each task.
A stacked transformer architecture is used to encode the student submissions into an embedding space, and then we train the embedding space using prototypical networks. This allows the model to generalize to new (query set) examples from a limited amount of (support set) data. Unfortunately, integrating transformers with prototypical networks doesn’t work very well by itself, and this is because the model still doesn’t have a lot of prior data.
“Integrating transformers with prototypical networks doesn’t work very well on its own.”
There are three tricks we use to tackle this problem. First, we had only around 259 classification tasks giving feedback on the rubrics, and frequently meta-learning operates on thousands of tasks. So, we introduce the notion of data augmentation but raise it to the level of tasks —creating auxiliary tasks like cloze style tasks or tasks to predict the compilation error. We use these tasks to augment the set of tasks that we have in addition to predicting rubric items. The second trick is using side information. Generally, learning how to give feedback from only 10 or 20 examples can leave ambiguity. So, we use side information that corresponds to the rubric option name and the question text and feeds this side information into the model. We encode the information and prepend the encoded side information as the first token in the stacked transformer. The last trick is instead of only using the prior education data, we also pre-train the model using unlabeled python code. Particularly, we initialize the embedding network from weights pre-trained in a self-supervised way from python code. Looking at the overall architecture, we encode the student solutions and train the model such that these prototypes lead to an effective generalization, allowing the network to give feedback to new student solutions effectively.
Evaluating the system offline using the held-out data, we found that it actually worked pretty well. Perhaps surprisingly, humans achieve an average precision of only 82.5%, whereas our system achieves an average precision of 84.2% on held-out rubrics. In the case of a held-out exam, there’s still room for improvement, but the approach is substantially better than the supervised learning system. Running several ablations on the different kinds of tricks mentioned before, such as task augmentation and side information, leads to about a 10 percent improvement in the average precision. Finally, deploying this system to give feedback to real students in a free online course. The students took the diagnostic and were given feedback using an interface, matching the predicted rubric item to some text written by the instructor and giving them the corresponding text. The interface also asks the student if they agree with the feedback. It is also worth mentioning that small syntax errors would prevent unit tests from being useful. This is why it’s important to use things like neural networks to look at open-ended code compared to just using unit tests.
After running a blind randomized trial evaluated by the students taking the course, human instructors gave feedback on around a thousand answers, while the model gave feedback on the remaining fifteen thousand solutions. Around two thousand of the solutions could be auto-graded, not included in the analysis, concluding that students actually agreed with the feedback around 97% to 98%, which is incredibly high, and that students agreed with the feedback given by the model slightly more frequently than they agreed with human feedback. Agreeing with the feedback isn’t the only thing needed, it must also be useful, and the students rated the model’s feedback as 4.6 out of 5 in terms of usefulness, which suggests that they not only agreed with it but also found it quite helpful in understanding their misconceptions. Lastly, we also checked for bias, and we didn’t see any signs of bias based on gender and country. A final takeaway is that instead of jointly training on items, we can optimize for transferable representation spaces, which allow us to solve new target tasks with a small amount of data.
Where to follow Chelsea: Twitter, Linkedin, Website
If you’d like to watch Chelsea’s full presentation, you can find it on the Snorkel AI Youtube channel. We encourage you to subscribe to receive updates or follow us on Twitter, Linkedin, Facebook, or Instagram.