Large language models (LLMs) have taken center stage in the world of AI. Their remarkable ability to perform competently even on tasks they’re not directly trained for has excited data scientists and corporate leaders alike. While impressive, these models demand a lot of infrastructure and generate high costs. Distilling LLMs can create models that are just as powerful, but cheaper to run and easier to deploy.

As I described in my presentation at Snorkel AI’s 2023 Enterprise LLM Virtual Summit, advanced techniques for distilling LLMs allow data scientists and enterprises to build powerful, small-footprint models. These small models sometimes exceed the accuracy of the LLMs that spawned them, and their training process sometimes requires a fraction of the training data demanded by traditional techniques.

In two case studies, I show:

  • How one approach reduced data requirements by up to 78.5%.
  • How another approach built a model that matched the accuracy of a fine-tuned GPT-3 model at less than one percent of its size. 

Let’s dive in.

Existing techniques for distilling LLMs and their pitfalls

Machine learning (ML) practitioners today have two primary ways to build small models that can do tasks at high throughput at a low cost: distillation and manual annotation.

Manual annotation presents obvious challenges. Labeling data by hand demands a lot of time and focus. Asking internal subject matter experts to label data imposes significant opportunity costs. While firms can sometimes use crowd labels to skirt this problem, this approach falls short if the task requires specialized knowledge or if the data contains sensitive or private information that must be held in confidence.

Distilling LLMs offers a solution that can keep data in-house. Data scientists use an LLM that already performs well on the chosen task to “teach” a smaller model. To do so, data scientists prompt the large model to classify unlabeled data and use those labels as training targets. When successful, this approach delivers the predictive power of the larger model in a much smaller (and cheaper!) footprint. However, this requires an already existing high-performance LLM and abundant unlabeled data, which may not be readily available.

Distilling step-by-step to reduce data needs

A recent collaboration between Snorkel AI and Google Research discovered an alternate approach: “distilling step-by-step.” This approach extracts not just a label from the LLM, but also its rationale—essentially, the reasoning behind its decision.

The pipeline uses this rationale to train a smaller student model to perform both labeling and rationalization. While we found that prompting a smaller model with rationalizations at inference time boosts accuracy, that approach would still leave us depending on a large, expensive model every time we want to make a prediction. Instead, the distilling step-by-step method uses the rationale as part of the supervision needed for the training.

This approach reduced the amount of training data we needed for specific tasks by 20 to 78.5 percent while creating small-scale models that performed better than small-scale models trained directly on all of the available data.

Programmatic labeling to distill multiple LLMs

Snorkel researchers have also long used another way to train effective, efficient models with less human-labeled data: combining signals from multiple sources through programmatic labeling. This approach extends nicely to the LLM era, where it operates as a kind of multi-source distillation.

To demonstrate this, we conducted a case study in which we classified legal provisions in contracts. The manual curation of such a dataset would have required a hefty sum of money and a substantial time commitment from domain experts.

Instead, we used Snorkel Flow to create labeling functions using various large language models and prompting methods and allowed Snorkel’s proprietary algorithm to combine those signals into probabilistic labels. We then used those labels to train a RoBERTa-based model. This effectively distilled the power of multiple LLMs into a model a fraction of the size of any one of them. This small-footprint model matched the accuracy of a fine-tuned and far more costly GPT-3 model—at a fraction of the deployment cost.

Distilling LLMs: building better, cheaper models faster

Research continues to underline the benefits of model distillation. Where traditional methods fell short, programmatic labeling and distilling step-by-step achieved a remarkable cost decrease and a negligible sacrifice to performance.

By shifting attention to training smaller, specialized models—and using LLMs to help build them—data science teams can realize a balance between performance and cost. This makes it feasible to deploy models with the performance of LLMs without their bulk. The future of AI indeed looks promising with such innovative and cost-effective approaches on the horizon.

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!