We used weak supervision to programmatically curate instruction tuning data for open-source LLMs.

Instruction tuning (fine-tuning on high-quality responses to instructions) has emerged as an important step in developing performant large language models (LLMs) for generative AI tasks. While industry-backed LLMs such as ChatGPT, Bard, Claude, and even the open-source Llama 2 have relied on massive, expensive proprietary datasets unavailable to the public, the open source community has banded together to create similar datasets such as OpenAssistant and Dolly that are available to everyone.

Unfortunately, high variance in the quality and distribution of responses collected by volunteers has limited the quality of resulting open source models. As a company whose origins were an open source project, and with nearly a decade of experience in turning noisy sources of signal into high-quality supervision, we at Snorkel were curious to see what could be done to address these data-centric problems head-on.

Today, we share a promising proof-of-concept result: by programmatically scoring, sampling, and filtering a collection of open source prompt/response datasets with Snorkel’s Foundation Model Data Platform, we observed a nine-point (24%) increase in response win-rate against ChatGPT responses (up to 46.7%), based on double-blind evaluation by experienced annotators across a variety of standard generative AI tasks. Because this data curation was performed programmatically versus manually, this was accomplished by two developers in a day, instead of hundreds of annotators over the course of weeks or months of labeling. We describe our methodology and release the resulting curated dataset and data curation model for public use.

We validated the downstream effect of fine-tuning on this higher-quality data with the Together AI fine-tuning service (now available via API), which we used to create an improved version of the open-source RedPajama chat LLM with full data transparency. Experiments showed improvement across every major instruction category (up to 10 points), with boosts as high as 12 points for specific tasks (such as writing emails). We released the resulting fine-tuned RedPajama model as well.

This proof-of-concept only scratches the surface of what can be done to programmatically curate training data for generative AI. We’re excited to explore, develop, and share further findings in the coming months. If you’re interested in learning more or contributing, join the discussion on our Slack channel.

Background: what is RedPajama?

For these experiments, we use the RedPajama family of LLMs. The RedPajama project aims to create a set of leading, fully open-source models (LLMs) for natural language processing, including not just open model weights, but also open training data. It is a collaboration between Together, Ontocord.ai, ETH DS3Lab, the Stanford Center for Research on Foundation Models, Hazy Research, MILA Québec AI Institute, and Snorkel AI.

RedPajama models are pre-trained on the same data as the LLaMA models released by Meta, but under an open license that makes them available for commercial applications and provides a more transparent research pipeline. Academic LLaMA derivatives such as Alpaca, Vicuna, and Koala used ChatGPT outputs for instruction tuning, and OpenAI’s license for ChatGPT doesn’t allow models trained on those outputs to compete commercially with ChatGPT. With RedPajama, all instruction tuning data was provided by human volunteers, thus preserving its open license.

The challenge: imbalanced, messy data

Machine learning practitioners know the value of training large language models with well-curated data. OpenAI has reported using human labelers to collect the data for fine-tuning its GPT models and is said to have hired hundreds or thousands of additional contractors since ChatGPT was released. Meta reported collecting over a million inputs from paid annotators for Llama 2.

But this approach is expensive, time-consuming, and out of reach for all but the most well-funded companies, making the use of free, open-source alternatives for data curation appealing if sufficiently high data quality can be achieved.

The RedPajama project released the first version of its instruction-tuned models in May, using two of the largest open instruction datasets available today: OpenAssistant, a community-driven dataset with a public user interface, and Dolly 2.0, a dataset collected by employees at Databricks.

While naively combining these two datasets allowed for an initial instruction-tuned version of the RedPajama model, two challenges stood out:

  1. Distribution. Instruction datasets typically contain multiple classes of instructions—for example: brainstorming, question answering, summarization, etc. Knowing what types of tasks a downstream model will be expected to perform can inform which types of instructions are used during the instruction-tuning process (and in what ratios) for better results. While the Dolly dataset contained such labels, the OpenAssistant dataset did not—leaving a large portion of the training data unclassified and potentially imbalanced.
  1. Quality. While the OpenAssistant dataset contained quality scores assigned to responses by community members, we found the scoring inconsistent. The Dolly dataset contained no indicators of response quality at all. Both data sets contained many low-quality results that should not be used to instruction-tune a model.

As additional open source instruction data becomes available, we can expect to see more of the same—differing levels of annotation (including instruction class and quality scores), varying distributions of instructions, and varying qualities of responses. While not as expensive as collecting the data from scratch, manually inspecting, tagging, and filtering such datasets quickly becomes just as untenable.

A llama in red pajamas wearing a snorkel and mask. Our little image for how GenAI data development.
Image generated using DALL-E.

The solution: scalable programmatic data curation

We decided to develop a pipeline to efficiently curate generative AI instruction datasets, using the same programmatic approach that has been used in Snorkel for years to make predictive AI dataset creation 10-100x faster and cheaper. Data annotation plays a crucial role in this process, as we leveraged it to build a foundation of high-quality, labeled datasets for training our models.

To accomplish this, we used a pair of models developed in just half a day with Snorkel: one to categorize instruction classes, and the other to estimate response quality (for filtering out low-quality responses).

Categorizing prompts to programmatically shape dataset distribution

To inspect the distribution of our dataset, we first had to categorize each instruction. We decided on a schema appropriate for the general purpose chatbot that RedPajama is intended to be, with the following six categories:

  1. Open-qa: question-answering without context, e.g., “When was Google founded?”
  2. Closed-qa: question-answer from a provided context, e.g., “Look at the following paragraph and tell me how many mentions of fruit there are.”
  3. Brainstorming: e.g., “Give me some ideas for planning a beach trip.”
  4. Generation: e.g., “Write me an essay comparing baroque with minimalist music”.
  5. Summarization: e.g., “Summarize the main points from this news article”
  6. Other: e.g., anything that did not fit the previous five categories.

With the classes defined, we wrote simple labeling functions to propose a category for each instruction. For example, one function might label any instruction that starts with the words “who, what, when, where, or why?” with a long portion of contextual text to follow as “open-qa.”

These “weak labels” from labeling functions weren’t perfectly precise. Multiple labeling functions sometimes overlapped and proposed conflicting categories for the same data point. This is an expected part of any weak supervision pipeline, and we used established algorithmic techniques for aggregating weak labels based on labeling statistics to resolve conflicts.

With the resulting labeled training set, we trained a classifier capable of categorizing all of the instructions for the original data set. We learned that RedPajama’s V1 instruction dataset under-emphasized brainstorming and generation prompts, which we were able to augment in our curated dataset.

Table 1: Instruction Class Distribution

UncuratedUncuratedCuratedCurated
count%count%
brainstorming187110.2120813.4
closed-qa585131.9190321
generation10715.8169018.7
open-qa860446.9385242.6
other160.1200.2
summarization9395.13734.1
total183521009046100

Importantly, because all labeling for this classifier is performed programmatically, should we decide to adjust the schema in the future to track different instruction categories, we can simply add new labeling functions or adjust existing ones to reflect the new class boundaries and regenerate the training set and model in a matter of minutes, rather than needing to review all labels manually.

Estimating response quality to remove noisy responses

We then developed a second model to identify high-quality responses. This model used a more complex feature space that included measures of response length, response perplexity, instruction/response cosine similarity in the SimCSE embedding space, and other syntactic features.

While training data can be generated programmatically, we always recommend that evaluation be performed on ground truth annotations from experts. For ground truth, we used user ratings of responses from the OpenAssistant dataset. We noticed some inconsistent and questionable community ratings, so we also supplemented the dataset with ground truth from our own internal judgments.

After we trained the model, we measured the win rate by response quality in a sample from the valid split and investigated the tradeoffs among projected win rates, instruction diversity, and dataset volume as a function of the response quality threshold. We observed that as the threshold increased, projected human win rates monotonically increased and instruction class entropy remained stable until a threshold of ~0.9. In order to maintain a target dataset volume of at least 8k-10k samples and hedge against a drop in the diversity of un-modeled dataset features, we chose 0.5 as the threshold below which responses would be filtered out of the curated dataset.

Table 2: Projected Features of Curated Dataset as a Function of Response Quality Threshold

Response quality thresholdCurated dataset volumeinstruction class distribution entropy (bits)Projected human win rate
0.1296161.860.33
0.3189511.950.38
0.5110502.070.44
0.734161.940.50
0.98230.240.53
*helpful-instructions dataset included in analysis.

Programmatic filtering for selective dataset augmentation

While our exercise focussed primarily on finding an optimized mix of high-quality training data from the original corpus, we also explored augmenting that mix with a similarly-curated selection of data from the “Helpful Instructions” data set, which includes data from Anthropic.

Our quality-response model indicated poorer quality responses in this dataset overall. We were able to sift out much higher quality responses from a small portion of the dataset, but we opted to leave it out of the final released dataset because those responses were still lower quality compared with the Dolly and OpenAssistant datasets.

Evaluation

The nature of free text renders any effort to evaluate it objectively a challenge. We therefore used a comparative approach—pitting the training responses in our curated training corpus against responses from ChatGPT (GPT v3.5 turbo).

We aimed to see if humans prefer our training examples roughly as often as those generated by ChatGPT (presumably trained on many thousands if not millions of manually curated responses), and we made meaningful progress toward that goal.

Evaluation methods

To evaluate our dataset quality, we developed double-blind experiments and enlisted the help of employees across the company as well as a third-party labeling vendor with prior experience with NLP and generative AI.  

The experiment asked subjects to compare human responses already in the datasets to ChatGPT responses to the same prompts. Following the methods of Anthropic, the experiment asked subjects which of the two responses was the most “helpful, honest, and harmless”. We then tracked the “human win rate”—the proportion of samples in which the human-generated response in the dataset was judged as higher quality than what was provided by ChatGPT.

Evaluation results

The Snorkel-curated data set showed highly significant improvement across each of the training data sets as well as each task type. When comparing a random sample of the curated dataset (n=1554 samples) with Dolly and OpenAssistant instruction/response pairs with the original used for RedPajama v1 (n=1000 samples), we found a nine-point (24%) increase in win rate against ChatGPT, from 37.8% to 46.7%.

We compared human win rates in random samples (out-of-sample, not used for model development) from the RedPajama V1 dataset with the curated version, finding improvements across all five major instruction classes. We found the greatest lift in the brainstorming, generation, and open-qa tasks. It is inconclusive whether the other trends are significant due to smaller sample sizes, and future work will include generating more judgments on those under-represented classes.

Table 3: Effect of Curating on Win Rates by Instruction Class

UncuratedUncuratedCurated*Curated*
Eval sample sizeHuman win rateEval sample sizeHuman win rate
brainstorming1060.3022670.418
closed-qa3030.4593890.486
generation630.3973340.485
open-qa4680.3317800.468
summarization520.423830.446
all10000.37818550.467
*helpful-instructions dataset omitted in final version because of low-quality scores from curating model
**”other” non analyzed due to class rarity (see Table 1) and small number of samples in experiment

We also compared human win rates in the curated set with a random sample balanced by dataset, finding a 7 to 12-point increase in win rate across each of the three:

Table 4: Effect of Curating on Dataset Win Rates

Balanced sampleBalanced sampleCuratedCurated
Eval sample sizeHuman win rateEval sample sizeHuman win rate
dolly4000.3515200.438
helpful-instructions4000.244440.355
open_assistant4000.42410340.485

Effect on downstream models

Thanks to the open nature of the RedPajama family of models, we were able to compare the quality of instruction tuning models trained on both the curated and uncurated datasets.

The original instruction-tuned version (v1.0) was trained on all ~20,000 samples from the Dolly 2.0 and Open Assistant datasets. We used the curating model to take the ~10,000 highest quality samples, ensuring that the distribution instruction classes was as varied as in the larger dataset. Using this curated dataset we created an instruction-tuned version (v1.5) to compare against the original.

In a head-to-head comparison between the models’ 1,426 samples from an instruction dataset not used for training any of the models (OpenChatKit), we found that the model fine-tuned on the curated data outperformed the original on all five major categories, with boosts ranging from 3.5 to 10 points.

Table 5: Effect of Curating on RPJ Quality

Instruction classSample sizeRPJ v1.5 win rate over v1.0
brainstorming2680.560
closed-qa2110.600
generation3870.556
open-qa3020.535
summarization2580.547

We also probed model performance on finer-grained tasks, such as email generation, which experienced an even larger win-rate boost of 12 points. We inspected this task first, since we had observed in the data that one of the most common weaknesses of the uncurated data was an apparent lack of effort and completeness, with responses to generation tasks that were often too short.

For most enterprise applications of LLMs, certain tasks will be of higher value than others. In these situations, a data-centric iteration loop between identifying failure modes of the model and then making corresponding updates or corrections to the training data is the most efficient approach we’ve seen for getting to production quality.

Conclusion

Increasingly in the age of LLMs, model development (whether for generative or predictive AI tasks) centers on data development. With these initial results, we share further empirical evidence that programmatic, data-centric operations can be applied effectively to the curation of instruction-tuning datasets, shrinking the gap to closed source models trained on massive manually collected datasets.

  • To download or learn more about the curated training set, check out its dataset card on Hugging Face.
  • To download or learn more about the instruction data curation model, check out its model card on Hugging Face.
  • To download or learn more about the RedPajama instruction-tuned model, check out its model card on Hugging Face.

If you have further questions or have ideas for how programmatic data curation might be applied to your domain, join the conversation on our Slack channel.

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!

Acknowledgments

Hoang Tran, Senior Machine Learning Engineer at Snorkel AI, made significant intellectual and technical contributions to this project. We’d like to thank the Together team for their excellent collaboration on this project, especially Xiaozhe Yao, Ce Zhang, and Jamie de Guerre.