The Biden administration this week issued an executive order that outlines a plan for the U.S. government to address the safety, transparency, and use of AI in the near future. This order includes many aspects—from encouraging the use of AI in critical industries to suggesting AI-focused changes to immigration policies—but we at Snorkel AI noticed that several of its points elevate the importance of AI data development and programmatic operations.
Biden’s order includes many nods to fairness, bias, and safety. As AI takes a larger role in all of our lives, those are important concerns to consider. They also represent difficult challenges that require an acute focus on data that is used to train the AI itself and how it is prepared, managed, and applied.
As we all absorb what this wide-ranging order means, let’s take a moment to understand where focusing on data helps government agencies (and commercial companies in the future) fulfill the AI Executive Order (AIEO) obligations.
AI data development for trusted AI
A key aspect of the AI executive order asked agencies to develop AI safety tests. The order would require “companies developing any foundation model that poses a serious risk to national security, national economic security, or national public health and safety“ to apply these tests to their models and report the results to the government.
Standardized evaluation and benchmark tests for large language models are still active areas of research. However, one of the fundamental ways to improve quality and thereby trust and safety for models with billions of parameters is to improve the training data quality. Higher quality curated data is very important to fine-tune these large multi-task models. New techniques such as data-centric iteration and fine-tuning allow you to focus and re-direct these large models for specific tasks (eg. classifying content based on your taxonomy, formatting the output for your needs). Here is a great tutorial on how to fine-tune GPT3.5 Turbo and improve its performance by 60% for specialized tasks using Snorkel Flow.
Programmatic data operations for safer AI
Biden’s executive order specifically singled out the risk of AI being used to engineer “dangerous biological materials.” While this might sound like the stuff of science fiction, something similar already happened. A researcher used automated drug discovery software to generate the most toxic molecules it was capable of. Experts disagree on how much danger this exercise actually created (knowing that a molecule is toxic doesn’t tell you how to make it), but it sets a worrying precedent.
To minimize the chance that a bad actor could perform a similar experiment with biological materials, developers will require a lot of labeled data. Experts will have to apply their knowledge to classify materials as potentially dangerous or not. Labeling documents one-by-one could take a prohibitive amount of time.
However, the researchers behind these models could greatly accelerate this process through programmatic AI data operations (aidataops) such as labeling, filtering, sampling, and more. Programmatic aidataops allow experts to craft rules or other sources of signal that cover many data points at once. In the time they would spend labeling a single record, experts can apply their knowledge and understanding to dozens, hundreds, or even thousands of documents.
Our researchers recently demonstrated the potency of programmatic dataops when they saved hundreds of hours preparing data to intruction-tune the RedPajama LLM. Manually labeling the instruction-tuning data would have taken months. Our researchers did it in two days.
We see other places in the AIEO where a similar philosophy applies—such as building AI systems to detect AI-generated content or enhance cybersecurity. Both require large volumes of labeled data. In the case of cybersecurity, Arista has already proven that programmatic labeling approaches can accelerate adapting to novel cybersecurity threats.
Model distillation for private, specialized AI
The AIEO emphasizes the need to better protect Americans’ privacy, including from the risks posed by AI. It specifically directs federal agencies to develop guidelines to evaluate the effectiveness of privacy-preserving techniques, including those used in AI systems to protect Americans’ data.
This challenge is not unique to government agencies. Commercial organizations with ambitions to leverage generative AI also seek to preserve data privacy and prevent IP leaks. Multi-billion parameter models are largely black boxes. This complicates organizations’ ability to govern models effectively. However, new techniques are emerging to distill LLMs into smaller specialized models with the same or often better accuracy. Smaller ML models enhance data privacy by using less data, reducing overfitting, simplifying audits, and lowering resource requirements. They are more amenable to privacy-preserving techniques like differential privacy, whereas large language models with their complexity and extensive training data can pose higher privacy risks by potentially exposing sensitive information. Here is a tutorial on how to distill popular LLMs into 1400x smaller, specialized models using Snorkel Flow.
A new day for AI and the U.S. government
Snorkel has been privileged to work with several government agency partners to help them prepare to fulfill the requirements of this order. Engineers at Snorkel and at those agencies have spent a significant amount of time and effort improving the AI data development capabilities necessary to improve transparency, data governance, and responsible AI.
Overall, we’re excited to see how the order supports pragmatic safety and transparency measures aimed at real issues, not imagined doomsday scenarios. We’re impressed to see that the order includes measures focussed on using AI to support cybersecurity, healthcare, and training data privacy—all areas that Snorkel is actively supporting—while also leaning in on government adoption and investment, which has always been a key catalyst for innovation.
We won’t know the ultimate impact of this policy for years to come, but traversing the gap between a framework and practical solutions to fulfill it will require a lot of work by developers and bureaucrats alike. We strongly believe that AI data development carried out with proven approaches such as programmatic data operations and model distillation will bridge the gap.