Applied AI
Data development

Scaling human preferences in AI: Snorkel’s programmatic approach

January 31, 2024
5 min read

My colleagues and I at Snorkel AI recently worked with several large enterprises to help improve their generative AI applications. There are many axes where we can improve a language model, including RAG, data curation, or fine-tuning. At our recent Enterprise LLM Summit, I gave a talk entitled “Programmatically Scale Human Preferences and Alignment in GenAI” that focussed on how to align the output of large language models to enterprise users’ expectations. We do this through scalable techniques to magnify the impact of feedback from your internal experts.

Alignment follows supervised fine-tuning and further refines your large language model (LLM) responses to be more likable, adhering to certain workflows, or internal policies. This approach is shown in the last 2 phases of training ChatGPT—with reinforcement learning from human feedback (RLHF).

You can watch my entire talk below on Snorkel’s YouTube page, but I have also summarized my main points here.

User alignment techniques: two approaches

Our work explored two techniques for model alignment: RLHF and direct preference optimization (DPO).

Reward modeling centers on building a model to mimic how an expert would rate a response according to one or more factors. Data scientists supply users with a prompt-response pair. Then, they ask the user to rate that response. Then data scientists use these ratings to train the reward model, which predicts ratings for future responses. This can be interpreted as trying to approximate how the users would rate a given prompt-response pair.

RLHF would use these predicted ratings as feedback to nudge the LLM’s output toward simulated higher user expectations.

Image4

DPO operates differently. It optimizes directly on the preferences data—as now you have a pair of accepted and rejected responses. The training pipeline then encourages the LLM to decrease the likelihood of the rejected responses while making the accepted response more likely.

Reward modeling and DPO both require data that reflect human preferences. Gathering such data can be resource-intensive and time-consuming. This is where the challenge lies: how can we efficiently collect and utilize preference data to improve the alignment of our AI models? Our research sought to address this issue and find scalable solutions for enterprise settings.

Scaling human preferences programmatically

The team at Snorkel uses our Snorkel Flow AI data development platform to scale preferences programmatically. The platform leverages technologies like weak supervision to enable subject matter experts (SMEs) and data scientists to collaborate on labeling functions that express SME preferences at scale.

Using Snorkel Flow we built high-quality scoring models that successfully classified and quality-scored responses at high accuracy—and we did so quickly. Snorkel Flow allowed us to complete these models in as little as a day, bypassing the need for weeks or months of manual annotation.

Hoang Tran talks about how his team developed new approaches to scale human preferences and align LLM output to enterprise users’ expectations by magnifying SME impact.

Snorkel originally developed these tools to help enterprises build classification models. Our experiments have shown that they’re just as impactful—and perhaps more so—in fine-tuning generative AI (GenAI) models.

We can use this reward model to curate the data for DPO alignment. This is shown in our Alpaca Eval 2.0 result as we hold the 2nd place (only after gpt-4-turbo). You can read more about the approach on our model card or the diagrams below.

We also demonstrated the effectiveness of this approach when we built labeling functions to curate instruction-tuning data for Together AI’s RedPajama model. In the space of a couple of days, we curated 20,000 prompt/response pairs down to the best 10,000. In testing, users preferred the version of RedPajama we trained from that refined corpus over the original model in every category. We continue to apply the lessons we learned through that project with our recent cohort of enterprise GenAI projects.

Implementing a reward model to scale human preferences

Once we built a reward model, we integrated it into a DPO self-training pipeline. DPO proved to be the better option because it is more computationally friendly, but we believe we would have achieved similar results from other reinforcement learning approaches.

We started with a base model and 10,000 prompts For each prompt, we asked the base model to generate five responses. We then ranked these responses according to our reward model.

In the first version of our pipeline, we identified the top-ranked response as “chosen” and randomly rejected one of the remaining texts. A later version of our pipeline further improved model performance by instead rejecting the response ranked worst by our reward model.

Results

Using only 10,000 prompts, our team was able to achieve a 7.4-point (9.5%) increase on the Alpaca-Eval benchmark in under a day. It’s worth noting that we achieved these results without using responses from other language models.

The reward model we developed in Snorkel Flow also outperformed a leading open-source alternative. For the sake of comparison, we trained a second LLM using the Open Assistant reward model. That resulting model achieved a score of 83.31 on the Alpaca-Eval benchmark. Our model scored 85.2.

These results demonstrate the potential of this approach in enterprise settings, where customization and efficiency are paramount.

High-performance LLM applications demand scalable alignment

Enterprise data science teams need alignment beyond fine-tuning to optimize GenAI responses. This requires human preferences and SME involvement. Historically, the labeling process made it prohibitively expensive to involve SMEs in aligning model performance. However, Snorkel’s technology and a data-centric workflow make this process more scalable, trackable, and transferable.

More Snorkel AI events coming!

Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.

If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.

Share this article
Hoang Tran portrayed.
Hoang Tran
Senior Machine Learning Engineer

Hoang Tran is a Senior Machine Learning Engineer at Snorkel AI, where he leverages his expertise to drive advancements in AI technologies. He also serves as a Lecturer at VietAI, sharing his knowledge and mentoring aspiring AI professionals. Previously, Hoang worked as an Artificial Intelligence Researcher at Fujitsu and co-founded Vizly, focusing on innovative AI solutions. He also contributed as a Machine Learning Engineer at Pictory.

Hoang holds a Bachelor’s degree in Computer Science from Minerva University, providing a solid foundation for his contributions to the field of artificial intelligence and machine learning.

Connect with Hoang to discuss AI research, machine learning projects, or opportunities in education and technology.

Recommended articles

View all articles
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image
Benchtalks #2: The future of coding benchmarks
For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel
June 3, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.