What is large language model (LLM) alignment?
The neural network architecture of large language models makes them black boxes. Neither data scientists nor developers can tell you how any individual model weight impacts its output; they often can’t reliably predict how small changes in the input will change the output. So, how do enterprises ensure that their models’ responses comply with company policies and general decency? They use a process called LLM alignment.
Large language model alignment uses a data-centric approach to encourage generative AI outputs to abide by organizational values, principles, or best practices. Data scientists collect user reactions that indicate what model responses they did and didn’t like and then use that information to nudge the model more in line with user preferences.
This process results in bespoke behaviors that maximize the effectiveness of the model for the organization using it.
Below, we will explain multiple facets of how alignment builds better large language model (LLM) experiences.
Let’s dive in.
How does large language model alignment work?
Think of large language model alignment as on-the-job training. When a company hires a new employee, they can execute their tasks, but that doesn’t mean they will do them perfectly within workplace guidelines on the first day. The manager observes the employee and tells them both what they’re doing well and what they can do better. Then, the employee adjusts.
Aligning an LLM works similarly. Upon deployment, the model’s architecture, pre-training, and fine-tuning likely enable it to mimic your organization’s style and preferred format, but it needs guidance to achieve that goal. To provide this guidance, the data scientists in charge of the model solicit user feedback that they use to encourage some kinds of responses while discouraging others.
Results from Microsoft’s paper on its instruction-finetuned LLM, Orca, clearly show the benefits of alignment. Models such as Orca and ChatGPT that have been aligned for safety and truthfulness significantly outperform in these categories relative to models that have only been aligned for helpfulness, such as Vicuna.
An important note: alignment nudges responses to mimic the preferences of the users who label it. That means that a model aligned to one organization’s preferences may yield responses quite different from one aligned to another organization.
Large language model alignment starts with data
Training a large language model (LLM) requires three phases:
- Pre-training.
- Fine-tuning.
- Alignment.
Every step of this process demands data and a lot of it. Alignment requires data in the form of prompts, the model’s responses to those prompts, and some form of human feedback for that prompt/response pairing (whether direct or by proxy).
Researchers and engineers have developed a variety of techniques to collect this data. In the simplest, a pipeline gives the model a prompt. The model generates multiple responses. Human evaluators then compare the responses and select the one they prefer.
In this scenario, the humans are deemed the oracle, i.e. the mechanism by which the model is provided with a reward signal and feedback. In some variants, annotators assign responses a numerical score using a standard such as the Likert scale, and sometimes responses are even evaluated using free-form textual descriptions. Model managers may also collect human feedback indirectly, using things like clicked links, follow-up questions, or the amount of time spent reading the response.
The mechanics of LLM alignment
Researchers and engineers have developed several approaches to turn user response data into useful alignment directions, the earliest of which was reinforcement learning with human feedback (RLHF).
In RLHF, data scientists use collected feedback to train a reward model. The reward model accepts a prompt/response pair and then predicts how the average human in the data labeling cohort would rate it. In the training environment, this reward model sends feedback to the LLM. This causes the LLM to adjust its neural network weights to produce responses that earn higher predicted scores.
RLHF still dominates alignment efforts at frontier model developers, but other approaches have gained ground in other corners of the AI world.
These include:
- Direct preference optimization (DPO)
- Odds ratio preference optimization (ORPO)
- Kahneman-Tversky optimization (KTO)
- Contrastive fine-tuning (CFT)
Direct preference optimization (DPO)
DPO fits the model to human preferences directly instead of using the intermediary of a reward model. DPO has achieved results comparable to—or better than—RLHF while vastly simplifying the model fine-tuning and preference optimization process.
Odds ratio preference optimization (ORPO)
ORPO improves decision-making by analyzing and optimizing preferences between two options. It calculates the “odds ratio” of how likely one option is preferred over another and adjusts choices or recommendations to better match those preferences.
Kahneman-Tversky optimization (KTO)
KTO leverages insights from behavioral economics to create systems that align with how humans naturally perceive risks and rewards. Instead of assuming users always make rational choices, it models real-world biases, such as the tendency to overvalue losses compared to equivalent gains. This makes predictions that feel more intuitive and effective for users.
Contrastive fine-tuning (CFT)
CFT directs a model to generate both desirable and undesirable responses. Data scientists then fine-tune the model to increase the odds of creating responses similar to the desirable output while decreasing the chances of mimicking the undesirable response.
Regardless of the specific technique used, the goal of alignment is to ground a machine learning model with the values, inclinations, and soft skills of a well-adjusted human.
Making LLM alignment scalable with Snorkel Flow
If the process of labeling data for alignment sounds laborious, it is.
In one six-month period in 2023, OpenAi hired more than 1,000 contractors to create and annotate alignment data. Very few organizations can afford to put that many human hours into model adjustments.
Researchers and engineers at Snorkel AI, however, have found ways to scale the impact of subject matter experts’ labeling time. By asking each labeler to explain why they thought a response was high or low quality, Snorkel Flow users can distill that logic into a labeling function that they can apply to thousands or millions of prompt/response pairs simultaneously.
Here are two quick examples:
- If a user indicates that a model response was poor because the prompt asked for a bullet list and the response didn’t include one, Snorkel Flow users can build a regex-based labeling function to give similar responses a poor rating.
- Using a sentiment analysis model, Snorkel Flow users can apply low ratings to negative-toned responses.
Through these approaches, Snorkel researchers and engineers have been able to achieve high-quality results in a fraction of the SME time this process would otherwise require.
LLM alignment: the forever stage
The practical use of LLMs changes over time. Organizations release new products and deprecate old ones. They update their messaging and brand guidelines. New regulations or evolving best practices may dictate what organizations incorporate into their LLMs and how.
Because of this, alignment never really ends. Data scientists must nudge model responses to accommodate changing organizational needs again and again. For some organizations, alignment may be a constant process. Other organizations may do an annual overhaul. Still, others may address alignment needs only when users indicate they’re unhappy with recent LLM behavior.
Regardless of the most appropriate approach, data science teams should keep in mind that alignment will rarely be a one-and-done process. They will want to plan for when and how to perform subsequent rounds.
LLM alignment: making LLMs more useful for your organization
LLMs promise to yield significant value, but they must work correctly for your purposes. Like a new employee, a fully-trained LLM won’t work perfectly for the tasks you have in mind. But, like a competent new hire, you can teach them to use their capabilities the way you want.
While this process has been laborious in the past, novel tools like Snorkel Flow have made it easier.
Learn how to get more value from your PDF documents!
Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.
Matt Casey leads content production at Snorkel AI. In prior roles, Matt built machine learning models and data pipelines as a data scientist. As a journalist, he produced written and audio content for outlets including The Boston Globe and NPR affiliates.