Generative artificial intelligence models offer a wealth of capabilities. They can write poems, recite common knowledge, and extract information from submitted text. But developers can also use genAI models to quickly build predictive pipelines—the kinds of tasks such as classification or extraction, where the correct answer resides in a closed set of options.

These pipelines are not ready for production deployment, but they can lay the groundwork for more robust, effective, and cost-effective models.

This guide outlines the following:

  1. Prompt strategies for a quick, genAI predictive pipeline.
  2. How to shape the code around the predictive pipeline.
  3. Why these pipelines should not be used in production.
  4. What to do instead.

While Snorkel has worked with partners to build valuable applications using image and cross-modal genAI models, this post will focus exclusively on large language models.

Let’s get started.

GenAI + code for predictive pipelines

To make a generative model act like a predictive model, developers must follow two basic principles:

  1. Constrain the generative model’s output.
  2. Build code to handle the LLM’s inputs and outputs.

The latter will map the model’s outputs to final labels and significantly ease the data preparation process. The former will make the generative model’s outputs (mostly) fall into an expected range.

Users can easily constrain an LLM’s output with clever prompt engineering. Researchers have found many ways to do this, but we’re going to focus on two:

  1. Closed question-answering.
  2. In-context learning.

Which approach makes more sense will depend on the specific application.

Closed question-answering

With this technique, the user constrains the genAI’s output by asking it a closed-ended question.

For example:

{text to evaluate}
Which of the following sports is the above text talking about: baseball|basketball|football|hockey

This approach can handle more categories than in-context learning due to the conciseness of its prescriptions; each additional category requires as few as two tokens: one for the one-word category, and one for the separating punctuation. That minimizes the chance that the prompt will overrun the context window, and also reduces the cost of high-volume runs.

The downsides of this technique are the following:

  1. It is more likely to return an output outside of the defined set.
  2. Its categorical power is brittle. A simple rephrasing in the source text can cause predictions to flip from one category to another.

In short, closed question-answering works best for high-cardinality or high-volume problems.

In-context learning

When using in-context learning, the user encodes examples of inputs and outputs directly into the prompt.

For example:

Prompt:

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
{text_to_evaluate} //

If text_to_evaluate was “What a horrible show!”, The LLM output should be:

Negative

This technique generally guarantees a properly constrained response that will translate well into a code pipeline.

However, this approach comes with one big downside: the prompt must include at least one example for each potential output.

Depending on the count of categories and the lengths of the texts to evaluate, the prompt could exceed the LLM’s token maximum. Even if the total text length fits within the context window, users may want to avoid in-context learning for tasks with high cardinality; when using LLM APIs, users pay by the token. Over thousands of executions, those extra tokens can add up.

In short, in-context learning works best for low-cardinality or low-volume problems.

Image by flatart

The surrounding code

Using an LLM for predictive purposes requires a coding pipeline that does three things:

  1. Inserts the targeted text into a prompt template.
  2. Sends the prompt to the LLM.
  3. Interprets the response.

Developers working on the Snorkel Flow platform only need to consider the prompt template. Snorkel Flow’s infrastructure will handle everything else. Other developers may use an automation utility such as LangChain.

Developers who can’t or don’t want to use those tools can do this directly through LLM APIs, as outlined below. Our examples use Python, but the concepts apply equally well to other coding languages.

Building the prompt

Each predictive task sent to an LLM starts with a prompt template. This is a piece of text that includes the portions of the prompt to be repeated for every document, as well as a placeholder for the document to examine.

Predictive pipelines for genAI begin with a function that accepts the text document to examine and inserts it into the prompt template, resulting in a complete prompt.

For example:

prompt_template = “””{text_to_evaluate}

Which of the following sports is the above text talking about: baseball|basketball|football|hockey”””

def build_prompt(text_to_evaluate, prompt_template):

        prompt = prompt_template.format(text_to_evaluate = text_to_evaluate)
        
        return prompt

Sending the prompt to the LLM

Once the code has built the final prompt, it must send the prompt to the LLM.

Other writers have composed thorough and robust tutorials on using the OpenAI Python library or using LangChain. We will not go into great detail about that here—except to say that the pipeline needs a function that does the following:

  1. Accepts the full text prompt.
  2. Sends it to the LLM.
  3. Returns only the new content generated by the LLM.

Interpreting the response

Once the LLM returns a response, the code must translate it for downstream use.

This is not quite as simple as it sounds. Directly returning the genAI output could result in noisy, mismatching data labels; LLM responses sometimes deviate from prescribed formats, and the code must handle that.

These deviations could include:

  1. Capitalization, which requires normalizing the case of all text.
  2. Extending the answer, which will require limiting the examined text to only the first token.
  3. Deviating completely from the suggested format, which requires catch-all error handling, such as classifying everything outside of the schema as “other.”

Depending on their use case, developers may also wish to convert responses to boolean variables.

Regardless, the interpretation portion of the pipeline will rest primarily on some light formatting followed by some kind of if/else gate.

For example:

acceptable_respones = [‘baseball’, ‘basketball’, ‘football’, ‘hockey’]

def intepret_response(response, acceptable_responses):

        cleaned_response = response.split(‘ ‘)[0].lower().strip()

        if cleaned_response in acceptable_responses:

                return cleaned_response

        else:

                return ‘other’

This should yield results consistent enough to use in downstream analysis.

The main challenges of deploying genAI for predictive into production

Given the relative ease of building predictive pipelines using generative AI, it might be tempting to set one up for large-scale use.

That’s a bad idea. For two main reasons: cost and accuracy.

The problem of accuracy

Text-generating AIs are trained to understand language largely by filling in missing tokens. This is significantly different from training for classification. When prompted for a classification task, a genAI LLM may give a reasonable baseline, but prompt engineering and fine-tuning can only take you so far. For better results, you need a dedicated classification model.

Consider the two following case studies:

  • GenAI for politeness. In a case study, a researcher fine-tuned a generative LLM using a version of the Stanford Politeness Dataset. After this training, the model achieved an accuracy of 78%.
  • BERT for misinformation. Researchers using a BERT derivative—a non-generative LLM—achieved 91% accuracy in predicting COVID misinformation.

The study using a purpose-built model achieved better results with a smaller model on a problem that is at least as difficult. While clever prompt engineering can coax better predictive performance from generative models, purpose-built models typically win when it comes to deployable applications.

The problem of cost

At the time of this writing, accessing GPT-4 through OpenAI’s APIs cost about three cents per thousand tokens.

The in-context learning template shown above contains 54 tokens. If the completed prompts average 100 tokens total, every 10 requests costs about three cents. At a small scale, this cost is negligible; 300 classifications will cost less than a dollar.

At high throughput, this will add up. Fast.

If the task calls for ~1000 classifications per minute, that’s ~$180 per hour. If it calls for ~1000 classifications per second, that’s more than $10,000 per hour.

While organizations who host their own generative models can avoid per-token costs, they must instead pay for the resources to host and run the models. The smallest version of Meta’s Llama 2 uses 7 billion parameters. The largest version of BERT contains 340 million parameters. At five percent of the size, a BERT-based model will cost just five percent as much to run at inference time.

How to make predictive GenAI approaches useful for your org

If generative AI for predictive tasks is both inaccurate and expensive, why would anyone use them? To help bootstrap better, cheaper models.

Iterating for better quality

Building an enterprise-ready model starts with a large volume of high-quality labeled data. This poses a serious challenge for companies, who indicated in a poll from our 2023 Future of Data-Centric AI virtual conference that lack of high-quality labeled data remains the biggest bottleneck to AI success.

Asking internal experts to label enough data for a model is often a non-starter; their time is expensive, and they have more urgent things to do. Outsourcing data labeling poses difficult questions, such as whether the gig labelers have the expertise required to label the data properly or if the contractor’s security protocols are strong enough to prevent your data from being leaked.

Generative AI can help build out a probabilistic data set quickly. Data scientists and subject matter experts can assemble labeling prompts that approach the question from different angles. Then, they can combine those predictive labels through a method like weak supervision to produce a probabilistic training set.

These initial probabilistic labels will not reach human-level accuracy, but they reach the scale needed to train a better model, faster. Subsequent rounds of investigating incorrect labels can help the team craft additional labeling functions and help the accuracy of the data set approach and in some cases exceed human labeling performance.

At a high level, the process looks like this:

  1. Write several generative AI prompts to predict classes.
  2. Combine those labels into a probabilistic data set.
  3. Investigate a sample of labeled data points.
    • If accuracy is too low, return to 1.
  4. Train an appropriately-sized model using the probabilistic labels.

Training a smaller, more accurate model

Depending on the task, an enormous LLM architecture may be worse than a smaller one.

In a recent case study for a customer in online retail, Snorkel developers built a labeled training set using genAI and additional tools. They used this data set to train two models:

  • A DistillBERT model—83.6% accuracy.
  • A GPT-3 model—82.5% accuracy.

Not only did BERT beat GPT-3 by more than a full percentage point, but it will cost a tiny fraction of what GPT-3 would cost to run at inference time.

Use genAI for predictive where appropriate, but do it right

GenAI can do a lot of things—write poems, extract information, and even make categorical predictions. While deploying a large-scale predictive pipeline built on genAI will likely never make sense due to low accuracy and high cost, these pipelines can offer indirect value.

GenAI can help bootstrap more appropriate models by generating probabilistic datasets quickly.

Organizations can combine genAI-produced labels through weak supervision, and iterate on a suite of prompts to get more accurate training labels. Then, data scientists can use this dataset to train a smaller, more accurate model that is both cost-effective and suitable for practical applications.

While generative AI has its place in the AI landscape, it should be utilized thoughtfully, understanding its strengths and limitations, and paired with purpose-built models for more accurate and cost-efficient predictive solutions in production.

Learn more

If you'd like to learn how the Snorkel AI team can help you develop high-quality LLMs or deliver value to your organization from generative AI, contact us to get started. See what Snorkel can do to accelerate your data science and machine learning teams. Book a demo today.