Data development

Vision language models: how LLMs boost image classification

June 12, 2024
5 min read

Vision language models demonstrate impressive off-the-shelf image classification capabilities. Still, generalist models like CLIP struggle to differentiate between subcategories with significant overlap, such as distinguishing cat breeds. I and my colleagues at Brown University set out to better understand how these models work and find ways to use large language models to improve vision language model performance—with notable results.

I recently presented two research papers, “Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification” and “If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions” to an audience of Snorkel AI researchers and engineers. You can watch the entire presentation (embedded below), but I have summarized the main points here.

How do contrastive vision language models work?

Contrastive vision language models (VLMs) pair a text encoder with an image encoder. The encoders process each input of their respective media to generate an embedding vector. Researchers then train these models on images paired with their descriptions. This enables the model to understand the relationship between the text and image, allowing it to interpret visual concepts.

VLMs like CLIP (Contrastive Language-Impage Pre-training) serve as a great starting point for building image classification models, but they can sometimes use help; research has shown that querying the model with a description of a target class rather than just its label improves accuracy—though not all descriptions work equally well.

Different VLMs prioritize different attributes. For instance, when analyzing a dataset of bird species, one VLM might prioritize the size attribute, while another might prioritize the habitat attribute.

Image1
A high-level overview of how researchers train contrastive vision-language models.

Automatically enhancing VLM prompts

Our Follow-Up Differential Descriptions (FuDD) study aimed to address a key challenge in VLM performance: distinguishing between classes of images easily confused for each other. We used a zero-shot, training-free method to improve accuracy by detecting ambiguous classes and using a large language model (LLM) to generate detailed descriptions crafted to help resolve ambiguities.

Image2
The first step in disambiguating image classes is to figure out which classes get confused for each other.

The process begins by presenting the VLM with images and text indicating the potential classes those images could belong to. By measuring the distance between the images and each of the classes, we can determine which classes are most similar to each other for this image (i.e., are ambiguous classes).

Then, we use the LLM’s generative AI capabilities to generate the information needed to resolve the ambiguity. For example, to help the VLM distinguish between images of a ‘sparrow’ and a ‘cuckoo’, we would prompt the LLM to generate descriptions highlighting differences between these bird species.

We use these contrastive descriptions to help guide CLIP to make the correct prediction for the image.  We ran extensive experiments on twelve fine-grained image classification datasets. The experiments tested two versions of CLIP and used GPT-3.5 to generate descriptions.  We also tested the impact of contrastive prompts for different levels of ambiguity, ranging from using contrastive prompts only for the five most ambiguous classes to all classes in the dataset.

Image5
Example contrastive descriptions.

The results were encouraging: we observed that providing differential information about ambiguous classes significantly improved classification performance—and that we need not include contrastive descriptions for all classes. While including contrastive descriptions for all classes yielded the best performance, we achieved most of the performance boost by using contrastive prompts just for the five most ambiguous classes.

Image4
The results of including differential descriptions for the k-most ambiguous classes for different values of k (i.e., for different levels of ambiguity).

Understanding VLM contours

In our Extract and Explore (EX2) study, we used reinforcement learning to train an LLM to generate class descriptions more aligned with the preferences of specific VLMs.

The goal of this study was twofold:

  1. To better understand how different VLMs might represent the same concepts.
  2. To demonstrate an approach for training LLMs to assist image labeling tasks better.

We asked an LLM to describe a particular class, for instance, “lemur” or “sparrow.” Then, we then took the VLM’s embeddings for these descriptions and calculated the cosine similarity between them and the embeddings produced by our VLM for images of the given class. We used this cosine similarity as a reward metric to train the LLM to create descriptions more aligned with the VLM’s perspective.

Image6
Our process for adapting LLMs to VLM preferences.

Using  EX2, we made several observations about how VLMs represent different concepts. We found that different VLMs prioritize different attributes to represent similar concepts. Some VLMs might focus on the size of an object, while others might focus on its habitat.

We also found that spurious descriptions and non-visual information significantly influence the representations in VLMs. Though non-intuitive, we found instances where VLMs were more responsive to descriptions that described where a bird lived rather than what the bird looked like.

Image3
The breakdown of the types of descriptions that CLIP prefers for representing flower species.

A better understanding for better VLMs

Understanding and improving vision language models is a critical endeavor in machine learning research. These two studies have shown promising results in enhancing the performance and understanding of these models.

Our Follow-Up Differential Definitions study demonstrated how we can resolve ambiguities in VLMs to improve image classification accuracy. Our EX2 study revealed how VLMs represent different concepts and the influence of spurious descriptions and non-visual information on these representations.

These studies contribute to our understanding of VLMs and open new avenues for further research and potential applications. We look forward to continuing to explore these fascinating models and to furthering our knowledge in this exciting field.

More Snorkel AI events coming!

Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.

If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.

Share this article

Reza Esfandiarpoor is a fifth-year Ph.D. candidate in the Department of Computer Science at Brown University, advised by Stephen Bach. His research interests concern machine learning systems with multiple large pre-trained models and the new challenges and opportunities that interactions between these models provide.

Recommended articles

View all articles
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.