Research

How ROBOSHOT boosts zero-shot foundation model performance

April 30, 2024
5 min read

Foundation models like CLIP are fantastic tools for classification applications, but they sometimes focus on the wrong features due to biases in their training data. To overcome this limitation, my colleagues and I developed ROBOSHOT.

The ROBOSHOT method improves the robustness of pre-trained model embeddings in a fully zero-shot fashion, without any additional fine-tuning required.

My PhD advisor, Fred Sala, included a note about ROBOSHOT in his presentation about Skill-It! at Snorkel AI’s Enterprise LLM Summit in January, but I recently had the privilege to present and discuss my work on ROBOSHOT in greater depth with Snorkel’s researchers.

You can watch a recording of the presentation (embedded below), but I have also summarized the main points here.

Understanding the concept of ROBOSHOT

Our work with ROBOSHOT began with understanding how embedding-based foundation models like CLIP make predictions. We start with a sample image and a list of possible labels. The model gets the image embedding and the labels’ embedding, then makes a classification by taking a dot product between the image embedding and all the label embeddings. The model predicts the label based on which has the highest cosine similarity with the image.

These off-the-shelf models can achieve respectable accuracy, but they sometimes rely on spurious correlations in the training data. For example, in classifying water birds versus land birds, if the pre-training dataset often shows water birds in front of water and land birds on land, the model may mistakenly use the background as the basis for its prediction.

Image2

To address this, we can use a large language model (LLM) like GPT-4 to identify likely spurious correlations and useful features. We then use techniques from the literature on embedding debiasing to modify the model’s behavior in the embedding space.

Our experiments show that using embeddings to reject spurious features tends to reduce variance along one vector. Amplifying the importance of our predicted helpful features enlarges the variance in orthogonal directions.

We experimented with only-rejection approaches and only-projection approaches. We’ve seen that reducing spurious correlations and increasing useful features together yields the best results.

Image1

ROBOSHOT results

We applied ROBOSHOT to various tasks and datasets and have seen promising results—including on the waterbirds and land birds data set mentioned above. As mentioned, standard pre-trained models struggle with this task because they rely on the bird’s immediate environment to make its prediction. ROBOSHOT redirected the model to focus more on features of the bird itself, such as the shape of its beak.

Our results showed not only an improvement in the average accuracy but also a notable increase in the worst-group accuracy, which improved by almost 30%.

We also applied ROBOSHOT to textual tasks using models like BERT and ADA (OpenAI’s embedding model). In a sentiment classification task, we observed positive results, indicating that ROBOSHOT’s approach to identifying and reducing spurious correlations can be successfully extended to textual data.

These preliminary results are encouraging and underscore the potential of ROBOSHOT in improving the robustness of pre-trained models, even in a fully zero-shot setting.

Limitations of the current approach

While ROBOSHOT is unable to handle more complex models like transformers or LLMs, where each token has one embedding per layer. Secondly, it’s currently focused on classification tasks where the features can be easily described with language.

In cases where we can’t find a textual description to differentiate features, ROBOSHOT’s abilities are limited. For example, a human can look at an LLM output and declare whether they think it is harmful, but they may struggle to put into words why they think it’s harmful.

We’re actively working to overcome these limitations.

Future directions

As we look forward to the future of ROBOSHOT, we hope that it could help build more cost-effective and efficient alternatives to the current LLM alignment methods. The current process, known as reinforcement learning from human feedback (RLHF), involves a complex and often time-consuming procedure.

In RLHF, data scientists collect human preference data and use it to retrain the base language model using a reinforcement learning objective function. This process, while effective, requires substantial human and computational resources.

Our ongoing research explores the possibility of modifying language models in the embedding or activation space during inference. In essence, the future directions for ROBOSHOT revolve around making the process of aligning machine learning models with human preferences more efficient, cost-effective, and widely applicable.

A promising start. We’ll keep working on it

Our work on ROBOSHOT has shown promising results in improving the robustness of pre-trained model embeddings in a zero-shot fashion. We’re excited about the potential impact of this method in the field of machine learning, particularly in settings where access to labeled data is limited.

We look forward to continuing our research and finding ways to overcome the current limitations. Thank you for your time and interest in our work.

More Snorkel AI events coming!

Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.

If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.

Share this article
Image
Dyah Adila
PhD Student

Dyah Adila hails from Indonesia and studies under Fred Sala. She had interned at Amazon AWS AI and JP Morgan Chase, Singapore. Her research interests center on building robust and reliable machine learning solutions— especially in settings where access to labeled data is limited.

Recommended articles

View all articles
judgment-bench
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based
June 18, 2026
Alexis Sobel
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.