Snorkel Expert data-as-a-service

Use Cases

From coding and agentic reasoning to text generation and more, discover how Snorkel enables AI teams to build the next generation of models with unparalleled speed and accuracy.

Agentic

Agentic

The frontiers of multi-turn math reasoning 

Snorkel provided a frontier LLM team with a dataset to assess LLM math reasoning skills on high school to graduate-level challenges. Our data development approach saw experts correct responses and reasoning traces and allowed the customer to control distribution across topics, skills, and complexity. 


0%

Pass rate for frontier LLMs

900

Mathematical skills
Agentic

AI Voice assistant training data for a tech industry giant

A tech industry giant aimed to build better, more usable voice assistants for its customers. We collaborated with them to build a deep, expert-crafted dataset of realistic multi-turn, multi-agent conversations, including simulated tool use.


3+

Tool calls per conversation, ~9+ turns

15+

Reasoning scenarios represented
Agentic

Robust agentic evaluation benchmarks

A Global 2000 telecom partnered with Snorkel to curate a gold-standard set of prompts, responses, and tool calls targeting reasoning and multi-step planning. This custom benchmark revealed critical model failures, enabling the team to target training and correction and progress to production faster than manual reviews.


10+

Tools

+35

Points in function calling (via MMAU)
Agentic
Text Generation

Multi-step, multi-turn, and multi-tool Deep Research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.


10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average

Annotation

Annotation

Grading LLM information retrieval and synthesis

An open-source LLM developer sought to improve its models’ ability to extract questions and answers from technical documents like textbooks and research papers. Snorkel experts graded and corrected model attempts to cite sources and answer questions from these documents, creating a golden set of retrievals.


30+

Grading dimensions

10+

Domains
Annotation
Multi-Modal

Enabling FMs to understand charts

A leading LLM developer sought high-quality annotations of graphs, maps, and other visuals used to solve middle-school and high-school math problems. Snorkel experts reviewed documents and curated annotations (including chart elements, data points, and implied relationships) for training and evaluation purposes.


22+

Average data points labeled per graph

15+

Visual attributes labeled

Coding

Coding

Alignment for better code generation

A frontier model developer sought to improve code generation outputs using human feedback. Snorkel rapidly assembled a team of qualified engineers to assess, review, and grade multiple candidate code responses to user queries, resulting in a rich training set to better align the model.


8

Assessment criteria per code generation

21

Coding languages assessed
Coding

Training and evaluation data for code generation

A tech industry giant sought unique prompts and answers to train and evaluate its frontier LLMs’ code generation capabilities. Snorkel experts curated unique competition-style coding prompts with verifiable solutions and accompanying unit tests to validate samples automatically. 


20+

Problem classes

4

Factors in quality rubric

Multi-Modal

Annotation
Multi-Modal

Enabling FMs to understand charts

A leading LLM developer sought high-quality annotations of graphs, maps, and other visuals used to solve middle-school and high-school math problems. Snorkel experts reviewed documents and curated annotations (including chart elements, data points, and implied relationships) for training and evaluation purposes.


22+

Average data points labeled per graph

15+

Visual attributes labeled
Multi-Modal

Image-based search for retail

An e-commerce giant aimed to let customers search products by image and feeling (such as “summer vibes.”) Snorkel researchers generated pairs of user queries and associated results that boosted downstream search mode performance.


10,000+

Products

+37

Point recall on image + text search

Text Generation

Text Generation

A PhD-level benchmark for frontier LLMs

A leading LLM developer sought a dataset of multiple-choice Q&A questions that stretched beyond the limits of frontier LLMs. Snorkel AI developed a dataset that probed for PhD-level understanding, covering thousands of topics across humanities, STEM, and professional domains.


<20%

Pass rate by two frontier LLMs

1,000+

PhD-level sub-domains
Text Generation

Q&A training data for customer billing  SLM

A Fortune 500 telecom wanted an SLM to automatically answer customer billing questions. Using expert input and programmatic acceleration, Snorkel curated data that covered all expected question types and improved the model’s performance, enabling the team to deploy 10+ supported use cases to production.


+41

Point improvement in SLM answer accuracy

93%

Alignment between SMEs and AI evaluators
Agentic
Text Generation

Multi-step, multi-turn, and multi-tool Deep Research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.


10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average
Image
See how Snorkel can help you get up to:
100x
Faster Data Cuartion
40x
Faster Model Delivery
99%
Model Accuracy