Data Development

Use cases

From coding and agentic reasoning to text generation and more, discover how Snorkel enables AI teams to build the next generation of models with unparalleled speed and accuracy.
For tool-using systems

Agentic

Structured data and evaluation workflows for agents that need to make decisions, use tools, and complete complex tasks end to end.

The frontiers of multi-turn math reasoning 

Snorkel provided a frontier LLM team with a dataset to assess LLM math reasoning skills on high school to graduate-level challenges. Our data development approach saw experts correct responses and reasoning traces and allowed the customer to control distribution across topics, skills, and complexity. 

0%

Pass rate for frontier LLMs

900

Mathematical skills
Image

AI voice assistant training data for a tech industry giant

A tech industry giant aimed to build better, more usable voice assistants for its customers. We collaborated with them to build a deep, expert-crafted dataset of realistic multi-turn, multi-agent conversations, including simulated tool use.

3+

Tool calls per conversation, ~9+ turns

15+

Reasoning scenarios represented
Image

Robust agentic evaluation benchmarks

A Global 2000 telecom partnered with Snorkel to curate a gold-standard set of prompts, responses, and tool calls targeting reasoning and multi-step planning. This custom benchmark revealed critical model failures, enabling the team to target training and correction and progress to production faster than manual reviews.

10+

Tools

+35

Points in function calling (via MMAU)
Image

Multi-step, multi-turn, and multi-tool deep research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.

10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average
For expert-labelled data

Annotation

Annotation workflows designed for tasks where quality depends on domain expertise, clear review standards, and consistent judgement.
Image

Grading LLM information retrieval and synthesis

An open-source LLM developer sought to improve its models’ ability to extract questions and answers from technical documents like textbooks and research papers. Snorkel experts graded and corrected model attempts to cite sources and answer questions from these documents, creating a golden set of retrievals.

30+

Grading dimensions

10+

Domains
Image

Enabling FMs to understand charts

A leading LLM developer sought high-quality annotations of graphs, maps, and other visuals used to solve middle-school and high-school math problems. Snorkel experts reviewed documents and curated annotations (including chart elements, data points, and implied relationships) for training and evaluation purposes.

22+

Average data points labeled per graph

15+

Visual attributes labeled
For software intelligence

Coding

Data for models that need to understand codebases, generate reliable solutions, and perform across real developer workflows.
Image

Alignment for better code generation

A frontier model developer sought to improve code generation outputs using human feedback. Snorkel rapidly assembled a team of qualified engineers to assess, review, and grade multiple candidate code responses to user queries, resulting in a rich training set to better align the model.

8

Assessment criteria per code generation

21

Coding languages assessed
Image

Training and evaluation data for code generation

A tech industry giant sought unique prompts and answers to train and evaluate its frontier LLMs’ code generation capabilities. Snorkel experts curated unique competition-style coding prompts with verifiable solutions and accompanying unit tests to validate samples automatically. 

20+

Problem classes

4

Factors in quality rubric
For visual reasoning

Multi-modal

Datasets and evaluations for models that interpret charts, images, diagrams, documents, and mixed-format inputs.
Image

Enabling FMs to understand charts

A leading LLM developer sought high-quality annotations of graphs, maps, and other visuals used to solve middle-school and high-school math problems. Snorkel experts reviewed documents and curated annotations (including chart elements, data points, and implied relationships) for training and evaluation purposes.

22+

Average data points labeled per graph

15+

Visual attributes labeled
Image

Image-based search for retail

An e-commerce giant aimed to let customers search products by image and feeling (such as “summer vibes.”) Snorkel researchers generated pairs of user queries and associated results that boosted downstream search mode performance.

10,000+

Products

+37

Point recall on image + text search
For high-quality outputs

Text generation

Improve how models write, summarise, extract, and respond across customer, research, and enterprise use cases.
Image

A PhD-level benchmark for frontier LLMs

A leading LLM developer sought a dataset of multiple-choice Q&A questions that stretched beyond the limits of frontier LLMs. Snorkel AI developed a dataset that probed for PhD-level understanding, covering thousands of topics across humanities, STEM, and professional domains.

<20%

Pass rate by two frontier LLMs

1,000+

PhD-level sub-domains
Image

Q&A training data for customer billing  SLM

A Fortune 500 telecom wanted an SLM to automatically answer customer billing questions. Using expert input and programmatic acceleration, Snorkel curated data that covered all expected question types and improved the model’s performance, enabling the team to deploy 10+ supported use cases to production.

+41

Point improvement in SLM answer accuracy

93%

Alignment between SMEs and AI evaluators
Image

Multi-step, multi-turn, and multi-tool deep research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.

10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average

For models that need to be right. Not just good enough.