Use Cases

Agentic

The frontiers of multi-turn math reasoning

Snorkel provided a frontier LLM team with a dataset to assess LLM math reasoning skills on high school to graduate-level challenges. Our data development approach saw experts correct responses and reasoning traces and allowed the customer to control distribution across topics, skills, and complexity.

0%

Pass rate for frontier LLMs

900

Mathematical skills

Agentic

AI voice assistant training data for a tech industry giant

A tech industry giant aimed to build better, more usable voice assistants for its customers. We collaborated with them to build a deep, expert-crafted dataset of realistic multi-turn, multi-agent conversations, including simulated tool use.

3+

Tool calls per conversation, ~9+ turns

15+

Reasoning scenarios represented

Agentic

Robust agentic evaluation benchmarks

A Global 2000 telecom partnered with Snorkel to curate a gold-standard set of prompts, responses, and tool calls targeting reasoning and multi-step planning. This custom benchmark revealed critical model failures, enabling the team to target training and correction and progress to production faster than manual reviews.

10+

Tools

+35

Points in function calling (via MMAU)

Agentic

Text generation

Multi-step, multi-turn, and multi-tool deep research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.

10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average

Annotation

Grading LLM information retrieval and synthesis

An open-source LLM developer sought to improve its models’ ability to extract questions and answers from technical documents like textbooks and research papers. Snorkel experts graded and corrected model attempts to cite sources and answer questions from these documents, creating a golden set of retrievals.

30+

Grading dimensions

10+

Domains

Annotation

Multi-modal

Enabling FMs to understand charts

A leading LLM developer sought high-quality annotations of graphs, maps, and other visuals used to solve middle-school and high-school math problems. Snorkel experts reviewed documents and curated annotations (including chart elements, data points, and implied relationships) for training and evaluation purposes.

22+

Average data points labeled per graph

15+

Visual attributes labeled

Coding

Alignment for better code generation

A frontier model developer sought to improve code generation outputs using human feedback. Snorkel rapidly assembled a team of qualified engineers to assess, review, and grade multiple candidate code responses to user queries, resulting in a rich training set to better align the model.

8

Assessment criteria per code generation

21

Coding languages assessed

Coding

Training and evaluation data for code generation

A tech industry giant sought unique prompts and answers to train and evaluate its frontier LLMs’ code generation capabilities. Snorkel experts curated unique competition-style coding prompts with verifiable solutions and accompanying unit tests to validate samples automatically.

20+

Problem classes

4

Factors in quality rubric

Text generation

A PhD-level benchmark for frontier LLMs

A leading LLM developer sought a dataset of multiple-choice Q&A questions that stretched beyond the limits of frontier LLMs. Snorkel AI developed a dataset that probed for PhD-level understanding, covering thousands of topics across humanities, STEM, and professional domains.

<20%

Pass rate by two frontier LLMs

1,000+

PhD-level sub-domains

Text generation

Q&A training data for customer billing SLM

A Fortune 500 telecom wanted an SLM to automatically answer customer billing questions. Using expert input and programmatic acceleration, Snorkel curated data that covered all expected question types and improved the model’s performance, enabling the team to deploy 10+ supported use cases to production.

+41

Point improvement in SLM answer accuracy

93%

Alignment between SMEs and AI evaluators

Agentic

Text generation

Multi-step, multi-turn, and multi-tool deep research data

A leading LLM provider hired Snorkel AI to create a dataset to enhance its models’ deep research capabilities. Snorkel researchers assembled a dataset where each data point included a complex user query, a high-quality research plan, and a fine-grained response quality evaluation rubric.

10+

Average interactions between model and user

30+

Evaluation criteria developed per task on average

Use cases

Agentic

The frontiers of multi-turn math reasoning

0%

900

AI voice assistant training data for a tech industry giant

3+

15+

Robust agentic evaluation benchmarks

10+

+35

Multi-step, multi-turn, and multi-tool deep research data

10+

30+

Annotation

Grading LLM information retrieval and synthesis

30+

10+

Enabling FMs to understand charts

22+

15+

Coding

Alignment for better code generation

8

21

Training and evaluation data for code generation

20+

4

Multi-modal

Enabling FMs to understand charts

22+

15+

Image-based search for retail

10,000+

+37

Text generation

A PhD-level benchmark for frontier LLMs

<20%

1,000+

Q&A training data for customer billing SLM

+41

93%

Multi-step, multi-turn, and multi-tool deep research data

10+

30+