Category
Data development

Our picks

LLM evaluation in enterprise applications: a new era in ML

Learn about the obstacles faced by data scientists in LLM evaluation and discover effective strategies for overcoming them.

November 25, 2024

AI data development: a guide for data science projects

What is AI data development? AI data development includes any action taken to convert raw information into a format useful to AI.

November 13, 2024

Building better enterprise AI: incorporating expert feedback in system development

Enterprises that aim to build valuable GenAI applications must view them from a systems-level. LLMs are just one part of an ecosystem.

January 30, 2024

Recomended for you

Multi agents in the context of enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

October 9, 2025

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…

September 30, 2025

Parsing Isn’t Neutral: Why Evaluation Choices Matter

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…

September 26, 2025

All articles on
Data development

Multi agents in the context of enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

October 9, 2025

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…

September 30, 2025

Parsing Isn’t Neutral: Why Evaluation Choices Matter

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…

September 26, 2025

The science of rubric design

Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.

September 11, 2025

The right tool for the job: An A-Z of rubrics

Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.

September 2, 2025

Data quality and rubrics: how to build trust in your models

Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.

July 29, 2025

Building the Benchmark: Inside Our Agentic Insurance Underwriting Dataset

In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.

July 10, 2025

Evaluating AI Agents for Insurance Underwriting

In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals…

June 26, 2025

LLM Observability: Key Practices, Tools, and Challenges

LLM observability is crucial for monitoring, debugging, and improving large language models. Learn key practices, tools, and strategies of LLM observability.

June 23, 2025

LLM-as-a-judge for enterprises: evaluate model alignment at scale

Discover how enterprises can leverage LLM-as-Judge systems to evaluate generative AI outputs at scale, improve model alignment, reduce costs, and tackle challenges like bias and interpretability.

March 26, 2025

Why enterprises should embrace LLM distillation

Unlock possibilities for your enterprise with LLM distillation. Learn how distilled, task-specific models boost performance and shrink costs.

February 18, 2025

LLM evaluation in enterprise applications: a new era in ML

Learn about the obstacles faced by data scientists in LLM evaluation and discover effective strategies for overcoming them.

November 25, 2024

AI data development: a guide for data science projects

What is AI data development? AI data development includes any action taken to convert raw information into a format useful to AI.

November 13, 2024

How a global financial services company built a specialized AI copilot accurate enough for production

Learn how Snorkel, Databricks, and AWS enabled the team to build and deploy small, specialized, and highly accurate models which met their AI production requirements and strategic goals.

Dr. Bubbles, Snorkel AI's mascot
September 9, 2024

Task Me Anything: innovating multimodal model benchmarks

“Task Me Anything” empowers data scientists to generate bespoke benchmarks to assess and choose the right multimodal model for their needs.

September 4, 2024
Image
See how Snorkel can help you get up to:
100x

Faster Data Curation

40x
Faster Model Delivery
99%
Model Accuracy