Snorkel helps build Terminal-Bench 2.0. Learn More
Search result for:
Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…
Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…
Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…
Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.
Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.
Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.
In this post, we unpack how Snorkel built a realistic benchmark dataset to evaluate AI agents in commercial insurance underwriting. From expert-driven data design to multi-tool reasoning tasks, see how our approach surfaces actionable failure modes that generic benchmarks miss—revealing what it really takes to deploy AI in enterprise workflows.
In this post, we will show you a specialized benchmark dataset we developed with our expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark uncovers several model-specific and actionable error modes, including basic tool use errors and a surprising number of insidious hallucinations from one provider. This is part of an ongoing series of benchmarks we are releasing across verticals…
LLM observability is crucial for monitoring, debugging, and improving large language models. Learn key practices, tools, and strategies of LLM observability.
Explore how Anthropic Claude + AWS help pharmaceutical companies leverage AI for enhanced data insights and revenue growth.
See how we can use these two new products—Snorkel Evaluate and Expert Data-as-a-Service–to evaluate and develop a specialized agentic AI system for an enterprise use case
Announcing two new products on our AI Data Development Platform that together create a complete solution for enterprises to specialize AI systems with expert data at scale.
Discover how enterprises can leverage LLM-as-Judge systems to evaluate generative AI outputs at scale, improve model alignment, reduce costs, and tackle challenges like bias and interpretability.
It’s critical enterprises can trust and rely on GenAI evaluation results, and for that, SME-in-the-loop workflows are needed. In my first blog post on enterprise GenAI evaluation, I discussed the importance of specialized evaluators as a scalable proxy for SMEs. It simply isn’t practical to task SMEs with performing manual evaluations – it can take weeks if not longer, unnecessarily…
We’re taking a look at the research paper, LLMs can easily learn to reason from demonstration (Li et al., 2025), in this week’s community research spotlight. It focuses on how the structure of reasoning traces impacts distillation from models such as DeepSeek R1. What’s the big idea regarding LLM reasoning distillation? The reasoning capabilities of powerful models such as DeepSeek…
GenAI needs fine-grained evaluation for AI teams to gain actionable insights.
Specialized GenAI evaluation ensures AI assistants meet business requirements, SME expertise, and industry regulations—critical for production-ready AI.
Ensure your LLMs align with your values and goals using LLM alignment techniques. Learn how to mitigate risks and optimize performance.
Learn how ARR improves QA accuracy in LLMs through intent analysis, retrieval, and reasoning. Is intent the key to smarter AI? Explore ARR results!
Unlock possibilities for your enterprise with LLM distillation. Learn how distilled, task-specific models boost performance and shrink costs.
Discover common RAG failure modes and how to fix them. Learn how to optimize retrieval-augmented generation systems for max business value.
Learn about large language model (LLM) alignment and how it maximizes the effectiveness of AI outputs for organizations.
Discover the power of integrating Databricks and Snorkel Flow for efficient data ingestion, labeling, model development, and AI deployment.
Discover how Snorkel AI’s methodical workflow can simplify the evaluation of LLM systems. Achieve better model performance in less time.
Accelerate LLM development with Snorkel Flow and SageMaker. Automate dataset curation, accelerate training, and gain a competitive advantage.
Learn about the obstacles faced by data scientists in LLM evaluation and discover effective strategies for overcoming them.
Snorkel AI and AWS are partnering to help enterprises build, deploy, and evaluate custom, production-ready AI models. Learn how.
What is AI data development? AI data development includes any action taken to convert raw information into a format useful to AI.
Discover highlights of Snorkel AI’s first annual SnorkelCon user conference. Explore Snorkel’s programmatic AI data development achievements.
We aim to help our customers get GenAI into production. In our 2024.R3 release, we’ve delivered some exciting GenAI evaluation results.
Snorkel AI has made building production-ready, high-value enterprise AI applications faster and easier than ever. The 2024.R3 update to our Snorkel Flow AI data development platform streamlines data-centric workflows, from easier-than-ever generative AI evaluation to multi-schema annotation.