Your AI model is only as good as your data! But how do you measure “good”?

While the AI industry races toward more sophisticated AI applications like agentic systems, a critical question remains top of mind: How do we systematically evaluate and improve the quality of training and evaluation data that powers these systems?

It’s time for enterprises investing in AI to adopt the state-of-the-art approach used by major AI labs and Snorkel: structured evaluation rubrics.

Welcome to Snorkel AI’s five-part blog series on rubrics.

  1. Part 1 introduces rubric-based evaluation and how both automated and human evaluations benefit from using them.
  2. Part 2 deep dives into different types of rubrics and where it makes sense to apply them. We discuss evaluating both final outcomes and an agent’s response steps (traces).
  3. Part 3 explains the science of rubric design, backed by existing literature and Snorkel’s own projects.
  4. Part 4 is another deep dive, this time into Snorkel’s own use of rubrics in a complex, multi-review process that integrates automated and human-in-the-loop evaluations that produce meaningful data outcomes and improvements to the rubrics over time.
  5. Part 5 looks ahead to emerging trends, advanced techniques, and new AI use cases like agentic multi-turn, multi-step reasoning conversations and tool calls, and multi-modal and coding AI applications.

Evaluation matters, but existing methods fall short of real-world utility for generative applications

Evaluation provides clear metrics to track performance, uncover hidden issues early, and build confidence that your AI behaves as expected before it reaches users. Understanding the limitations of agentic systems is crucial for risk management in a successful production deployment. However, most organizations still rely on outdated evaluation methods.

The shift to generative models and agentic systems requires a similar shift in how we approach evaluation. The challenge is building evaluation frameworks that can assess open-ended, generative responses. Closed-ended benchmarks such as MMLU still matter for multiple-choice tasks, but we need a different lens for the open-ended, generative era.

One outdated method, ad hoc evaluation, relies on gut instincts and simple approaches that miss nuance and fail to capture critical edge cases at scale. Ad hoc checks fall short when assessing open‑ended generative outputs in specialized domains.

“Golden” responses, an approach that uses predefined, ideal answers as points of comparison, has proven brittle. The predefined responses quickly become outdated, or may not apply when there is no single correct response.

We need to shift our mental models of evaluation to include rubrics

The evaluation problem can’t be solved with better data alone.

To build evaluation frameworks that address the unique demands of the generative era, we need to fundamentally transform how we think about quality measurement itself.

Where gut instinct falls short and endless lists of “golden responses” fail to apply to the full variety of agentic interactions, rubrics are up to the challenge.

Ready to begin? Let’s dive in.

Part I: Introduction to rubrics for AI evaluation

The era of vibe-checking AI models—the “it looks good to me” approach—is over. As AI systems tackle increasingly complex, open-ended tasks such as generating code and conducting deep research, the industry is shifting from intuitive, ad hoc assessment methods to systematic, science-backed evaluation frameworks. What started as a necessity for major AI labs is fast becoming industry standard: structured rubric-based evaluation

Rubric-based assessments are:

  • Reliable
  • Nuanced
  • Consistent between automated and human annotators
  • Granular enough to supply the feedback essential for continuous model improvement

Before we explore the evidence for how effective rubrics are for AI system evaluation, we’ll anchor on a clear definition of a rubric.

What is a rubric?

A rubric is a structured guide that spells out what “good” looks like for each response from an AI system. 

A rubric consists of:

  • A list of criteria: Does the code compile? Does it have comments?
  • How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
  • Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.

The final rubric score is a set of criteria, each associated with numbers or values. A rubric is a mechanism for embedding domain expertise in a checklist. For example, for a code-generating LLM, you’d want someone familiar with coding to decide which criteria to include and the relevant levels of performance for each. We’ll talk more about rubric design in Part 3 of this series.

Both humans and LLMs can use evaluation rubrics. Researchers and program managers can provide rubrics to human annotators, giving all annotators a shared understanding of the rating system for their dataset. The rubric helps to reduce bias and increase alignment between annotators. In automated judging (called LLM-as-a-judge evaluation), the annotator includes the rubric in the AI judge’s prompt.

For both human and automated evaluators, the rubric converts fuzzy expectations into repeatable scores that feed both data quality loops and live evaluation dashboards. 

Here’s an example of what a rubric for a weekend planning agent could look like:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.
CriterionResponse TypeDescriptionScore
Budget ComplianceYes or NoIs the total estimated cost within $500?Yes / No
Hotel PreferencesYes or NoDid the plan include three-star hotel options?Yes / No
Surfing Focus1 to 5How well does the itinerary incorporate surfing activities?1 (low) to 5 (high)
Clarity of Itinerary1 to 5Is the schedule clear, with times and locations specified?1 (low) to 5 (high)
Variety of Activities1 to 5Does the plan include a balanced mix of leisure and adventure?1 (low) to 5 (high)
Cost of Breakdown Detail1 to 5Are lodging, food, transportation and activity costs itemized?1 (low) to 5 (high)

Here’s an example of applying the weekend planning rubric to a sample generated response:

Prompt
Plan a two-day weekend trip for a solo traveler who loves surfing, prefers three-star hotels, and has a budget of up to $500 total. Include suggested itinerary items, lodging details, and approximate costs.
Model Response
Day 1:
- Surf with Pros Lesson - $60
- Lunch at Taco Shack - $20
- Check-in at Seaview Inn (three-star) - $120/night

Day 2:
- Sunrise surf at Lighthouse Cove - free
- Coastal hike tour - $40
- Picnic lunch - $25

Total Estimated Cost: $295
CriterionDescriptionScoreRationale
Budget ComplianceIs the total estimated cost within $500?YesTotal of $295 leaves room under $500.
Hotel PreferenceDid the plan include three-star hotel options?YesIncludes Seaview Inn, a confirmed three-star property
Surfing FocusHow well does the itinerary incorporate surfing activities?5Multiple surf sessions and a lesson cover core preference.
Clarity of ItineraryIs the schedule clear, with times and locations specified?4Locations are linked, though exact items could be added.
Variety of ActivitiesDoes the plan include a balanced mix of leisure and adventure?3Surfing and hike mix leisure and adventure, but more sightseeing could help.
Cost Breakdown DetailAre lodging, food, transportation and activity costs itemized?4Most costs are listed, though transport fees are assumed free.

The evolution from simpler evaluation methods

For decades, AI evaluation relied heavily on intuitive assessment methods. These methods included:

  • Subject matter experts reviewing outputs against their unique intuition of the ideal output
  • Accuracy metrics assessed against static benchmarks
  • Comparisons against “golden datasets” for a finite set of examples

This approach worked reasonably well for classification tasks with clear right and wrong answers, but it fundamentally breaks down when applied to modern generative AI (GenAI) systems.

Consider the case of evaluating an AI-generated research summary, or a code solution with multiple valid approaches, or a conversational response where creativity and nuance matter as much as factual accuracy. Traditional metrics like BLEU scores and exact match comparisons fail to capture the multidimensional nature of quality in these open-ended scenarios, often missing critical aspects like coherence, helpfulness, or domain-specific requirements.

Leading AI organizations have moved beyond ad hoc evaluation toward systematic, multidimensional rubric-based frameworks. This shift was driven by necessity. As AI systems became more capable and deployed in more complex applications, the limitations of simple evaluation methods became glaring obstacles. These simple methods lacked either reliability or the ability to generate metrics quickly enough to create a useful iteration loop.

The ability to provide structured, detailed feedback across multiple criteria for each output of a GenAI system has become essential not just for model development, but for building the trust and reliability required for real-world deployment.

The evaluation rubric is doubly useful, because it secures labeling consistency among human annotators in the annotation phase, and later doubles as the blueprint for automated grading once the model is trained, closing the loop between data creation and evaluation.

Literature review: how the pros think about evaluation rubrics

The recent literature converges on a simple insight: treat the evaluation plan as the primary objective of the data and let the training pipeline enforce it. Evaluation, and evaluation rubrics, are essential at two foundation blocks in model development loop:

  • Curating high-quality training datasets; that is, making sure that what goes into making AI systems is up to spec.
  • Automating scoring of model responses; that is, making sure that what comes out of AI systems is also up to (the same) spec.

Here are examples of rubrics in use during the creation of high-quality training data:

  • Google’s ML Test Score [1] turns twenty-eight discrete checks into an executable scorecard that teams run on every pull request, a practice that generated double-digit accuracy gains and exposed silent data drift long before it reached production.
  • Microsoft’s RUBICON [2] extends that philosophy to conversational agents, using a large language model to propose many candidate criteria, then pruning them until the final rubric separates strong and weak dialogues with 18% better precision than baseline heuristics.
  • Databricks [3] echoes the same theme for retrieval-augmented generation, showing that an explicit rubric prompt fed to GPT-4 can grade thousands of question-answer pairs per hour while keeping expert agreement above 80%.

Speed alone is not enough; automated evaluators must also earn trust.

  • The Alternative Annotator Test [4] offers a statistical framework for deciding when an LLM can replace human raters, requiring only a small audit set to quantify risk.
  • Prometheus 2 [5] pushes that frontier further, matching human and GPT-4 preferences across both direct scoring and pairwise ranking, and letting practitioners swap new criteria in or out without a full reward model retrain.

Together, these studies demonstrate that scalable model iteration pipelines are possible. Efficient and effective enterprise AI development depends on these three qualities:

  1. Scalability: Human effort is reserved for edge cases.
  2. Reliability: Rubric questions are concrete enough for prompt-based scoring.
  3. Alignment: Statistical safeguards confirm that automated judges remain aligned with expert opinion.

In practice, automated evaluation follows a layered approach:

  1. An initial LLM-as-a-judge evaluation screens for obvious failures such as toxicity or broken formatting.
  2. A rubric-based second pass scores surviving responses on factual accuracy, coherence, and domain alignment.
  3. Periodically, humans perform spot checks to recalibrate the first two automated layers.

This structure is scalable because most items flow through the first two layers without human touch, reliable because the rubric prompt grounds the judge’s decisions in transparent criteria, and flexible because teams can edit or add rubrics as product goals evolve, all without redefining an immutable ground truth that never truly existed for many open-ended tasks.

To understand why rubrics work to align human annotators with LLM judges, we turn to an analysis of the cognition that takes place when humans use a rubric during manual annotation.

Psychology of human manual annotation

Rubric-guided evaluation reshapes how human annotators think. A clear rubric externalizes the criteria that would otherwise sit in working memory, letting raters focus on one dimension at a time and cutting the mental juggling that drives inconsistency. 

  • Education studies report [6] sharper agreement when analytic rubrics replace holistic judgments, with human graders themselves crediting lower cognitive load for the improvement.
  • Recent benchmarking on explanation quality echoes that finding in AI datasets: the CUBE paper [7] showed that once annotators followed a structured rubric, agreement with expert adjudicators increased even on subjective tasks such as reasoning clarity and stance.
  • A 2024 experiment [8] went further, pairing crowd workers with GPT-4 labels. It found that the complementary strengths of humans and the model surfaced only after the workers aligned on the same rubric, pushing accuracy past either source alone.
  • Consistency needs proof, which is where inter-rater reliability metrics enter. Measures such as Cohen’s Kappa, Fleiss’ Kappa, and the intraclass correlation coefficient quantify how often raters converge beyond chance expectations [9].
  • Scale design matters. Most rubrics use five- or seven‑point Likert scales, but without clear anchors, these can introduce central tendency and interpretation bias. In fact, Best-Worst Scaling [10] has been shown to yield significantly higher inter‑rater reliability than traditional Likert ratings, and continuous (slider‑style) scales can further improve consistency over discrete options in dialogue evaluation tasks [11].

Rubrics do more than raise inter-annotator agreement; they also act as guardrails against bias and annotator fatigue.

  • The authors of [12] designed assessment rubrics for human evaluation of generated long-form answers to medical questions; the rubrics incorporated multiple dimensions of bias with the potential to contribute to equity-related harm, exposing failure modes missed by generic toxicity screens in LLMs.
  • On the more practical side, Pareto et al. shows [13] that well-structured instructions combined with task rotation slow the onset of annotation fatigue and maintain quality over many hours of work.

The same rubric that guides automated judges must first persuade human annotators to grade consistently. This guarantees that manual annotation becomes a controlled experiment rather than a leap of faith.

Conclusion

To conclude, the evidence argues for a layered, rubric-driven evaluation process:

  1. Define rubrics that surface failure modes in both training data and model responses.
  2. Train annotators until inter-rater scores stabilize.
  3. Rely on automated judges for faster grading of responses during model iteration.
  4. Run periodic expert spot checks to catch drift in both annotators and judges.

Following these steps delivers an evaluation methodology that provides the nuanced and multi-faceted metrics that apply to real-world scenarios, and reliably scales it up to the volume and pace needed for fast AI system development.

Stay tuned for Part 2 of our rubric series, where we’ll unpack different types of rubrics and discuss when to apply them on various types of datasets.

At Snorkel AI, we don’t just write about high-quality evaluation, we deliver it. Through our Expert Data-as-a-Service, we partner with frontier model developers to curate world-class datasets for training and evaluating LLMs. Think: high-signal, expert-labeled data tailored to your most ambitious use cases for complex tool-using agents, reasoning-heavy workflows.

Interested in bringing rubric-based rigor to your models? Let’s talk.

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

References

  1. Breck, Eric, et al. “The ML test score: A rubric for ML production readiness and technical debt reduction.” 2017 IEEE international conference on big data (big data). IEEE, 2017.
  2. Biyani, Param, et al. “Rubicon: Rubric-based evaluation of domain-specific human ai conversations.” Proceedings of the 1st ACM International Conference on AI-Powered Software. 2024.
  3. Leng, Quinn. “Best Practices for LLM Evaluation of RAG Applications.” Databricks, Databricks, 2023, www.databricks.com/blog/LLM-auto-eval-best-practices-RAG.
  4. Calderon, Nitay, Roi Reichart, and Rotem Dror. “The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs.” arXiv preprint arXiv:2501.10970 (2025).
  5. Kim, Seungone, et al. “Prometheus 2: An open source language model specialized in evaluating other language models.” arXiv preprint arXiv:2405.01535 (2024).
  6. “Empirical exploration into academic grading and feedback approaches.” Edexia, 4 Apr. 2025, www.edexia.ai/news/empirical-explration-into-academic-grading-and-feedback-approaches
  7. Galvan-Sosa, Diana, et al. “Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset.” arXiv preprint arXiv:2503.23899 (2025).
  8. He, Zeyu, et al. “If in a crowdsourced data annotation pipeline, a gpt-4.” Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.
  9. “Inter-rater Reliability: Definition, Examples, Calculation.” Encord, encord.com/blog/inter-rater-reliability/.
  10. Kiritchenko, Svetlana, and Saif M. Mohammad. “Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation.” arXiv preprint arXiv:1712.01765 (2017).
  11. Santhanam, Sashank, and Samira Shaikh. “Towards best experiment design for evaluating dialogue system output.” arXiv preprint arXiv:1909.10122 (2019).
  12. Pfohl, Stephen R., et al. “A toolbox for surfacing health equity harms and biases in large language models.” Nature Medicine 30.12 (2024): 3590-3600.
  13. Parti, Ayush. “Annotation fatigue: Why human data quality declines over time.” Pareto et al., 6 Feb. 2025, pareto.ai/blog/annotation-fatigue