This is Part 3 of Snorkel AI’s five-part blog series on rubrics. In Part 1 we introduced rubric-based evaluation and discussed the benefits to both automated and human evaluations. In Part 2 we dove into different types of rubrics and where it makes sense to apply them, including in agentic contexts where we may need step-by-step evaluations. 

Here we will explain the science of rubric design, backed by our experience with customers, internal experiments, and prior studies that include research not only in AI but in education, which shares the same fundamental challenges around defining and measuring what has been learned.

We will focus on:

  1. Structuring the rubric.
  2. How to measure the quality of the rubric.
  3. How to improve the quality of the rubric.

A key takeaway in this post: rubrics are not simply constitutions that live in their own right. Their purpose is to predict value in a way that (1) aligns with the stakeholders’ objectives for the AI system, and (2) maximizes agreement rates amongst the stakeholders.

Neither axis alone is sufficient—we can write down criteria that appear to align with objectives but yield low agreement rates sample-by-sample because they are vaguely worded or too open to interpretation. On the other hand, we can develop criteria that give really high agreement rates, but turn out to be trivial with low predictive power for AI system outcomes.

In the science of rubric design, we treat rubrics as models and iterate on their development to optimize along both axes.

Structuring the rubric

In this section, we’ll break down the key components of effective rubric structures and discuss how they can be tailored for fine-grained evaluations and reward modeling.

For fine-grained evaluations

When structuring rubrics for fine-grained evaluations, it’s essential to start by establishing the form of assessments you would like to receive from evaluators. Dawson et al. [3] proposes 14 rubric design elements to consider for evaluations. Four design elements that are relevant to our discussion on rubric structure are:

  • Evaluative criteria: The criteria of a rubric are the attributes of quality that are being evaluated.
  • Specificity: Rubric criteria exist on a spectrum ranging from general to highly specific. General rubric criteria can be reused to evaluate multiple task instances in a broad domain. On the other end of the spectrum, criteria can be tailored to specific task instances. Task instance-level rubrics support detailed feedback and analysis while general rubrics can be reused for multiple task instances.
  • Quality levels: Quality levels are ordered categories that an evaluator can use for their assessment. For example, a criteria can have binary pass or fail levels, or a range of quality levels as in a Likert scale with varying degrees of agreement. 
  • Accompanying feedback information: It is oftentimes useful to obtain rationales from evaluators that explain why they provided a certain grade. The rationales can be used as additional feedback for the model and for rubric quality assurance. For example, rationales may give the evaluator an opportunity to highlight ambiguities in the rubric evaluation criteria.

Rubric designers often face similar challenging decisions when it comes to the specificity and quality-level elements described above. Below is a collection of tips and things to consider to improve the effectiveness of your rubric for fine-grained evaluations:

  • Number of scale points: How many points (options) should a rating scale have? Research and practice often recommend between five (5) and seven (7) points, as this range balances granularity with cognitive ease. An exception to this recommendation is made for binary grading scales to force more decisive judgements from humans or models. If a binary scale is used then you should ensure the criteria for “pass” or “fail” is well-defined and provided to the evaluator.
  • Odd vs. even number of options: Offering a middle/neutral option (an odd-numbered scale) can affect responses. Some respondents appreciate a neutral option to indicate indifference or uncertainty, while others may overuse it to avoid making a decision. Removing the midpoint (using an even-numbered scale) forces a directional choice but can also frustrate those who genuinely feel neutral.
  • Balanced and labeled scales: A well-designed scale should be balanced (equal number of positive and negative options around a midpoint) and use symmetric wording. If one side of a scale is more strongly worded than the other, it can push respondents toward the softer side, biasing results.
  • Rubric complexity: Rubric complexity and design can dramatically affect how human evaluators respond. When faced with a long or tedious questionnaire, respondents may experience survey fatigue. Symptoms include selecting the same answer for many items in a row (“straight-line” response), choosing arbitrary answers just to finish quickly, or abandoning the survey altogether.

To reduce ambiguity, more sophisticated rubric structures are also being adopted, giving more control to the rubric creator and fulfilling the requirements for new methodologies. For instance, the rubric for the PaperBench benchmark discussed in more detail in the following section is organized in a hierarchical structure [8]. This supports the decomposition of the main replication task into increasingly fine-grained sub-tasks. At the top of the rubric is the broad goal, “Replicate the paper’s main contributions,” and each level of the tree breaks down the task into more specific outcomes, all the way down to the leaf nodes, which are objective and easily gradeable. Leaf nodes can be one of three types: 

  1. Results match: A check of whether the results of submitted code match those reported in the original paper.
  2. Execution: A check if the code successfully executes and produces required outputs.
  3. Development: A check if the agent has written code that implements a part of the paper contributions such as an algorithm. 

Further, nodes are weighted according to their importance. The node weights are used to aggregate a final score. A concrete example of a portion of a rubric structure is provided below:

Figure 1. An excerpt of the rubric for one of the papers in PaperBench [8].
Figure 1. An excerpt of the rubric for one of the papers in PaperBench [8].

The rubric is structured in this way to enable objective and reproducible grading and to provide an evaluation that gives partial credit for incomplete attempts. The partial credit is especially important for highly complex tasks for AI agents such as scientific paper reproduction. An agent can be thoroughly evaluated and learn from meaningful measures of progress.

For learning algorithms

All of the principles for fine-grained evaluation are relevant to designing rubrics for learning algorithms. However, there are two key differences: (1) in most learning applications, rubric signal is ultimately collapsed into a one-dimensional readout (e.g., good vs. bad on some scale), (2) it is simply impractical to have humans in any learning loop that involves thousands of evaluated steps (often more), so there is an additional challenge of automation to reduce latencies. This means leveraging AI to apply rubrics via large language model-based annotator judges (LLMAJ), which, in some cases, is the model-in-training itself via self critiques. This could also mean training a more lightweight reward model using data labeled by humans or LLMAJ or both.

Arguably the simplest examples can be found in the early work on reinforcement learning with human feedback (RLHF), in which teams of human annotators labeled data for reward model training using relatively flat rubrics. However, Bai et al. [2] pioneered the use of more scalable rubric application through their constitutional AI (CAI) framework, aimed at improving the helpfulness, harmlessness, and honesty of LLMs. At the core of the CAI is a reward model trained using preference pairs generated from self-critiques and revisions guided by a general rubric. The rubric is a succinct set of principles referred to as a “constitution” that were included in multiple prompts. For instance, the following is one critique request and revision request prompt from the original paper:


“CritiqueRequest: Identify all ways in which the assistant’s last response is harmful, unethical, or socially biased. Furthermore, provide specific details on how the assistant can improve its response.

RevisionRequest: Please rewrite the assistant response to remove all harmful, unethical, or socially biased content, and move the conversation in a positive direction.”

The rubric in the CAI framework is embedded in the prompts via brief statements of the criteria, e.g., “harmful, unethical, or socially biased.” This form of rubric is simple to develop and the authors show it is effective for achieving their goals. However, the model is left to interpret the meaning of “harmful, unethical, or socially biased.”

Many variations of CAI have been innovated since that work under the banner of reinforcement with AI feedback (RLAIF), often using rubrics to guide feedback. A more recent approach proposed by Srivastava et al. [7] leverages a causally robust reward modeling framework (CROME). This approach hinges on “causal rubrics” that disentangle factors that drive quality from those that may be correlated but are spurious (e.g., the length of a response). Importantly, the rubrics are developed by interacting with language models to identify candidate attributes in the data which are then reviewed, and the resulting rubrics are used to augment a reward model training dataset leading to a 5.4% overall accuracy gain on RewardBench. Moreover, the authors show there are fewer failures on adversarial or long-tail inputs with spurious attributes and provide a granular evaluation of the reward model performance that is tied to the rubric structure. 

Measuring rubric quality

Quantitative evaluations provide measurable indicators of rubric effectiveness. Numerical measures facilitate rapid iteration and systematic assessment of rubric improvements, reducing dependency solely on labor-intensive qualitative reviews. Importantly, quantitative evaluations of rubrics must align directly with the intended application and downstream outcomes.

One goal for building rubrics is to improve inter-annotator agreement or inter-rater reliability, in other words, the degree of agreement among independent graders [4, 9]. For instance, the joint probability of agreement estimates the percentage of time graders agree. A drawback of the joint probability of agreement is that it does not take into account the likelihood of graders agreeing by chance. See [4, 9] for more information on sophisticated statistics such as Cohen’s Kappa and Fleiss’ Kappa that consider the effect of agreement by chance. 

Studies overwhelmingly show that rubrics improve human inter-rater reliability, or inter-annotator agreement, compared to unguided scoring [5]. When teachers or judges use a common rubric, their scoring tends to converge. However, high inter-annotator agreement is not automatic. Measuring inter-annotator agreement statistics gives insights into the quality of your rubric and where it could be improved. For example, despite an extensive process of vetting physicians and ongoing qualitative evaluations, the creators of HealthBench, Arora et al. [1], still measured a high amount of variability, 55% to 75% agreement rates. The authors of the HealthBench paper hypothesized:

“Reasons for variation in grading of consensus criteria could include ambiguity in criteria, ambiguity in conversations and responses to be graded, and differences in clinical specialization, risk tolerance, perceived severity, communication style, and interpretation of instructions.”

Training raters together, providing scoring examples, and revising unclear criteria are techniques that have been proven in practice to push inter-rater reliability higher.

Rubrics also help individual graders stay consistent with themselves, improving intra-rater consistency. A review of studies found that rubric use tends to yield high internal consistency as measured by Cronbach’s alpha [5]. Notably, by providing a fixed reference for each score level, rubrics anchor the grader’s expectations. 

Rubric alignment between human annotators and LLMAJ is another meaningful quantitative target when AI is being leveraged to apply rubrics (as discussed above in application to learning). The same statistics used for inter-annotator agreement may be applied to quantify LLMAJ alignment. For instance, Sirdeshmukh et al. [6] recently introduced the MultiChallenge benchmark for evaluating LLMs on their ability to conduct multi-turn conversations. The authors follow an instance-level rubric-based evaluation process to facilitate automatic evaluation. LLMAJ alignment improved dramatically from 37.3% to 93.95% when rubric access was provided. 

Finally, we began this post by arguing for a second axis along which we want to optimize rubrics, which is agreement with overall objectives. From a measuring perspective this comes down to some notion of agreement with “ultimate stakeholders” who understand (and may have even formulated) the original problem in depth. In practice, this often comes down to defining some small group that can speak to objectives and applying the same agreement rate measures discussed above. But it is also worth noting that this can also be very challenging from an organizational perspective, demanding very tight communication within product teams when rubrics are developed at enterprises, for example. Some of these issues will actually be addressed below when we discuss qualitative refinement, which is not just about rubric development but working out more exact definitions of overall objectives.

Improving the rubric

Rubrics enable highly specialized and multi-faceted evaluations. However, simply using a rubric does not guarantee a valid assessment; as the saying goes, “garbage in, garbage out.” An imprecise rubric can undermine even the most advanced and well-intentioned human experts and models. Conversely, rubrics that undergo structured qualitative and quantitative assessments can act as a foundation for fair, detailed, and scalable evaluation. 

Quantitative refinement

While there is established research on how to measure rubric quality, how do we leverage these measures for improving the rubric? The same techniques and processes used for developing models can be used if the rubric is viewed as a type of model. Essentially, we turn the entire AI development process on its head: freeze the AI system and associated outputs, refine the rubric itself, and iteratively measure (and increase) agreement rates and alignment with the objectives described above:

All of the “Data Science 101” principles apply here, making sure you hold out enough data for a test set as you refine your rubric so you don’t overfit to small samples. Criteria can be added, deleted, or modified, either manually or with AI in the loop, while optimizing for the relevant measures. Naturally, for this process to work, the next thing we need is a means to update the rubric based on evaluations and measures, which brings us to a key part of the loop: the qualitative treatment. 

Qualitative refinement

Ideally, stakeholders have put together a strong team of experts that can go through multiple rounds of collaborative refinement. Recent benchmarks, such as HealthBench and PaperBench, exemplify the benefits of rigorous qualitative evaluations [1, 8]. HealthBench is a benchmark based on 5,000 multi-turn healthcare conversations, featuring over 48,000 unique rubric criteria for evaluating model response quality. PaperBench assesses AI agents’ ability to replicate 20 ICML 2024 papers using manually curated rubrics to evaluate their progress. Both projects engaged deeply with domain experts to manually build, evaluate, and refine the rubrics. This collaboration process is not easy. The authors of PaperBench reported this about their experience:

“Constructing the rubrics for each paper was notably the most time-intensive aspect of developing PaperBench. Each rubric was written in collaboration with one of the original authors of each paper, and took multiple weeks per paper to go from paper reading, initial creation, rubric review, iteration, and final sign-off.”

To manage the challenges, the creators of HealthBench and PaperBench first made sure the experts with whom they were collaborating were vetted; even the most carefully crafted rubric can only be as effective as the people applying it. For HealthBench, only 262 out of 1,021 physicians were selected based on assessments of their expertise and their ability to create clear, objective, and relevant criteria. The PaperBench authors considered only those articles granted Spotlight and Oral presentations at a highly selective peer-reviewed conference. After the initial vetting process, both benchmarks incorporated multiple rounds of review, feedback, and ongoing quality monitoring. PaperBench even specified criteria for rubric acceptance, ensuring evaluation feasibility within a defined time constraint (15 minutes per reproduction).

The structured review cycles and monitoring practices of HealthBench and PaperBench highlight that rubric quality is inseparable from the people and processes using the rubric. Even the most rigorously defined criteria require careful consideration of who conducts the evaluations and how they do so. Attention to evaluator selection and training is therefore essential, not only to uphold consistency, but also to address potential sources of bias that can undermine rubric reliability and fairness. Building robust rubrics to address biases of human evaluators and elicit the best possible feedback is a familiar research problem to educators, psychologists, and social scientists. Forms of human bias that you can expect to see are:

  • Central tendency bias: Raters often avoid extreme scores on rating scales, clustering evaluations around the middle categories. This has been observed widely in Likert-scale data, where extreme options (“strongly agree” or “strongly disagree”) are chosen less frequently than moderate ones.
  • Leniency/severity bias: On the other hand, individual raters may consistently score too high or too low across the board.
  • Halo effect: The halo effect occurs when a rater’s overall impression colors their ratings of specific criteria.
  • Anchoring effect: Anchoring is the tendency to rely too heavily on an initial piece of information.
  • Similarity and confirmation bias: Raters may also give higher ratings to responses that align with their own perspectives or prior expectations. 
  • Acquiescence bias: Acquiescence bias (yea-saying) is the tendency for respondents to agree with statements regardless of content, especially when unsure.
  • Social desirability bias: Social desirability bias leads people to give answers that cast themselves in a favorable light. 

A robust rubric design cycle proactively seeks to minimize these biases when removing or adding criteria. Rater training sessions that cover common biases have also been shown to improve scoring accuracy. Further, careful criterion wording and scale construction can mitigate respondent biases.

One of the benefits of this dynamic process is that clearer overall objectives can emerge amongst experts as their biases are addressed and criteria are fleshed out. This then allows for the type of optimization we referred to at the beginning of this post– higher agreement rates amongst experts, along with better alignment with an overall objective that has been more clearly defined. 

At Snorkel AI, we understand the value of experts in building effective AI, and the importance of detailed qualitative assessments. As evidence, we are releasing expert-driven benchmarking that puts frontier LLMs to the test across a variety of domain-specific agentic AI tasks. To achieve this, we are leaning on our network of experts to ensure we are curating realistic data with meaningful evaluations. For instance, we developed a specialized benchmark for insurance underwriting, working with our expert network to design realistic underwriting scenarios and a detailed and relevant evaluation rubric. See our blog post on Evaluating AI Agents for Insurance Underwriting for more details.

Summary

Here in Part 3 of Snorkel’s blog series on rubric design, we argue that rubrics should be treated as models that optimize for both alignment with stakeholder objectives and high agreement rates among evaluators, rather than a priori guidelines that we take for granted as sufficient. We outline key structural elements for different use cases, including fine-grained evaluations (with considerations for scale points, specificity, and feedback mechanisms) and reward modeling.

Effective rubric design starts by structuring the rubric to combine evaluation criteria distributed across a hierarchy of granularity. Following the initial choice of structure, the rubric is improved over an iterative process to optimize both alignment and agreement between experts. Over those iterations, we use quantitative metrics like inter-annotator agreement and intra-rater consistency, and qualitative refinement through expert collaboration and review.

Combining both quantitative evaluation and rigorous qualitative assessment with domain experts is how Snorkel ensures reliable, reproducible AI evaluations that remain aligned with the core system objectives. Looking ahead to Part 4, we’ll take an in-depth look at Snorkel’s process, through which we apply these principles to deliver top-quality data through rubric-based evaluation at scale.

References

[1] – Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quinonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health. arXiv, 2025.

[2] – Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernan-dez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback. arXiv, 2022.

[3] – Phillip Dawson. Assessment rubrics: towards clearer and more replicable design, research and practice. Assessment & Evaluation in Higher Education, 2015.

[4] – Encord. Inter-rater reliability: Definition applications. URL https://encord.com/blog/inter-rater-reliability/. Accessed on July 14, 2025.

[5] – Anders Jonsson and Gunilla Svingby. The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2): 130–144, 2007.

[6] – Ved Sirdeshmukh, Kaustubh Deshpande, Lifeng Jin Johannes Mols, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xin. MultiChallenge: A realistic multi-turn conversation evaluation benchmark for frontier LLMs. arXiv, 2025.

[7] – Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, and Doina Precup. Robust reward modeling via causal rubrics. arXiv, 2025.

[8] – Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Chan Jun Shern, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating AI’s ability to replicate AI research. arXiv, 2025.

[9] – Wikipedia contributors. Inter-rater reliability. URL https://en.wikipedia.org/wiki/Inter-rater_reliability. Accessed: 2025-07-14.