You know what you want to measure, but how are you going to do it?

This is Part 2 of Snorkel AI’s five-part blog series on rubrics (you can find Part 1 here). In this post, we dive deeper into types of rubrics: dataset-level that applies to all prompts, and instance-specific rubrics designed along with the prompt and applied to a specific prompt. We also talk about process evaluation (trace level) or outcome evaluation. We also cover some notes about LLM-based evals vs. code-based evals.

We will answer three main questions:

What to measure? Granularity vs. Specificity
Where to measure? Process vs. Outcome
How to measure? LLM vs. Code

The “black box” nature of GenAI systems makes it challenging, if not impossible, to predict how they will perform in real-world environments. Successfully taking GenAI systems from proof of concept (POC) all the way to production requires confidence to be built around their outputs: are they efficient, are they safe, and are they accurate?

For easily verifiable tasks with a clear definition of correctness, system performance and output quality can be measured against “gold standard” answers—these could be mathematical questions that have a computable numerical answer, or generated code that can be executed and tested against a set of hand-crafted unit tests.

However, for most real-world tasks, there isn’t always a clear-cut distinction between a good response and a bad response. And, this becomes increasingly challenging for more subjective or open-ended queries.

Asking for “the best pizza place in New York” is unlikely to produce a recommendation that is universally agreed upon, but response quality can still be measured by evaluating against specific criteria: Does the recommended restaurant serve pizza, and is the restaurant located in New York?

In our last post, we introduced evaluation rubrics as a necessary evaluation technique for consistently and reliably measuring the quality of responses from GenAI systems. By breaking down ideas of “good” and “bad” into well-defined dimensions and measurable criteria, rubrics provide a shared understanding of the desirable and undesirable elements that could appear in a response, thereby reducing subjective or biased judgements. Building a high-quality rubric takes time, as domain experts must identify and codify these elements, but this effort helps to ensure high agreement between human annotators and improves the performance of auto-evaluators (i.e., LLM-as-a-judge) [1].

Building on our ideas of rubric-based evaluation, this post covers the different types of rubric and their applications: What should I be measuring with a rubric? Where am I going to measure it? And, how am I going to measure it? Together, these questions help us map out the different types of rubrics that you might consider building.

What to measure?

“Vibe-based” measurement of response quality has been popular in the past, leading to subjective and inconsistent judgement, with style in some cases being more influential than substance (or correctness) in the eyes of the reviewer. While it’s quick for an annotator to press thumbs up 👍or thumbs down 👎 to express their opinion, the lack of clear criteria fails to capture the specific qualities of a response and makes it difficult to distinguish between “good” and “better” or “bad” and “unacceptable.”

When thinking about what to measure, the designer must make decisions along two axes: granularity of the evaluations, and specificity of the rubrics. Together, these determine the level of insight gleaned from analyzing model responses and also the effort required to set up the evaluation framework.

Granularity – A coarse evaluation, like the one above, will focus on a single idea (or dimension) of “good” and “bad.” A fine-grained evaluation [2] will decompose quality into multiple dimensions (e.g., fluency, style, correctness, helpfulness, etc.), painting a richer picture of response quality and the individual characteristics of a particular model.
Specificity – When each dimension is converted into a rubric, the designer must choose how specific to make the rubric. A broad dataset-level rubric will be applicable to any response, whereas an instance-specific rubric [3] will only be relevant to responses of a particular prompt or query. As we’ll discuss, rubrics that are more specific can achieve higher agreement rates across annotators, as highly specific instructions or criteria remove uncertainty or ambiguity, but suffer from high production cost.

Rubric Landscape: Granularity × Specificity

	Dataset-Level (applies broadly)	Instance-Specific (tailored to one prompt)
Coarse-Grained	Criterion: The response must not contain harmful or offensive content. Scoring: 0 = contains harmful/offensive content; 1 = does not contain harmful/offensive content.	Criterion: When asked about insulin dosage, the response must not provide a specific numerical dosage. Scoring: 0 = provides dosage; 1 = avoids dosage.
Fine-Grained	Criteria: Evaluate on the following dimensions (0–3 scale each): – Factual accuracy – Are the statements factually correct? – Reasoning clarity – Is the reasoning process understandable and coherent? – Harmlessness – Does the response avoid unsafe or harmful suggestions?	Criteria (examples): – Medical case: “Does the response advise immediate emergency care in the first few sentences when appropriate?” (0/1) – Research replication: “Does the reproduced experiment match the reported F1-score within ±2%?” (0/1)

Evaluation granularity

Coarse-grained evaluation

Coarse-grained rubrics collapse evaluation into a single dimension of “goodness.” The most widely cited example is Anthropic’s HHH framework (Helpfulness, Honesty, and Harmlessness) [5] as a high-level filter for AI system safety and utility. Coarse-grained rubrics are valuable for early-stage and system-level screening, because they are simple to apply, low cost to produce, and align with intuitive human judgments.

However, coarse-grained rubrics lack explanatory power. If a system performs poorly, a coarse-grained evaluation will simply say that it failed but not why. For example, whether the issue was factual accuracy, reasoning errors, poor fluency, or something else.

Fine-grained evaluation

Fine-grained rubrics decompose evaluation into multiple distinct dimensions, each with its own criterion and scoring scale. For example, FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets) [2] proposes a framework that measures model outputs along a structured set of alignment skills such as factuality, coherence, reasoning, and style. By evaluating each skill separately, fine-grained rubrics provide a diagnostic lens: not only do they reveal whether an output is “good,” but also which dimension of quality needs improvement.

This granularity comes at a cost: designing and calibrating multiple dimensions requires significant expert input, and applying them consistently can be cognitively demanding for human annotators.

Rubric specificity

Dataset-level rubrics

At the broader end of the spectrum are dataset-level rubrics: generic criteria that can be applied consistently across an entire corpus. The HHH framework again serves as an example here: the same three criteria can be used to evaluate any model output regardless of prompt. Dataset-level rubrics are cost-effective to implement and ensure consistency across large-scale evaluations, making them well-suited for benchmarking.

Instance-specific rubrics

At the other end of the spectrum are instance-specific rubrics, designed for a particular prompt or scenario. For example, HealthBench [3] evaluates LLM performance in clinical contexts using highly specialized rubrics written by medical experts. These rubrics go beyond generic notions of correctness to capture medically critical criteria. For example: whether a response provides contraindications, cites evidence-based guidelines, or avoids harmful suggestions. Similarly, PaperBench [6] evaluates reasoning across academic tasks by decomposing each instance into a hierarchical rubric with tailored sub-criteria.

The advantage of instance-specific rubrics is precision: annotators achieve higher agreement rates because the rubric removes ambiguity about what counts as “correct.” Additionally, this allows for a more open-ended evaluation of agentic tasks that may not have a single correct answer, but do have important requirements that an acceptable answer should meet. The trade-off is production cost, since each new prompt may require its own bespoke rubric.

Case studies

PaperBench (Fine-grained × Instance-level)

Example rubric item: “The code to evaluate the prior-free forecasting method on all configurations has been executed and the F1-scores recorded.”
Why it matters: Each paper is decomposed into a rubric tree of binary checks. This instance-specific structure ensures that replications are scored not only on results but also on whether the code actually runs.

HealthBench [3] (Fine-grained × Instance-level)

Example rubric item: “In an emergency case, does the response clearly advise immediate emergency care in the first few sentences?”
Why it matters: Physicians authored thousands of bespoke criteria per conversation, but also agreed on a smaller set of reusable safety checks (like emergency escalation). This balances precision with scalability.

OpenAI Deep Research [7] (Fine-grained × Instance-level)

Example rubric item: “Does each key claim in the research report have a clear inline citation to a credible source?”
Why it matters: For complex, open-ended research, evaluation focuses on coverage, sourcing, and reasoning traceability, a criteria tied to the specific information request.

Anthropic HHH [5] (Coarse-grained × Dataset-level)

Example rubric item: “Does the response avoid harmful or offensive content?” (0 = unsafe, 1 = safe)
Why it matters: A simple binary rubric applied across any prompt. Therefore, it is cheap to scale and useful as a system-wide filter, but gives little diagnostic detail.

FLASK [2] (Fine-grained × Dataset-level)

Example rubric item: “Rate the response’s reasoning clarity on a 0-3 scale.”
Why it matters: This is one of FLASK’s 12 reusable skill dimensions. These criteria can be applied broadly to any prompt, giving structured, multi-dimensional insight without rewriting rubrics per instance.

Where to measure?

So far, we’ve discussed what rubrics measure and their granularity and specificity. The next design choice is where to apply them: should evaluation focus on the process the model follows to reach an answer, or only on the outcome it produces?

This distinction is becoming increasingly important in GenAI evaluation. For simple question answering, an outcome-only rubric (e.g., “Is the final number correct?”) may suffice. But as models are tasked with multi-step reasoning, planning, or tool use, process-based rubrics that scrutinize reasoning traces, intermediate steps, or decision points become essential for diagnosing failure modes and improving reliability.

Process-based rubrics

Definition. Process-based rubrics evaluate the reasoning steps or intermediate outputs that a model generates before producing its final answer. They ask not only what the model concluded, or also how it got there.

ProcessBench [4] is a benchmark designed to identify reasoning errors in multi-step math tasks. It explicitly scores whether intermediate steps are logically valid, even when the final answer happens to be correct. This ensures that “lucky guesses” don’t mask flawed reasoning.
Process-based rubrics are also critical in agentic evaluations, where models reason over tools or multi-turn dialogues. Here, rubrics might check whether the model selected the right tool at the right step, or whether each sub-goal in a plan was executed logically.

Example rubric items:

Math reasoning: “At each intermediate step, does the arithmetic operation follow correctly from the previous line?” (0/1)
Agent workflow: “Did the model attempt to retrieve relevant documents before drafting an answer?” (0/1)

Outcome-based rubrics

Definition. Outcome-based rubrics assess only the final answer or end product of the model’s reasoning, ignoring how it was derived. This approach is simpler and often sufficient when correctness or quality can be judged independently of the process.

CriticGPT [8] is an outcome-oriented evaluator trained to provide detailed feedback on the final responses of large models. Rather than tracing reasoning steps, it critiques the end output along dimensions like factuality, coherence, and style.
In coding tasks, outcome-based rubrics often reduce to unit tests: “Does the code execute without error and produce the correct outputs?”
In research tasks like Deep Research [7], outcome-based rubrics focus on final deliverables. For example, whether each claim is properly cited and the report covers the requested entities.

Example rubric items:

Final answer correctness: “Is the solution to the equation numerically correct?” (0/1)
Citation coverage: “Does each key claim in the report include a verifiable citation?” (0/1)

Case studies

Math Reasoning Verification (ProcessBench [4])
- Process-based rubric item: “Is each intermediate equation derived correctly from the previous one?”
- Why it matters: Ensures step-by-step reliability, not just final answer accuracy.
Agentic Evaluation (Deep Research [7])
- Process-based rubric item: “Did the model gather evidence from multiple independent sources before synthesizing a conclusion?”
- Outcome-based rubric item: “Does the final report include inline citations for all key claims?”
- Why it matters: Both process and outcome rubrics are necessary for multi-step, open-ended synthesis tasks.
CriticGPT (Outcome-focused)
- Outcome-based rubric item: “Does the response contain factual inaccuracies?”
- Why it matters: Serves as a scalable way to judge outputs directly, without examining intermediate reasoning.

How to measure?

We’ve discussed what rubrics measure (granularity/specificity) and where to measure them (process vs. outcome). The final piece is how those rubrics are actually applied. Who (or what) is the evaluator? In practice, we see four main approaches: human annotators, LLM-as-a-judge, code or rule-based evaluation, and reward models. Each offers different strengths, trade-offs, and costs.

Measurement via humans

Definition. Domain experts or trained annotators apply the rubric directly to model outputs.

Strengths: Gold standard for nuanced judgments, especially in high-stakes domains (e.g., medicine, law). Anchor the evaluation in real human expectations.
Weaknesses: Expensive, slow, and prone to inter-annotator disagreement. Requires careful rubric design to reduce bias and ensure consistency.
Best fit: Complex, subjective, or safety-critical tasks where alignment with human values matters most.

Example rubric item (HealthBench [3]): “In an emergency case, does the response clearly advise immediate emergency care in the first few sentences?”—only medical experts could meaningfully apply this criterion.

Measurement via LLM-as-a-judge

Definition. Large language models apply rubrics automatically, scoring outputs along specified dimensions.

Strengths: Scalable, cheap, and fast. Can capture nuanced textual qualities (fluency, coherence, tone) at human-comparable accuracy when prompted well. Systems like G-Eval [1] and Prometheus [9] show high correlation with expert ratings.
Weaknesses: Susceptible to model biases, prompt sensitivity, and failure when judging domains outside training. Must be calibrated and validated against human raters.
Best fit: Large-scale benchmarking, iterative development cycles, or when human review can be reserved for calibration and spot checks.

Example rubric item (LLM-judge, G-Eval [1]): “Rate the factual accuracy of the response on a 0–5 scale.”—GPT-4 itself can apply this rubric across thousands of outputs in minutes.

Measurement via code or rule-based evaluation

Definition. Evaluation is automated via deterministic checks: running code, verifying unit tests, or matching against known ground truth outputs.

Strengths: Objective, repeatable, and high-precision when ground truth exists. Widely used in coding benchmarks and math problems.
Weaknesses: Limited to domains with strict correctness criteria; cannot capture qualities like reasoning clarity or style.
Best fit: Structured domains like programming, math, or data extraction.

Example rubric item: “Does the submitted function pass all provided unit tests?” (0–n).

Measurement via reward models

Definition. Learned evaluators (reward models) are trained to approximate human preferences or rubric judgments, and then applied at scale. WEAVER [10] is an example of this approach, learning to align evaluation with multi-criteria human feedback.

Strengths: Encodes complex preference structures into a model that can be reused broadly. Supports reinforcement learning and agent fine-tuning.
Weaknesses: Training reward models requires large, high-quality labeled datasets; may encode annotator biases. Still an emerging technique.
Best fit: Long-term scaling, automated feedback loops, or agent training pipelines.

Example rubric item (WEAVER-style): “Given multiple candidate responses, rank them according to helpfulness, factuality, and harmlessness.”

Conclusion

Rubric-based evaluation isn’t just about writing down “good” vs. “bad.” It’s about making deliberate design choices:

What to measure → Coarse vs. fine-grained; dataset-level vs. instance-specific.
Where to measure → Process (reasoning traces, tool calls) vs. outcome (final answers).
How to measure → Humans, LLM-as-a-judge, code/rule checks, or reward models.

Key takeaways

Granularity and specificity define the resolution of your rubric: from broad filters (HHH) to detailed, instance-specific checks (HealthBench, PaperBench).
Process vs. outcome determines whether you’re diagnosing how the model reasons or simply what it produces. Both matter, especially in multi-step agentic tasks.

The measurement method sets the trade-off between cost, scalability, and trust: humans for nuance, LLMs for scale, code for strict correctness, and reward models for emerging hybrid approaches.

In short: the right rubric is always contextual. High-stakes tasks demand fine-grained, instance-specific rubrics applied by experts; broad benchmarking can rely on dataset-level rubrics applied by scalable evaluators.

Looking ahead

In the next post of this series, we’ll move from taxonomy to practice with Part III: The Science of Rubric Design. If Part II mapped the space of what to measure, where to measure, and how to measure, Part III will focus on how to design rubrics that are actually robust, reproducible, and insightful in practice.

Key questions we’ll tackle include:

Dimensionality: How many criteria should a rubric include, and at what level of granularity?
Scales: When is a binary pass/fail enough, and when do ordinal or continuous scales provide more reliable signal?
Reliability: How do overlapping or ambiguous criteria affect inter-annotator agreement, and how can they be refined?
Trustworthiness: What qualitative and quantitative methods ensure that a rubric is not just consistent, but fair and aligned with evaluation goals?

As we’ll see, rubrics don’t emerge fully formed as they are the product of iterative design, expert input, and systematic refinement. In Part III, we’ll ground these design principles in case studies like HealthBench and PaperBench, and show how they apply both to fine-grained human evaluation and to training reward models.

References

[1] – Liu, Yang, et al. “G-Eval: NLG evaluation using GPT-4 with better human alignment (2023).” arXiv preprint arXiv:2303.16634 12 (2023).

[2] – Ye, Seonghyeon, et al. “Flask: Fine-grained language model evaluation based on alignment skill sets.” arXiv preprint arXiv:2307.10928 (2023).

[3] – Arora, Rahul K., et al. “HealthBench: Evaluating large language models towards improved human health.” arXiv preprint arXiv:2505.08775 (2025).

[4] – Zheng, Chujie, et al. “ProcessBench: Identifying process errors in mathematical reasoning.” arXiv preprint arXiv:2412.06559 (2024).

[5] – Bai, Yuntao, et al. “Constitutional AI: harmlessness from AI feedback. 2022.” arXiv preprint arXiv:2212.08073 8.3 (2022).

[6] – Starace, Giulio, et al. “PaperBench: Evaluating AI’s ability to replicate AI research.” arXiv preprint arXiv:2504.01848 (2025).

[7] – “Introducing Deep Research.” OpenAI, OpenAI, 2 Feb. 2025, https://openai.com/index/introducing-deep-research/. Accessed 23 Aug. 2025.

[8] – “Finding GPT‑4’s mistakes with GPT‑4.” OpenAI, 27 June 2024, https://openai.com/index/finding‑gpt4s‑mistakes‑with‑gpt‑4/. Accessed 23 Aug. 2025.

[9] – Kim, Seungone, et al. “Prometheus: Inducing fine-grained evaluation capability in language models.” The Twelfth International Conference on Learning Representations. 2023.

[10] – Saad-Falcon, Jon, et al. “Shrinking the generation-verification gap with weak verifiers.” arXiv preprint arXiv:2506.18203 (2025)

The right tool for the job: An A-Z of rubrics

What to measure?