JudgmentBench: Comparing rubric and preference evaluation for quality assessment

May 27, 2026

A collaboration between Stanford Law School, Snorkel AI, and Harvey, JudgmentBench is the first benchmark to collect both rubric scores and pairwise preference judgments from the same domain experts on the same items — a direct test of how expert judgment should be elicited to evaluate AI in domains without verifiable ground truth.


What is JudgmentBench?

JudgmentBench is a benchmark of 30 real-world legal tasks paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys — including lawyers at major U.S. law firms with substantial experience. It is the first publicly available dataset in a high-expertise domain where both supervision signals are elicited from the same experts on the same items, making it possible to compare evaluation methodologies head-to-head rather than across different annotator pools. The tasks are drawn from Harvey’s BigLaw Bench, selected to represent economically valuable legal work rather than stylized exam-style problems, and the annotations represent roughly $242,000 in equivalent billable attorney time.

Why JudgmentBench matters

Two methodologies dominate how teams benchmark and evaluate model outputs. Rubric-based scoring grades each item against predefined criteria, while comparative judgment elicits pairwise preferences between two outputs. Both are widely used to build LLM judges and autograders — yet the choice between them is rarely justified, and almost never tested against expert ground truth in a domain where correctness is genuinely hard to verify.

JudgmentBench closes that gap. By capturing both signals from the same attorneys on the same legal work product, it isolates the methodology itself as the variable, giving teams an evidence-based foundation for designing evaluation pipelines, reward models, and LLM-as-a-judge systems in expert domains.

Key findings

The authors generated LLM outputs at three constructed quality levels and measured how well each evaluation method recovered the intended quality ordering:

  • Comparative judgment substantially outperforms rubric scoring. Pairwise preferences recovered the intended ordering with a mean Spearman’s rank correlation of 0.908, versus just 0.150 for rubrics (estimated difference of 0.758, 95% interval [0.494, 1.021]).
  • It is also cheaper. Collecting preference judgments required less than half the annotation time of rubric scoring.
  • The pattern is robust. It holds for both human annotators and LLM autograders, suggesting the advantage is a property of the methodology rather than of who (or what) is doing the grading.

For teams building evaluation in domains without verifiable ground truth, the takeaway is concrete: pairwise preference signals recover expert-intended quality far more reliably than rubric scores — and do so at lower annotation cost.

A broader research agenda

Beyond this initial comparison, JudgmentBench’s paired structure supports a wider agenda on how expert judgment should be elicited, aggregated, and used as supervision. Because every item carries both a rubric score and a preference judgment from the same expert, researchers can study aggregation strategies, annotator agreement, and the reliability of automated judges against a high-expertise reference — work that is directly relevant to evaluating frontier AI agents on subjective, high-stakes tasks.

Authors

JudgmentBench was created by Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Megan Ma, and Julian Nyarko of Stanford University, Charles Dickens of Snorkel AI, and Matthew Guillod of Harvey. Harvey provided the legal tasks (from its BigLaw Bench) and associated data, and Snorkel AI served as the project’s data partner. The paper (arXiv:2605.25240) was submitted on May 24, 2026.

Snorkel AI’s role

Snorkel AI is a contributor to JudgmentBench, with Snorkel researcher Charles Dickens among the co-authors. The work reflects Snorkel’s broader focus on rigorous, expert-grounded evaluation — building the data and methods that determine how AI is judged in domains without verifiable ground truth. It builds on Snorkel’s data-centric AI research and its support for open evaluation through the Open Benchmarks Grants program.

JudgmentBench is part of Snorkel’s broader work on AI agent evaluation. Explore related benchmarks: Agents’ Last Exam, Terminal-Bench Science, and Terminal-Bench 2.0.

For models that need to be right. Not just good enough.