JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based scoring, where annotators apply predefined criteria to a single output, and comparative judgment, where annotators select the holistically better of two outputs side by side. Both are widely used, but the choice between them is rarely empirically justified, especially in domains like law, consulting, and finance where expert judgment is the only available signal. JudgmentBench is the first publicly available dataset to elicit both supervision signals from the same experts on the same items: 30 real-world legal tasks annotated by 50+ practicing attorneys from major U.S. law firms, yielding more than 3,000 paired rubric scores and pairwise preference judgments built on top of Harvey’s BigLawBench. The results are striking: comparative judgments recover intended quality ordering nearly perfectly (mean Spearman’s ρ = 0.908) while rubrics barely beat chance (ρ = 0.150), and pairwise judgments require less than half the annotation time, a finding that holds for both human annotators and LLM autograders.

Transcript

Lightly edited for readability.

Charlie Dickens of Snorkel AI introduced Russell Yang to the room, highlighting a background that spans EECS and molecular biophysics at Yale, finance at Citadel and KKR, and now AI engineering at Stanford Law, a mix that makes him uniquely positioned to think about how expert judgment should be captured and used to train AI systems.

Science vs. art: where does legal practice fall?

Thank you, Charlie, for the warm introduction, and thanks to everyone for coming out. I know there’s a lot going on with Databricks Summit this week.

I want to start with a quick poll to set the stage.

Think about science on one end of a spectrum — chemistry, for example. The kinds of questions chemists ask each other are things like: does this reaction produce the right compound? Chemistry is a great example of a natural science domain where ground truth is objective and verifiable.

At the other end of the spectrum: painting or sculpture. When you walk through a museum and look at a piece of art, you probably don’t have a rubric in your mind that you’re walking through to determine whether that painting is good. Instead, you’re thinking about what makes it different from other paintings, what sets it apart in its time period.

So if I could ask all of you to raise a number of fingers from one to ten, based on where you think legal practice falls on that spectrum, I’d love to see it.

[Audience responds.]

I’m seeing some 7s, some 8s, some 3s and 4s. The variability actually plays well into my story. The motivation for our study is the fact that in a lot of use cases, people assume that legal practice, or other knowledge work practices, are like science or like art, without really empirically justifying that claim.

Two evaluation paradigms: rubrics vs. comparative judgment

As many of you probably know, there are two paradigms that people typically use for evaluation in these domains.

The first is comparative judgment: you give a pair of work products to an expert and ask them to indicate which of the two is holistically better.

The second is rubric-based scoring: you give a single work product to an annotator and ask them to apply a set of predefined criteria to give it a score.

The reason this choice matters is because it plays directly into how we think about purchasing human data, which in turn determines how we train and improve our models.

It turns out that the prevailing view in knowledge work, judgment-rich domains, like legal, consulting, finance, and medicine, is that there’s something holistic about how evaluation should work, and that you can only understand whether a work product is good by taking it in its entirety. But again, that’s folk wisdom, not an empirical claim. And that’s exactly what we set out to test.

Building JudgmentBench with expert attorneys

For this project, we were really grateful to have lawyer time contributed by more than 50 expert attorneys. Many came from Snorkel, and some came from a couple of big law firms.

We were grateful because in the legal field, as in other knowledge work domains, it can be extremely expensive to get time on people’s calendars for evaluation projects. In our sample, we had more than 50 lawyers contribute hours. A fifth of them were partner or counsel level, and 40% more were senior associates, so we had a nice spread of experience levels to explore whether the effects we observed vary by seniority.

Most of the participants were big law attorneys or had substantive experience in firms, primarily in litigation and transactions, the two main areas of commercially relevant legal work.

We built JudgmentBench on top of a base dataset that Harvey generously allowed us to use: BigLawBench, which consists of 100 expert-developed tasks paired with expert-developed rubrics. Here’s a sense of what those tasks look like: one example asks a lawyer or a model to look at a deposition transcript and pull out the most important facts. The rubric for that task is highly specific; it awards points for mentioning particular phrases from the transcript.

With JudgmentBench, we release 30 of Harvey’s 100 BigLawBench tasks and build an annotation-level dataset of more than 3,000 rubric and preference annotations on top of them.

Constructing quality levels

Before I show results, I want to pause and ask: how would you rigorously construct outputs of different quality in the legal space?

This is especially hard because there’s no accepted ground truth. One approach is to look at briefs that plaintiffs and defendants both submitted, and check which side the judge cited more. If the judge cited the plaintiff’s brief more, maybe that brief is higher quality. But that’s confounded by the facts of the case — maybe the plaintiff just has stronger underlying facts — and by judicial preference.

[Audience member suggests using appellate court decisions, specifically whether courts uphold or reverse lower court rulings.]

Yes, that’s a great direction. It brings in similar confounding issues, but it’s an interesting future direction.

The point is: defining quality and ground truth in these high-judgment, knowledge-rich domains is genuinely difficult, and there’s no accepted approach.

For our project, we used prompting to induce quality variations, generating outputs at three constructed levels: excellent, good, and intermediate. The main limitation is that those quality differences might not be representative of actual variation in model-generated outputs, let alone human-generated outputs from practicing lawyers. We used LLM-as-a-judge to validate that measurable differences existed between the levels, and we went to a practicing big law partner to ask whether the differences he observed were at least plausible relative to what he sees in his own work.

Key findings: comparative judgment wins

The key finding of our study was that comparative judgment is a more reliable method of recovering quality ordering compared to rubric-based scoring. This lines up with the folk wisdom in the space: that there are holistic and tacit notions of quality that practitioners find difficult to enumerate in a rubric, especially in advance.

To measure this, we looked at two statistics.

The first is a per-task rank correlation: we fit a Bradley-Terry model using the preference data, averaged the rubric scores within each task, and looked at the correlation between those implied rankings and the actual quality ordering. This metric is what will feel most familiar to people in the benchmarking space.

One issue with this approach is that it’s sample-size dependent. If an annotator is just 51% accurate at a pairwise judgment, the rank correlation will converge to 1 as you add more independent annotators, because of the wisdom of crowds.

That’s why we also looked at a second statistic: the aggregate win rate across all judgments made. A value of 60% means that annotators were 60% accurate at picking the better output over the worse one across all individual judgments.

The results, on both metrics, are stark:

Win rate: 67% for comparative judgment vs. 54% for rubrics.
Spearman’s ρ: 0.908 for comparative judgment vs. 0.150 for rubrics.

Comparative judgments nearly perfectly recover the intended quality ordering. Rubrics barely beat chance. And this effect persists across experience levels; regardless of how junior or senior the attorney, comparative judgment had the advantage. We also found that comparative judgments took about half the time of rubric scoring. At data labs like Snorkel, a core question is how to get the most signal for the least cost, especially when experts are expensive. That efficiency finding underscores the case for preference-based annotation.

Implications for benchmarkers and law firms

For benchmarkers, the main implication is that comparative judgments are a better approach for ordering models, at least in high-judgment domains.

For law firms, the finding has practical teeth. Say you’re a partner trying to decide which AI tool from competing vendors to adopt. Our research suggests the right approach is to collect preference feedback from the lawyers at your firm, not to score outputs against some predefined playbook.

This also matters for law firms building internal tooling. Kirkland & Ellis recently announced a $500 million investment in internal AI development. Our research speaks directly to how those teams should think about post-training and model evaluation.

What’s next

The main next step is using the paired structure of JudgmentBench to answer deeper questions about expert judgment.

One big open question: can we back out the actual markers of quality from comparative judgment data? There are existing methods for automatic rubric generation, and we’re thinking about how to use comparative judgment annotations as the ground truth, building rubrics that faithfully replicate the preference judgments of human annotators, rather than being specified in advance.

Q&A

Q: Do you think comparative judgment generalizes beyond legal, to domains like consumer research where most of the answers are subjective?

The science-to-art spectrum is a useful frame for thinking about this. Legal is probably toward the art end, and consumer research or biomedical science might sit somewhere between natural science and legal. Our motivation was that a lot of knowledge work domains use rubrics or preferences without backing that choice empirically. It would be really interesting to see someone run a similar study in consumer research and learn where it falls on the spectrum.

[Audience member describes a parallel setup at their company: they simulate consumer responses for brands using pairwise preferences collected from AI consumers, combined with rubric scoring, then build harnesses to reduce human validation load. They’re uncertain how much human benchmarking is sufficient.]

That sounds like a situation where the ground truth is purely what the consumer thinks, pure taste, which might call for even stronger preference signals.

Q: You used prompting to construct quality levels. How do you know this doesn’t introduce bias, or that quality differences are consistent across cases?

Two responses: first, we used LLM-as-a-judge to validate that measurable quality differences existed between the levels. Second, we asked a practicing big law partner whether the differences looked plausible compared to what he sees in his own work.

But you’re absolutely right that confounders are a concern. One approach would be to collect surface-level linguistic features across the quality classes and check whether a simple classifier can distinguish them by surface features alone, which would suggest the differences aren’t purely about quality. Controlling for those confounders is an important direction for future work.

Q: Did you think about using preference data to compare different parts of a rubric, to get a more granular, dimensional analysis of quality?

Yes, that’s actually close to something we’re working on. We’re thinking about how to use comparative judgment annotations as ground truth to auto-generate rubrics that faithfully replicate the preference judgments of human annotators, rather than being authored in advance.

Q: How do you know the improvement is from rubric vs. comparative judgment, rather than the rubrics just not being good enough?

Fair question. In our study, we didn’t create the rubrics ourselves; they were developed by experts at Harvey as part of BigLawBench. That makes them more rigorous than a typical rubric creation process. But the right way to fully answer this is to compare optimal rubrics against preference data, and that’s clearly a direction for future work.

Q: Were the rubrics created by the same lawyers who did the annotation?

No, the rubrics were created by experts at Harvey. The annotations were performed by a separate group of lawyers: big law attorneys who volunteered through firms, and others recruited through Snorkel.

Q: Looking at the experience plot, it seems like senior lawyers show less advantage for comparative judgment. Does seniority matter?

That pattern is likely an artifact of sample size dependence in the rank correlation metric. When you look at the win rate plot instead, the effect across experience levels is much more uniform. The rank correlation metric amplifies small per-annotator differences based on how many judgments each person did, which tends to skew how senior lawyers, who may have done fewer, look on that axis.

Q: Do you observe more noise or disagreement in preferences vs. rubrics?

Good question, and a hard one for our dataset. We have 50 versions of each quality level, which means the occasions where two different lawyers saw the exact same version are quite rare. That makes standard inter-rater reliability hard to compute. If you have other methodological suggestions for how to look at that, I’d love to hear them.

Q: LLMs acting as judges might be analogous to less experienced annotators. Do you think rubrics are more appropriate for automated evaluation?

That’s an interesting intuition and probably true in some sense. A lot of RL environments use rubrics for verification precisely because comparative judgment is harder to formalize. An important future direction for this work is studying how well LLM autograders correlate with human judgments in each paradigm, which would give more grounding to that question.

Q: Will these conclusions hold across different types of legal work, like criminal law, IP, or regulatory, versus commercial litigation?

BigLawBench covers commercially relevant big law work, so the tasks are primarily in litigation and transactions; it doesn’t include criminal law, constitutional law, or regulatory matters. The study wasn’t statistically powered to look for differences across task types, but there are almost certainly areas of practice, like criminal advocacy, that lean more toward the “art” end of the spectrum. Exploring that variation within law is an interesting direction.

Q: How does this apply to patents and trademarks specifically?

I’m not a lawyer, so I won’t give legal advice, but there are legal tech companies building specifically for patent and trademark applications. Our main suggestion would be for those teams to think carefully about where their tasks fall on the science-to-art spectrum, and collect the type of human data, rubric or preference, that gives them the most signal at the best price accordingly.

Q: What about using the full briefs from both sides of a case plus the judge’s order as training signal?

That’s an interesting outcome-based approach to ground truth. The core issue is still confounding: the facts of the case may favor one party over the other independently of brief quality, and judicial preferences can be hard to disentangle from quality differences. But it’s an interesting direction, and we’d love to connect and think through how to design something that controls for those confounders.

Q: Have you calculated inter-rater reliability, and does it differ between rubrics and preferences?

Inter-rater reliability is more tractable for rubrics, because rubrics involve a single draw from the pool of quality-level variants, whereas preferences draw two items. The occasional overlap between what multiple lawyers saw is more common for rubrics than for preferences, where you’d need two lawyers to be assigned the same pair in the same order. I’ll check whether we’ve already computed it and will follow up.