This is Part 5 of Snorkel AI’s five‑part series on rubrics. See earlier posts: Part 1: Introduction to rubrics for AI evaluation, Part 2: The right tool for the job: an A–Z of rubrics, Part 3: The science of rubric design, and Part 4: Scaling Trust: Rubrics in Snorkel’s Quality Process

As we look ahead, the future of AI is agentic, multi-turn, tool-using, environment-driven, multimodal, and increasingly code-generating. We see rubrics continuing to provide the connective thread that aligns this complexity end to end—turning vague goals into measurable steps and reliable outcomes. To show how this plays out in practice, we provide an example of a rubric applied to an agentic workflow, and highlight how rubrics can be flexibly applied to dynamic workflows.

Along with this, we see that the rubrics themselves can be refined with AI feedback, allowing them to keep pace with the dynamic nature of action spaces and the challenge of maintaining system reliability and safety. Through all of these changes, it remains critical to provide a channel through which expert human input keeps systems grounded, and we discuss an ensemble-based approach to evaluation that shows promise. 

Agentic environments, tool-calling, multimodal artifacts, and code

As agentic systems mature, they increasingly rely on tools and operate across multi-turn environments where they plan, reason, and act autonomously. Rubrics play a role in keeping these systems aligned—ensuring every action, from tool use to multimodal generation, is auditable and grounded. To understand this more concretely, we walk through core dimensions of agentic behavior and show how rubric checks anchor each one.

Agentic environments

  • Goal alignment: Plans and reasoning steps remain consistent with user intent and constraints.
  • Turn structure: Each step outputs verifiable, machine-readable results before proceeding.
  • Tool governance: Each call is checked for relevance, validity, and attribution prior to execution.
  • Self-correction: Agents log rubric outcomes to refine reasoning across runs.
  • Safety & oversight: Policy violations or drift in goal adherence trigger intervention or review.

Example: one agent, one task, many checks

To bring these elements together, imagine a travel agent building a two‑day surf weekend in Santa Cruz for under $500 with a three‑star hotel. Each step—from scoping constraints to delivering an itinerary—requires judgment, traceability, and tool use. The user’s request becomes a journey, and rubrics provide confident mile markers along the way. 

The following details the multi-turn steps required to complete the task:

Turn 1: Align on the goal

  • The agent restates constraints (budget, hotel class, surfing focus).
  • Rubric checks: Goal alignment, safety/policy (e.g., no risky advice), evaluation readiness (did the agent capture constraints in machine‑readable form?).

Turn 2: Plan the route

  • The agent proposes: search hotels → check surf report → draft itinerary.
  • Rubric checks: Plan quality (feasible, ordered, justified), evidence plan (sources/tools lined up), cost awareness (budget referenced).

Turn 3: Call the right tools

  • The agent queries hotels and a surf API.
  • Rubric checks: Tool relevance, parameter validity, idempotence/side‑effects, attribution (tie claims to tool outputs), security (no secret leakage), cost hygiene (no loops or wasteful calls).
  • Guardrails: Budget sentinels and loop detection trigger alerts if calls spike without progress.

Turn 4: Ground the evidence

  • The agent drafts an itinerary, citing hotel results and the surf forecast.
  • Rubric checks: Evidence sufficiency & independence (more than one source), grounded claims (region‑linked citations to pages/regions), layout fidelity for tables with times/costs.

Turn 5: Deliver the outcome

  • The final plan includes itemized costs, hotel details, and two surf sessions.
  • Rubric checks: Budget compliance, preference match, clarity & structure, overall quality.
  • Aggregation: Per‑turn scores are combined with time‑aware weighting so later, better‑informed steps count more; critical policy violations short‑circuit to a fail regardless of polish.

A machine-readable trace

Rubrics enable reproducible, auditable evaluation rather than guesswork:

[

 {"turn": 1, "goal": "Plan a 2-day surf weekend in Santa Cruz", "constraints": {"budget": 500, "hotel": "3-star", "focus": "surf"}},

{"turn": 2, "plan": ["search_hotels", "check_surf_report", "draft_itinerary"]},

  {"turn": 3, "tool_call": {"name": "search_hotels", "args": {"city": "Santa Cruz", "max_price": 150}}, "observation": "Seaview Inn $140/night (3-star)", "citations": ["hotel:seaview"]},

  {"turn": 4, "tool_call": {"name": "surf_forecast", "args": {"spot": "Lighthouse Cove"}}, "observation": "Low swell Sunday AM", "self_update": "Shift long surf to Day 1", "citations": ["surfline:2025-09-13"]},

  {"turn": 5, "candidate": "Itinerary with two surf sessions, itemized costs", "rubric": ["budget_compliance", "preference_match", "evidence_grounding"]}

]

  

A modular approach

The example above highlights the modularity in the rubric itself, and reflects the strong connection between the rubric and the functional goals of the agentic application. Criteria for each action can be grouped according to that action, and the overall rubric can be composed of those groups dynamically, based on the sequence of turns and steps taken by the agent. In each of the cases of tool calling, artifact generation, and code creation, criteria can be articulated in a way that modularizes well.

Tool calling

  • Relevance: The tool fits the sub-goal.
  • Parameter validity: Arguments match schema and context.
  • Idempotence & side-effects: Irreversible actions require confirmation.
  • Attribution: Claims tie to tool outputs or sources.
  • Cost & security hygiene: Avoid redundant calls; respect budgets; prevent secret leakage.

Multimodal artifacts

  • Perception fidelity: Extracted text or bounding boxes align with the model’s output.
  • Grounded claims: Citations link to document regions or image coordinates.
  • Layout & accessibility: Preserve structure, units, and alt-text.
  • Image safety: Redactions and filters meet policy standards.

Code & technical content

  • Execution: Compiles and runs without error.
  • Tests: Meets unit and property coverage goals.
  • Performance budgets: Within time and memory limits.
  • Security & structure: No vulnerabilities; clear, documented design.
  • Traceability: Links requirements ↔ tests ↔ implementation.

Rubrics can also score progress over time, awarding partial credit as tests start to pass and tracking lift in pass-rates—turning evaluation into a measure of momentum, not just an end-state snapshot.

Dynamism and rubric evolution

Similar to how the AI systems themselves are constantly evolving, rubrics can also be refined with AI feedback, allowing them to keep pace with the dynamic nature of AI applications. However, to meet the challenge of maintaining system reliability and safety, this rubric evolution must have its own testing and verification process to preserve alignment with system objectives and inter-annotator agreement. This is best done by an ensemble of LLM and human judges, and we expect to see this practice become commonplace.

Rubrics that learn from AI

Rubrics themselves can evolve through AI feedback. Promising techniques include:

  • Mining annotator rationales to clarify ambiguous criteria.
  • Using LLMs to propose candidate sub-criteria, then A/B testing on audit sets.
  • Applying constitutional-style self-critique to scale preference data.¹
  • Training lightweight reward models on rubric-scored pairs to separate true quality from proxies.

The evaluator ensemble

Because no single judge—human or model—captures the full picture, evaluation pipelines increasingly rely on ensembles. Combining deterministic checks, LLM-based scoring, and targeted human review provides both scalability and safety.

An effective ensemble includes:

  1. Deterministic checks for structure and safety.
  2. LLM-as-judge scoring for semantic quality.
  3. Risk-weighted human review for edge cases.
  4. Drift monitors to catch misalignment over time.

When human-model alignment drops below a calibrated threshold, automatic scoring pauses until retraining restores trust—closing the loop between evaluation and governance.

Series wrap‑up

Rubric-based evaluation has shifted from optional to foundational. As AI becomes more autonomous and multimodal, rubrics and evaluators together form the backbone of trustworthy, scalable AI development.

Across this five-part series, we explored:

  • Part 1 – Introduction to rubrics for AI evaluation: What rubrics are and why they outperform ad hoc checks for both human and automated evaluation.
  • Part 2 – The right tool for the job—an A–Z of rubrics: The landscape: dataset-level vs. instance-specific, coarse vs. fine-grained, process (traces) vs. outcome, and evaluator types (humans, LLM judges, code checks, reward models).
  • Part 3 – The science of rubric design: Treat rubrics like models—how to structure criteria and scales, calibrate graders, quantify agreement between raters and with AI judges, and iterate for reliability and fairness.
  • Part 4 – Scaling Trust: Rubrics in Snorkel’s Quality Process: Our Trusted Scale pipeline—co-design rubrics with experts, validate/calibrate, and run continuous QA loops that tie rubric scores to downstream ROI.
  • Part 5 – Future directions and emerging trends: Looking ahead to agentic multi-turn traces, tool-calling, multimodal and code artifacts; using AI feedback (e.g., constitutional-style self-critique) to refine rubrics; and deploying an ensemble of evaluators to make this practical in production.

At Snorkel AI, we are putting all of these practices together to help our customers achieve the levels of performance and accuracy that make AI-based applications truly useful. If you’re working on a project that could benefit from datasets of the quality that is earned through the rigorous application of rubrics, or you’d like to develop rubrics for your RL environments, come talk to us!

References

  1. Bai, Yuntao, et al. “Constitutional AI: harmlessness from AI feedback. 2022.” arXiv preprint arXiv:2212.08073 8.3 (2022).