Data development
Research

The self-critique paradox: Why AI verification fails where it’s needed most

November 26, 2025
4 min read

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines.


The promise vs. the reality

The “agentic loop”—having an AI critique and improve its own work—is a popular method for attempting to boost performance. Techniques like Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023) have popularized the idea that iterative feedback can help models solve complex tasks. The logic is simple: two heads (even if they’re the same head) are better than one.

But are they always?

We ran a rigorous experiment:

  • 50 hard visual reasoning tasks (verifiable ground truth)
  • 2 frontier models (Claude Sonnet 4.5, OpenAI o4-mini)
  • 5 critique-improve iterations per task
  • 100 total experiments

Why these models? We chose two of the strongest reasoning models available. If SOTA models—specifically optimized for reasoning—cannot leverage self-critique to fix their own outputs on simple tasks, it suggests a fundamental limitation that is likely even more severe in weaker or smaller models.

What we found challenges the core assumption of agentic AI.


The data: A tale of two extremes

When we aggregated our results, it looked like a generic failure: accuracy dropped 10% overall. But when we split tasks by difficulty, a startling pattern emerged.

1. The “corrosive critique” effect (Easy Tasks)

For tasks where models started strong (≥75% accuracy), the critique loop was devastating for both models.

ModelInitialLoop 5DropResult
Claude Sonnet 4.598.1%56.9%↓ 41.2%0 improved, 8 degraded
OpenAI o4-mini94.2%78.4%↓ 15.8%0 improved, 5 degraded

What happened? Hallucination. The critic, primed to find errors, invented them. A correct answer of “yes” became “no” because the model “detected” a 2-pixel discrepancy that didn’t exist. Confidence became a liability.

2. The “Lazarus” effect (Hard Tasks)

For tasks where models failed completely (<35% accuracy), critique was a miracle worker.

ModelInitialLoop 5GainResult
Claude Sonnet 4.50.0%60.0%↑ 60.0%3 improved, 0 degraded
OpenAI o4-mini0.0%20.0%↑ 20.0%1 improved, 0 degraded

Here, the critic had real errors to catch—calculation mistakes, logic inversions—and debugging actually worked. This universality across models suggests a fundamental property of LLM reasoning, not a quirk of one architecture.


The hidden danger for model training

This finding has profound implications beyond just prompt engineering. It strikes at the heart of modern model training, particularly Reinforcement Learning Fine-Tuning (RLFT) and Reinforcement Learning from AI Feedback (RLAIF).

The reward modeling trap

In RLFT/RLAIF pipelines, we often use a strong model (the “Judge”) to score the outputs of a model being trained. If the Judge is the same model (or a similar one), our results suggest a dangerous feedback loop:

  1. Penalty for perfection: If the student model gets an easy task right, the Judge might hallucinate a flaw and penalize it.
  2. Reward for uncertainty: The Judge may prefer hedged, uncertain answers over confident, correct ones to avoid “missed” errors.
  3. Drift: Over time, this could train models to be less decisive on simple tasks while over-correcting on complex ones.

If your reward model (Judge) has the same blind spots as your policy model, self-correction isn’t just useless—it’s an adversarial attack on your own training data.


When to use critique in agentic solutions

The data is clear: Self-critique is not a free lunch. It’s a high-stakes bet that only pays off when you’re already losing.

The core strategy: triage your tasks

Don’t apply a flat “3 loops” policy to everything. You must categorize incoming requests by difficulty or risk corrosive effects.

1. The “Red Zone” (easy tasks) -> ZERO loops

Identify them by: Simple classification, high initial confidence (>90%), or tasks where LLMs historically excel (e.g., sentiment analysis, basic extraction).
Action: Trust the first draft. Critique here is actively harmful (↓15-40% accuracy).
Why: The model is right, but the critic will hallucinate flaws to justify its existence.

2. The “Green Zone” (hard tasks) -> 3-5 Loops

Identify them by: Complex reasoning, multi-step logic, or low initial confidence (<50%).
Action: Force critique loops.
Why: The model is likely wrong initially. The critic acts as a debugger, catching calculation errors or logic gaps that the generator missed.

The golden rule for agents

Critique is for debugging, not polishing.
If your agent is confident and the task is standard, shut the critic up. Only engage the loop when the model is struggling or the task complexity demands a “second pair of eyes” to catch structural errors.


References

  1. Self-Refine: Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
  2. Reflexion: Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.
Share this article
Image
Armin Parchami
Sr. Director, R&D

Armin Parchami is the Senior Director, R&D, at Snorkel AI, where he leads work on synthetic data, data quality, and model fine-tuning. He previously held technical leadership roles at Ford and Nokia Bell Labs, focusing on multimodal AI and autonomy. His work centers on moving research into production.

Recommended articles

View all articles
judgment-bench-paper
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
At our latest Snorkel AI Reading Group, Russell Yang (AI Engineering Fellow at Stanford Law) stopped by our San Francisco office to present JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment. As AI models improve at open-ended tasks, the field faces a harder problem: how to measure quality in domains where ground truth is contested. Two paradigms dominate: rubric-based
June 18, 2026
Alexis Sobel
benchmarks-3-axis
The Art and Science of Building AI Benchmarks That Shape the Field
Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build AI benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents with
June 16, 2026
Snorkel Team
Image
Cua-Bench: benchmarking computer-use agents on professional software
TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —
June 15, 2026
Armin Parchami
,
Zhengyang (Jason) Qi
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.