The Self-Critique Paradox: When AI Verification Fails

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines.

The promise vs. the reality

The “agentic loop”—having an AI critique and improve its own work—is a popular method for attempting to boost performance. Techniques like Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023) have popularized the idea that iterative feedback can help models solve complex tasks. The logic is simple: two heads (even if they’re the same head) are better than one.

But are they always?

We ran a rigorous experiment:

50 hard visual reasoning tasks (verifiable ground truth)
2 frontier models (Claude Sonnet 4.5, OpenAI o4-mini)
5 critique-improve iterations per task
100 total experiments

Why these models? We chose two of the strongest reasoning models available. If SOTA models—specifically optimized for reasoning—cannot leverage self-critique to fix their own outputs on simple tasks, it suggests a fundamental limitation that is likely even more severe in weaker or smaller models.

What we found challenges the core assumption of agentic AI.

The data: A tale of two extremes

When we aggregated our results, it looked like a generic failure: accuracy dropped 10% overall. But when we split tasks by difficulty, a startling pattern emerged.

1. The “corrosive critique” effect (Easy Tasks)

For tasks where models started strong (≥75% accuracy), the critique loop was devastating for both models.

Model	Initial	Loop 5	Drop	Result
Claude Sonnet 4.5	98.1%	56.9%	↓ 41.2%	0 improved, 8 degraded
OpenAI o4-mini	94.2%	78.4%	↓ 15.8%	0 improved, 5 degraded

What happened? Hallucination. The critic, primed to find errors, invented them. A correct answer of “yes” became “no” because the model “detected” a 2-pixel discrepancy that didn’t exist. Confidence became a liability.

2. The “Lazarus” effect (Hard Tasks)

For tasks where models failed completely (<35% accuracy), critique was a miracle worker.

Model	Initial	Loop 5	Gain	Result
Claude Sonnet 4.5	0.0%	60.0%	↑ 60.0%	3 improved, 0 degraded
OpenAI o4-mini	0.0%	20.0%	↑ 20.0%	1 improved, 0 degraded

Here, the critic had real errors to catch—calculation mistakes, logic inversions—and debugging actually worked. This universality across models suggests a fundamental property of LLM reasoning, not a quirk of one architecture.

The hidden danger for model training

This finding has profound implications beyond just prompt engineering. It strikes at the heart of modern model training, particularly Reinforcement Learning Fine-Tuning (RLFT) and Reinforcement Learning from AI Feedback (RLAIF).

The reward modeling trap

In RLFT/RLAIF pipelines, we often use a strong model (the “Judge”) to score the outputs of a model being trained. If the Judge is the same model (or a similar one), our results suggest a dangerous feedback loop:

Penalty for perfection: If the student model gets an easy task right, the Judge might hallucinate a flaw and penalize it.
Reward for uncertainty: The Judge may prefer hedged, uncertain answers over confident, correct ones to avoid “missed” errors.
Drift: Over time, this could train models to be less decisive on simple tasks while over-correcting on complex ones.

If your reward model (Judge) has the same blind spots as your policy model, self-correction isn’t just useless—it’s an adversarial attack on your own training data.

When to use critique in agentic solutions

The data is clear: Self-critique is not a free lunch. It’s a high-stakes bet that only pays off when you’re already losing.

The core strategy: triage your tasks

Don’t apply a flat “3 loops” policy to everything. You must categorize incoming requests by difficulty or risk corrosive effects.

1. The “Red Zone” (easy tasks) -> ZERO loops

Identify them by: Simple classification, high initial confidence (>90%), or tasks where LLMs historically excel (e.g., sentiment analysis, basic extraction).
Action: Trust the first draft. Critique here is actively harmful (↓15-40% accuracy).
Why: The model is right, but the critic will hallucinate flaws to justify its existence.

2. The “Green Zone” (hard tasks) -> 3-5 Loops

Identify them by: Complex reasoning, multi-step logic, or low initial confidence (<50%).
Action: Force critique loops.
Why: The model is likely wrong initially. The critic acts as a debugger, catching calculation errors or logic gaps that the generator missed.

The golden rule for agents

Critique is for debugging, not polishing.
If your agent is confident and the task is standard, shut the critic up. Only engage the loop when the model is struggling or the task complexity demands a “second pair of eyes” to catch structural errors.

References

Self-Refine: Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
Reflexion: Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.