Shrinking the generation-verification gap with weak verifiers

Jon Saad-Falcon, et all.

Abstract

Verifiers can enhance language model (LM) performance by scoring and ranking a set of generated responses, but high-quality verifiers today are either unscalable (like human judges) or of limited practical use (such as formal proof tools like Lean). While LM-based judges and reward models serve as general-purpose verifiers, they still fall short of the performance levels achieved by oracle verifiers, which are perfectly accurate. To bridge this gap, the Weaver framework is introduced as a method for constructing a strong verifier by combining multiple weaker, imperfect ones. Weaver shows that weighted ensembles of verifiers, which traditionally depend on labeled data, substantially outperform unweighted combinations due to differences in individual verifier accuracies. To minimize reliance on labeled data, Weaver uses weak supervision to estimate verifier accuracies and merges their outputs into a single score that better reflects the true quality of a response. However, applying weak supervision directly brings challenges like inconsistent output formats and the presence of poor-quality verifiers, which Weaver overcomes by normalizing outputs using dataset statistics and selectively filtering verifiers. The framework is evaluated in test-time repeated sampling scenarios, where multiple responses are generated and one is selected, showing that Weaver substantially outperforms simple Pass@1 selection on various reasoning and math tasks. It achieves o3-mini-level accuracy using Llama 3.3 70B Instruct—a less costly, non-reasoning generator—paired with an ensemble of 70B or smaller verifiers, reaching an average accuracy of 87.7%. This gain mirrors the leap from GPT-4o to o3-mini (69.0% to 86.7%), typically requiring expensive fine-tuning and post-training work. To further reduce the computational burden of using verifier ensembles, Weaver distills its combined outputs into a compact 400M cross-encoder that retains 98.7% of full accuracy while cutting verification compute by up to 99.97%.