How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Chris Glaze

Published: February 18, 2026

Updated: February 18, 2026

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here.

The key insight? Just focus on tool use.

Why tool discipline beats scale

Large generalist models have excellent reasoning but poor tool discipline. They hallucinate column names, ignore constraints, and generate SQL that returns nonsensical results. The problem isn’t intelligence—it’s reliability.

Rather than training on expensive multi-table examples, the team focused on teaching reliable tool use with simple, single-table queries – and those skills generalized. In internal ablations, single-table-only training achieved the best results (66.3% internal Pass@1), outperforming both single + multi-table (61.6%) and a single→multi curriculum (64.8%).

The fundamentals generalize: explore tables before querying, validate data before proceeding, retry on failure rather than giving up.

Trained on simple tasks, verified in complex environments

The Snorkel team’s contributions were (1) the agentic environment for eval and RL, and (2) our Snorkel Finance benchmark and Finance Reasoning benchmark, containing expert-curated financial analysis tasks for evaluating the agent’s performance, so we could be confident that the lift we saw was relevant to realistic, complex tasks. The rLLM team developed single-table queries that focused on using the relevant tools correctly, then completed the RL fine-tuning of the model under test in the environment.

Enterprise implications

The economics shift substantially. For a firm processing 50,000 analyst queries monthly, this approach could reduce costs by 90% while improving accuracy and keeping data on-premises. A 4B model runs on a single GPU; its 235B counterpart requires a multi-node cluster.

The methodology isn’t finance-specific either. Healthcare, legal, insurance – anywhere structured data and tool use intersect – the same pipeline applies: convert documents into queryable structures, teach tool-calling fundamentals on simple queries, verify aggressively, and fine-tune.

Build your own domain specialists

Qwen3-4B-Instruct-2507 was fine-tuned using the rLLM framework on a cluster of 8x H100 GPUs. By using small, specialized judges (GPT-5-nano) for simpler verifications and reserving larger models only for complex multi-table queries, the team kept the total training cost under $500 per run.

This fundamentally changes the accessibility of domain adaptation. You do not need a massive pre-training cluster to build a state-of-the-art specialist; instead, you need the right domain expertise, a well-engineered RL environment, and a smart verification pipeline.

The team is open-sourcing everything. Check out the rLLM blog here for links to their repository and the full details.

Snorkel is thrilled to have collaborated with Berkeley’s Sky Computing Lab and the rLLM team on this research project. For more details, check out their homepage here. For more information on how Snorkel can help you with RL environments and expert-curated datasets, come talk to us!

Chris Glaze

Chris Glaze is Applied Research Scientist at Snorkel AI. He is an experienced PhD with a demonstrated history of developing novel machine learning tools and mathematical models in academia and industry. Accomplishments span data mining, experimental research, and application to digital technologies.

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Why tool discipline beats scale

Trained on simple tasks, verified in complex environments

Enterprise implications

Build your own domain specialists

Recommended
articles

Coding agents don’t need to be perfect, they need to recover

Closing the Evaluation Gap in Agentic AI

SlopCodeBench: Measuring Code Erosion as Agents Iterate

Join our newsletter for expert advice, the latest research, and exclusive events.

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Why tool discipline beats scale

Trained on simple tasks, verified in complex environments

Enterprise implications

Build your own domain specialists

Recommended articles

Coding agents don’t need to be perfect, they need to recover

Closing the Evaluation Gap in Agentic AI

SlopCodeBench: Measuring Code Erosion as Agents Iterate

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles