Search result for:
The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…
Error analysis of 8 models on Agentic Coding tasks Successful completion of complex tasks doesn’t come from models being always right. It comes from models being resilient when things go wrong. To get a deeper understanding of model behavior in agentic environments, our team analyzed all of the errors found in the full traces of tasks from our Agentic Coding…
Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real—backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will find hesitation. The reason: our ability to measure AI…