second responses vs hours
safety & governance score
decision usefulness with GPT-5.4-mini upgrade
The challenge
A global media SaaS company analyzes hundreds of millions of sources daily from public news, social, and broadcast sources to proprietary analyst-curated databases. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting hours for a human analyst to manually aggregate across multiple sources.
The company’s AI team set out to make that synthesis conversational and instant. The hard part was encoding the institutional expertise that makes their output decision-grade and informs decisions that can run into tens or hundreds of millions of dollars.
The solution
Snorkel designed and built a multi-agent conversational intelligence system which orchestrates specialized agents across the company’s data sources, returning grounded, decision-ready answers in seconds. This system includes an evaluation harness customized with the client team’s own institutional knowledge about what makes answers useful for decision-makers, what counts as properly grounded, and which safety and governance boundaries matter for individual use cases.
When Snorkel GPT-5.4-mini was released, Snorkel was able to easily assess the impact of upgrading from GPT-4.1-mini. The harness showed a 5-point lift in decision usefulness, a 100% pass rate on safety-critical refusal checks, and an improvement from 82.6% to 98.6% on broader governance checks for avoiding internal jargon and keeping unrelated details out of responses. This provided a clear, data-backed case to upgrade to GPT-5.4-mini.
The outcome
The agent replaces a process which used to take hours, delivering answers in an average of 15 seconds with safety scores that meet client requirements. As models continue to evolve, the eval-first foundation lets the client test, compare, and swap models without rebuilding the agent or losing the expert judgement that makes it trustworthy.
More customer stories









