How a global telecom scaled agentic AI with synthetic data

Impact

20+

Task-specific data quality evaluators

+35 point

Function calling accuracy (according to MMLU)

2 months

To design and build custom data curation and evaluation frameworks

The challenge

An Asian telecom leader aimed to expand its offerings with a flagship AI personal assistant. However, the team faced critical roadblocks:

Poor personalization with unreliable outputs
Fragmented, manual development workflows
Lack of scalable, metrics-driven evaluation systems

These gaps made it challenging to iterate quickly, inflating development costs and stalling deployment

The company partnered with Snorkel AI to radically improve how it created and evaluated data for agentic systems to overcome these issues. In under two months, we built modular, scalable data curation pipelines. These pipelines enabled high-volume, high-quality training data for key planning and reasoning use cases—delivering a model performance boost and laying the groundwork for production-ready AI systems that are faster, cheaper, and more effective.

Turning AI ambition into reality

Our client, an Asian telco giant, serves 30M+ subscribers and operates across broadband, digital content, and enterprise services. Recently, it began expanding into AI infrastructure and applications, including a “do-it-all” personal assistant app.

Despite major investments, the telco giant’s early agentic AI prototypes struggled with:

Context retention: Models couldn’t maintain context across multi-turn conversations
Generic responses: Plans were vague and impersonal
Tool use: Agents either didn’t call tools or used them incorrectly
Vague evaluation: Feedback relied on manual “vibe checks” with no hard metrics to measure progress
Slow iteration: Manual reviews and a rigid system design slowed improvement

These challenges stemmed from a lack of scalable data development and evaluation infrastructure. Without high-quality training or benchmark datasets, progress was slow, models underperformed, and iteration cycles stalled.

The goal

The goal was to build a best-in-class AI personal assistant powered by open-source models. The company explored building upon proprietary APIs, but wanted a reliable internal model that they could control. The project initially focused on use cases such as meal and trip planning, which required agentic reasoning, tool calling, and constraint handling.

To create models that could reliably complete these tasks, the company needed:

Scalable data pipelines for generating and curating training and evaluation data
Custom evaluation rubrics for advanced behaviors (e.g., multi-turn planning, tool chaining)
Task-specific models fine-tuned to reliably perform on the app’s core applications

The solution

Our client worked with us to create a reusable, modular data pipeline that spanned the full AI development lifecycle—from data creation to evaluation.

Data generation for agentic use cases

norkel’s team helped the telco build infrastructure to programmatically generate high-quality, multi-turn conversations that included:

Persona and scenario creation: to simulate diverse user profiles and intents
Tool use modeling: including validation and formatting of tool calls
Constraint-driven planning: to enforce adherence to user goals and preferences
Scenario diversity: to balance representations across user types and intents

In service of this data, Snorkel’s experts built a suite of more than 20 task-specific data quality evaluators that could automatically assess:

Tool call correctness and format
Constraint adherence
Plan quality and coherence
Action sequencing and reasoning behavior

This new infrastructure supplemented the team’s existing manual review process—reducing reliance on “vibe checks” and academic benchmarks, and enabling faster iteration, continuous evaluation, and integration of real-world feedback loops.

Rapid, scalable impact

In just two months, we built:

A custom evaluation framework for advanced agentic tasks
Scalable, reusable pipelines for generating 60K+ training and evaluation datapoints
Fine-tuned OSS models with 8% higher performance over Llama base models

The results

Better models, faster development, lower costs

Working with Snorkel, the telco accelerated iteration cycles and unblocked development by automating training and evaluation pipelines. Through fine-tuning with curated synthetic data, the project increased the performance of their chosen open source LLM by 8% above baseline on internal evaluation metrics—and more than that on some tasks.

This move away from proprietary models reduced API costs and gave the team greater control over deployment. With reusable pipelines now in place, the team can rapidly spin up new datasets and expand its agentic assistant capabilities to more use cases.

Share this customer story

More customer stories

View all stories

From hours to seconds on CLO contract review with 94% end user acceptance

A top 10 US bank manages CLO portfolios totaling billions in assets, each governed by contracts up to 500 pages.

Conversational, decision-grade responses in 15 seconds

A global media intelligence firm analyzes hundreds of millions of sources daily – from public news, social, and broadcast to proprietary analyst-curated databases – to help large enterprise clients manage communications, reputation, and strategic decision-making. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting for a human analyst to manually aggregate across those sources: a process measured in hours, not seconds.

Leading Global Firm-case study banner image

Deploying production AI in <60 days to accelerate claims review 67%

A leading global firm transforming insurance subrogation operations with AI found that manual review processes capped their throughput to ~30% of available claims. This bottleneck left significant revenue on the table and froze their ability to scale. The path to automation was further blocked by severe data imbalances where the critical signals for coverage appeared in only a small fraction of claims, making traditional AI models unreliable.

For models that need to be right. Not just good enough.

Request dataset samples

Talk to our team