How A Leading International Telecom Provider Scaled Agentic AI with High-quality Synthetic Data
Task-specific data quality evaluators
Function calling accuracy (according to MMLU)
To design and build custom data curation and evaluation frameworks
Customer Story
An Asian telecom leader aimed to expand its offerings with a flagship AI personal assistant. However, the team faced critical roadblocks:
- Poor personalization with unreliable outputs
- Fragmented, manual development workflows
- Lack of scalable, metrics-driven evaluation systems
These gaps made it challenging to iterate quickly, inflating development costs and stalling deployment
The company partnered with Snorkel AI to radically improve how it created and evaluated data for agentic systems to overcome these issues. In under two months, we built modular, scalable data curation pipelines. These pipelines enabled high-volume, high-quality training data for key planning and reasoning use cases—delivering a model performance boost and laying the groundwork for production-ready AI systems that are faster, cheaper, and more effective.
Turning AI ambition into reality
Our client, an Asian telco giant, serves 30M+ subscribers and operates across broadband, digital content, and enterprise services. Recently, it began expanding into AI infrastructure and applications, including a “do-it-all” personal assistant app.
Despite major investments, the telco giant’s early agentic AI prototypes struggled with:
- Context retention: Models couldn’t maintain context across multi-turn conversations
- Generic responses: Plans were vague and impersonal
- Tool use: Agents either didn’t call tools or used them incorrectly
- Vague evaluation: Feedback relied on manual “vibe checks” with no hard metrics to measure progress
- Slow iteration: Manual reviews and a rigid system design slowed improvement
These challenges stemmed from a lack of scalable data development and evaluation infrastructure. Without high-quality training or benchmark datasets, progress was slow, models underperformed, and iteration cycles stalled.
Goal
The goal was to build a best-in-class AI personal assistant powered by open-source models. The company explored building upon proprietary APIs, but wanted a reliable internal model that they could control. The project initially focused on use cases such as meal and trip planning, which required agentic reasoning, tool calling, and constraint handling.
To create models that could reliably complete these tasks, the company needed:
- Scalable data pipelines for generating and curating training and evaluation data
- Custom evaluation rubrics for advanced behaviors (e.g., multi-turn planning, tool chaining)
- Task-specific models fine-tuned to reliably perform on the app’s core applications
Solution
Our client worked with us to create a reusable, modular data pipeline that spanned the full AI development lifecycle—from data creation to evaluation.
Data generation for agentic use cases
norkel’s team helped the telco build infrastructure to programmatically generate high-quality, multi-turn conversations that included:
- Persona and scenario creation: to simulate diverse user profiles and intents
- Tool use modeling: including validation and formatting of tool calls
- Constraint-driven planning: to enforce adherence to user goals and preferences
- Scenario diversity: to balance representations across user types and intents
In service of this data, Snorkel’s experts built a suite of more than 20 task-specific data quality evaluators that could automatically assess:
- Tool call correctness and format
- Constraint adherence
- Plan quality and coherence
- Action sequencing and reasoning behavior
This new infrastructure supplemented the team’s existing manual review process—reducing reliance on “vibe checks” and academic benchmarks, and enabling faster iteration, continuous evaluation, and integration of real-world feedback loops.
Rapid, scalable impact
In just two months, we built:
- A custom evaluation framework for advanced agentic tasks
- Scalable, reusable pipelines for generating 60K+ training and evaluation datapoints
- Fine-tuned OSS models with 8% higher performance over Llama base models
Results: Better models, faster development, lower costs
Working with Snorkel, the telco accelerated iteration cycles and unblocked development by automating training and evaluation pipelines. Through fine-tuning with curated synthetic data, the project increased the performance of their chosen open source LLM by 8% above baseline on internal evaluation metrics—and more than that on some tasks.
This move away from proprietary models reduced API costs and gave the team greater control over deployment. With reusable pipelines now in place, the team can rapidly spin up new datasets and expand its agentic assistant capabilities to more use cases.
Ready to get started?
Take the next step and see how you can accelerate AI development by 100x.