Product

Unlock proprietary data with Snorkel Flow and Amazon SageMaker

December 2, 2024
5 min read

Large language models (LLMs) fine-tuned on proprietary data have become a competitive differentiator for enterprises. While off-the-shelf open-source LLMs like Meta’s Llama herd offer impressive capabilities, their real value emerges when enterprises customize them. We made this process much easier through Snorkel Flow’s integration with Amazon SageMaker and other tools and services from Amazon Web Services (AWS).

The integration between the Snorkel Flow AI data development platform and AWS’s robust AI infrastructure empowers enterprises to streamline LLM evaluation and fine-tuning, transforming raw data into actionable insights and competitive advantages.

Here’s what that looks like in practice.

Snorkel Flow and SageMaker integration diagram

Snorkel Flow: the AI data development platform

Snorkel Flow accelerates AI development by focusing on data development. The platform enables organizations to curate, label, and refine datasets programmatically. This reduces the reliance on manual data labeling and significantly speeds up the model training process.

At its core, Snorkel Flow empowers data scientists and domain experts to encode their knowledge into labeling functions, which are then used to generate high-quality training datasets. This approach not only enhances the efficiency of data preparation but also improves the accuracy and relevance of AI models.

Snorkel Flow + Amazon SageMaker

Snorkel Flow’s integration with AWS SageMaker provides a seamless AI development workflow.

 The SageMaker Jumpstart machine learning hub offers a suite of tools for building, training, and deploying machine learning models at scale. When combined with Snorkel Flow, it becomes a powerful enabler for enterprises seeking to harness the full potential of their proprietary data.

What the Snorkel Flow + AWS integrations offer

  • Streamlined data ingestion and management: With Snorkel Flow, organizations can easily access and manage unstructured data stored in Amazon S3. This integration allows for quick data ingestion and setup, enabling teams to focus on refining and labeling data rather than managing infrastructure.
  • Efficient model evaluation and fine-tuning: After importing baseline data from AWS S3 and accessing LLMs from SageMaker Jumpstart or Bedrock, users can use Snorkel Flow’s LLM evaluation tools to build a customized, comprehensive report on their LLM’s current performance.
  • Enhanced data quality and model accuracy: Snorkel Flow’s data-centric approach allows for the identification and correction of data quality issues at scale. By encoding domain knowledge into labeling functions, organizations can improve the quality of training datasets, leading to more accurate and reliable models.
  • Scalable deployment and inference: After Snorkel Flow users iteratively fine-tune their model via SageMaker Jumpstart, they can deploy it directly to SageMaker or Bedrock endpoints for scalable and efficient inference. This integration ensures that models are production-ready and capable of delivering real-time insights to drive business decisions.

Snorkel Flow + Amazon SageMaker: step by step

To illustrate how enterprises can leverage Snorkel Flow and Amazon SageMaker Jumpstart integrations, let’s walk through a high-level workflow. This will demonstrate the process of evaluating and fine-tuning large language models:

Snorkel Flow SageMaker step by step walkthrough

Step 1: Baseline the system

Begin by uploading raw or generated AI pipeline data to Snorkel Flow via native S3 integration. Then, develop your evaluators and data slices to build your first LLM evaluation report. This establishes a baseline for the current system’s performance, providing a starting point for further refinement and evaluation.

Step 2: Curate a high-quality dataset

Using the baseline report from Step 1, precisely identify where your model needs the most help. Use Snorkel Flow and your experts’ knowledge and intuition to develop labeling functions to address these issues. This curated dataset forms the foundation for subsequent model training and evaluation.

Step 3: Configure and connect an OSS base model

Integrate an open-source LLM, such as one from Meta’s Llama herd, with SageMaker using the SageMaker SDK. This setup provides the infrastructure necessary for model training and fine-tuning, leveraging AWS’s robust machine-learning capabilities.

Step 4: Fine-tune the model

Send the curated dataset from Snorkel Flow to SageMaker JumpStart for in-place LLM fine-tuning. This process refines the model, aligning it with the organization’s specific data and requirements.

Step 5: Iteratively evaluate and develop

Return prompt responses from the newly fine-tuned model to Snorkel Flow. Run another evaluation report to identify where the model improved and where it needs more work. Continue this iterative loop until the model meets production-quality standards.

Step 6: Deploy the production-ready fine-tuned model

Finally, deploy the fine-tuned model to production using JumpStart Inference Endpoints. This deployment ensures that the model is ready to deliver actionable insights and drive business value in real-world scenarios.

Snorkel Flow + Amazon SageMaker: a powerful pair

The integration of Snorkel Flow with AWS SageMaker offers a powerful solution for enterprises seeking to unlock the full potential of their proprietary data through LLM evaluation and fine-tuning.

By streamlining the data preparation, model training, and deployment processes, this integration enables organizations to develop AI systems that are not only accurate and efficient but also aligned with their specific business needs. As enterprises continue to navigate the complexities of AI development, the partnership between Snorkel and AWS provides the tools and infrastructure necessary to transform raw data into a strategic asset, driving innovation and competitive advantage in the digital age.

Ready to accelerate AI development?

Deploy production AI and ML applications 10-100x faster with Snorkel’s experts, using our proprietary technology.

Request a demo

Share this article
Image
Chris Borg
Senior Machine Learning Solutions Engineer

Recommended articles

View all articles
agents-last-exam-thumbnail
Agents’ Last Exam: AI Benchmarking for Real Work
At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can
June 30, 2026
Snorkel Team
continual-learning-bench-featured-image
Continual learning and evaluating how AI agents learn across sequences of tasks
Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated
June 29, 2026
Chris Glaze
Image
Benchtalks #3: We taught AI everything except how to learn
For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration
June 25, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.