Product

Unlock proprietary data with Snorkel Flow and Amazon SageMaker

December 2, 2024
5 min read

Large language models (LLMs) fine-tuned on proprietary data have become a competitive differentiator for enterprises. While off-the-shelf open-source LLMs like Meta’s Llama herd offer impressive capabilities, their real value emerges when enterprises customize them. We made this process much easier through Snorkel Flow’s integration with Amazon SageMaker and other tools and services from Amazon Web Services (AWS).

The integration between the Snorkel Flow AI data development platform and AWS’s robust AI infrastructure empowers enterprises to streamline LLM evaluation and fine-tuning, transforming raw data into actionable insights and competitive advantages.

Here’s what that looks like in practice.

Snorkel Flow and SageMaker integration diagram

Snorkel Flow: the AI data development platform

Snorkel Flow accelerates AI development by focusing on data development. The platform enables organizations to curate, label, and refine datasets programmatically. This reduces the reliance on manual data labeling and significantly speeds up the model training process.

At its core, Snorkel Flow empowers data scientists and domain experts to encode their knowledge into labeling functions, which are then used to generate high-quality training datasets. This approach not only enhances the efficiency of data preparation but also improves the accuracy and relevance of AI models.

Snorkel Flow + Amazon SageMaker

Snorkel Flow’s integration with AWS SageMaker provides a seamless AI development workflow.

 The SageMaker Jumpstart machine learning hub offers a suite of tools for building, training, and deploying machine learning models at scale. When combined with Snorkel Flow, it becomes a powerful enabler for enterprises seeking to harness the full potential of their proprietary data.

What the Snorkel Flow + AWS integrations offer

  • Streamlined data ingestion and management: With Snorkel Flow, organizations can easily access and manage unstructured data stored in Amazon S3. This integration allows for quick data ingestion and setup, enabling teams to focus on refining and labeling data rather than managing infrastructure.
  • Efficient model evaluation and fine-tuning: After importing baseline data from AWS S3 and accessing LLMs from SageMaker Jumpstart or Bedrock, users can use Snorkel Flow’s LLM evaluation tools to build a customized, comprehensive report on their LLM’s current performance.
  • Enhanced data quality and model accuracy: Snorkel Flow’s data-centric approach allows for the identification and correction of data quality issues at scale. By encoding domain knowledge into labeling functions, organizations can improve the quality of training datasets, leading to more accurate and reliable models.
  • Scalable deployment and inference: After Snorkel Flow users iteratively fine-tune their model via SageMaker Jumpstart, they can deploy it directly to SageMaker or Bedrock endpoints for scalable and efficient inference. This integration ensures that models are production-ready and capable of delivering real-time insights to drive business decisions.

Snorkel Flow + Amazon SageMaker: step by step

To illustrate how enterprises can leverage Snorkel Flow and Amazon SageMaker Jumpstart integrations, let’s walk through a high-level workflow. This will demonstrate the process of evaluating and fine-tuning large language models:

Snorkel Flow SageMaker step by step walkthrough

Step 1: Baseline the system

Begin by uploading raw or generated AI pipeline data to Snorkel Flow via native S3 integration. Then, develop your evaluators and data slices to build your first LLM evaluation report. This establishes a baseline for the current system’s performance, providing a starting point for further refinement and evaluation.

Step 2: Curate a high-quality dataset

Using the baseline report from Step 1, precisely identify where your model needs the most help. Use Snorkel Flow and your experts’ knowledge and intuition to develop labeling functions to address these issues. This curated dataset forms the foundation for subsequent model training and evaluation.

Step 3: Configure and connect an OSS base model

Integrate an open-source LLM, such as one from Meta’s Llama herd, with SageMaker using the SageMaker SDK. This setup provides the infrastructure necessary for model training and fine-tuning, leveraging AWS’s robust machine-learning capabilities.

Step 4: Fine-tune the model

Send the curated dataset from Snorkel Flow to SageMaker JumpStart for in-place LLM fine-tuning. This process refines the model, aligning it with the organization’s specific data and requirements.

Step 5: Iteratively evaluate and develop

Return prompt responses from the newly fine-tuned model to Snorkel Flow. Run another evaluation report to identify where the model improved and where it needs more work. Continue this iterative loop until the model meets production-quality standards.

Step 6: Deploy the production-ready fine-tuned model

Finally, deploy the fine-tuned model to production using JumpStart Inference Endpoints. This deployment ensures that the model is ready to deliver actionable insights and drive business value in real-world scenarios.

Snorkel Flow + Amazon SageMaker: a powerful pair

The integration of Snorkel Flow with AWS SageMaker offers a powerful solution for enterprises seeking to unlock the full potential of their proprietary data through LLM evaluation and fine-tuning.

By streamlining the data preparation, model training, and deployment processes, this integration enables organizations to develop AI systems that are not only accurate and efficient but also aligned with their specific business needs. As enterprises continue to navigate the complexities of AI development, the partnership between Snorkel and AWS provides the tools and infrastructure necessary to transform raw data into a strategic asset, driving innovation and competitive advantage in the digital age.

Ready to accelerate AI development?

Deploy production AI and ML applications 10-100x faster with Snorkel’s experts, using our proprietary technology.

Request a demo

Share this article
Image
Chris Borg
Senior Machine Learning Solutions Engineer

Recommended articles

View all articles
Image
Building AI-Native Systems for Federal Infrastructure: A Conversation with Rezaur Rahman
Christopher Sniffen recently sat down with Rezaur Rahman — CIO / CISO / CAIO at the Advisory Council on Historic Preservation — for a conversation on what it actually takes to build frontier AI for federal infrastructure. They get into the limits of frontier models on geospatial reasoning, mechanistic interpretability for applied AI, the trick that makes vision models useful
May 14, 2026
Snorkel Team
Image
Code World Models and AutoHarness for LLM Agents
At our latest Snorkel AI Reading Group, Carter Wendelken of Google DeepMind walked us through two related papers he presented at ICLR: Code World Models for General Game Playing and AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. Both ask the same question from opposite ends: when you want an LLM to act reliably in a complex, possibly
May 14, 2026
David Burch
coding-agents-eval
Why coding agents need better data, evals, and environments
Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –
May 11, 2026
Justin Bauer
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.