In today’s fast-paced AI landscape, seamless integration between data platforms and AI development tools is critical. At Snorkel, we’ve partnered with Databricks to create a powerful synergy between their data lakehouse and our Snorkel Flow AI data development platform. This integration uniquely bridges the gap between scalable data management and cutting-edge AI development, unlocking new efficiencies in data ingestion, labeling, model development, and deployment for our customers.

In this post, we’ll explore four key integration points between Snorkel Flow and Databricks, using a chatbot intent classification use case as an example:

  1. Ingesting raw data from Databricks into Snorkel Flow.
  2. Leveraging Databricks’ model-serving endpoints to generate labeling functions in Snorkel Flow.
  3. Registering Snorkel Flow-trained models into the Databricks Unity Catalog.
  4. Exporting labeled training data back to Databricks for further analysis or model training.

Let’s dive into the details of how these integrations work and how they can supercharge your AI workflows.

If you’d like a video version of this walkthrough, you can watch it on our YouTube channel or via the embed below.

Ingesting raw data from Databricks into Snorkel Flow

Efficient data ingestion is the foundation of any machine learning project. In our chatbot intent classification use case, we started with a raw collection of chatbot utterance data stored in the Databricks Hive Metastore. Our experts labeled a subset of these utterances to establish ground truth but left the majority unlabeled.

One of our tasks was to provide predicted classifications for all existing utterances. This is how the process begins:

How to ingest data from Databricks into Snorkel Flow

  1. Connect Databricks and Snorkel Flow: Use Snorkel Flow’s Databricks SQL connector to set up a connection with your Databricks instance by entering your credentials and query parameters.
  2. Run checks and import data: The platform performs checks on the data and imports it, enabling you to create an application for your labeling process.

This seamless integration eliminates data logistics challenges, enabling rapid iteration and allowing you to focus on labeling and model development.

databricks plus snorkel

Using Databricks model-serving endpoints for labeling functions

Large language models (LLMs) are powerful tools for generating initial labels. Snorkel Flow natively integrates with leading LLM providers, allowing you to harness the power of frontier LLMs to create labeling functions for your data.

However, fine-tuned LLMs trained on your proprietary data often outperform generic models. Organizations hosting custom LLMs on Databricks can seamlessly leverage these models directly within Snorkel Flow.

How to use Databricks-hosted models for LLM-powered labeling functions:

  1. Set up LLM configurations: Select a Databricks-hosted LLM and connect it through our foundation model management tools.
  2. Craft your prompts: Snorkel Flow allows you to write and iterate on prompt templates to quickly generate labels.
  3. Preview and refine: Test and refine your prompts using the preview functionality to ensure label accuracy.
  4. Generate initial labels: Use the LLM to generate high-coverage labeling functions, providing a strong starting point for further refinement.

Even with a proprietary LLM and a well-engineered prompt, initial labels won’t be perfect. These labels provide coverage across your dataset, helping identify gaps where targeted labeling functions are needed.

Registering models in Snorkel Flow to Databricks Unity Catalog

When your model is ready for deployment, Snorkel Flow simplifies the process by enabling you to register your custom models directly into Databricks’ Unity Catalog for hosting and inference.

Snorkel Flow → Unity Catalog registration process:

  1. Deploy via Snorkel Flow: Name your deployment, specify an experiment, and select an MLflow registry.
  2. Register in Unity Catalog: Once deployed, your model is registered in the Unity Catalog, ready for inference and integration with other Databricks workflows.

This integration ensures a smooth transition from development to production, seamlessly connecting data-centric AI development with scalable deployment.

databricks and snorkel integration highlights

Exporting labeled training data back to Databricks

Labeling data isn’t just about training a model—it’s about enriching your dataset for future use. After completing the labeling process in Snorkel Flow, export your curated training data back to Databricks for further analysis or model training.

How to export and validate labeled data:

  1. Export with Snorkel Flow SDK: Use the Databricks extension in the Snorkel Flow SDK to send labeled data back to the Hive Metastore.
  2. Validate in Databricks: Ensure that the newly labeled dataset is loaded correctly. In our chatbot example, we successfully labeled previously unknown utterances, preparing the data for downstream tasks.

Key takeaways from the Snorkel Flow-Databricks integration

By integrating Snorkel Flow with Databricks, we streamlined several critical components of the machine learning lifecycle:

  • Ingesting raw data directly from Databricks’ Hive Metastore into Snorkel Flow.
  • Using labeling functions powered by a custom LLM hosted on Databricks to jumpstart the labeling process.
  • Registering trained models into the Databricks Unity Catalog for seamless deployment.
  • Exporting enriched labeled datasets back into Databricks for extended analysis and usage.

These integrations highlight the unique flexibility and scalability of combining Snorkel Flow’s data-centric AI capabilities with Databricks’ robust data platform.

Learn how to get more value from your PDF documents!

Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.

Sign up here!