How to import Databricks data into Snorkel Flow
Databricks customers can now access millions of rows of data seamlessly within the Snorkel Flow platform thanks to a new Databricks connector. With a few clicks, users have access to massive amounts of Databricks data and can use Snorkel Flow’s data-centric approach to develop, fine-tune, and adapt ML models of all sizes—including multi-billion parameter foundation models—using their own proprietary data and subject matter expertise.
The new Databricks connector adds to Snorkel Flow’s suite of third-party data connectors, making data stored in external repositories like Databricks quickly and easily accessible for AI application development.
We’re excited to announce this new connector in conjunction with our upcoming The Future of Data-Centric AI virtual event. On June 7, the first day of the conference, Databricks Chief Technologist and Co-founder Matei Zaharia will discuss “Making LLM Applications Production Grade” at 1:30 PM PDT.
Weeks later, on June 29, Snorkel AI Founding Engineer and Product Director Vincent Chen will present at “Building AI-Powered Products with Foundation Models” at the Databricks Data + AI Summit. Both events will benefit the AI and ML community and continue to advance the conversation around this exciting technology.
Data-centric AI development with Snorkel Flow
One of the most painstaking and time-consuming issues with developing AI applications is the process of curating and labeling unstructured data. Snorkel AI solves this bottleneck with Snorkel Flow, a novel data-centric AI platform.
Data science and machine learning teams use Snorkel Flow’s programmatic labeling to intelligently capture knowledge from various sources—such as previously labeled data (even when imperfect), heuristics from subject matter experts, business logic, and even the latest foundation models and large language models—and then scale this knowledge to label large quantities of data.
As users integrate more sources of knowledge, the platform enables them to rapidly improve training data quality and model performance using integrated error analysis tools.
Snorkel Flow + Databricks
Snorkel is further streamlining the machine learning development process for organizations that rely on Databricks with the new Databricks SQL connector built directly into the platform interface. This connector makes clients’ Databricks data accessible to Snorkel Flow with just a few clicks.
Here’s how it works:
- Select “Databricks SQL”’ as a data source when creating a new dataset in Snorkel Flow.
- Enter Databricks SQL connection details and credentials. To make sure sensitive credentials are never exposed, all credentials are encrypted end-to-end.
- Use SQL queries to access relevant data, select splits, and identify inconsistencies that may cause issues.
- Select the unique identifier column or choose to have Snorkel Flow autogenerate one.
- Snorkel Flow will then ingest the dataset, making it immediately referenceable throughout the platform.
- Data can then be labeled programmatically using a data-centric AI workflow in Snorkel Flow to quickly generate high-quality training sets over complex, highly variable data. Snorkel Flow includes templates to classify and extract information from unstructured text, native PDFs, richly formatted documents, HTML data, conversational text, and more.
- Newly labeled datasets can then be used to either train custom ML models or fine-tune pre-built models.
The new Databricks connector is currently in private preview and will be generally available soon.
Making it easier than ever to get value out of your data
The new Databricks connector joins our suite of third-party connectors in the Snorkel Flow platform. Each makes it easier for our customers to get their data onto the Snorkel Flow platform, where they can rapidly and iteratively build probabilistic training sets and construct valuable, deployable models faster.
To learn more about how Databricks and Snorkel can help your enterprise build and deploy powerful, valuable machine learning applications, join Snorkel AI at The Future of Data-Centric AI and Databricks at Databricks Data + AI Summit.
Learn how to get more value from your PDF documents!
Transforming unstructured data such as text and documents into structured data is crucial for enterprise AI development. On December 17, we’ll hold a webinar that explains how to capture SME domain knowledge and use it to automate and scale PDF classification and information extraction tasks.
As Head of Partnerships for Snorkel, Friea Berg leverages over a decade of channel experience to help the world’s most innovative enterprises realize the promise of AI using proprietary data. Friea joined Snorkel to build the startup’s channel strategy from the ground up. Under her leadership, Snorkel has built successful partnerships with Google, Microsoft, AWS, Databricks, Snowflake, and Hugging Face plus unlocked new routes-to-market via Marketplace and global resellers. Partners are now integral to every team at Snorkel, one of CRN’s 10 Hottest Data Science/ML Startups in 2022 and one of Forbes’s 50 most promising AI startups in the world in 2023.
Prior to diving into startups, Friea held leadership, alliance, and business development positions at Splunk, NetApp, and other technology leaders. At Splunk she built and scaled global strategic partnerships with Google, Cisco, and Palo Alto Networks. She also led a team that incubated first-of-a-kind ‘market maker’ partnerships with Deloitte, SAP, Cerner, Salesforce, and others.