Up to 80% of enterprise information assets lie in unstructured content formats such as text, PDFs, emails, web pages, and transcripts according to Gartner. Most enterprises recognize they have a wealth of valuable insights embedded within contracts, buried within patient files, captured in chat conversations, etc. Unfortunately accessing data across various locations and file types and then operationalizing that data for AI usage has traditionally been a painfully manual, time-consuming, and costly process.

Ahmad Khan, Head of AI/ML Strategy at Snowflake, discusses the challenges of operationalizing ML in a recent talk.

Snorkel AI has teamed with Snowflake to help our shared customers transform raw, unstructured data into actionable, AI-powered insights. The combination of the Snowflake Data Cloud with Snorkel AI’s data-centric AI platform powered by programmatic labeling accelerates AI development by 10-100x and empowers enterprises to solve their most impactful challenges using all relevant knowledge and data.

The Snowflake Data Cloud for unstructured data

Snowflake breaks down silos by allowing enterprises to centralize and govern all of their data – structured, semi-structured, and unstructured – in a single, secure repository. Snowflake’s support for unstructured data management includes built-in capabilities to store, access, process, manage, govern, and share unstructured data, bringing the performance, concurrency, and scale benefits of the Snowflake Data Cloud to unstructured data. 

Our partnership with Snorkel AI can help make scalable data science on Snowflake more accessible across an organization to help drive business outcomes.

Ahmad Khan, Head of AI/ML Strategy at Snowflake.

But – the value of rich textual data centralized within Snowflake can’t be realized by training machine learning models until this raw data is curated and labeled. 

Snorkel AI is the perfect complement to Snowflake 

To empower Snowflake customers to dramatically accelerate time-to-value for AI initiatives, Snorkel AI has teamed with Snowflake as both a Technology Partner and Powered By partner. 

Snorkel AI’s data-centric approach addresses the biggest blocker to AI deployment: the massive, hand-labeled training datasets needed to train modern machine learning models. Snorkel AI solves this bottleneck with Snorkel Flow, a novel data-centric AI platform. Data science and machine learning teams use Snorkel Flow’s programmatic labeling to intelligently capture knowledge from various sources such as previously labeled data (even when imperfect), heuristics from subject matter experts, business logic, and even the latest foundation models, then scale this knowledge to label large quantities of data. Users are able to rapidly improve training data quality and model performance using integrated error analysis to develop highly accurate and adaptable AI applications.

The Snorkel Flow platform integrates natively with the Snowflake Data Cloud to streamline and simplify AI development workflows:

  • With a few clicks, data scientists can immediately pull relevant data from Snowflake into Snorkel Flow using the natively-integrated Snowflake connector. 
  • Data can then be labeled programmatically using a data-centric AI workflow in Snorkel Flow to quickly generate high-quality training sets over complex, highly variable data. Snorkel Flow includes templates to classify and extract information from native PDFs, richly formatted documents, HTML data, conversational text, and more.
  • Newly labeled datasets can then be used to either train custom ML models or fine-tune prebuilt models.
  • Labeled data can be loaded back into Snowflake as structured data. 
Snorkel Flow data ingestion sources.
Data ingestion sources in Snorkel Flow, now includes Snowflake Data Cloud

Organizations also have the option of deploying complex ML models on Snowflake. Models built in Snorkel Flow can be registered on Snowflake as Snowpark UDFs. Snowpark, which is Snowflake’s developer framework that extends the benefits of the Data Cloud beyond SQL to Python, Scala, and Java, can be used to scale batch inference across your Snowflake data warehouse.

Real-World Value for Pixability 

Pixability is a technology and data company that empowers the world’s largest brands and their agencies to maximize the value of video advertising. With over 500 hours of content created on YouTube every minute, Pixability needs to constantly and accurately categorize billions of videos to fully understand their context so that advertisers can be sure they are running their ads on brand-suitable content. 

To manage its data, Pixability has implemented data pipelines from Amazon S3 to Snowflake using Snowpipe for structured and unstructured data. The team easily ingests unstructured data from Snowflake into Snorkel Flow then programmatically labels and trains high-quality models rapidly in-house, keeping data private and secure. Pixability has used Snorkel Flow to distill knowledge from foundation models to build smaller, deployable classification models with more than 90% accuracy in just days, improving ad performance and brand-suitable targeting.

Transform data into actionable, AI-powered insights with Snorkel AI + Snowflake

Together, Snowflake and Snorkel AI make AI application development fundamentally easier. “Our partnership with Snorkel AI can help make scalable data science on Snowflake more accessible across an organization to help drive business outcomes,” said Ahmad Khan, Head of AI/ML Strategy at Snowflake. “We are excited to strengthen the Data Cloud through partners such as Snorkel AI and leverage their capabilities and expertise to deliver powerful data-driven business outcomes to our joint customers.”

We’re excited to continue deepening our partnership with Snowflake to help accelerate AI development across industries. Schedule a custom demo tailored to your use case with our ML experts today.