Powered by cutting-edge technology

Snorkel Flow builds on and productionizes years of research at the Stanford AI Lab, represented in over thirty peer-reviewed research papers and co-developed with some of the world’s leading organizations
Request demo
Join the waitlist
Intro —

A new paradigm for machine learning development

Snorkel Flow is motivated by a tectonic shift in AI over the last decade towards powerful but data-hungry machine learning models that succeed or fail based on the massive labeled datasets they learn from.

Unfortunately, labeling these training datasets by hand is prohibitively expensive and slow for most organizations. In particular, the status quo of cheap, outsourced manual labeling that works for some use cases (e.g., labeling stop signs, pedestrians, cats, and dogs) is not an option for most organizations where data is highly private, requires subject matter experts to label, and changes rapidly.

Snorkel Flow is based on a novel approach to programmatically building and managing training datasets: rather than labeling thousands of data points by hand, users label, augment, and structure training datasets programmatically — e.g., using rules, heuristics, and other sources of signal. This type of input, often referred to as weak supervision, is fundamentally faster and more flexible, but also messier. Snorkel Flow relies on years of theoretical and algorithmic research from the Stanford AI Lab to manage this powerful new type of input.

Ultimately, the vision that all the algorithmic, theoretical, and empirical research behind Snorkel Flow builds to is a simple but powerful one: that of ML development as a practical, iterative, error-analysis driven process, rather than one reliant on weeks or months of labeling and relabeling data by hand.

Algorithmic core

Programmatic labeling

The time and cost of labeling training data is one of the biggest bottlenecks in deploying machine learning today. Snorkel Flow enables users to label data programmatically instead using push-button or programmatically-defined labeling functions, to label data quickly, flexibly, and in an interpretable and auditable way. These labeling functions can be used to express subject matter expertise in the form of rules, heuristics, and pattern matchers, as well as to leverage organizational knowledge resources such as existing labels or models, legacy rules, knowledge bases and graphs, and more.

The resulting technical challenge is that these user-developed labeling strategies can be inaccurate, overlap and disagree, be correlated, and have minimal data coverage. Snorkel Flow uses novel, theoretically-grounded techniques to model, integrate, and clean this programmatically-labeled data to yield the same or greater accuracy as hand-labeled data.

Diverse operators

Data augmentation & slicing

In addition to labeling data, Snorkel Flow enables users to programmatically perform other key operations on training data, including augmenting training datasets by creating transformed data copies, slicing training datasets into critical subsets for monitoring and model prioritization, and more.

Whereas these key techniques are often implemented by hand in practice, Snorkel Flow enables users to programmatically express them and then uses novel algorithmic approaches to tune and optimize them. The result is faster, smarter, and more finely controllable ML.

Privacy focused

“Eyes off” machine learning

One of the largest blockers to applying machine learning to many problem domains and sectors is the sensitivity of the data involved.

With Snorkel Flow’s programmatic approach to training data, not only can training data labeling and management be kept on-premises or self-hosted, it can be done without humans needing to view the majority of the data — setting a new high bar for practical, private machine learning.

Closing the loop

Monitoring, analysis, and auditing

The core slowdown in most iterative development loops involving machine learning is the need to label and re-label data by hand. With the shift to programmatic labeling and management of training data, monitoring and analysis leads to imminently actionable steps for improvement in the data — leading to a fast and responsive iterative loop.

Similarly, programmatic training data enables a fundamentally new but pragmatic level of auditability and versioning — since training data is labeled and managed by code.

Cutting Edge

State-of-the-art modeling

Modern ML models and training techniques are powerful but data-hungry.

Snorkel Flow includes and enables these types of models (e.g., BERT, XLNet, etc.) and techniques (transfer learning, multi-task learning, ensembling, etc.) with training datasets that can be orders of magnitude larger than ones hand-labeled, and latest techniques are regularly integrated with the platform so that you can have access to the latest and greatest technology as it evolves.

Integrated Crowdlabeling

Annotator management

The same algorithmic and theoretical techniques that power Snorkel Flow's ability to estimate the quality of diverse labeling functions and integrate their outputs applies equally to hand-labeled data. Though no hand-labeled training data is needed to use Snorkel Flow, if you have it, Snorkel Flow will manage it for you — automatically estimating the quality of your annotators and identifying unreliable or adversarial labelers.

The Snorkel Flow platform provides an array of collaboration and workflow tools to integrate subject matter expert labelers into your workflow — whether they are labeling a small test or validation data split, or providing higher-level feedback on a difficult portion of the training dataset to assist with development.

Roadmap —

From research to production

Snorkel Flow builds on and productionizes years of research represented in over thirty peer-reviewed research papers and deployments with some of the world’s leading organizations, covering use cases in text, structured, semi-structured, telemetry, image, video, time series, and other data types.

Data Scientists

  • Form parsing
  • PDF table extraction
  • Contract & Financials analysis
  • And more …

Image Data

  • AND MORE …

Video & Time Series Data

  • AND MORE …
View more technology details

Based on years of groundbreaking research

Snorkel Flow is informed by novel research into machine learning systems and weak supervision out of the Stanford AI Lab and beyond, funded by DARPA, ONR, DoD, NIH, NSF, Google, Intel, Microsoft, and many others, taught in several introductory and advanced machine learning courses, and published in over thirty-six peer-reviewed papers.
Snorkel: Rapid Training Data Creation with Weak Supervision

A. Ratner, S. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Ré.
VLDB 18 Best Of.

Snorkel and Software 2.0: Beyond Hand-labeled Data

C. Ré.
KDD 2018 (Invited).

Data Programming: Creating Large Training Sets, Quickly

A. Ratner, C. De Sa, S. Wu, D. Selsam, C. Ré.
NeurIPS 2016.

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale
Google: Web content and event classification. S. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, et al. SIGMOD Industry 2019.
Overton: A Data System for Monitoring and Improving Machine Learned Products
Apple: Serving >1B queries in multiple languages. C. Ré, F. Niu, P. Gudipati, C. Srisuwananukorn. Apple 2019.
Osprey: Non-Programmer Weak Supervision of Imbalanced Extraction Problems
Intel: Business intelligence. E. Bringer, A. Israeli, A. Ratner, C. Ré. DEEM @ SIGMOD 2019.
Bootstrapping Conversational Agents With Weak Supervision
IBM: Conversational agents. N. Mallinar, A. Shah, R. Ugrani, A. Gupta, et al. AAAI 2019.
A Machine-compiled Database of Genome-wide Association Studies
Stanford Genomics: Knowledge base construction. V. Kuleshov, J. Ding, C. Vo, B. Hancock, et al. Nature Comms 2019.
Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences
Stanford Medicine: Cardiac MRI classification. J. Fries, P. Varma, V. Chen, K. Xiao, H. Tejeda, et al. Nature Comms 2019.
Medical device surveillance with electronic health records
Stanford & VA: Medical device surveilance. A. Callahan, J. Fries, C. Ré, J. Huddleston, et al. NPJ Digital Medicine 2019.
Cross-Modal Data Programming Enables Rapid Medical Machine Learning
Stanford Medicine: Medical triaging. J. Dunnmon, A. Ratner, N. Khandwala, K. Saab, et al. Cell Patterns 2020.
Trove: Ontology-driven Weak Supervision for Medical Entity Classification
Stanford Medicine: Medical entity classification. J. Fries, E. Steinberg, S. Khattar, S. Fleming, et al. Preprint 2020.
Data Programming with DDLite: Putting Humans in a Different Part of the Loop
H. Ehrenberg, J. Shin, A. Ratner, J. Fries, C. Ré. HILDA @ SIGMOD 2016.
Socratic Learning: Correcting Misspecified Generative Models using Discriminative Models
P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. De Sa, C. Ré. 2016.
Learning to Compose Domain-Specific Transformations for Data Augmentation
A. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, et al. NeurIPS 2017.
Learning the Structure of Generative Models without Labeled Data
S. Bach, B. He, A. Ratner, C. Ré. ICML 2017.
Snorkel: Fast Training Set Generation for Information Extraction
A. Ratner, S. Bach, H. Ehrenberg, C. Ré. SIGMOD 17 (Demo).
Inferring Generative Model Structure with Static Analysis
P. Varma, B. He, P. Bajaj, I. Banerjee, et al. NeurIPS 2017.
Medical device surveillance with electronic health records
Babble Labble: Learning from Natural Language Explanations
SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
J. Fries, S. Wu, A. Ratner, C. Ré. 2017.
Snorkel MeTaL: Weak Supervision for Multi-Task Learning
A. Ratner, B. Hancock, J. Dunnmon, et al. DEEM @ SIGMOD 2018.
Fonduer: Knowledge Base Construction from Richly Formatted Data
S. Wu, L. Hsiao, X. Cheng, B. Hancock, et al. SIGMOD 2018.
Training Classifiers with Natural Language Explanations
B. Hancock, P. Varma, S. Wang, M. Bringmann, et al. ACL 2018.
Deep Text Mining of Instagram Data without Strong Supervision
Social media text mining. K. Hammar, S. Jaradat, N. Dokoohaki, M. Matskin. ICWI 2018.
Training Complex Models with Multi-Task Weak Supervision
A. Ratner, B. Hancock, J. Dunnmon, F. Sala, et al. AAAI 2019.
Slice-based Learning: A Programming Model for Residual Learning in Critical Data Slices
V. Chen, S. Wu, Z. Weng, A. Ratner, C. Ré NeurIPS 2019.
Learning Dependency Structures for Weak Supervision Models
P. Varma, F. Sala, A. He, A. Ratner, C. Ré. ICML 2019.
Interactive Programmatic Labeling for Weak Supervision
B. Cohen-Wang, S. Mussman, A. Ratner, C. Ré. KDD DCCL 2019.
Scene Graph Prediction with Limited Labels
V. Chen, P. Varma, R. Krishna, M. Bernstein, C. Ré, F. Li ICCV 2019.
Snuba: Automating Weak Supervision to Label Training Data
P. Varma and C. Ré. VLDB 2019.
Improving Sample Complexity with Observational Supervision
K. Saab, J. Dunnmon, A. Ratner, D. Rubin, C. Ré. ICLR LLD 2019.
A Kernel Theory of Modern Data Augmentation
T. Dao, A. Gu, A. Ratner, V. Smith, C. De Sa, C. Ré. ICML 2019.
A Clinical Text Classification Paradigm using Weak Supervision and Deep Representation
Y. Wang, S. Sohn, S. Liu, F. Shen, L. Wang, et al. BMC MIDM 2019.
Multi-Resolution Weak Supervision for Sequential Data
P. Varma, F. Sala, J. Fries, D. Fu, S. Sagawa, et al. NeurIPS 2019.
Utilizing Weak Supervision to Infer Complex Objects and Situations in Autonomous Driving Data
Z. Wheng, P. Varma, A. Masalov, J. Ota, C. Ré. IEEE IVS 2019.
The Role of Massively Multi-Task and Weak Supervision in Software 2.0
A. Ratner, B. Hancock, C. Ré. CIDR 2019.
Train and You'll Miss It: Interactive Model Iteration with Weak Supervision and Pre-Trained Embeddings
M. Chen, D. Fu, F. Sala, S. Wu, et al. 2020.
Fast and Three-rious: Speeding up Weak Supervision with Triplet Methods
D. Fu, M. Chen, F. Sala, M. Hooper, et al. ICML 2020.

Accelerate your AI application development today