Introducing new capabilities for Data-centric Foundation Model Development in Snorkel Flow Powerful new large language or foundation models (FMs) like GPT-3, Stable Diffusion, BERT, and more have taken the AI space by storm, going viral—even beyond technical practitioners—thanks to incredible capabilities around text generation, image synthesis, and more. However, enterprises face fundamental barriers to using these foundation models on real,…
We created Data-centric Foundation Model Development to bridge the gaps between foundation models and enterprise AI. New Snorkel Flow capabilities (Foundation Model Fine-tuning, Warm Start, and Prompt Builder) give data science and machine learning teams the tools they need to effectively put foundation models (FMs) to use for performance-critical enterprise use cases. The need is clear: despite undeniable excitement about…
Databricks’ Chief Technologist: Data-Centric AI can learn from Data Engineering and ML Engineering in five ways: continuous updates, versioning, code-centric deployment, data privatization and actionable monitoring.
Create a data-centric AI application using Snorkel Flow to save your analysts time of manual labeling and information extraction related to environmental, social, and governance (ESG) factors from earnings call transcripts. Rapidly and accurately extract all existing and new factors from the transcripts to make the right investment decision.
AI is generally accepted as necessary for organizations across private and public sectors to build (or maintain) a competitive advantage. However, a major challenge to adopting AI successfully is our ability to build reliable, predictable, and equitable solutions. A critical flaw with traditional approaches to developing AI is the reliance on hand-labeled training datasets and/or “pre-trained” black-box models that are effectively ungovernable and unauditable. In this article, we explore the motivations and challenges for Trustworthy AI that we’ve encountered and discuss how core tenants of Data-Centric AI, including programmatic labeling, help ameliorate them.
To meet the requirements of unexpected regulatory changes brought on by the pandemic, a top-10 US bank needed to urgently adapt its underperforming model-centric artificial intelligence and machine learning development approach to a data-centric one. The team used Snorkel Flow to automatically classify thousands of loan documents and extract critical clauses in just 24 hours, saving loan managers thousands of hours of manual document review.
Schlumberger is the world’s leading provider of technology and services for the energy industry, operating in over 120 countries. The company provides well maintenance and analytics services to the world’s biggest oil companies, and it believes that large-scale data analysis and artificial intelligence/machine learning will help them remain a leader in the market. One way they’ve been able to achieve this is by building their own AI application using Snorkel Flow to automatically extract geological entities and critical field data across a variety of document structures and report types they receive from their customers.
This blog post introduces variants of Precision, Recall, and F1 metrics called Precision Gain, Recall Gain, and F1 Gain. The gain variants have desirable properties such as meaningful linear interpolation of PR curves and a universal baseline across tasks. This post explains what these benefits mean for you, how the gain metrics are calculated and outline some examples for intuitive comparison.
On the heels of the second annual Future of Data-Centric AI event, we’re energized by what we learned from data scientists, machine learning engineers, and AI leaders who are adopting data-centric approaches to accelerate AI success. The Snorkel Flow platform provides these teams with a seamless workflow across training data creation, model training, and analysis—the scaffolding to make data-centric AI…
Continuous Model Feedback, available in beta as part of the new Studio experience, is Snorkel Flow’s latest capabilities to make training data creation and model development more integrated, automated, and guided.
Snorkel AI just hosted the second day of The Future of Data-Centric AI conference 2022. Across 40+ sessions, 50+ Data scientists, ML engineers, and AI leaders came together to share insights, best practices, and research on adopting data-centric approaches with thousands of attendees from all around the world. Aarti Bagul, a Snorkel AI ML Solutions Engineer and one of the…
Snorkel AI just hosted the first day of The Future of Data-Centric AI conference 2022. This conference brings together data scientists, ML engineers, and AI leaders to share insights, best practices, and research on how to evolve the ML lifecycle from model-centric to data-centric approaches. This conference takes place over two days with 40+ sessions, 50+ speakers, and thousands of…
Building NLP techniques to understand 10-Ks is time-consuming, costly, and challenging. In this post, Machine Learning Engineer, Aarti Bagul discusses three information extraction case studies on how banks around the world are building highly accurate NLP applications using Snorkel Flow’s AI platform. From retail banking to hedge fund investing, NLP is used across the financial industry. By processing and extracting…
Programmatic labeling moves a classic technique from interesting to high-impact So much of real-world AI development entails working with text data that’s messy — in fact, 80%+ of enterprise data is unstructured. And while state-of-the-art models get a lot of the glory, creating the training data that conveys what your model needs to learn is more often the biggest determiner of AI…
AI systems are well-suited to tasks involving recognizing and predicting data patterns. Supervised classification systems categorize unseen data into a finite set of discrete classes by learning from millions of hand-labeled labeled sample points. These classifiers are powerful business tools – they automate document sorting, customer sentiment analysis, sales performance, and other distinct business problems. However, they also require an…
What is data annotation? Data annotation refers to the process of categorizing and labeling data for training datasets. In order for a training dataset to be usable, it must be categorized appropriately and annotated for a specific use case. With Snorkel Flow, organizations can annotate high-quality labeled training data via Labeling Functions and rapidly develop and adapt AI applications by…
Labeling functions are fundamental building blocks of programmatic labeling that encode diverse sources of weak labeling signals to produce high-quality labeled data at scale. Let’s start with the core motivation for labeling functions: over time, every major commercial organization and government agency builds various valuable, often bespoke knowledge resources. These resources include employee expertise, wikis and ontologies, business logic, and…
Research recap: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs) In this post, I have summarized the research published in this academic paper, Ontology-driven weak supervision for clinical entity classification in electronic health records by Jason Fries et al. This paper was published in Nature Communications in 2021.Problem statement Electronic health records (EHR) contain a rich…
Highlighting the best practices for building and deploying AI models for financial document processing applications AI has massive potential in the financial industry. Building AI models to automate information extraction, fraud detection, and compliance monitoring can provide efficient and faster responses and support repurposing domain experts’ labor to more meaningful tasks. Developing AI models is not just about having models…
The following post is based on a talk discussing the benefits of programmatic labeling for trustworthy AI, which was presented as part of the Trustworthy AI: A Practical Roadmap for Government event that took place this past April, with Snorkel AI Co-founder and Head of Technology, Braden Hancock. If you would like to watch Braden’s presentation, we have included it…
Learning about the challenges and opportunities behind deep neural networks In this talk, Assistant Professor in Computer Science Sharon Li shares some exciting work about uncovering the unknowns of deep neural networks. She also shares some exciting challenges and opportunities in this domain. If you would like to watch Sharon’s presentation, we have included it below, or you can find…
If you were ever amazed at how Google accurately finds the answer to your question just by a few keywords, you’ve witnessed the power of named entity recognition (NER). By quickly and accurately identifying different entities in a sea of unstructured articles, like names of people, places, and organizations, the search engine can figure out each article’s main topics and…
The future of data-centric AI talk series In this talk, Assistant Professor of Biomedical Data Science at Stanford University, James Zou, discusses the work he and his team have been doing from a data-centric perspective to trustworthy and interpretable AI. If you would like to watch James’ presentation, we have included it below, or you can find the entire event…
Gregory Ihrie is the Chief Technology Officer for the FBI, responsible for technology, innovation, and strategy. He also leads the FBI’s efforts in advancing the bureau’s management, policy, and governance of AI systems. Ihrie chairs the FBI’s Scientific Working Group on Artificial Intelligence, as well as the Department of Justice’s AI Committee of Interest. He is one of three officers…
The future of data-centric AI talk series Don’t miss the opportunity to gain an in-depth understanding of data-centric AI and learn best practices from real-world implementations. Connect with fellow data scientists, machine learning engineers, and AI leaders from academia and industry with over 30 virtual sessions. Save your seat at The Future of Data-Centric AI. Happening on August 3-4, 2022….
30+ sessions by 40+ speakers in 2 action-packed days Last year we organized The Future of Data-Centric AI conference to explore the shift from model-centric to data-centric AI. Speakers included researchers and industry experts such as Andrew Ng (Landing AI), Anima Anandkumar (NVIDIA), Chris Re (Stanford AI Lab), Michael DAndrea (Genentech), Skip McCormick (BNY Mellon), Imen Grida Ben Yahia (Orange)…
Constructing labeling functions (LFs) is at the heart of using weak supervision. We often think of these labeling functions as programmatic expressions of domain expertise or heuristics. Indeed, much of the advantage of weak supervision is that we can save time—writing labeling functions and applying them to data at scale is much more efficient compared to hand-labeling huge numbers of…
Powerful resources to leverage as labeling functions In this post, we’ll use the COVID-FACT dataset to demonstrate how to use existing resources as labeling functions (LFs), to build a fact-checking system. The COVID-FACT dataset contains 4086 claims about the COVID-19 pandemic; it contains claims, evidence for the claims, and contradictory claims refuted by the evidence. The evidence retrieval is formulated…
Browse through these FAQ to find answers to commonly raised questions about Snorkel AI, Snorkel Flow, and data-centric AI development. Have more questions? Contact us. Programmatic labeling Use cases 1. What is a labeling function? A Labeling Function (LF) is an arbitrary function that takes in a data point and outputs a proposed label or abstains. The logic used to…
This post showcases a panel discussion on the academic and industry perspectives of ethical AI, which was moderated by Director of Federal Strategy and Growth, Alexis Zumwalt, Fouts Family Early Career Professor and Lead of Ethical AI (NSF AI Institute AI4OPT), Georgia Institute of Technology, Swati Gupta, Chief Data Officer, Department of the Navy, Thomas Sasalsa, Senior Manager of Responsible…