Snorkel AI wrapped the second day of our The Future of Data-Centric AI virtual conference by showcasing how Snorkel’s data-centric platform has enabled customers to succeed, taking a deep look at Snorkel Flow’s capabilities, and announcing two new solutions.
Snorkel Co-Founder and CEO Alex Ratner kicked off the day’s events by giving attendees a peek into Snorkel’s new Foundation Model Data Platform, which includes solutions to develop and adapt large language models and foundation models.
Across the 19 sessions that followed, speakers outlined the increasing importance of data-centric AI in domains ranging from taxes to network security to heavy equipment. They also gave a glimpse into the near future, where large language models will become a pervasive part of our lives and even live locally on our devices.
Below follows a summary of the day’s highlights, along with a full rundown of all other sessions.
Opening Keynote: New introductions from Snorkel AI
The future of enterprise large language models, Alex said, is GPT-You not GPT-X. While generalized LLMs will yield useful results for general questions, enterprises that want to get business value out of LLMs will have to customize their own. That means curating an optimized set of prompts and responses for instruction tuning as well as cultivating the right mix of pre-training data for self-supervision.
To that end, Snorkel will soon debut Snorkel GenFlow and Snorkel Foundry. Snorkel Foundry will allow customers to programmatically curate unstructured data to pre-train an LLM for a specific domain. Clients can then follow up with Snorkel GenFlow, which will help prune, balance, and optimize their instruction tuning training set to sharpen generative models’ performance on specific tasks.
In the past, Alex said, data curation operations have often been ad-hoc, manual, and under-appreciated. These new solutions continue Snorkel’s mission to make preparation tasks first-class and programmatic.
Fireside Chat: Journey of Data: Transforming the Enterprise with Data-Centric Workflows
In a lively back and forth, Alex talked with Nurtekin Savas, head of enterprise data science at Capital One, about broadening the scope of being “data-centric.”
Data-centric approaches typically focus on preparing data when it’s time to train a model. Savas said he widens that point of view to cover every step of the data lifecycle, from the point of data creation to data deletion. In his role at Capital One, Nurtekin said, he has focussed on developing tools and solutions for both data producers and data consumers. That focus continues through data registration and data storage, even before it reaches the point of being used in a model.
“You need to find a place to park your data. It needs to be optimized for the type of data and the format of the data you have,” he said.
By optimizing every part of the data pipeline, he said, “You will, as a result, get your models to market faster.”
Fireside chat: Building RedPajama
Ce Zhang, CTO of Together talked with Braden Hancock, Snorkel AI co-founder and head of technology, about the incredible progress made in open-source large language models in the past few months.
There was a time not long ago, Braden said, when the community worried that the large companies that published the first wave of large language models had “totally insurmountable leads.”
“What happened in the last couple of months, is people started to see the hope,” Zhang said.
Some of that hope came from the RedPajama model itself. Assembled by Together and a group of collaborators, RedPajama used a fully open-source data set. This approach, Zhang said, yields several advantages. Unlike many academic large language models that have been trained on ChatGPT responses, the RedPajama models can be used and adapted for commercial purposes. It also allows the open source community to offer ways to improve the data set—and for people who don’t want their data included to find their data, request that it be removed from the training set, and later verify that it has been.
The pair teased that the RedPajama collaboration has been “working really hard” on an upcoming release that should reach the public soon.
Full session recap
The Opportunity of Data-Centric AI in Insurance
Alejandro Zarate Santovena, lecturer at Columbia University and Managing Director at Marsh, asserted that AI and foundation models have a lot of potential to disrupt the insurance industry. These models will help automate manual processes and improve insurance companies’ abilities to find the right buyers for the right products.
Accelerate ML Adoption by Addressing Hidden Needs
Max Williams, AI platform product manager at Wells Fargo, discussed the challenges of achieving a return on investment in machine learning as well as the hidden needs an organization must address for ML to gain widespread adoption and deliver attractive returns.
Transforming the Customer Experience with AI: Wayfair’s Data-Centric Way
Wayfair’s Archana Sapkota (ML Manager) and Vinny DeGenova (Associate Director of Machine Learning) shared insights on transforming the customer experience with AI, emphasizing the use of ML in understanding customers and catalog products. They highlighted the collaboration between subject matter experts and data scientists as a key factor in rapidly developing and testing models.
Unleashing Human Potential with AI Augmentation
Bryan Wood, a Data Science Executive at Bank of America, discussed how AI augmentation enhances human potential and creativity. The presentation showcased real-world applications and the potential of AI to enhance human capabilities.
Tackling advanced classification using Snorkel Flow
Snorkel AI’s Angela Fox (Staff Product Designer) and Vincent Chen (Director of Product) discussed the challenges and approaches for building production-ready classification models in the age of foundation models. They also demonstrated how data-centric workflows in Snorkel Flow can unblock previously untenable problems and enable high-quality production models.
Combining domain knowledge with data to track and predict heavy-equipment service events
Davide Gerbaudo, a Sr. Data Scientist at Caterpillar, showcased how the century-old company combines domain knowledge and data to track and predict heavy-equipment service events, emphasizing the value of leveraging industry-specific expertise and understanding the points of view of different business units.
Accelerating information extraction with data-centric iteration
Snorkel AI’s Vincent Chen (Director of Product) and John Semerdjian (Tech Lead Manager, Applied Machine Learning) discussed practical workflows for building enterprise information extraction applications using Snorkel Flow—including annotation, error analysis, and model-guided iteration.
Data Driven AI for Threat Detection
Debabrata Dash, distinguished data scientist at Arista, explored the challenges of traditional machine learning in network security and proposed the use of weak supervision to enhance threat detection models. By leveraging heuristics and applying them to raw data, Dash presented promising results for more efficient and predictably accurate cybersecurity models.
Comcast SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale
Raphael Tang, lead research scientist at Comcast Applied AI, presented SpeechNet—a label-scarce, compute-limited, end-to-end automatic speech recognition system. The system, developed partially with Snorkel labeling functions, currently serves 12 million queries daily on voice-enabled smart televisions.
Applying weak supervision and foundation models for computer vision
Snorkel AI Machine Learning Research Scientist Ravi Teja Mullapudi discussed the latest advancements in computer vision, focusing on the use of weak supervision and foundation models. Among other topics, he highlighted how visual prompts and parameter-efficient models enable rapid iteration for improved data quality and model performance.
AI and the Future of Tax
Ken Pryadarshi, global prompt engineering leader at EY, noted that the role of tax within Fortune 500 companies has changed; those in charge of taxes at large companies, he said, are becoming “data custodians.” He also described a near future where large companies will augment the performance of their finance and tax professionals with large language models, co-pilots, and AI agents.
Leveraging Data-centric AI for Document Intelligence and PDF Extraction
Snorkel AI ML Engineer Ashwini Ramamoorthy highlighted the challenges of extracting entities from semi-structured documents. She explained how Snorkel’s data-centric approach simplifies and streamlines the process, and discussed the utilization of foundation models to accelerate the development of extraction models.
Leveraging foundation models and LLMs for enterprise-grade NLP
Kristina Liapchin, Lead Product Manager at Snorkel AI, discussed how Snorkel Flow can assist enterprises in overcoming the challenge of deploying large language models for natural language processing. She highlighted how the platform enables businesses to adapt LLMs to customer-specific data and incorporate domain knowledge.
Bias Busters: Strategies for Monitoring, Managing, and Mitigating AI Bias
Nataraj Prakash, vice president of digital analytics at Kaiser Permanente, discussed the pervasive issue of AI bias as well as practical strategies to counteract it. He emphasized the importance of post-detection action steps like model retraining and challenger model selection to enhance transparency and meet evolving regulatory requirements.
Lessons From a Year with Snorkel: Data-Centric Workflows with SMEs at Georgetown
James Dunham, an NLP engineer at the Georgetown University Center for Security and Emerging Technology, discussed the center’s experience using Snorkel to address bottlenecks and enhance collaboration between data scientists and subject-matter experts.
The future of AI is hybrid
Jilei Hou, VP of Engineering & Head of AI Research at Qualcomm, argued that future foundation model applications can (and should) run in a hybrid fashion. Tasks within the grasp of smaller foundation models should be handled by local deployments on our devices, and only fall back to enormous cloud-based models when the task is too large.
Catch the sessions you missed!
The Future of Data-Centric AI 2023, our two-day free virtual conference, brought together thousands of data scientists, AI/ML practitioners, researchers, and the AI community at large to hear about and discuss the latest trends and research in data-centric AI. If you registered for the event but didn't see all the sessions you wanted, you can now catch up. The recorded sessions are available for registrants at the same Zoom portal as the live sessions.