The Snorkel AI team will present 18 research papers and talks at the 2023 Neural Information Processing Systems (NeurIPS) conference from December 10-16. The Snorkel papers cover a broad range of topics including fairness, semi-supervised learning, large language models (LLMs), and domain-specific models.
Snorkel AI is proud of its roots in the research community and endeavors to remain at the forefront of new scholarship in data-centric AI, programmatic labeling, and foundation models. We designed the Snorkel Flow platform to integrate the latest technologies from our research and extend these capabilities to our customers to help solve valuable business problems. As part of our science-first culture, Snorkel AI’s founders and researchers are honored to present at distinguished AI conferences such as NeurIPS.
We are excited to present the following papers and presentations during this year’s event. We have grouped each according to its topic and included links where possible.
Benchmarks, domain-specific datasets, and models
Benchmarking drives progress in AI research. Typically, benchmarks work by standardizing datasets, specific machine-learning tasks, or certain parts of the machine learning pipeline for the primary purpose of evaluating the best approach to solve specific problems. Famous examples include ImageNet, SQuAD, MNIST, and GLUE, and typically include hundreds of thousands of examples developed by researchers at well-funded academic institutions.
These papers from Snorkel researchers include benchmarks, datasets, and models for specific data domains (including healthcare, biology, and law), and reflect an increasing trend in using advancements in computer science and machine learning research to solve tangible problems in specific research fields and industries.
- EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models
Wornow et al.
This paper represents the first longitudinal electronic health records (EHR) benchmark for evaluating pre-trained foundation models (FM) and open sources the weights and data processing pipeline to allow other researchers to reproduce the group’s work.
- HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Nguyen et al.
HyenaDNA is a foundation model pre-trained on the human reference genome that can examine DNA sequences at up to 500x the context length of existing genomic FMs using dense attention.
- LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
Guha et al.
Computer scientists and legal experts came together to assemble 162 evaluation tasks. While some of these came from existing legal datasets, legal experts hand-crafted others. The team used these tasks to evaluate 20 LLMs, from 11 different families, representing a range of size categories.
- DataComp: In search of the next generation of multimodal datasets
Gadre et al.
DataComp is a multimodal benchmark focused on a new test candidate pool of 12.8 billion image-text pairs from CommonCrawl. Participants in this benchmark designed new filtering techniques and/or developed new data sources, ran the standardized CLIP training code, and evaluated their approaches against 38 downstream test sets.
- INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis
Huang et al.
INSPECT contains de-identified longitudinal records from more than 19,000 patients at risk for pulmonary embolism (PE), along with ground truth labels for multiple outcomes. The data set includes CT images, radiology report exceprts, and electronic health record data. Using INSPECT, the researchers developed a benchmark for evaluating modeling approaches on PE-related tasks.
Leveraging weak supervision
Weak supervision (WS) combines high-level (and often noisy) sources of supervision to quickly create large training sets. By observing when different sources of signal agree and disagree, weak supervision applies the most likely label to each record. Then, data scientists use these probabilistic labels to train discriminative end models.
The following papers explore topics in WS. The papers in this group advance weak supervision across many dimensions—from theoretically analyzing fairness in WS systems to using weakly supervised approaches for improving LM performance.
- Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification
Guha et al.
In contrast to much of the literature and experimentation around prompt-tuning, this paper explores prompt-patching, whereby the researchers correct LM predictions by computing multiple representations of the dataset and using the consistency between the LM prediction for surrounding examples to identify mispredictions.
- Mitigating Source Bias for Fairer Weak Supervision
Shin et al.
This paper explores fairness in weak supervision and presents an empirically validated model of fairness that captures labeling function bias. Additionally, the researchers share a simple counterfactual fairness correction algorithm.
- Characterizing the Impacts of Semi-supervised Learning for Weak Supervision
Li et al.
This study develops a simple design space to systematically evaluate semi-supervised learning (SSL) techniques in weak supervision and describes scenarios where SSL techniques will be the most effective.
New model architectures to improve performance
Language model (LM) performance has drastically improved through increasing the size of the respective training datasets, increasing model size, and other architectural improvements to handle larger context and sequence lengths. However, these improvements have come at a significant computational cost.
The following works explore new model architectures and modifications to existing architectures that either increase model performance and/or reduce memory footprint—all while maintaining or improving accuracy.
- Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions
Massaroli et al.
Authors in this study look to enable constant memory for any pre-trained long-convolution architecture to increase throughput and reduce the memory footprint. Additionally, the study makes architectural improvements to create more performant models that are easier to distill.
- Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Fu et al.
This paper explores a new model architecture that scales sub-quadratically along sequence length and model dimensions, a stark contrast to most modern architectures (e.g. transformers) which achieve great performance at the expense of introducing attention layer(s) which are naively computed with quadratic complexity.
More effective learning strategies
With the continued development of pre-trained models that demonstrate outstanding performance across a broad category of tasks, a quality gap persists between pre-trained models and their counterparts which have been trained in a task-specific manner (e.g. fine-tuning, classic supervised learning). However, the financial and temporal cost of task-specific training can often negate the value of pre-trained models.
The following papers investigate different approaches for increasing model performance on a specific task, without additional training or supervision.
- On the Trade-off of Intra-/Inter-class Diversity for Supervised Pre-training
Zhang et al.
Which is more impactful for downstream model performance: the number of classes, or the number of examples per class? This work explores the tradeoffs associated with intra-class (number of samples per class) and inter-class (number of classes) diversity with respect to a supervised learning training dataset.
- A case for reframing automated medical image classification as segmentation
Hooper et al.
This paper explores the potential to reframe medical image classification as a segmentation problem, using both theoretical and empirical analysis to evaluate the implications of using a different modeling approach for the same overarching task.
- Geometry-Aware Adaptation for Pretrained Models
Roberts et al.
This work investigates whether or not it’s possible to adapt pre-trained models to predict new classes without fine-tuning or retraining. The study explores a simple adaptor for pre-trained models, whereby a fixed linear transformation is applied over the probabilistic output of a traditional supervised model to create a richer set of possible class predictions.
- Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Chen et al.
This research explores the relationship between a machine learning task, the manner in which a model is trained, and the data itself. This research explores and defines an operational model for “skills,” or behaviors that a model can learn with an associated slice of data. The authors then define a framework and propose methods for how to select data so that the LM learns skills quickly and more effectively.
- Tart: A plug-and-play Transformer module for task-agnostic reasoning
Bhatia et al.
The authors in this study develop a task-agnostic approach for improving LLM reasoning abilities without task-specific training (e.g. fine-tuning). In short, the authors train a task-agnostic reasoning module to learn probabilistic inference on synthetically generated data (logistic regression problems) and then compose the module for the base LLM by simply aggregating the output embedding and using those as an input along with the class label. Together, these components boost quality by improving reasoning while balancing scalability by aggregating input embeddings into a single vector.
Semi-supervised learning focuses on using a small amount of labeled data to bootstrap a much larger unlabeled dataset. Semi-supervised learning is unique in its propagation of knowledge from the labeled dataset to the broader unlabeled dataset.
The following three papers use pseudo-labeling and auto-labeling to leverage small amounts of labeled data to train a model on a much larger collection of unlabeled examples and examine the performance characteristics of their approaches against supervised corollaries.
- Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
Menghini et al.
This paper investigates whether the pseudo-labels generated by CLIP can be used to bolster CLIP’s own performance. The authors conduct an extensive exploration of learning scenarios that involve modulating learning paradigms, prompt modalities, and training strategies, and showcase the effectiveness of iterative prompt training regardless of learning paradigm or prompt modality.
- Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
Yu et al.
This paper investigates the use of LLMs as training data generators, focusing specifically on diversely attributed prompts (e.g. specifying attributes like length and style) and datasets with high cardinality and diverse domains.
- Good Data from Bad Models: Foundations of Threshold-based Auto-labeling
Vishwakarma et al.
While auto-labeling systems are a promising way to reduce reliance on manual labeling for training dataset creation, they require manually created validation data to guarantee quality. This paper studies threshold-based auto-labeling systems and establishes bounds on the quality and quantity auto-labeled as a fraction of the validation and training data sample complexity required for the auto-labeling system.