Understanding Programmatic Weak Supervision via Source-aware Influence Function
This paper proposes source-aware variation of Influence Function, which measures the influence of individual components in the Programmatic Weak Supervision pipeline, and can be used for multiple purposes such as understanding incorrect predictions, identifying mislabeling of sources, and improving the end model’s generalization performance.
BIGBIO: A Framework for Data-Centric Biomedical Natural Language Processing
BigBIO is a community library of biomedical NLP datasets that facilitates meta-dataset curation and enables zero-shot evaluation of biomedical prompts and multi-task learning.
Generative Modeling Helps Weak Supervision (and Vice Versa)
This work proposes and theoretically justifies a model that fuses weak supervision and generative adversarial networks to improve the estimate of unobserved labels and data augmentation, outperforming baseline weak supervision models on multiclass image classification datasets.
Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
Compositional soft prompting is a parameter-efficient technique that improves the zero-shot compositionality of large-scale pretrained VLMs by learnable tokens of vocabulary and outperforms existing methods on benchmark datasets.
Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision
Liger, a combination of foundation models and weak supervision frameworks, improves existing weak supervision techniques by partitioning the embedding space and extending source votes in embedding space, resulting in improved performance on six benchmark NLP and video tasks.
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
This paper presents Nemo, an interactive system that improves the overall productivity of Weak Supervision learning pipelines by an average of 20%, compared to the prevailing WS approach.
A Survey on Programmatic Weak Supervision
This paper presents a comprehensive survey of recent advances in Programmatic Weak Supervision (PWS), and discusses related approaches to tackle limited labeled data scenarios.
Dataset Debt in Biomedical Language Modeling
This paper finds that only 13% of biomedical datasets are available via programmatic access and 30% lack documentation on licensing and permitted reuse, highlighting the dataset debt in biomedical NLP.