Jason Fries

Scalable Approach to Medical Wearable Post-Market Surveillance

Objective: We sought to develop a weak supervision-based approach to demonstrate feasibility of post-market surveillance of wearable devices that render AF pre-diagnosis. Materials and Methods: Two approaches were evaluated to reduce clinical note labeling overhead for creating a training set for a classifier: one using programmatic codes, and the other using prompts to large language models (LLMs). Probabilistically labeled notes were then used to fine-tune a classifier, which identified patients with AF pre-diagnosis mentions in a note. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against...

Research Paper

Scalable Approach to Medical Wearable Post-Market Surveillance

Objective: We sought to develop a weak supervision-based approach to demonstrate feasibility of post-market surveillance of wearable devices that render AF pre-diagnosis. Materials and Methods: Two approaches were evaluated to reduce clinical note labeling overhead for creating a training set for a classifier: one using programmatic codes, and the other using prompts to large language models (LLMs). Probabilistically labeled notes…

Sep 23, 2024 •

RM. Yoo, et al.

Learn more about Scalable Approach to Medical Wearable Post-Market Surveillance

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.

Research Paper

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can…

Sep 18, 2024 •

C. Chang, et al.

Learn more about Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

The third Machine Learning for Health (ML4H) symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA (Parziale et al., 2022). The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community.

Research Paper

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

The third Machine Learning for Health (ML4H) symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA (Parziale et al., 2022). The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community.

Sep 18, 2024 •

H. Jeong, et al.

Learn more about Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current shortage of both general and specialized radiologists, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies while simultaneously using the images to extract novel physiological insights. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. To overcome...

Research Paper

Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current shortage of both general and specialized radiologists, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies while simultaneously using the images to extract novel physiological…

Sep 18, 2024 •

L. Blankemeier, et al.

Learn more about Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example

Research Paper

Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example

Sep 18, 2024 •

CC. Chiang, et al.

Learn more about Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

Importance: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. Objective: Primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. Methods: This study included three cohorts: SickKidsPeds from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions:...

Research Paper

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

Importance: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. Objective: Primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were…

Sep 18, 2024 •

LL Guo, et al.

Learn more about Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics are rare. Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU...

Research Paper

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such…

Sep 18, 2024 •

S. Bedi, et al.

Learn more about A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Background: Foundation models hold promise for transforming artificial intelligence (AI) in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across different hospitals and their performance for local task adaptation. Objective: This multi-center study examined the adaptability of a recently released structured EHR foundation model (FMSM), trained...

Research Paper

A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Background: Foundation models hold promise for transforming artificial intelligence (AI) in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness…

Sep 18, 2024 •

LL Guo, et al.

Learn more about A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Language Models in the Loop: Incorporating Prompting into Weak Supervision

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data....

Research Paper

Language Models in the Loop: Incorporating Prompting into Weak Supervision

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct…

Aug 22, 2024 •

R. Smith et al.

Learn more about Language Models in the Loop: Incorporating Prompting into Weak Supervision

Jason Fries

The latest from Jason

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?