Jason Fries Headshot
author

Jason Fries

Stanford University
Assistant Professor of Biomedical Data Science and of Medicine

I’m an Assistant Professor of Biomedical Data Science and of Medicine at Stanford University. My research focuses on training and evaluating foundation models for healthcare and is positioned at the intersection of computer science, medical informatics, and hospital systems. Much of my work explores using electronic health record (EHR) data to contextualize human health, leveraging longitudinal patient information to inform model development and evaluation. My work has appeared in NeurIPS, ICLR, AAAI, Nature Communications, and npj Digital Medicine.

The latest from Jason

Scalable Approach to Medical Wearable Post-Market Surveillance
Objective: We sought to develop a weak supervision-based approach to demonstrate feasibility of post-market surveillance of wearable devices that render AF pre-diagnosis. Materials and Methods: Two approaches were evaluated to reduce clinical note labeling overhead for creating a training set for a classifier: one using programmatic codes, and the other using prompts to large language models (LLMs). Probabilistically labeled notes were then used to fine-tune a classifier, which identified patients with AF pre-diagnosis mentions in a note. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against...
Research Paper
Scalable Approach to Medical Wearable Post-Market Surveillance

Objective: We sought to develop a weak supervision-based approach to demonstrate feasibility of post-market surveillance of wearable devices that render AF pre-diagnosis. Materials and Methods: Two approaches were evaluated to reduce clinical note labeling overhead for creating a training set for a classifier: one using programmatic codes, and the other using prompts to large language models (LLMs). Probabilistically labeled notes…

Sep 23, 2024
RM. Yoo, et al.
Learn more about Scalable Approach to Medical Wearable Post-Market Surveillance
Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.
Research Paper
Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can…

Sep 18, 2024
C. Chang, et al.
Learn more about Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium
The third Machine Learning for Health (ML4H) symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA (Parziale et al., 2022). The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community.
Research Paper
Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium

The third Machine Learning for Health (ML4H) symposium was held in person on December 10, 2023, in New Orleans, Louisiana, USA (Parziale et al., 2022). The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community.

Sep 18, 2024
H. Jeong, et al.
Learn more about Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2023 Symposium
Merlin: A Vision Language Foundation Model for 3D Computed Tomography
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current shortage of both general and specialized radiologists, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies while simultaneously using the images to extract novel physiological insights. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. To overcome...
Research Paper
Merlin: A Vision Language Foundation Model for 3D Computed Tomography

Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current shortage of both general and specialized radiologists, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies while simultaneously using the images to extract novel physiological…

Sep 18, 2024
L. Blankemeier, et al.
Learn more about Merlin: A Vision Language Foundation Model for 3D Computed Tomography
Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example
Research Paper
Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example
Sep 18, 2024
CC. Chiang, et al.
Learn more about Exploring the Potential of Large Language Models in Neurology, Using Neurologic Localization as an Example
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare
Importance: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. Objective: Primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. Methods: This study included three cohorts: SickKidsPeds from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions:...
Research Paper
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

Importance: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. Objective: Primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were…

Sep 18, 2024
LL Guo, et al.
Learn more about Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare
A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics are rare. Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU...
Research Paper
A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such…

Sep 18, 2024
S. Bedi, et al.
Learn more about A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records
Background: Foundation models hold promise for transforming artificial intelligence (AI) in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across different hospitals and their performance for local task adaptation. Objective: This multi-center study examined the adaptability of a recently released structured EHR foundation model (FMSM), trained...
Research Paper
A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records

Background: Foundation models hold promise for transforming artificial intelligence (AI) in healthcare by providing modular components that are easily adaptable to downstream healthcare tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness…

Sep 18, 2024
LL Guo, et al.
Learn more about A Multi-Center Study on the Adaptability of a Shared Foundation Model for Electronic Health Records
Language Models in the Loop: Incorporating Prompting into Weak Supervision
We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data....
Research Paper
Language Models in the Loop: Incorporating Prompting into Weak Supervision

We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct…

Aug 22, 2024
R. Smith et al.
Learn more about Language Models in the Loop: Incorporating Prompting into Weak Supervision
1 2 5 6

For models that need to be right. Not just good enough.