medrxiv Preprint
|
2024

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

S. Bedi, et al.

Abstract

Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics are rare. Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU tasks and medical specialties with standardized dimensions of evaluation.

Share this article
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.