A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
Abstract
Studies rarely use real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions are understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation are infrequently explored. Accuracy is the predominant dimension of evaluation, while fairness, bias and toxicity assessments are neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics are rare. Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU tasks and medical specialties with standardized dimensions of evaluation.