Snorkel helps build Terminal-Bench 2.0. Learn More
        medrxiv Preprint                    | 2024            
  
  Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior
Abstract
As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.