Research

Clinical entity classification in electronic health records

June 17, 2022
6 min read

Research recap: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs) 

In this post, I have summarized the research published in this academic paper, Ontology-driven weak supervision for clinical entity classification in electronic health records by Jason Fries et al. This paper was published in Nature Communications in 2021.Problem statement

Electronic health records (EHR) contain a rich set of information such as clinical notes, laboratory results, diagnoses, among other things, that can be utilized to tailor a specific treatment for each patient. The extraction of entities such as drugs and disorders is an important step in making clinical decisions. Recently, natural language processing (NLP) techniques such as named entity recognition (NER) have been used for automating tasks such as identifying disease names or other entities from text.

Family history: The patient's sister has ovarian cancer disease and his father has liver cancer disease. Summary of findings: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs)

Challenges with NER

Traditionally, training classifiers for named entity recognition (NER) and cue-based entity classification have relied on hand-labeled training data. However, annotating EHR requires considerable domain expertise and money, creating barriers to using machine learning. Moreover, hand-labeled datasets are static artifacts that are expensive to change. Due to privacy concerns regarding patient data, outsourcing and sharing labeled data is often a non-starter. Later in this research, the authors discuss the need for agile and robust techniques for acquiring training data for machine learning models in light of the fast-changing events of the COVID-19 pandemic.

Proposed method

The authors propose Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. With this approach, instead of manually labeling training data, the authors use Snorkel’s weak supervision framework to programmatically label EHR. This makes it easy to share and modify training data, while offering performance comparable to learning from manually labeled training data.Weak supervision is the technique of generating low-cost and less accurate labels (called labeling functions) by utilizing subject matter expert heuristics, rules-based systems, large language models, dictionaries, and ontologies as multiple imperfect resources for supervision. Weak supervision has demonstrated success across a range of NLP and other settings at Google, Genentech, Intel, Apple, Stanford Medicine, and more.In this paper, Trove applies weak supervision by creating labeling functions using:

  • Task-specific rules: extraction of rules and heuristics based on specific text such as physician notes
  • Ontologies: extraction of terminologies from external resources such as Unified Medical Language System (UMLS 1), or other dictionaries

Trove pipeline

Trove: Ontology-driven weak supervision for EHR. Summary of findings: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs)
Trove: Ontology-driven weak supervision for EHR

The following is a quick description of the Trove pipeline:

  1. User start by
    • Specifying ontology dictionaries (UMLS, etc.), and then a mapping from ontologies’ categories to class entities labeling via ontologies and task-specific rules
    • Collecting a large set of unlabelled training documents
  2. The label matrix is built by aggregating all labeling functions
  3. The label model is trained to correct the noise and learn the accuracies in labeling functions
  4. It predicts a consensus probability per word to generate a labeled dataset
  5. The labeled dataset is then used to train an end model such as BioBERT 2
    • Or the label model predictions could be used as the final classifier
  6. Finally, apply the end model to process documents and obtain predictions per word

Datasets and tasks

Several NER datasets are used in this study for chemical/disease and drug tagging.  

Datasets used and type of information extracted. Summary of findings: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs)
Datasets used and type of information extracted

In the figure below, an example sequence X i is shown with the application of four different ontology labeling functions (MTH, CHV, LNC, SNOMEDCT). The entity of interest to be tagged is ‘diabetes type 2’.

  • Majority vote estimates Y i as a word-level sum of positive class labels, weighing each equally
  • Label model estimates Y i by reweighting labels to generate a more accurate prediction
Summary of findings: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs)

Combining labeling functions and training a classifier; what’s the effect on performance?

To test Trove’s capabilities, several experiments have been performed using the datasets mentioned above. In the first experiment, the labeling functions are aggregated by majority vote, and the label model, and then the extra effect of training another classifier by weak supervision is explored. These performances are compared against fully-supervised models by having access to hand-labeled data for training, and published state-of-the-art performances. In summary, the following four methods are tested and compared on all four sequence tagging tasks:

  1. Majority vote (MV): the majority class vote for each word
  2. Label model (LM): the default output of data programming model
  3. Weakly supervised (WS): BioBERT trained on the probabilistic dataset generated by the label model
  4. Fully supervised (FS): BioBERT trained on the original expert-labeled training set
  5. State-of-the-art (SOTA): published performance metrics in the literature

Ontology-based vs. ontology-based AND task-specific labeling functions; what’s the effect on performance?

In the next experiment, the effects of different labeling functions are analyzed. Medical ontologies are a great source of information for weak supervision, however, additional task-specific rules may be required for a boost in performance. To test this, the following two approaches are explored for all four tasks:

  1. Using ontologies for labeling functions
    1. Dictionary of numbers, stopwords, punctuation
    2. + UMLS
    3. + Existing ontologies not in UMLS
  2. Using ontologies AND task-specific rules
    1. Regular expressions
    2. small dictionaries (e.g., illegal drugs)
    3. other heuristics

The table below summarizes the results of the application of different approaches:

  • LM performance was higher than MV in all tasks by 4.1 F1 points on average
  • Weak supervision using BioBERT provided an additional average increase of 0.3 F1 points over LM
  • Weak supervision using additional task-specific rules performed within 1.3–4.9 F1 points (4.1%) of models trained on hand-labeled data (FS)
*SOTA: state-of-the-art performance in literature. Summary of findings: Ontology-driven weak supervision for clinical entity classification in electronic health records (EHRs)
*SOTA: state-of-the-art performance in literature

Case study: COVID-19 risk factor monitoring

COVID-19 pandemic presented a situation where there was a critical need to rapidly analyze literature and unstructured EHR data to fully understand symptoms, outcomes, and risk factors at short notice. Several challenges arise in making rapid classification models, such as manual labeling costs and data privacy concerns. This final case study of Trove is for COVID-19 symptom tagging and risk factor monitoring using a daily data feed of Stanford Health Care (SHC) emergency department notes. 

Results

Disorder tagging: Ontology-based weak supervision performed almost as well as a hand-labeled FS model while adding additional task-specific rules—outperformed the FS model by 2.3 F1 points. 

Exposure classification: The weakly supervised end model provided a 5.2 F1 points improvement over the rules alone.

*P: Precision, R: Recall, F1: F1-score

*P: Precision, R: Recall, F1: F1-score

Conclusion

The Trove framework demonstrates how classifiers for a wide range of medical NLP tasks can be quickly reconstructed.

  • Fast: No need for time-consuming hand-labeled training data
  • Explainable: Rule-based and ontology-based labeling functions provide an interpretable view of generated training labels
  • Privacy preserved: Easy to share, edit, and modify labeling functions
  • High performance: combining the state-of-the-art machine learning (such as BioBERT language model) with the flexibility of rule-based approaches offers performance comparable to learning from manually labeled training data
 

References

  1. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2019).
  2. Bodenreider, O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32, D267–70 (2004).

Featured photo by Jeremy Bezanger.

Share this article
Nazanin Makkinejad
Nazanin Makkinejad
Applied Machine Learning Engineer

Nazanin Makkinejad is an applied machine learning engineer at Snorkel AI, where she works with enterprise data science teams to realize the benefits of data-centric AI and Snorkel Flow. Prior to her role at Snorkel AI, Nazanin was a Postdoctoral Research Fellow at Harvard Medical School (HMS) and Massachusetts General Hospital (MGH), working on the intersection of deep learning and brain image analysis. She has a Ph.D. from the Illinois Institute of Technology in Biomedical & Medical Engineering and a Master’s Degree in Electrical and Computer Engineering from The University of Illinois Chicago.

Recommended articles

View all articles
agentic-in-action
The Standard for Agents You Can Trust: Lessons from the Federal Front Lines
In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on
June 5, 2026
Snorkel Team
collab-gym-thumbnail
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
At our latest Snorkel AI Reading Group, Yijia Shao (Stanford NLP) stopped by our San Francisco office to present Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration. As LLM agents get better at automating tasks on their own, a large class of real-world problems still needs a human in the loop – for their preferences, their domain expertise, or simply for control.
June 4, 2026
Alexis Sobel
Image
Benchtalks #2: The future of coding benchmarks
For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel
June 3, 2026
Vincent Sunn Chen
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.