CUSTOMER STORY

Snorkel AI helps MSKCC streamline HER-2 patient identification

Industry:

Healthcare

Solution:

Content Classifier

93%

Accuracy with just a few labeling functions

1000s

Of patient records auto-labeled

Weeks

Instead of months to build a document classification

MSKCC, the world’s oldest and largest cancer center, sought to identify patients as candidates for clinical trial studies by classifying the presence of a relevant protein, HER-2. Reviewing patient records for HER-2 is onerous; clinicians and researchers must parse through complex, variable patient data. By unblocking training data labeling and data-centric iteration, MSKCC was able to build a model with overall accuracy of 93%. This powers an AI-driven screening system to classify patient records at scale, speeding recruitment for clinical trials, and as a result, treatment research and development.

Scaling clinical trial screening with document classification

Challenge

To unlock improvements to clinical trial screening, the data science team at MSKCC wanted to use AI/ML to classify patient records based on the presence of HER-2, a protein common to many cancers. However, lack of labeled training data bottlenecked their progress. Labeling data, in this case complex patient records, requires clinician and researcher expertise and is prohibitively slow and expensive. Further, even when experts were able to manually annotate training data, their labels were at times inconsistent, limiting model performance potential.

Time to label training data was prohibitively slow given the high degree of domain expertise required and inability to outsource.
Limited model quality as information was inconsistently referenced within the complex unstructured text of patient records.
Disagreement in expert labels wasn’t discoverable because there were no practical ways to govern labels.

“The biggest challenge we have—which is true of any AI/ML project, but is especially so in clinical contexts—is how do we label [training] data? Our labelers are physicians and researchers, their time is very expensive.” 
Subratta Chatterjee
Principal Data Scientist, MSKCC

Goal

Reduce data labeling and development time by making more efficient use of domain experts’ effort—without reducing data or AI application quality.

Solution

MSKCC used Snorkel Flow to build an AI application to classify patient records across five classes categorizing the presence of HER-2. This application was used for a downstream clinical trial screening system to identify potential clinical trial participants. While Snorkel Flow was originally deployed on customer-managed infrastructure, the team later migrated Snorkel Flow to AWS using Snorkel Managed VPC service to simplify Snorkel Flow management.

To objectively measure Snorkel Flow as a solution, the team used 3,200 data points they’d labeled previously outside of the platform. They ingested the data and split it across training, validation, and test sets. The training set of 2,300 was uploaded to Snorkel Flow without labels (aside from 500 ground truth labels used for analysis). 

The lead Bioinformatics Engineer developing the project wrote just eight noisy, imperfect labeling functions which Snorkel Flow combined to auto-label a training dataset. They used this to train an XGBoost model within the platform (which generalized beyond the data labeled using weak supervision). Next, using error analysis tools within the platform, the team used feedback from this model to learn where it was confused and how to correct.

As one example, the model misclassified data that were positive for HER-2 (2+ and 3+) as negative when the records frequently contained the word “negative.” To solve this, the engineer wrote a labeling function to only consider “negative” when it was within close proximity to the keyword “HER-2”. After retraining, the XGBoost model was able to classify these data points correctly.

Of particular note to the team was that many of the ground truth labels—hand labeled by their experts—were incorrect on analysis. Fixing these ground truth labels removed a hidden constraint to model performance.

With just a few rapid iterations, the team achieved an overall accuracy of 93% and an average F1 of 87% across all classes. The test set results were nearly as strong, with 92% accuracy and 87% F1. While they’ve met their initial performance targets for production, the bioinformatics engineer intends to improve further with a few more labeling functions and ML model experimentation.

Auto-labeled to significantly reduced time to label complex, domain-specific text documents as training data by labeling programmatically.
Used model-guided error analysis to identify data quality issues—including incorrect ground truth labels—and iterate rapidly to improve.
Increased explainability by encoding the labeling rationale for each training data point as labeling functions that can be inspected like code.

The document classification AI application the team built is now used downstream to power a clinical trial screening system. This system allows MSKCC to identify HER-2 among patient records without relying on human experts to review each record.

“It took me a long time to build the Regex model [outside of Snorkel Flow]; I read hundreds, thousands of pathology reports to make it work. Whereas [with Snorkel Flow] it was a few hours to get to the results we achieved—a pretty dramatic difference.”  
John Cadley
Bioinformatics Engineer, MSKCC

Ready to get started?

Take the next step and see how you can accelerate AI development by 100x.

Join a live demo

Talk to an expert

Snorkel AI helps MSKCC streamline HER-2 patient identification

Scaling clinical trial screening with document classification

Challenge

Goal

Solution

Ready to get started?

Product

Solutions

Services

Industries

Customers

Resources

Learn

Engage

AI Primers

Docs

AI Research

Company

Contact

Compliance