MSKCC, the world’s oldest and largest cancer center, sought to identify patients as candidates for clinical trial studies by classifying the presence of a relevant protein, HER-2. Reviewing patient records for HER-2 is onerous; clinicians and researchers must parse through complex, variable patient data. By unblocking training data labeling and data-centric iteration, MSKCC was able to build a model with overall accuracy of 93%. This powers an AI-driven screening system to classify patient records at scale, speeding recruitment for clinical trials, and as a result, treatment research and development.

Scaling clinical trial screening with document classification
Memorial Sloan Kettering Cancer Center, the world’s oldest and largest private cancer center, provides care to increase the quality of life of more than 150,000 cancer patients annually. In service of this, they use AI to speed the discovery of more effective strategies to prevent, control and ultimately cure cancer in the future.


To unlock improvements to clinical trial screening, the data science team at MSKCC wanted to use AI/ML to classify patient records based on the presence of HER-2, a protein common to many cancers. However, lack of labeled training data bottlenecked their progress. Labeling data, in this case complex patient records, requires clinician and researcher expertise and is prohibitively slow and expensive. Further, even when experts were able to manually annotate training data, their labels were at times inconsistent, limiting model performance potential.

  • Time to label training data was prohibitively slow given the high degree of domain expertise required and inability to outsource.
  • Limited model quality as information was inconsistently referenced within the complex unstructured text of patient records.
  • Disagreement in expert labels wasn’t discoverable because there were no practical ways to govern labels.
Example of patient data with HER-2 labels

“The biggest challenge we have—which is true of any AI/ML project, but is especially so in clinical contexts—is how do we label [training] data? Our labelers are physicians and researchers, their time is very expensive.”

Subratta Chatterjee
Principal Data Scientist MSKCC


Reduce data labeling and development time by making more efficient use of domain experts’ effort—without reducing data or AI application quality.


MSKCC used Snorkel Flow to build an AI application to classify patient records across five classes categorizing the presence of HER-2. This application was used for a downstream clinical trial screening system to identify potential clinical trial participants.

To objectively measure Snorkel Flow as a solution, the team used 3,200 data points they’d labeled previously outside of the platform. They ingested the data and split it across training, validation, and test sets. The training set of 2,300 was uploaded to Snorkel Flow without labels (aside from 500 ground truth labels used for analysis).

The lead Bioinformatics Engineer developing the project wrote just eight noisy, imperfect labeling functions which Snorkel Flow combined to auto-label a training dataset. They used this to train an XGboost model within the platform (which generalized beyond the data labeled using weak supervision). Next, using error analysis tools within the platform, the team used feedback from this model to learn where it was confused and how to correct.

As one example, the model misclassified data that were positive for HER-2 (2+ and 3+) as negative when the records frequently contained the word “negative.” To solve this, the engineer wrote a labeling function to only consider “negative” when it was within close proximity to the keyword “HER-2”. After retraining, the XGboost model was able to classify these data points correctly.

Intermediate error analysis within Snorkel Flow

Of particular note to the team was that many of the ground truth labels—hand labeled by their experts—were incorrect on analysis. Fixing these ground truth labels removed a hidden constraint to model performance.

With just a few rapid iterations, the team achieved an overall accuracy of 93% and an average F1 of 87% across all classes. The test set results were nearly as strong, with 92% accuracy and 87% F1. While they’ve met their initial performance targets for production, the bioinformatics engineer intends to improve further with a few more labeling functions and ML model experimentation.

  • Auto-labeled to significantly reduced time to label complex, domain-specific text documents as training data by labeling programmatically.
  • Used model-guided error analysis to identify data quality issues—including incorrect ground truth labels—and iterate rapidly to improve.
  • Increased explainability by encoding the labeling rationale for each training data point as labeling functions that can be inspected like code.

The document classification AI application the team built is now used downstream to power a clinical trial screening system. This system allows MSKCC to identify HER-2 among patient records without relying on human experts to review each record. 

“It took me a long time to build the Regex model [outside of Snorkel Flow]; I read hundreds, thousands of pathology reports to make it work. Whereas [with Snorkel Flow] it was a few hours to get to the results we achieved—a pretty dramatic difference.”

John Cadley
Bioinformatics Engineer MSKCC


accuracy with just a handful of labeling functions


of patient records auto-labeled


instead of months to build a document classification

This work was presented at the Future of Data-centric AI event hosted by Snorkel AI. Watch this and many other sessions on-demand at