How can data-centric AI speeds your end-to-end healthcare AI development and deployment

Healthcare is a field that is awash in data, and managing it all is complicated and expensive. As an industry, it benefits tremendously from the ongoing development of machine learning and data-centric AI. The potential benefits of AI integration in healthcare can be broken down into two categories: Reducing costs, increasing efficiency for current practices, and unlocking new pathways for future innovation. 

Cost and efficiency for AI in healthcare

Healthcare in the United States is highly complex. It is also very administratively “mature”—the industry we see today was created over time by multiple parties employing generations of processes, built one on top of another. This development path has created a lot of byzantine practices and procedures and a lot of overhead. For example, in pharmaceuticals, the average cost to develop one new drug is about $2.6 billion. The American healthcare system, in general, represents nearly 20 percent of U.S. GDP. The United States unquestionably has some of the best providers in the world. Still, because American healthcare requires so many administrative procedures and overhead to function, doctors and nurses are inundated and overworked. As a result, misdiagnosis is the fourth-leading cause of death in America. Each of these metrics stands to improve significantly with the successful implementation of machine learning across the healthcare field.


Data is really what makes AI effective. Healthcare as an industry is getting better and better at collecting more and more data about the human body, disease processes, and more. Healthcare has struggled to make effective use and reuse of all that data. With properly deployed AI applications, we can more effectively use the data we already collect to improve patient outcomes, both now and in the future.

Systematically harnessing all that data will also unlock new advances in medicine. The emergence and growth of personalized medicine and identifying opportunities for rare disease treatments, for example, are just two places where first-class data-centric AI can push the frontiers of science and medicine and improve the quality of care.

If we all agree that AI has a vital role in healthcare, why has it remained so difficult to find a place for it?

Challenges for healthcare AI

In our experience working with our partners, we’ve seen five primary challenges to implementing AI in healthcare. 

First, all that data is either unstructured or semi-structured. There is no shortage of available data—provider notes, clinical trial data, pdfs, and scanned documents—but almost none is structured. It needs to be processed to create the structure and uniformity that AI models need to work effectively, which costs time and resources.

Second, the data is not harmonized. Not only does it come in all kinds of distinct unstructured formats, but there are marked differences in collection methods, equipment, units of measure, medical terminology, and more. The healthcare system is not a monolith, and its fragmented nature creates a wide diversity of data methods, types, and metrics that are difficult to combine into one usable dataset.

Third, the prohibitive cost of labeling all that discordant and unstructured data leads to suboptimal development data. Training sets and AI models are expensive to create, and if the data is insufficiently representative or incomplete, that adversely affects the quality of any AI application. While this is often a problem for many industries working to implement AI, the healthcare field cannot afford to start with an imperfect model or suspect data and then work to improve it over time after deployment. Perhaps social media companies, for instance, can “move fast and break things,” but in the healthcare industry, people’s lives are at stake. The bar for quality is much higher.

That leads to the fourth primary challenge, which is that medical data is private and sensitive, and it needs to be treated as such. Medical data, for example, cannot be easily outsourced for labeling tasks, which is how a lot of data in other industries gets labeled. 

Finally, the fifth challenge is the need for specialized subject matter experts (SMEs) for data labeling. SMEs in healthcare come at a premium cost because of how expensive experts’ time tends to be. Biomedical data requires nuanced, specialized knowledge to label correctly. But as we mentioned above, healthcare SMEs are already overworked. Regularly adding a time-consuming labeling task to their duties is untenable over the long term.

These five challenges create a lot of inertia for integrating good quality AI applications in the healthcare industry. But they all have one thing in common; they are all about data as the bottleneck blocking progress. And this is where we at Snorkel AI have solutions.

How Snorkel Flow meets healthcare’s AI challenges:

Programmatic labeling

Snorkel Flow is a data-centric AI platform pioneered by Snorkel AI. It uses programmatic labeling to intelligently capture unique and specific SME knowledge and then scale it across large quantities of data. Programmatic labeling thus empowers humans to do what humans remain best at, then lets computers do the rest. 

Iterative model development

In AI applications, generally, data requires multiple iterations or “passes” to get it ready for use in a model. Practically, this results in the need for multiple iterations of the data labeling task and model training to produce an application that effectively solves a problem. This iterative “loop” is essential for data-centric AI and is at the core of Snorkel Flow. With Snorkel Flow, data scientists can iterate on training data continuously and at scale, without the need for costly and time-intensive manual labeling from SMEs.

Monitoring and adaptability

Keeping models relevant over time requires ongoing updates and flexibility. Once you create a model and move it into production, in other words, you need the ability to adapt it to changing data and objectives. There are almost weekly advancements in healthcare research, ranging from what we know about the human body to how particular drugs and molecules interact inside the body. These research developments often compel you to re-label your data, update label schema, and retrain models on a recurrent basis to ensure you capture the best possible understanding of a problem. Snorkel Flow gives you the flexibility to do so efficiently and at scale.

Enterprise readiness

Any AI application needs to be able to work out-of-the-box in ways that interoperate with your existing technology investments. It also needs to be secure, so that patient information remains private and uncompromised. Snorkel Flow can integrate with your existing infrastructure and allow you to do all your data labeling and iterations in-house.

The Snorkel Flow approach to AI in healthcare

The team at Snorkel AI has been working closely with the healthcare industry leaders such as Stanford Medicine, Genentech, Memorial Sloan Kettering Cancer Center, and more. We have published research on 

Snorkel Flow out-of-the-box application pipelines for common ML tasks such as classification and information extraction are ideal for harmonizing healthcare data. They allow you to collect, label, and harmonize semi-structured and unstructured data, extract the most relevant data, and make it AI-ready very quickly. 

Programmatic labeling allows you to capture healthcare experts such as clinicians, physicians, and administrative staff’s intuition and spread it rapidly across your data, rather than just one data point at a time. This means you can always work with high-quality labeled data for every part of your set. Labeling functions make this scalable and efficient, so you can perform this process in-house and keep sensitive data secure and private. In a practical sense, your SME engages with data with their expertise without labeling or re-labeling data themselves. Snorkel Flow uses the SME’s time most productively and lets the computer scale that expertise across your dataset.

Snorkel’s approach creates an iterative loop, which allows you to develop AI applications with the best available data in-house in a fast, responsive, and scalable way. In other words, you can get applications into production and keep them relevant as your data and goals change over time.

Snorkel Flow for healthcare providers and payers

Below are some case studies featuring problems Snorkel AI has solved while working with our customers.  

Claims processing

A substantial amount of unstructured data emerges from insurance claims, things like medical coding, prior-authorization documents, and other miscellaneous documents often rendered as scanned PDFs. This data is challenging as it is private and must stay in-house. It requires time from subject matter experts (SME) —nurses, clinicians, and others—whose time is costly and limited to label data usefully. Snorkel Flow’s programmatic labeling, however, can greatly speed up coding using non-SME coders, saving SME time for the more complex cases, such as prior authorizations for expensive drugs. 

Tools for doctors

Physicians and clinicians generally record a lot of unstructured data about population health information, outputs from medical devices, lab reports, and other sources. Using Snorkel Flow, data science teams create labeling functions that take unstructured data and extract patient population insights, assisting with reports that are required to be submitted to government regulatory agencies. This saves a lot of clinician time—boosting doctor productivity and aids with diagnosing and triaging patients.  

Let’s take a Fortune 500 healthcare provider’s use case as an example here. The provider wanted to use their large amount of data—doctor notes and pathology reports—to look for trends and correlations and to extract insights into a particular disease category. The challenge was to see whether they could use AI to analyze this unstructured data and detect correlational or predictive findings.  

Simply put, it was infeasible to use valuable physician time to label even a portion of the data—to do this without programmatic labeling, each note would have to be examined one at a time and categorized in order to turn it into structured information.

Using Snorkel Flow, their data science team was able to use programmatic labeling to run through unstructured data much more quickly, and then demonstrably iterate on that data, creating a loop that remained adaptable and so useful for the long term.

Another partner, a payer, wanted help in determining patient eligibility for medical assistance. But determining this is traditionally a fragmented and disconnected process, with high operational overhead for the medical team doing the evaluations as well as the creation of a lot of frustration for both patients and providers. It took 12 nurse-SMEs an entire month to manually label data for just one procedure, and they estimated it would take a year to develop the application with that dataset. 

With Snorkel Flow, their data science team applieda programmatic approach to label data and build the end-to-end AI application. An example of a labeling function they build is: “extract all the medical entities from the claim using an existing machine-learning model or some other ontology and check to see if that is matched in the patient’s records.”  Using Snorkel Flow, you produce “noisy” but large amounts of data at once and thenSnorkel Flow “de-noise” it as part of the model training and training data iteration.

Snorkel Flow for Pharma

Now let’s look at some use cases from the pharmaceutical side of healthcare.

Clinical trial analytics

Clinical trial protocol design requires a large amount of data, most of which is unstructured. Usually, these are enormous documents that come in a variety of different formats. Manually labeling all this data using a clinical scientist SME would take a very long time. 

Snorkel Flow can use all these data sources in their unstructured format to help you design better clinical trials. You can: identify patients from past trials to develop new protocols; speed up adverse events reporting to regulators; search other relevant trials, experiments, and published research; and do all of that much faster in a programmatic way.

A trial design team may have a lot of unstructured data from claims EHR aggregators, doctors notes, scanned patient documents, etc. Using Snorkel Flow, data science teams automate the labeling of this data in order to improve drug safety both pre- and post-market; improve treatment efficacy and healthcare outcomes for reimbursements and commercial sales; and advance our understanding of diseases and the clinical guidelines surrounding them.

Genentech wanted to improve the design of its clinical trial inclusion/exclusion (I/E) criteria—the factors that determine whether a test patient should be included in a particular trial or not. I/E is a critical part of study design, and it is often a primary arbiter of whether a trial is successful or not. Traditionally, I/E design is done manually by clinical scientists who look up trials and use large Excel spreadsheets to analyze past trials. An AI application would make this far more efficient, but it also has to keep the data private and in-house. Doing this manually, Genentech estimated it would generally take 140 SMEs a month to label their training dataset. 

Given a clinical trial document, they wanted to extract the I/E criteria. The criteria might be “a patient could have x condition but not y” or something similar. Snorkel Flow leveraged its programmatic labeling approach to label 340,000 clinical trial protocol documents very quickly. Broadly speaking, our platform created a complex machine-learning pipeline that, given a trial protocol document, could quickly extract what diseases are mentioned within it, and then map those to a certain set of chronic conditions. Then, with some post-processing, it could determine whether that condition would be useful I/E criteria for a trial. 

In other words, Snorkel Flow was able to take all this unstructured data and convert it into a structured format programmatically. It also reduced the “egregious error” rate down to 0.7 percent with guided error analysis. Using the resultant data in the best machine-learning models available today, we achieved a 99 percent accuracy for NER + Entity Linking Use cases.

Crucially, once Genentech’s label schema changed, it only required a single day to iterate and relabel all of their training data. The process was fast, maintained the privacy of the data, made the best use of SME time and resources, and was readily adaptable to changes in clinical trends or goals.

Final words on AI in healthcare

Using the Snorkel Flow platform can save person-months or even years compared to hand labeling data for machine-learning applications, and do so while also improving accuracy and providing iterative flexibility. It is an “end-to-end” AI application development platform. 

Snorkel Flow starts with taking the unstructured data you already have at your enterprise and makes it AI-ready using our templates. It then allows you to label that data programmatically and makes it easy to iterate and relabel as your needs change. You end up with a model that can directly interface with your existing production infrastructure, and it encapsulates the entire life cycle of your data: ingestion; making data AI-ready; labeling; creating a model in an iterative way, and then making it production-ready. 

In essence, Snorkel Flow represents an exhaustive set of “connectors” that not only bring your data in but also make your resultant model adaptable, and thus useful, over the long term. Snorkel Flow’s data-centric AI and programmatic labeling approach has a unique potential for use in the healthcare field. It specializes in taking vast amounts of unstructured data and harmonizing and labeling it quickly, in-house, and in ways that are easily scalable and that maximize your SME resources. 

Healthcare facilities not only use Snorkel Flow to build and deploy AI applications but also to keep their applications relevant and value-creative over time. You can read more about some of the outcomes we’ve enabled in our case studies.

To learn more about Snorkel Flow and how it can unleash your healthcare AI deployment, request a demo or visit our platform. We encourage you to subscribe to receive updates or follow us on TwitterLinkedin, or Youtube.