This article was originally published in EAIGG’s 2022 Annual Report on Realizing the Promise of AI.
In 2020, it was discovered that a significant part of the Scots version of Wikipedia wasn’t written by a Scots linguist or even a Scottish person but by an American teenager who didn’t speak Scots [1].
Beyond its obvious impact on Wikipedia readers, the reach of negative impact was much larger—many large language models (LLMs) are trained on these articles. The LLMs (in addition to being used in chatbots and voice agents) also write new Wikipedia entries, creating a vicious circle and overwriting our understanding of the native language of 1.5 million people [2].
And this is not a one-off: In-home lending alone, one study found that lenders using algorithms to generate decisions have resulted in overcharges of $765 million each year for home and refinance loans to people of color [2]. Another example in the space that made this apparent is when the DOJ recently ruled that Meta’s housing advertising system discriminated based on protected attributes under the Fair Housing Act [3][4].
These incidents highlight that AI applications are only as good as their training data, and compromising the integrity of the training data and the data-driven explainability of AI applications can have deleterious effects.
Trustworthy training data
Most organizations create their training data in one of two ways: internally or using a labeling vendor. For AI/ML teams who have privacy and expertise constraints, using a labeling vendor isn’t an option. Labeling internally is nearly always bottlenecked by the limited availability of (often high-value, busy) subject matter experts (SMEs). Fundamentally, both these options are bottlenecked by the need for a human to hand-annotate every datapoint, which is slow, hard to audit and adapt, and – as you’ll see in the case study below – is prone to introducing human bias in ways that are difficult to correct.
Programmatic labeling solves the impracticality and risk of hand-labeling training data. Instead of annotating millions of data points, SMEs convey their expertise in a handful of labeling functions that capture their rationale as auditable and inspectable logic [5]. This knowledge can include the SME’s heuristics, organizational resources such as existing rule-based systems or ontologies, or even relevant insight from advanced sources of knowledge like foundation models.
These knowledge inputs, each treated as sources of “weak” signal, are combined and might even conflict or have errors. Programmatic labeling relies on theoretically grounded weak supervision techniques to denoise and reconcile each signal source to output the best training label for each datapoint [6]. And because each of these training labels is encoded using the labeling functions, the overall system is much more adaptable, interpretable, and governable.
Trustworthy AI
At Snorkel AI, we’ve had the opportunity to partner with leading organizations, including financial institutions, healthcare providers, and the federal government. Such organizations continue to see important new AI ethics requirements around the explainability of the AI models they develop and their inferences. These critical requirements are meant to eliminate bias, prevent misuse, and promote fairness to increase AI trust.
There are several recurring requirements and tenants of the data foundation needed to practically ensure trustworthy AI applications, including:
- Integrity and correctness of the training data that AI learns from to drive reliability.
- Explicability of AI decisions and errors to create traceability and accountability.
- Human engagement between domain experts/annotators and data scientists promotes explainability and learning.
- Systemic bias correction to foster fairness.
- Protection of the data to respect privacy and prevent misuse.
What you’ll notice is common to each of the above is an emphasis on the foundation of training data that underpins the AI application. This approach to AI application development, which places the data foundation at the center of the application’s governance and operationalization, is called Data-Centric AI [7].
This is not to discount the many other elements of AI development that impact trustworthiness, from problem formulation to designing the end application that delivers results to stakeholders. Our strong perspective is that trustworthy data is the non-negotiable foundation and will help measure and quantify the impact of these downstream elements.
Furthermore, while black-box third-party APIs and pre-trained models can be an attractive way to skip the need for a training data foundation, a lack of transparency into their supply chain critically jeopardizes the governance of the aforementioned tenants.
Case study: Lending discrimination
Let’s consider an AI application tasked with lending decisioning for mortgages. The application is trained on a dataset with millions of data points.
“The Fair Housing Act makes it illegal to discriminate against someone because of race, color, religion, sex (including gender, gender identity, sexual orientation, and sexual harassment), familial status, national origin, or disability at any stage of the mortgage process.”
To ensure home ownership is equitable and adheres to protections codified in the Fair Housing Act, it is essential to evaluate if the application is learning from or using any protected attributes to make lending decisions. This is unlikely to be a one-time exercise, of course. As data distributions shift in the real world, unintended biased learning can occur, and as regulations evolve to include new protected classes, AI applications must be retrained. For example, TFHA was amended in 1988 to protect the disabled and families with children [8], and one could foresee similar amendments (e.g., genetic information).
For this example, let’s consider that a new class has been protected. We want to check and correct any references to the protected attribute in our AI application.
The first step is to check the integrity of the training data to see if any of the labels reference the new attribute. If a dataset is manually labeled, the only infallible way to check the assigned labels for policy compliance or bias is to review each of the millions of labels by hand. This is impractical at scale. When we’ve labeled the training data programmatically l, however, we can efficiently and reliably check if the labeling functions reference a protected attribute since they directly encapsulate the reasoning behind the labeling as logic and code.
But what if there are hidden relationships between an unprotected attribute, name, or zip code, and the new protected attribute? Manual labeling is prone to the annotators’ hidden bias (conscious or not), making it intractable to post-facto ascertain if an applicant’s name influenced a set of decisions. By reducing the labeling process to a manageable set of labeling functions, programmatically labeling the data makes it much more manageable to iterate with SMEs to learn and analyze each labeling function’s relationship to the protected attribute.
If we identify a cause of bias, we need a way to systematically fix the issue. With manual labeling, this would likely entail relabeling all the data because it is too hard to ascertain exactly which slices of the training data are compromised. This is impractical because we cannot ship the sensitive data externally, and it is time-consuming to do so internally—especially when the legal and societal repercussions are so high. When we’ve labeled our training data programmatically, however, we can isolate the bias to a particular set of labeling functions and take swift action by editing or removing them, deciding whether to encode newly relevant knowledge into additional labeling functions and create a new training dataset with the push of a button.
Case Study: Healthcare
In a well-known incident, it appeared that IBM’s Watson for Oncology product identified multiple incorrect and unsafe cancer-treatment recommendations, partly because it was trained on just a handful of synthetic cancer cases [9]. Unfortunately, this practice of using synthetic and incomplete data is all too common in healthcare—despite there being a rich trove of real-world medical data available, creating training data is often prohibitively difficult. First, the trained experts (e.g., doctors) don’t have the time to label millions of data points needed. Second, privacy requirements prevent the data from being shared with external vendors (even if they had the necessary expertise). However, programmatic labeling can make it significantly more practical and manageable to use real data by engaging experts in the creation of and iteration on a handful of labeling functions without needing to ship the data externally.
Programmatic labeling also supports the explainability of healthcare AI decisions, which is paramount in ensuring safety and mitigating algorithmic bias concerns. Deep learning models are especially good at picking up on unintended signals. For example, a 2017 study found that biometric AI models to classify gender picked up on mascara as one of the confounding factors—clearly biasing them [10]. Another study found that a commercially approved product for algorithm-aided diagnosis of melanoma from lesions was influenced by surgical skin markings and not the lesions themselves [11]. In this context, black-box / “pre-trained” models present a challenge in achieving explainability and transparency. A data-centric approach not only allows healthcare organizations to promote explainability by explicitly encoding the rationale behind each training data label used to teach a given model as labeling functions. These are fully traceable and adaptable if identified as a potential source of bias or poor out.
Conclusion
AI applications are only as good as their training data foundation. Data-centric AI is a fundamental shift in AI development where practitioners shift away from focusing the lion’s share of their time on the model to spending more time on the integrity and governance of the AI application’s data foundation.
As we’ve shown, labeling data manually is not a practical solution to creating such a data foundation; it is unscalable, uninterpretable, and bias-prone. In contrast, programmatic labeling allows teams to encode their knowledge and label using inspectable labeling functions.
Beyond labeling, this approach keeps humans engaged in the components that require expertise and iteration (labeling functions, model iterations, inference explainability) practically and efficiently. This allows teams to ship AI applications faster and to ensure these applications are inspectable, explainable, adaptable, and more resilient.
By adopting a data-centric approach to AI from the very beginning, organizations set up the right systems for a governable and scalable data foundation.