What is data annotation?
Data annotation refers to the process of categorizing and labeling data for training datasets. In order for a training dataset to be usable, it must be categorized appropriately and annotated for a specific use case. With Snorkel Flow, organizations can annotate high-quality labeled training data via Labeling Functions and rapidly develop and adapt AI applications by iterating on labeled data programmatically.
Teams often overlook the importance of data annotation guidelines and best practices until they’ve run into problems caused by their absence. Supervised machine learning problems require labeled data, whether you are trying to analyze financial documents, build a fact-checking system, or automate other use cases. Snorkel Flow accelerates the process of generating labeled data via programmatic labeling, but teams still need a clear definition of the labels (i.e., ground truth).
Annotation guidelines are the guideposts that annotators, domain experts, and data scientists follow when labeling data. The critical steps for creating these guidelines are:
👩💼 Consider your audience (both the annotators and the downstream users of the data)
🔄 Iterate early to refine definitions
📍Consistently keep track of confusing and difficult data examples.
The Snorkel Flow platform supports this process by providing a custom annotation workspace and tagging capabilities for flagging ambiguous data points as a part of the end-to-end, data-centric AI application development platform. Moreover, the interplay of programmatic labeling and hand annotations can surface systematic problems in the annotation guidelines.
Why do we need guidelines for data annotation?
Supervised learning tasks like sentiment analysis or topic classification may seem straightforward at first glance, but they often involve a lot of gray areas. Does this sentence have a positive or negative sentiment:
I loved the acting, but the special effects were awful.
The best answer is mixed sentiment, but has that class been added to the label space? In Named Entity Recognition (NER), a common problem is distinguishing between GPEs (Geo-Political Entities or governments) and Locations. Consider this sentence:
The aliens attacked Britain.
Should Britain
be labeled as a government or a place? Clear annotation guidelines can help narrow down the set of unclear situations. This early process directly impacts your downstream applications, as models can only fit their inputs. For example, if a social media platform wants to build a model to detect harmful language — an underspecified process for labeling the training data may lead to legitimate content getting blocked, or vice-versa for inadvertently publishing violative content.
Next, let’s take a deeper look at real-world annotation guidelines. They may include class definitions, examples, and rules on when to skip an example. Consider the Part-of-Speech Tagging Guidelines for the Penn Treebank Project, which was developed in 1991. The resulting corpus is used as a basis of many modern NLP packages that include a Part-of-Speech (POS) component. For some classes, the guidelines assume that the annotator knows English grammar rules and simply provides the acronym to use:
Other guidelines include explicit examples:
A large portion of the guide is devoted to “Problematic Cases,” where the part of speech may not be apparent even to an expert in grammar. The difference between prepositions and particles can be exceptionally subtle.
The guidelines can get increasingly complicated depending on the task — for example, in the paper Annotating Argument Schemes that looked at argument structure in Presidential debates, an extensive flow-chart was used to define the intent of different argument types:
Both of those examples represent tasks with complex structure and high cardinality. However, even binary classification tasks may need to be worded diligently:
When classifying social media posts for toxic traits, detailed examples should be given to explain what language falls into each class. This scenario is challenging as multiple studies have found annotators’ predisposition to affect their perception of toxicity 1, 2. Subjectivity can come up in less nefarious contexts, too. Consider a financial group that wants to build a relevance model to identify news that may affect companies’ stock portfolios. Different annotators may perceive what the model should classify as relevant.
Building the data annotation guidelines
Now that we’ve established the importance of well-designed annotation guidelines, we will discuss some criteria for how to produce them. For additional recommendations, consider these sources: Rosette text analytics, Shared Tasks in Digital Humanities, and Best Practices for Managing Data Annotation by Bloomberg. The most important step is to test the guidelines early and iterate because problems can be quickly discovered once you look at the data.
Let’s consider the overall process in terms of two examples: designing guidelines for the POS Treebank and a news relevance task for a financial organization. First, we want to consider the audience. In this case, it is both the annotators and the downstream users of the application. For POS tagging, the annotators may be expert linguists or crowd workers with standard English education. Each group will require different levels of detail in the guidelines. We would also consider whether the nuanced uses of “there” (as shown earlier) will matter to the model trained on this dataset. In the news relevance task, the annotators and downstream application users will be the same analysts, so the definition of relevance will be up to them, but we still want alignment across the annotators.
With this consideration, we will create a first pass of the guidelines. These should include a definition of each class and a couple of examples. Teams should always have a couple of expert annotators label a few dozen examples so everyone involved can review disagreements, points of confusion, and missing definitions. Deciding how to handle these cases early helps avoid relabeling more significant swaths of data later. Even a small set of data can reveal problems: we may realize that “one” can be used both as a cardinal number or a noun; or that two analysts disagree on whether a news article about vaccine availability will affect a specific company; or that we forgot to include a class for the mixed sentiment. We can empirically consider how subjective the task is using one of several empirical annotator agreement metrics.
“Newsflash: Ground truth isn’t true. It’s an ideal expected result according to the people in charge.”
Cassie Kozyrkov, Chief Decision Scientist, Google
With subjectivity, we should consider potential annotator biases when designing the guidelines. We discussed how annotator demographics could influence their perception of toxicity in social media. In such cases, a very detailed rubric can be used to mitigate some of the effects. In the next section, we will discuss how programmatic labeling can help us recognize some of these problems and fix the guidelines and the labels.
Snorkel Flow’s advantage in data annotation
The Snorkel Flow platform can help with all aspects of designing annotation guidelines.
First, Snorkel Flow provides an in-platform annotation workspace that is integrated with the main model development loop. In the annotation workspace, it provides inter-annotator agreement metrics and the ability to comment on individual data points. Next, the platform has a built-in capability for tagging data points. This allows annotators to note confusing examples early in the process and bring them up for discussion within the platform. By keeping the annotation and development processes in the platform, we tighten the loop for iterating annotation guidelines.
Snorkel’s Labeling Functions provide a powerful mechanism for resolving problems with ground truth. Using heuristics, users can encode the guidelines. By evaluating them on hand-labeled data, we can discover underspecified cases. Let’s say that for the parts-of-speech system, we encode a rule:
If the word is “one”, then it is a cardinal number.
If this rule gets low precision based on ground truth, we will quickly see that some annotators consider one a noun in specific cases. This points us to a place where we can iterate on the guidelines. For news relevance, a rule that says vaccines → non-relevant and getting low precision will point to another point for discussion amongst the subject matter experts. Once an agreement is reached, the “controversial” examples can be easily pulled up in the platform and fixed by filtering on the rule.
Finally, Snorkel Flow makes it simple to edit the label space.
Going back to the initial example: I loved the acting, but the special effects were awful
. If you decide that your application needs a new mixed sentiment class, Snorkel Flow’s Label Schema Editor allows this change and preserves existing ground truth and rules.
Final thoughts on data annotation
For many reasons, the whole definition and nuances of a task are often not going to be evident when we approach a new task. The data may contain unexpected edge cases, the downstream users may disagree on specific labels, or we may have defined the guidelines in a manner subjective to the annotators. These situations are natural parts of the process, and most of them can be handled through an iterative and collaborative-driven process. The Snorkel Flow platform accelerates this process both through a built-in annotation workspace and the ability to use programmatic labeling to capture assumptions and disagreements in the labeling process.
If you are interested in learning more about data annotation guidelines and how they work within Snorkel Flow, request a demo with one of our machine learning experts. We would be happy to go over the specifics of your use case and how programmatic data labeling can be applied to accelerate your AI efforts.
Stay in touch with Snorkel AI, follow us on Twitter, LinkedIn, and Youtube, and if you’re interested in joining the Snorkel team, we’re hiring! Please apply on our careers page.
Footnotes:
1 Jiang, Jialun Aaron, Morgan Klaus Scheuerman, Casey Fiesler, and Jed R. Brubaker. 2021. “Understanding International Perceptions Of The Severity Of Harmful Content Online”. PLOS ONE 16 (8): e0256762. doi:10.1371/journal.pone.0256762.
2 2022. “Annotation with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection”. Homes.Cs.Washington.Edu.
Featured image by Clayton Robbins on Unsplash.