paperMay 2024

On Efficient and Statistical Quality Estimation for Data Annotation

AuthorsJan-Christoph Klie, Juan Haladjian, Marc Kirchner, Rahul Nair

Annotated data is an essential ingredient to train, evaluate, compare and productionalize machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. For instance, project managers can use quality estimates to improve annotation guidelines, retrain annotators or catch as many errors as possible before release.

Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often are not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes uses money that could be spent better, for instance on more annotations.

Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that it can reduce the required sample sizes up to 50% while providing the same statistical guarantees.

Figure 1: Overview of agile data corpus creation, the recommended workflow to annotate high-quality datasets. This work explores how to efficiently estimate annotation quality using statistics.

Figure 2: Flowcharts for the three different acceptance sampling methods discussed in this work.

Figure 3: Average sample numbers (ASN) required for a strict and relaxed configuration for Confidence Intervals (CI), Single Sampling Plans (SSP), Double Sampling Plans (DSP), and Sequential Sampling Plans based on the Sequential Probability Ratio Test (SPRT). Dotted lines are plans with curtailment. The confidence interval requiring the smaller sample size is the one assuming p_a. It can be seen that overall, acceptance sampling methods can provide similar statistical guarantees as confidence intervals while requiring up to 50% less samples.

On Efficient and Statistical Quality Estimation for Data Annotation

Related readings and updates.

All About Sample-Size Calculations for A/B Testing: Novel Extensions and Practical Guide

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Discover opportunities in Machine Learning.