Validation workflow

A validation workflow is the sequence of tasks executed to verify the correctness of a labeled dataset.

Cambridge dictionary defines “workflow” as “the way that a particular type of work is organized, or the order of the stages in a particular work process”. In the context of data labeling, a validation workflow is, therefore, a workflow with the goal of verifying the process of data annotation.

We have four workflows to evaluate the correctness of the labeled data:

  1. Without validation: A labeler annotates and there is no extra verification.
  2. With reviewing: An expert “reviewer” verifies the annotation, giving feedback to the labelers.
  3. Consensus voting: Multiple labelers annotate the same piece of data, which is only used in case of agreement.
  4. Honeypot (or ground-truth): An expert (often the client) annotates an extract of the data, which is then used as a benchmark to assess the quality of subsequent labels provided by annotators.

The workflow varies in terms of stakeholders involved, their seniority, and the complexity of each task. These workflows are established during the beginning of the project, tailored to its complexity. 

Choosing the validation workflow

The choice of the validation workflow is up to the client and its needs. This decision is very flexible. That is, the client can choose on which workflow and whether to apply the validation process to samples or the entire production dataset. 

It’s important to highlight that the workflow isn’t fixed for the entire project duration. For projects with reviewing, relying solely on 100% reviewing would be excessively costly. Instead, an adaptable approach is effective, initiating with a strong review process. As labelers enhance their labeling performance through feedback from reviewers, the proportion of reviewing gradually reduces. Similarly, consensus voting, typically applied to a fraction (e.g. 10-20%) of labels, follows this principle. Agility is key, enabling responsive adjustments that positively impact cost, speed, and labeling quality.

Finally, a validation workflow is one of the three approaches we work with in assessing quality in data labeling. If you want to know more about it, don’t hesitate to take a look at our article on high-quality.