Enhancing worker management and supporting external tasks in crowdsourced data labeling

Date

2023-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Human data labeling is key to training supervised machine learning (ML) models. We propose a new software infrastructure layer to augment capabilities of Amazon’s SageMaker Ground Truth (GT) data labeling platform. Whereas crowdsourced annotation via Amazon Mechanical Turk (MTurk) is well-established, Amazon’s more recent GT platform is less known but specifically designed to support ML annotation. Differentiating features include a curated “public crowd” sourced from MTurk, and integrating human labeling into Amazon’s broader SageMaker ML tool suite, which provides an end-to-end pipeline for training and deploying ML services. Key features of our software layer include: 1) continuous worker performance monitoring wrt. Requester gold labels; 2) automatically restricting task access when performance standards are not met; 3) geographic-based restriction of task access to US-based workers; and 4) the ability to conduct external tasks off-platform while sourcing workers from GT and continuing to use GT’s payment system. Our design seeks to streamline Requester experience with minimal changes, and to utilize a sustainable software design to ease long-term management, extension, and maintenance. More generally, design goals center on promoting efficient, user-friendly, and quality-focused data labeling with crowdsourced annotators.

Department

Description

LCSH Subject Headings

Citation