Enhancing worker management and supporting external tasks in crowdsourced data labeling



Journal Title

Journal ISSN

Volume Title



Human data labeling is key to training supervised machine learning (ML) models. We propose a new software infrastructure layer to augment capabilities of Amazon’s SageMaker Ground Truth (GT) data labeling platform. Whereas crowdsourced annotation via Amazon Mechanical Turk (MTurk) is well-established, Amazon’s more recent GT platform is less known but specifically designed to support ML annotation. Differentiating features include a curated “public crowd” sourced from MTurk, and integrating human labeling into Amazon’s broader SageMaker ML tool suite, which provides an end-to-end pipeline for training and deploying ML services. Key features of our software layer include: 1) continuous worker performance monitoring wrt. Requester gold labels; 2) automatically restricting task access when performance standards are not met; 3) geographic-based restriction of task access to US-based workers; and 4) the ability to conduct external tasks off-platform while sourcing workers from GT and continuing to use GT’s payment system. Our design seeks to streamline Requester experience with minimal changes, and to utilize a sustainable software design to ease long-term management, extension, and maintenance. More generally, design goals center on promoting efficient, user-friendly, and quality-focused data labeling with crowdsourced annotators.



LCSH Subject Headings