Handwriting transcription using word spotting with humans in the loop
MetadataShow full item record
Handwritten materials are increasingly being digitized and made available for scholarly analysis and research. To this end, numerous specialized software tools have been developed to support the crowdsourced transcription of such texts. However, as many of these tools operate at the page-level, they are unsuitable for documents containing privacy-sensitive data such as medical records, as displaying an entire page at a time risks the potential of disclosing such information to unintended parties. Additionally, manual transcription efforts can be slow and expensive. Automated optical character recognition (OCR) methods perform poorly on handwritten text due to factors such as the large variability in human handwriting, degradation of paper documents, and artifacts of scanning. Thus, handwritten text recognition and analysis remain active areas of research. With the renewed interest in neural networks, recent methods using deep learning have achieved unprecedented state-of-the-art results on benchmark datasets in areas including word recognition, word spotting, and character recognition. Despite this, current methods are not yet robust enough to fully automate handwriting transcription tasks alone. In this work, we report a novel approach that combines the efficiency of machine learning with the accuracy of human intelligence in order to semi-automatically transcribe a challenging real-world dataset of word images segmented from historical handwritten medical records as part of the Central State Hospital Digital (CSH) Library and Archives project. Specifically, we leverage a deep convolutional network to generate feature sets, identify groups of similar images using unsupervised hierarchical density-based clustering, and develop a system to obtain cluster transcriptions from human workers on an online crowdsourcing platform. In doing so, we aim to reduce the number of images to be sent to the crowd, thereby optimizing monetary and time costs while maintaining an acceptable level of accuracy as well as preserving the privacy of the data.