In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers

作者:

Highlights:

摘要

In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models.We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.

论文关键词:

论文评审过程:Received 25 January 2021, Revised 22 March 2021, Accepted 12 April 2021, Available online 6 May 2021, Version of Record 6 May 2021.

论文官网地址:https://doi.org/10.1016/j.ipm.2021.102606