Word spotting in historical documents using primitive codebook and dynamic programming
作者:
Highlights:
•
摘要
Word searching and indexing in historical document collections are a challenging problem because text characters are often touching or broken due to degradation or aging effects. In this paper, we present a novel approach towards word spotting using text line decomposition into character primitives and string matching. The text lines are initially separated by a segmentation process. Then each text line is described as sequences of primitive labels which correspond to single characters or parts of characters. These representative primitives are considered from a codebook of shapes generated from training pages taken from the collection. During indexation, the text lines are transcribed into strings of primitives in off-line stage and stored in files. For this purpose, an efficient indexation strategy using multi-label approach is used by a combination of two-level analysis of the primitives: coarse and fine levels. During retrieval, the query word image is encoded into strings of coarse and fine primitives chosen according to the codebook. Finally, a dynamic programming method based on approximate string matching is used to find similar primitive sequences in the text lines from the collection in runtime. We present the experimental evaluation on datasets of real life document images, gathered from historical books of different scripts. Experimental results show that the method is robust in searching text in noisy documents.
论文关键词:Word spotting,Document indexing,Approximate string matching,Coarse-to-fine
论文评审过程:Received 13 November 2013, Revised 4 September 2015, Accepted 21 September 2015, Available online 22 October 2015, Version of Record 30 October 2015.
论文官网地址:https://doi.org/10.1016/j.imavis.2015.09.006