Language identification for handwritten document images using a shape codebook

作者:

Highlights:

摘要

Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each representing a segmentation-free shape feature that is generic enough to be detected repeatably. We learn a concise, structurally indexed shape codebook from training by clustering and partitioning similar feature types through graph cuts. Our approach is easily extensible and does not require skew correction, scale normalization, or segmentation. We quantitatively evaluate our approach using a large real-world document image collection, which is composed of 1512 documents in eight languages (Arabic, Chinese, English, Hindi, Japanese, Korean, Russian, and Thai) and contains a complex mixture of handwritten and machine printed content. Experiments demonstrate the robustness and flexibility of our approach, and show exceptional language identification performance that exceeds the state of the art.

论文关键词:Language identification,Shape descriptor,Shape codebook,Handwriting recognition,Document image analysis

论文评审过程:Received 8 August 2008, Revised 24 November 2008, Accepted 21 December 2008, Available online 7 January 2009.

论文官网地址:https://doi.org/10.1016/j.patcog.2008.12.022