Identification of different script lines from multi-script documents

作者:

Highlights:

摘要

For wider readership, some documents may be printed in several scripts and languages. For optical character recognition (OCR) of such a document page, a software module is necessary to identify the scripts before feeding them to their individual OCR systems. This paper deals with an automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document. For this purpose script characteristics, shape-based features, statistical features and some features obtained from the concept of water overflow from the reservoir have been employed. The scheme shows an accuracy of about 97.33%.

论文关键词:Optical character recognition,Script lines,Head-line

论文评审过程:Accepted 11 July 2002, Available online 4 December 2002.

论文官网地址:https://doi.org/10.1016/S0262-8856(02)00101-4