Chinese text distinction and font identification by recognizing most frequently used characters

作者:

Highlights:

摘要

In this study, the method of implementing the three functions that can offer great help for a traditional OCCR (Optical Chinese Character Recognition) system is proposed: (1) to identify the font used in a document; (2) to detect and recognize the most frequently used (MFU) characters; and (3) to distinguish between the machine-printed and hand-written characters. According to the study investigated by Chang and Chen (Proceedings of the ICCC, 1994, pp. 310–316), about 20% of Chinese characters in a text document are predominated by the top-40 MFU characters. If those MFU characters in a text document can be detected before adopting the traditional OCCR method, there will be great savings in computation time.The proposed method for character detection consists of the following three stages: the stage of segmentation, the stage of feature extraction, and the stage of classification. In the first stage, based on the concept of projection profile, the method presented by Wang et al. (Pattern Recognition 30 (1997) 1213) is utilized to segment characters individually from the input text document. In the second stage, three different types of features are introduced, including the density of black pixels, the projection profile code, and the modified skeleton template. These features are used to check whether the segmented character is semi-matched or fully-matched with the MFU template. Finally, in the last stage, based on the matching result, three different algorithms for implementing the aforementioned functions are provided. Experimental results are given in this study to demonstrate the practicality and superiority of the proposed method.

论文关键词:Feature extraction,Template matching,Character recognition,Font identification,Text distinction

论文评审过程:Received 17 February 1999, Revised 9 August 2000, Accepted 26 September 2000, Available online 27 April 2001.

论文官网地址:https://doi.org/10.1016/S0262-8856(00)00082-2