Text identification for document image analysis using a neural network

作者:

Highlights:

摘要

A new bottom-up method is described that clusters the content of a mixed type document into text or non-text areas. The proposed approach is based on a new set of features combined with a self-organized neural network classifier. The set of features corresponds to the contents and the relationship of 3×3 masks, is selected by using a statistical reduction procedure, and provides texture information. Next, a Principal Components Analyzer (PCA) is applied, which results in a reduced number of `effective' features. The final set of features is then utilized as input vector into a proper neural network to achieve the classification goal. The neural network classifier is based on a Kohonen Self Organized Feature Map (SOFM). Document blocks are classified as text, graphics, and halftones or to secondary subclasses corresponding to special cases of the primal classes. The proposed method can identify text regions included in graphics or even overlapped regions, that is, regions that cannot be separated with horizontal and vertical cuts. The performance of the method was extensively tested on a variety of documents with very promising results.

论文关键词:Block classification,Document segmentation,Page layout analysis,Neural network classifiers

论文评审过程:Received 21 March 1997, Accepted 12 November 1997, Available online 5 January 1999.

论文官网地址:https://doi.org/10.1016/S0262-8856(98)00055-9