Categorizing Paper Documents: A Generic System for Domain and Language Independent Text Categorization

作者：

Highlights：

•

摘要

Text categorization assigns predefined categories to either electronically available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based on statistical analysis of representative text corpora. Significant features are automatically derived from training texts by selecting substrings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential information. The classification is a minimum least-squares approach based on polynomials. The described system can be efficiently adapted to new domains or different languages. In application, the adapted text categorizers are reliable, fast, and completely automatic. Two example categorization tasks achieve recognition scores of approximately 80% and are very robust against recognition or typing errors.

论文关键词：

论文评审过程：Received 14 February 1997, Accepted 21 December 1997, Available online 10 April 2002.

论文官网地址：https://doi.org/10.1006/cviu.1998.0687