A multi-class SVM classification system based on learning methods from indistinguishable chinese official documents

作者:

Highlights:

摘要

Support Vector Machines (SVM) has been developed for Chinese official document classification in One-against-All (OAA) multi-class scheme. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are used in preprocess. We observe that most documents of which contents are indistinguishable make poor classification results. The traditional solution is to add misclassified documents to the training set in order to adjust classification rules. In this paper, indistinguishable documents are observed to be informative for strengthening prediction performance since their labels are predicted by the current model in low confidence. A general approach is proposed to utilize decision values in SVM to identify indistinguishable documents. Based on verified classification results and distinguishability of documents, four learning strategies that select certain documents to training sets are proposed to improve classification performance. Experiments report that indistinguishable documents are able to be identified in a high probability and are informative for learning strategies. Furthermore, LMID that adds both of misclassified documents and indistinguishable documents to training sets is the most effective learning strategy in SVM classification for large set of Chinese official documents in terms of computing efficiency and classification accuracy.

论文关键词:Support Vector Machines (SVM),Multi-class classification,Chinese official document classification,Indistinguishability identification,Incremental learning

论文评审过程:Available online 14 September 2011.

论文官网地址:https://doi.org/10.1016/j.eswa.2011.08.176