Virtual relevant documents in text categorization with support vector machines

作者：

Highlights：

•

摘要

This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.

论文关键词：Virtual document,Prior knowledge,Topical representation,Text categorization,Support vectors

论文评审过程：Received 22 June 2006, Revised 1 August 2006, Accepted 16 August 2006, Available online 1 November 2006.

论文官网地址：https://doi.org/10.1016/j.ipm.2006.08.010