Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification

作者:

Highlights:

摘要

Automatic text classification is the problem of automatically assigning predefined categories to free text documents, thus allowing for less manual labors required by traditional classification methods. When we apply binary classification to multi-class classification for text classification, we usually use the one-against-the-rest method. In this method, if a document belongs to a particular category, the document is regarded as a positive example of that category; otherwise, the document is regarded as a negative example. Finally, each category has a positive data set and a negative data set. But, this one-against-the-rest method has a problem. That is, the documents of a negative data set are not labeled manually, while those of a positive set are labeled by human. Therefore, the negative data set probably includes a lot of noisy data. In this paper, we propose that the sliding window technique and the revised EM (Expectation Maximization) algorithm are applied to binary text classification for solving this problem. As a result, we can improve binary text classification through extracting potentially noisy documents from the negative data set using the sliding window technique and removing actually noisy documents using the revised EM algorithm. The results of our experiments showed that our method achieved better performance than the original one-against-the-rest method in all the data sets and all the classifiers used in the experiments.

论文关键词:Binary text classification,The one-against-the-rest method,The EM algorithm,The sliding window technique

论文评审过程:Received 7 March 2006, Revised 8 November 2006, Accepted 9 November 2006, Available online 18 January 2007.

论文官网地址:https://doi.org/10.1016/j.ipm.2006.11.003