Comparison of metrics for feature selection in imbalanced text classification

摘要

AbstractClass imbalance problems are often encountered in real applications of automatic text classifications especially at the so-called “one-against-all” settings and thus handling the problem with satisfactory performance is substantially important. In this paper, we focus our attention on a feature selection scheme for solving this problem and explore the abilities and characteristics of various metrics for feature selection. We examine three different types of metrics; Type-I: χP2 and Gini index, Type-II: χ2 and information gain and Type-III: signed χ2 and signed information gain. Type-I and Type-II metrics implicitly combine positive and negative features which indicate the membership and nonmembership of positive class, respectively. Type-III metrics were utilized in the combination framework in which the positive and negative features are explicitly combined and the degree of combination is optimized to improve the performance at imbalanced situations. Our experimental results show that feature selections using Type-I metrics on imbalanced data set achieve the comparable classification performances with those of the combination framework using Type-III metrics and proved to be much more superior to those of Type-II metrics. This result indicates that Type-I metrics serve as more simplified alternative methods for the combination framework. The characteristic behaviors and the performance of each of the used metrics are also investigated closely in terms of the distribution and quality of selected features.