Soft voting technique to improve the performance of global filter based feature selection in text corpus

摘要

In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.