An automatic classification of text documents based on correlative association of words

摘要

Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.