Model-induced term-weighting schemes for text classification

作者:Hyun Kyung Kim, Minyoung Kim

摘要

The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.

论文关键词:Document/text classification, Feature/term weighting, Feature selection, Supervised learning

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-015-0745-z