Multinomial mixture model with feature selection for text clustering

作者:

Highlights:

摘要

The task of selecting relevant features is a hard problem in the field of unsupervised text clustering due to the absence of class labels that would guide the search. This paper proposes a new mixture model method for unsupervised text clustering, named multinomial mixture model with feature selection (M3FS). In M3FS, we introduce the concept of component-dependent “feature saliency” to the mixture model. We say a feature is relevant to a certain mixture component if the feature saliency value is higher than a predefined threshold. Thus the feature selection process is treated as a parameter estimation problem. The Expectation–Maximization (EM) algorithm is then used for estimating the model. The experiment results on commonly used text datasets show that the M3FS method has good clustering performance and feature selection capability.

论文关键词:Text clustering,Multinomial mixture model,Feature selection,Text mining

论文评审过程:Received 27 June 2007, Accepted 24 March 2008, Available online 31 March 2008.

论文官网地址:https://doi.org/10.1016/j.knosys.2008.03.025