Model selection and application to high-dimensional count data clustering

作者:Nuha Zamzami, Nizar Bouguila

摘要

EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account the burstiness phenomenon correctly while being many times faster than DCM. This work proposes the use of Minimum Message Length (MML) criterion for determining the number of components that best describes the data with a finite EDCM mixture model. Parameters estimation is based on the previously proposed Deterministic Annealing Expectation-Maximization (DAEM) approach. The validation of the proposed unsupervised algorithm involves different real applications: text document modeling, topic novelty detection and hierarchical image clustering. A comparison with results obtained for other information-theory based selection criteria is provided.

论文关键词:Finite mixture models, EDCM mixture, DAEM, Model selection, MML, Count data, Text clustering, Novelty detection, Hierarchical clustering

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-018-1333-9