Cluster validity functions for categorical data: a solution-space perspective

作者:Liang Bai, Jiye Liang

摘要

For categorical data, there are three widely-used internal validity functions: the \(k\)-modes objective function, the category utility function and the information entropy function, which are defined based on within-cluster information only. Many clustering algorithms have been developed to use them as objective functions and find their optimal solutions. In this paper, we study the generalization, effectiveness and normalization of the three validity functions from a solution-space perspective. First, we present a generalized validity function for categorical data. Based on it, we analyze the generality and difference of the three validity functions in the solution space. Furthermore, we address the problem whether the between-cluster information is ignored when these validity functions are used to evaluate clustering results. To the end, we analyze the upper and lower bounds of the three validity functions for a given data set, which can help us estimate the clustering difficulty on a data set and compare the performance of a clustering algorithm on different data sets.

论文关键词:Cluster analysis, Cluster validity function, Generalization, Effectiveness, Normalization

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10618-014-0387-5