Clustering categorical data in projected spaces
作者:Mohamed Bouguessa
摘要
The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.
论文关键词:Projected clustering, Categorical data, High dimensions, Mixture model
论文评审过程:
论文官网地址:https://doi.org/10.1007/s10618-013-0336-8