Incremental entropy-based clustering on categorical data streams with concept drift

作者:

Highlights:

摘要

Clustering on categorical data streams is a relatively new field that has not received as much attention as static data and numerical data streams. One of the main difficulties in categorical data analysis is lacking in an appropriate way to define the similarity or dissimilarity measure on data. In this paper, we propose three dissimilarity measures: a point-cluster dissimilarity measure (based on incremental entropy), a cluster–cluster dissimilarity measure (based on incremental entropy) and a dissimilarity measure between two cluster distributions (based on sample standard deviation). We then propose an integrated framework for clustering categorical data streams with three algorithms: Minimal Dissimilarity Data Labeling (MDDL), Concept Drift Detection (CDD) and Cluster Evolving Analysis (CEA). We also make comparisons with other algorithms on several data streams synthesized from real data sets. Experiments show that the proposed algorithms are more effective in generating clustering results and detecting concept drift.

论文关键词:Categorical data stream,Clustering,Data labeling,Concept drift detection,Cluster evolving analysis

论文评审过程:Received 7 January 2013, Revised 16 January 2014, Accepted 1 February 2014, Available online 7 February 2014.

论文官网地址:https://doi.org/10.1016/j.knosys.2014.02.004