Clustering data streams using grid-based synopsis

作者:Vasudha Bhatnagar, Sharanjit Kaur, Sharma Chakravarthy

摘要

Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.

论文关键词:Stream clustering, Synopsis, Micro-cluster, Grid structure, Exclusive clustering, Complete clustering

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10115-013-0659-1