BLOCK-DBSCAN: Fast clustering for large scale data

作者:

Highlights:

摘要

We analyze the drawbacks of DBSCAN and its variants, and find the grid technique, which is used in Fast-DBSCAN and ρ-approximate DBSCAN, is almost useless in high dimensional data space. Because it usually yields considerable redundant distance computations. In order to tame these problems, two techniques are proposed: one is to use ϵ2-norm ball to identify Inner Core Blocks within which all points are core points, it has higher efficiency than grid technique for finding more core points at one time; the other is a fast approximate algorithm for judging whether two Inner Core Blocks are density-reachable from each other. Besides, cover tree is also used to accelerate the process of density computations. Based on the three techniques, an approximate approach, namely BLOCK-DBSCAN, is proposed for large scale data, which runs in about O(nlog (n)) expected time and obtains almost the same result as DBSCAN. BLOCK-DBSCAN has two versions, i.e., L2 version can work well for relatively high dimensional data, and L∞ version is suitable for high dimensional data. Experimental results show that BLOCK-DBSCAN is promising and outperforms NQDBSCAN, ρ-approximate DBSCAN and AnyDBC.

论文关键词:DBSCAN,ρ-approximate DBSCAN,BLOCK-DBSCAN,Core block

论文评审过程:Received 28 December 2018, Revised 18 April 2020, Accepted 29 August 2020, Available online 30 August 2020, Version of Record 1 September 2020.

论文官网地址:https://doi.org/10.1016/j.patcog.2020.107624