Clustering binary cube dimensions to compute relaxed GROUP BY aggregations

作者:

Highlights:

摘要

Computing cube aggregations on large data sets with high dimensionality is a crucial and costly task that normally requires multiple passes on the input table. This task gets harder when the number of result groups increases due to a large number of combinations of dimension values. In this research, we focus on reducing the number of aggregations and providing a more succinct result by deriving aggregations on top of groups with similar records exploiting an efficient binary clustering of the fact table, which can be viewed as a relaxation of traditional OLAP cubes. We present an efficient window-based Incremental K-Means algorithm implemented in a DBMS as a user-defined function. A significant speedup is achieved through sufficient statistics, multithreading, efficient distance computation and sparse matrix operations. Our algorithm performance is experimentally compared against multiple variants of the K-Means algorithm. We show our incremental K-Means algorithm achieves similar or better results much faster than the traditional K-Means algorithm. Moreover, we show interesting aggregations can be efficiently obtained using the cluster identifier as a new cube dimension.

论文关键词:OLAP,Clustering,Binary streams

论文评审过程:Received 18 March 2014, Revised 7 November 2014, Accepted 22 December 2014, Available online 31 December 2014, Version of Record 26 June 2015.

论文官网地址:https://doi.org/10.1016/j.is.2014.12.008