SubspaceDB : In-database subspace clustering for analytical query processing

作者:

Highlights:

摘要

High dimensional data analysis within relational database management systems (RDBMS) is challenging because of inadequate support from SQL. Currently, subspace clustering of high dimensional data is implemented either outside DBMS using wrapper code or inside DBMS using SQL User Defined Functions/Aggregates(UDFs/UDAs). However, both these approaches have potential disadvantages from performance, resource usage, and security perspective for voluminous and frequently updated data. Hence, we propose an efficient querying system, named SubspaceDB, that implements subspace clustering directly within an RDBMS. SubspaceDB provides a novel set of query operators, each with an optimization objective, to facilitate interactive analysis for subspace clustering. The query operators focus on retrieving optimal answers to four key query types : (a) Medoid queries, (b) Neighbourhood queries, (c) Partial similarity queries, and (d) Prominence queries, that aid the formation of subspace clusters. Experimental studies on real and synthetic databases of size 15M tuples and 104 attributes show that our proposed approach SubspaceDB can be over 10 times faster as compared to a conventional wrapper-based or SQL UDF approach. The proposed approach is also efficient in retrieving at least 50% data with performance improvement of at least 25%.

论文关键词:Machine learning,Subspace clustering,RDBMS,SQL operators,PostgreSQL

论文评审过程:Received 23 November 2017, Revised 9 January 2019, Accepted 16 May 2019, Available online 22 May 2019, Version of Record 7 June 2019.

论文官网地址:https://doi.org/10.1016/j.datak.2019.05.003