An embedding driven approach to automatically detect identifiers and references in document stores

作者：

Highlights：

•

摘要

NoSQL stores have become ubiquitous since they offer a new cost-effective and schema-free system. Although NoSQL systems are widely accepted today, Business Intelligence & Analytics (BI&A) wields relational data sources. Exploiting schema-free data for analytical purposes is a challenge since it requires reviewing all the BI&A phases, particularly the Extract-Transform-Load (ETL) process, to fit big data sources as document stores. In the ETL process, the join of several collections, with a lack of explicitly known join fields is a significant dare. Detecting these fields manually is time and effort-consuming and infeasible in large-scale datasets. In this paper, we study the problem of discovering join fields automatically. We introduce an algorithm that aims to automatically detect both identifiers and references on several document stores. The modus operandi of our approach underscores three core stages: (i) global schema extraction; (ii) discovery of candidate identifiers; and (iii) identifying candidate pairs of identifier and reference fields. We use scoring features and pruning rules to discover true candidate identifiers from many initial ones efficiently. To find candidate pairs between several document stores, we put into practice node2vec as a graph embedding technique, which yields significant advantages while using syntactic and semantic similarity measures for pruning pointless candidates. Finally, we report our experimental findings that show encouraging results.

论文关键词：Business intelligence and analytics,ETL,Document stores,Join,Identifier discovery,Reference discovery

论文评审过程：Received 31 July 2021, Revised 19 January 2022, Accepted 28 February 2022, Available online 12 March 2022, Version of Record 22 March 2022.

论文官网地址：https://doi.org/10.1016/j.datak.2022.102003