Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments

作者:Dimas Cassimiro Nascimento, Carlos Eduardo Pires, Demetrio Gomes Mestre

摘要

Deduplication is the task of identifying the entities in a data set which refer to the same real world object. Over the last decades, this problem has been largely investigated and many techniques have been proposed to improve the efficiency and effectiveness of the deduplication algorithms. As data sets become larger, such algorithms may generate critical bottlenecks regarding memory usage and execution time. In this context, cloud computing environments have been used for scaling out data quality algorithms. In this paper, we investigate the efficacy of different machine learning techniques for scaling out virtual clusters for the execution of deduplication algorithms under predefined time restrictions. We also propose specific heuristics (Best Performing Allocation, Probabilistic Best Performing Allocation, Tunable Allocation, Adaptive Allocation and Sliced Training Data) which, together with the machine learning techniques, are able to tune the virtual cluster estimations as demands fluctuate over time. The experiments we have carried out using multiple scale data sets have provided many insights regarding the adequacy of the considered machine learning algorithms and proposed heuristics for tackling cloud computing provisioning.

论文关键词:Data quality, Deduplication, Machine learning, Heuristics, Cloud computing, Elasticity

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-016-0774-2