Sampling strategies for extracting information from large data sets

作者:

Highlights:

摘要

Getting information from large volumes of data is very expensive in terms of resources like CPU and memory, as well as computation time. The analysis of a small data set extracted from the original set is preferred. From this small set, called sample, approximate results can be obtained. The errors are acceptable given the reduced cost necessary for processing the data. Using sampling algorithms with small errors saves execution time and resources. This paper presents comparisons between sampling algorithms in order to determine which one performs better when taking into account set operations such as intersect, union and difference. The comparison focuses on the errors introduced by each algorithm for different sample sizes and on execution times.

论文关键词:Sampling algorithms,Space complexity,Time complexity,Set operations,Data set cardinality,Time optimization

论文评审过程:Received 11 October 2016, Revised 19 November 2017, Accepted 15 January 2018, Available online 2 February 2018, Version of Record 4 June 2018.

论文官网地址:https://doi.org/10.1016/j.datak.2018.01.002