Probabilistic data structures for big data analytics: A comprehensive review

作者:

Highlights:

摘要

An exponential increase in the data generation resources is widely observed in last decade, because of evolution in technologies such as-cloud computing, IoT, social networking, etc. This enormous and unlimited growth of data has led to a paradigm shift in storage and retrieval patterns from traditional data structures to Probabilistic Data Structures (PDS). PDS are a group of data structures that are extremely useful for Big data and streaming applications in order to avoid high-latency analytical processes. These data structures use hash functions to compactly represent a set of items in stream-based computing while providing approximations with error bounds so that well-formed approximations get built into data collections directly. Compared to traditional data structures, PDS use much less memory and constant time in processing complex queries. This paper provides a detailed discussion of various issues which are normally encountered in massive data sets such as-storage, retrieval, query,etc. Further, role of PDS in solving these issues is also discussed where these data structures are used as temporary accumulators in query processing. Several variants of existing PDS along with their application areas have also been explored which give a holistic view of domains where these data structures can be applied for efficient storage and retrieval of massive data sets. Mathematical proofs of various parameters considered in the PDS have also been discussed in the paper. Moreover, the relative comparison of various PDS with respect to various parameters is also explored.

论文关键词:Big data,Internet of things (IoT),Probabilistic data structures,Bloom filter,Quotient filter,Count min sketch,HyperLogLog counter,Min-hash,Locality sensitive hashing

论文评审过程:Received 12 April 2019, Revised 21 August 2019, Accepted 21 August 2019, Available online 26 August 2019, Version of Record 20 January 2020.

论文官网地址:https://doi.org/10.1016/j.knosys.2019.104987