Utilising a statistical inequality for efficiently finding term sets

作者:

Highlights:

• We have addressed the problem of Frequent Itemset (FI) mining using a statistical analysis called Bell-Wigner’s Inequality (BWI). This inequality provides additional bounds to the principle of monotonicity and significantly reduces the execution time of Apriori without requiring the storage space of the FPGrowth algorithm.

• An extensive experimental validation is described together with a comparison of the use of the BWI in Apriori with the original Apriori algorithm and the FPGrowth algorithm. The comparison has been measured using time and space, thus highlighting the differences between the computational costs of these algorithms.

• We have validated the hypothesis that the BWI can successfully be utilised to reduce the computational time of Apriori, thus making the computation of term sets for Information Retrieval (IR) purposes less expensive and more feasible.

• In particular, the performance of Apriori extended by BWI can be still competitive with FPGrowth in time and space when the datasets are derived from the runs of an IR system. In particular, FPGrowth demands much more space than BWI-Apriori and BWI-Apriori is only slightly slower when the minimum support is not very low.

• This paper has been compared with a number of publications and concludes with some issues, perspectives and future works.

摘要

•We have addressed the problem of Frequent Itemset (FI) mining using a statistical analysis called Bell-Wigner’s Inequality (BWI). This inequality provides additional bounds to the principle of monotonicity and significantly reduces the execution time of Apriori without requiring the storage space of the FPGrowth algorithm.•An extensive experimental validation is described together with a comparison of the use of the BWI in Apriori with the original Apriori algorithm and the FPGrowth algorithm. The comparison has been measured using time and space, thus highlighting the differences between the computational costs of these algorithms.•We have validated the hypothesis that the BWI can successfully be utilised to reduce the computational time of Apriori, thus making the computation of term sets for Information Retrieval (IR) purposes less expensive and more feasible.•In particular, the performance of Apriori extended by BWI can be still competitive with FPGrowth in time and space when the datasets are derived from the runs of an IR system. In particular, FPGrowth demands much more space than BWI-Apriori and BWI-Apriori is only slightly slower when the minimum support is not very low.•This paper has been compared with a number of publications and concludes with some issues, perspectives and future works.

论文关键词:Query expansion,Clustering,Term selection,Efficiency

论文评审过程:Received 30 September 2015, Revised 26 January 2016, Accepted 26 April 2016, Available online 6 May 2016, Version of Record 28 September 2016.

论文官网地址:https://doi.org/10.1016/j.ipm.2016.04.011