Utilising a statistical inequality for efficiently finding term sets

Highlights：

• An extensive experimental validation is described together with a comparison of the use of the BWI in Apriori with the original Apriori algorithm and the FPGrowth algorithm. The comparison has been measured using time and space, thus highlighting the differences between the computational costs of these algorithms.

• We have validated the hypothesis that the BWI can successfully be utilised to reduce the computational time of Apriori, thus making the computation of term sets for Information Retrieval (IR) purposes less expensive and more feasible.

• In particular, the performance of Apriori extended by BWI can be still competitive with FPGrowth in time and space when the datasets are derived from the runs of an IR system. In particular, FPGrowth demands much more space than BWI-Apriori and BWI-Apriori is only slightly slower when the minimum support is not very low.

• This paper has been compared with a number of publications and concludes with some issues, perspectives and future works.

摘要

•We have addressed the problem of Frequent Itemset (FI) mining using a statistical analysis called Bell-Wigner’s Inequality (BWI). This inequality provides additional bounds to the principle of monotonicity and significantly reduces the execution time of Apriori without requiring the storage space of the FPGrowth algorithm.•An extensive experimental validation is described together with a comparison of the use of the BWI in Apriori with the original Apriori algorithm and the FPGrowth algorithm. The comparison has been measured using time and space, thus highlighting the differences between the computational costs of these algorithms.•We have validated the hypothesis that the BWI can successfully be utilised to reduce the computational time of Apriori, thus making the computation of term sets for Information Retrieval (IR) purposes less expensive and more feasible.•In particular, the performance of Apriori extended by BWI can be still competitive with FPGrowth in time and space when the datasets are derived from the runs of an IR system. In particular, FPGrowth demands much more space than BWI-Apriori and BWI-Apriori is only slightly slower when the minimum support is not very low.•This paper has been compared with a number of publications and concludes with some issues, perspectives and future works.