HANP-Miner: High average utility nonoverlapping sequential pattern mining

作者:

Highlights:

摘要

Nonoverlapping sequential pattern mining (SPM) is a data analysis task, which aims at identifying repetitive sequential patterns with gap constraint in a set of discrete sequences. Nonoverlapping means that any character in the sequence can be rematched by characters at different positions in the pattern, while overlapping means that any character can be reused at the same position, which can be regarded as no condition. Compared with overlapping SPM, nonoverlapping SPM has more strict constraint on the occurrences of pattern, and meets the Apriori property. However, current algorithms mine frequent or closed patterns, resulting in some low-frequency but extremely important patterns being ignored. To tackle this issue, this paper proposes to mine high average utility nonoverlapping sequential patterns (HANP). An efficient algorithm called HANP-Miner is proposed, which involves two key steps: support calculation and candidate pattern reduction. To calculate the support (the occurrence frequency of a pattern), depth-first search and backtracking strategies based on the simplified Nettree structure are adopted, an approach that effectively reduces the time and space complexities of the algorithm. To effectively reduce the number of candidate patterns, a pattern join strategy based on an upper bound on the average utility is proposed. Experiments on biological sequences and real sales sequences are carried out to verify the efficiency of HANP-Miner and the superiority of HANPs. The results demonstrate that HANP-Miner is not only more efficient, but that the HANPs mined in this way are also more valuable than existing frequent patterns. The algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/HANP-Miner.

论文关键词:Sequential pattern mining,High average utility,Depth-first search,Pattern join

论文评审过程:Received 31 December 2020, Revised 17 July 2021, Accepted 2 August 2021, Available online 9 August 2021, Version of Record 19 August 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.107361