Frequent high minimum average utility sequence mining with constraints in dynamic databases using efficient pruning strategies

摘要

High utility sequence mining is a popular data mining task, which aims at finding sequences having a high utility (importance) in a quantitative sequence database. Though it has several applications, state-of-the-art algorithms have one or more of the following limitations: (1) they rely on a utility function that tends to be biased toward finding long patterns, (2) some algorithms do take pattern length into account using an average-utility function but they adopt an optimistic perspective that can be risky or misleading for some applications, (3) they do not let the user specify additional constraints on patterns to be found. To address these three limitations, this paper defines a novel task of mining frequent high minimum average-utility sequences (FHAUS) with constraints in a quantitative sequence database. This task has the following benefits. First, it uses the average-utility au function based on the minimum utility, which takes the length of a pattern into account to calculate its utility. This helps finding short patterns missed by traditional algorithms and it is based on more safe pessimistic utility calculations. Second, the user can specify a set of monotonic and anti-monotonic constraints C on patterns to filter irrelevant patterns and improve the performance of the mining process. To efficiently find all FHAUSs with constraints, this paper first proposes some novel upper bounds (UBs) and weak upper bounds (WUBs) on the average-utility, which satisfy downward-closure (DC) properties or DC-like properties. Then, to effectively reduce the search space, the paper designs novel width pruning, depth pruning, reducing, and tightening strategies based on the proposed bounds. These proposed novel theoretical results are integrated into an algorithm named C-FHAUSPM (Constrained Frequent High minimum Average-Utility Sequential Pattern Mining) for efficiently discovering all FHAUSs with constraints. Results from extensive experiments on both real-life and synthetic quantitative sequence databases show that C-FHAUSPM is highly efficient in terms of runtime and memory usage.