Mining interesting sequences with low average cost and high average utility

摘要

Discovering high utility sequences in a quantitative database is a popular data mining task. The goal is to enumerate all sequences of items (symbols) that have a high value for the user, as measured by a utility function. A representative application of high utility sequence mining is the identification of profitable sequences of purchases in transactions from online stores. Though useful, a drawback of that task is that the cost of items is not considered. However, cost is a key factor for decision-making in that domain and many others. To consider both the cost and utility of items for sequence mining, this paper defines a novel problem \( \mathcal{FLCHUSM} \) of mining frequent sequences having a high average utility and a low average cost. Though the proposed problem is a generalization of the traditional problem of frequent sequence mining, it is more challenging because the average utility and average cost functions do not satisfy the downward-closure property traditionally used to reduce the search space. To offer a solution to this issue, this paper presents a lower bound on the cost and two novel upper bounds on the utility. Besides, four width, depth pruning, reducing and tightening strategies are devised to eliminate unpromising patterns from the search space. Taking these theoretical results as a foundation, a new CUL (Cost-Utility List) data structure is conceived for storing and quickly updating the utility and cost information of patterns, and a novel algorithm named FLCHUSPM is proposed for \( \mathcal{FLCHUSM} \). Results from several experiments show that FLCHUSPM is efficient in terms of memory usage and runtime, and that interesting patterns can be discovered in real data.