e-NSP: Efficient negative sequential pattern mining

作者:

摘要

As an important tool for behavior informatics, negative sequential patterns (NSP) (such as missing medical treatments) are critical and sometimes much more informative than positive sequential patterns (PSP) (e.g. using a medical service) in many intelligent systems and applications such as intelligent transport systems, healthcare and risk management, as they often involve non-occurring but interesting behaviors. However, discovering NSP is much more difficult than identifying PSP due to the significant problem complexity caused by non-occurring elements, high computational cost and huge search space in calculating negative sequential candidates (NSC). So far, the problem has not been formalized well, and very few approaches have been proposed to mine for specific types of NSP, which rely on database re-scans after identifying PSP in order to calculate the NSC supports. This has been shown to be very inefficient or even impractical, since the NSC search space is usually huge. This paper proposes a very innovative and efficient theoretical framework: set theory-based NSP mining (ST-NSP), and a corresponding algorithm, e-NSP, to efficiently identify NSP by involving only the identified PSP, without re-scanning the database. Accordingly, negative containment is first defined to determine whether a data sequence contains a negative sequence based on set theory. Second, an efficient approach is proposed to convert the negative containment problem to a positive containment problem. The NSC supports are then calculated based only on the corresponding PSP. This not only avoids the need for additional database scans, but also enables the use of existing PSP mining algorithms to mine for NSP. Finally, a simple but efficient strategy is proposed to generate NSC. Theoretical analyses show that e-NSP performs particularly well on datasets with a small number of elements in a sequence, a large number of itemsets and low minimum support. e-NSP is compared with two currently available NSP mining algorithms via intensive experiments on three synthetic and six real-life datasets from aspects including data characteristics, computational costs and scalability. e-NSP is tens to thousands of times faster than baseline approaches, and offers a sound and effective approach for efficient mining of NSP in large scale datasets by directly using existing PSP mining algorithms.

论文关键词:Negative sequence analysis,Sequence analysis,Behavior analytics,Non-occurring behavior,Behavior informatics,Behavior computing,Pattern mining

论文评审过程:Received 13 January 2015, Revised 3 March 2016, Accepted 7 March 2016, Available online 10 March 2016, Version of Record 24 March 2016.

论文官网地址:https://doi.org/10.1016/j.artint.2016.03.001