Analyzing very large time series using suffix arrays

作者:Konstantinos F. Xylogiannopoulos, Panagiotis Karampelas, Reda Alhajj

摘要

Suffix arrays form a powerful data structure for pattern detection and matching. In a previous work, we presented a novel algorithm (COV) which is the only algorithm that allows the detection of all repeated patterns in a time series by using the actual suffix array. However, the requirements for storing the actual suffix strings even on external media makes the use of suffix arrays impossible for very large time series. We have already proved that using the concept of Longest Expected Repeated Pattern (LERP) allows the actual suffices to be stored in linear capacity O(n) on external media. The repeated pattern detection using LERP has analogous time complexity, and thus makes the analysis of large time series feasible and limited only to the size of the external media and not memory. Yet, there are cases when hardware limitations might be an obstacle for the analysis of very larger time series of size comparable to hard disk capacity. With the Moving LERP (MLERP) method introduced in this paper, it is possible to analyze very large time series (of size tens or hundreds thousands times larger than what the LERP can analyze) by maximal utilization of the available hardware. Further, when empirical knowledge related to the distribution of repeated pattern’s length is available, the proposed method (MLERP) can achieve better time performance compared to the standard LERP method and definitely much better than using any other pattern matching algorithm and applying brute force techniques which are unfeasible in logical (human) time frame. Thus, we may argue that MLERP is a very useful tool for detecting all repeated patterns in a time series regardless of its size and hardware limitations.

论文关键词:Suffix arrays, Repeated patterns detection, Data mining, Time series, DNA analysis

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-014-0553-x