Repeated patterns detection in big data using classification and parallelism on LERP Reduced Suffix Arrays

摘要

Suffix array is a powerful data structure, used mainly for pattern detection in strings. The main disadvantage of a full suffix array is its quadratic O(n 2) space capacity when the actual suffixes are needed. In our previous work [39], we introduced the innovative All Repeated Patterns Detection (ARPaD) algorithm and the Moving Longest Expected Repeated Pattern (MLERP) process. The former detects all repeated patterns in a string using a partition of the full Suffix Array and the latter is capable of analyzing large strings regardless of their size. Furthermore, the notion of Longest Expected Repeated Pattern (LERP), also introduced by the authors in a previous work, significantly reduces to linear O ( n ) the space capacity needed for the full suffix array. However, so far the LERP value has to be specified in ad hoc manner based on experimental or empirical values. In order to overcome this problem, the Probabilistic Existence of LERP theorem has been proven in this paper and, furthermore, a formula for an accurate upper bound estimation of the LERP value has been introduced using only the length of the string and the size of the alphabet used in constructing the string. The importance of this method is the optimum upper bounding of the LERP value without any previous preprocess or knowledge of string characteristics. Moreover, the new data structure LERP Reduced Suffix Array is defined; it is a variation of the suffix array, and has the advantage of permitting the classification and parallelism to be implemented directly on the data structure. All other alternative methodologies deal with the very common problem of fitting any kind of data structure in a computer memory or disk in order to apply different time efficient methods for pattern detection. The current advanced and elegant proposed methodology allows us to alter the above-mentioned problem such that smaller classes of the problem can be distributed on different systems and then apply current, state-of-the-art, techniques such as parallelism and cloud computing using advanced DBMSs which are capable of handling the storage and analysis of big data. The implementation of the above-described methodology can be achieved by invoking our innovative ARPaD algorithm. Extensive experiments have been conducted on small, comparable strings of Champernowne Constant and DNA as well as on extremely large strings of π with length up to 68 billion digits. Furthermore, the novelty and superiority of our methodology have been also tested on real life application such as a Distributed Denial of Service (DDoS) attack early warning system.