A space efficient solution to the frequent string mining problem for many databases

作者：Adrian Kügel, Enno Ohlebusch

摘要

The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.

论文关键词：String mining, Enhanced suffix array

论文评审过程：

论文官网地址：https://doi.org/10.1007/s10618-008-0110-5