Suffix trees for inputs larger than main memory

作者:

Highlights:

摘要

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings.We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them.1 As a proof of concept, we show that our method allows to build the suffix tree for 12 GB of real DNA sequences in 26 h on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.

论文关键词:String databases,Suffix trees,Full-text indexes

论文评审过程:Received 5 October 2009, Revised 30 October 2010, Accepted 1 November 2010, Available online 18 November 2010.

论文官网地址:https://doi.org/10.1016/j.is.2010.11.001