The suffix-signature method for searching for phrases in text

作者:

Highlights:

摘要

We present a new algorithm to find all occurrences of a given phrase based on the data structure known as a suffix array and using a corresponding array of signatures. With this algorithm, matches to phrases of moderate length can be found with expected search time of one disk access to the text and one disk access to its index. To achieve this performance for phrases of up to five words in length requires an index having total size of approximately 120% of the size of the text. The algorithm guarantees a worst case search performance of two disk accesses to the text per phrase search. Experiments with actual data ranging in size from 6Mb to 550Mb and with actual query patterns derived from logs of searches on the World Wide Web show that the approach is applicable in practice to a variety of texts and realistic phrase searches.

论文关键词:Text Indexing,Phrase Search,Suffix Arrays,pat Arrays,Signature Arrays

论文评审过程:Received 23 February 1998, Revised 14 October 1998, Available online 12 February 1999.

论文官网地址:https://doi.org/10.1016/S0306-4379(98)00029-5