A hash trie filter method for approximate string matching in genomic databases

作者:Ye-In Chang, Jiun-Rung Chen, Min-Tze Hsu

摘要

In genomic databases, approximate string matching with k errors is often applied when searching genomic sequences, where k errors can be caused by substitution, insertion, or deletion operations. In this paper, we propose a new method, the hash trie filter, to efficiently support approximate string matching in genomic databases. First, we build a hash trie for indexing the genomic sequence stored in a database in advance. Then, we utilize an efficient technique to find the ordered subpatterns in the sequence, which could reduce the number of candidates by pruning some unreasonable matching positions. Moreover, our method will dynamically decide the number of ordered matching grams, resulting in the increase of precision. The simulation results show that the hash trie filter outperforms the well-known (k+s) q-samples filter in terms of the response time, the number of verified candidates, and the precision, under different lengths of the query patterns and different error levels.

论文关键词:Approximate string matching, Filter, Genomic database, Global order, Local order

论文评审过程:

论文官网地址:https://doi.org/10.1007/s10489-010-0233-4