A new dot plot-based algorithm for genomes sequences comparison: A preliminary study

摘要

To have efficient data mining systems, we need powerful algorithms to extract and mine the data. In the case of genomes data mining system, the algorithms search for genomes/proteins that share similar properties. Proteins that have a significant biological relationship to one another often share only isolated regions of sequence similarity. When identifying relationships of this nature, the ability to find local regions of optimal similarity is advantageous over global alignments that optimize the overall alignment of two entire sequences. The paper describes a new method for genome sequence comparison. This algorithm can be used in a genomes data mining system. It provides a good theoretical improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is based on the popular progressive approach, the dot plot method, but avoids the most serious pitfalls caused by the greedy nature of this technique. The new approach pre-processes a data set of all pair-wise alignments between the sequences. This provides a library of alignment information that can be used to guide the comparison. The algorithm is based on the similar segment method, i.e. having n similar identities in window of size L. The paper presents some results about the termination and correctness of the algorithm and how to include this algorithm into other comparison algorithms. The paper introduces the mechanism to create random sequences. These data will be our main benchmarks for comparing our algorithms.