A linguistic approach to classification of bacterial genomes

作者:

Highlights:

摘要

In the present paper, 188 prokaryote genomes are classified by separately calculating the compositional spectra for the coding and the non-coding parts of the genomes. For each subsequence, the compositional spectrum is transformed into the corresponding point in a vector space. This enables the categorization of genomes into meaningful groups by a formal method. Repeated clustering performed for the coding and the non-coding genome parts makes it possible to estimate the true number of the genome clusters. The method we propose is based on a new application of external cluster validation indexes and on the misclassified quantities obtained in the process of repeated clustering. Besides, we have constructed additional data embedding into the appropriate Euclidean space only on the basis of the distances between compositional spectra. Biological evaluation of the results obtained for the 4-letter and the 2-letter alphabets substantiates the appropriateness of the resulting cluster-based classification.

论文关键词:Genome clustering,Cluster validation,Compositional spectra method

论文评审过程:Received 7 October 2008, Revised 29 July 2009, Accepted 7 August 2009, Available online 4 September 2009.

论文官网地址:https://doi.org/10.1016/j.patcog.2009.08.019