Deep metric learning for accurate protein secondary structure prediction

作者:

Highlights:

摘要

Predicting the secondary structure of a protein from its amino acid sequence alone is a challenging prediction task for each residue in bioinformatics. Recent work has mainly used deep models based on the profile feature derived from multiple sequence alignments to make predictions. However, the existing state-of-the-art predictors usually have higher computational costs due to their large model sizes and complex network architectures. Here, we propose a simple yet effective deep centroid model for sequence-to-sequence secondary structure prediction based on deep metric learning. The proposed model adopts a lightweight embedding network with multibranch topology to map each residue in a protein chain into an embedding space. The goal of embedding learning is to maximize the similarity of each residue to its target centroid while minimizing its similarity to nontarget centroids. By assigning secondary structure types based on the learned centroids, we bypass the need for a time-consuming k-nearest neighbor search. Experimental results on six test sets demonstrate that our method achieves state-of-the-art performance with a simple architecture and smaller model size than existing models. Moreover, we also experimentally show that the embedding feature from the pretrained protein language model ProtT5-XL-U50 is superior to the profile feature in terms of prediction accuracy and feature generation speed. Code and datasets are available at https://github.com/fengtuan/DML_SS.

论文关键词:Protein secondary structure prediction,Deep metric learning,Protein language model,Profile feature,Embedding feature

论文评审过程:Received 14 November 2021, Revised 7 January 2022, Accepted 31 January 2022, Available online 7 February 2022, Version of Record 17 February 2022.

论文官网地址:https://doi.org/10.1016/j.knosys.2022.108356