Improving protein domain classification for third generation sequencing reads using deep learning

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	7
Dung lượng	835,5 KB

Nội dung

Du et al BMC Genomics (2021) 22 251 https //doi org/10 1186/s12864 021 07468 7 RESEARCH ARTICLE Open Access Improving protein domain classification for third generation sequencing reads using deep lea[.]

(2021) 22:251 Du et al BMC Genomics https://doi.org/10.1186/s12864-021-07468-7 RESEARCH ARTICLE Open Access Improving protein domain classification for third-generation sequencing reads using deep learning Nan Du1† , Jiayu Shang2† and Yanni Sun2* Abstract Background: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data New computational methods are still needed to improve the performance of domain prediction in long noisy reads Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification Conclusions: In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction Introduction Third-generation sequencing (TGS) technologies, such as Pacific Biosciences single-molecule real-time sequencing (PacBio) and Oxford Nanopore sequencing (Nanopore), produce longer reads than next generation sequencing (NGS) technologies With increased read length, long reads can contain complete genes or protein domains, making gene-centric functional analysis for high throughput sequencing data more applicable [1–3] In genecentric analysis, often there are specific sets of genes in pathways that are of special interest, for example G protein-coupled receptor (GPCR) genes in intracellular *Correspondence: yannisun@cityu.edu.hk † Nan Du and Jiayu Shang contributed equally to this work Electrical Engineering, City University of Hong Kong, Hong Kong, People’s Republic of China Full list of author information is available at the end of the article signaling pathways for environmental sensing, while other genes in the assemblies provide little insight to the specific questions One basic step in gene-centric analysis is to assign sequences into different functional categories, such as families of protein domains (or domains for short), which are independent folding and functional units in a majority of annotated protein sequences There are a number of tools available for protein domain annotation They can be roughly divided into two groups depending on how they utilize the available protein domain sequences One group of methods rely on alignments against the references HMMER is the state-of-the-art profile search tool based on profile hidden Markov models (pHMM) [4, 5] But the speed of the pHMM homology search suffers from the increase in the number of families Extensive research has been conducted to improve the efficiency of the profile homology search [6] © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data Du et al BMC Genomics (2021) 22:251 The other group of tools are alignment-free [7] Recent developments in deep learning have led to alignment-free approaches with automatic feature extraction [8–11] A review of some available methods and their applications can be found in [12] Of the learning-based tools, the most relevant one to protein domain annotation is DeepFam [9], which used convolutional neural networks (CNN) to classify protein sequences into protein/domain families The authors showed that it outperformed HMMER and previous alignment-free methods on protein domain classification Also, DeepFam is fast and the speed is not affected much by the number of families For example, DeepFam is at least ten times faster than HMMER when 1,000 query sequences are searched against thousands of protein families [9] Thus deep learning-based methods have advantages for applications that not need detailed alignments Despite the success of existing protein domain annotation tools, they are not ideal choices for domain identification in error-prone reads Although the sequencing accuracy of TGS platforms has improved dramatically, TGS data have lower per read accuracy than short-read sequencing [13] The newest circular consensus sequencing (CCS) reads by PacBio Sequel II can reach high accuracy [14] However, these reads exhibit a bias for indels in homopolymers [14] In particular, there is still much room to improve for reads produced via direct RNA sequencing [13] Insertion or deletion errors, which are not rare in TGS data, can cause frameshifts during translation [15] Without knowing the errors and their positions, the frameshifts can lead to only short or non-significant alignments [16] As the translation of each reading frame is partially correct, it also leads to poor classification performance for existing learning-based models Our experimental results in “Experiments and results” section clearly showed this Domain classification with error correction Because sequencing errors remain an issue for TGS data, there are active developments of error correction tools for long reads [15, 17] An alternative pipeline is therefore to apply tools such as HMMER and DeepFam to error-corrected sequences Error correction tools can be generally divided into hybrid and standalone depending on whether they need short reads for error correction Recently, several groups conducted comprehensive review and comparison of existing error correction tools [15, 17] None of these tools can achieve optimal performance across all tested data sets Based on the recent reviews and also our own experimental results, there are two major limitations of applying error correction before protein domain classification First, the performance of standalone tools is profoundly affected by the coverage of the aligned sequences against Page of 13 the chosen backbone sequences When the coverage is low (e.g the depth of sequencing

Ngày đăng: 23/02/2023, 18:20