2011 Third International Conference on Knowledge and Systems Engineering Protein type specific amino acid substitution models for influenza viruses Nguyen Van Sau1, Dang Cao Cuong1, Le Si Quang3, Le Sy Vinh1, University of Engineering and Technology, VNU1 Institute of Information Technology, VNU2 Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam Welcome Trust Centre for Human Genetics3 University of Oxford, UK Roosevelt Drive, Oxford OX3 7BN, UK saunv.mcs09@coltech.vnu.vn, cuongdc@vnu.edu.vn, quang@well.ox.ac.uk, vinhls@vnu.edu.vn [1,2] and JTT [3] are two most popular models estimated by this approach The second approach takes advantages of multiple alignments by using the maximum likelihood method The main idea is to estimate both phylogenies as well as the substitution models to maximize the likelihood of alignments Adachi and Hasegawa [4], Yang et al [5], and Adachi et al [4] were first to apply the approach to alignments from few species with an assumption that all proteins come from the same phylogeny Whelan and Goldman released the assumption where they used approximate phylogenies for different alignments Le and Gascuel [6] extended :KOHQ DQG *ROGPDQ¶V PHWKRd by optimizing phylogenies and evolution rates across sites in estimating processes General models have been estimated from large databases, however, current studies have showed that they might be not appropriate for particular set of species due to differences in the evolutionary processes of these species [7,8,9] A number of specific amino acid substitution models for important species have been introduced For example, Dimmic and colleagues estimated the rtRev model for inference of retrovirus and reverse transcriptase Phylogeny [9] Nickle and coworkers introduced HIV-specific models that showed a consistently superior fit compared with the best general models when analyzing HIV proteins [7] Influenza viruses are the most dangerous viruses for avian and humans They are a kind of RNA virus and belong to the Orthomyxoviridae family They are divided into three types: influenza A, influenza B, and influenza C, of which influenza A type is the most prevalent and dangerous In recent years, influenza A viruses have caused serious problems for human health and social economics Current emerging influenza epidemics are H5N1 ('avian flu') or H1N1 More details about historical and recently emerging influenza pandemics and epidemics can be found at the World Health Organization website (http://www.who.int/csr/disease/influenza/en/) Theoretical and experimental studies have been extensively conducted for decades to understand the evolution, transmission, and infection processes of Abstract²The amino acid substitution model (matrix) is a crucial part of protein sequence analysis systems General amino acid substitution models have been estimated from large protein databases, however, they are not specific for influenza viruses In previous study, we estimated the amino acid substitution model, FLU, for all influenza viruses Experiments showed that FLU outperformed other models when analyzing influenza protein sequences Influenza virus genomes consist of different protein types, which are different in both structures and evolutionary processes Although FLU matrix is specific for influenza viruses, it is still not specific for influenza protein types Since influenza viruses cause serious problems for both human health and social economics, it is worth to study them as specific as possible In this paper, we used more than 27 million amino acids to estimate 11 protein type specific models for influenza viruses Experiments showed that protein type specific models outperformed the FLU model, the best model for influenza viruses These protein type specific models help researcher to conduct studies on influenza viruses more precisely Keywords-influenza virus, amino acid substitution model, phylogeny tree BACKGROUND Protein sequence analysis systems usually require an amino acid substitution model for analyzing the relationships between protein sequences Therefore, estimating amino acid substitution models is a crucial task in Bioinformatics for more than decades There are two main approaches to estimate amino acid substitution models from proteins alignments The first one estimates substitution rates between amino acids based on an assumption that the probability of exchanging from an amino acid to another one in a period of time is linear to the substitution rates between the two amino acids Thus, substitution rates can be estimated directly from the number of exchanges between amino acid sequence pairs This approach is simple and applicable to large databases However, the assumption is only acceptable if the time period is short, thus, the amino acid sequences must be very closely related (typically with >85% identity) PAM 978-0-7695-4567-7/11 $26.00 © 2011 IEEE DOI 10.1109/KSE.2011.23 98 influenza viruses [10,11,12,8] (and references therein) Recently, we published the FLU model, which was specifically estimated for influenza viruses Our extensive experiments showed that FLU is much better than other models when analyzing influenza protein sequences Although FLU model is specific for influenza viruses, it is not specific for protein types The influenza A virus genome consists of 11 different protein types: HA, NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 (see Table for more details) These protein types have different structures and evolve at different rates These raise a need to have different amino acid substitution models for different protein types In this study, we continue working on amino acid substitution models for influenza viruses Since influenza A viruses are the most prevalent and dangerous, we studied and estimated 11 amino acid substitution models for 11 protein types of influenza A viruses These models will allow researchers to analyze the evolution processes of influenza proteins more precisely The paper is organized into sections In the section (Method) we will present theoretical background of amino acid substitution models; and our approach to estimate protein type specific models Section (Data preparation) describes our process to prepare protein sequences to estimate models Result comparisons among models will be reported in the section Conclusions are given in the last section Figure The four-step approach to estimate protein type specific amino acid substitution models The model consists of two components: 1) an instantaneous substitution rate 20x20-matrix where is the number of substitutions from amino acid x to amino acid y per time unit; 2) an amino acid frequency 20-vector where is the frequency of amino acid While can be easily estimated from data using a counting method, Q is the study subject of estimation methods We will apply four-step maximum likelihood approach to estimate protein type specific models as pictured in Figure 1: - Data preparation: Downloaded, cleaned, classified and aligned sequences to create multiple protein alignments (more details will be presented in Section 3) - Constructing tree step: For each protein alignment, use the maximum likelihood method (such as PhyML [16]) to construct a phylogenetic tree using an initial matrix Q (initial with FLU matrix) - Estimating model step: Use an expectationmaximization algorithm (such as XRATE [17]) to train a new model Q' using protein alignments and reconstructed trees - Comparing Step: Compare Q and Q' If Q' is nearly identical to Q, Q' is consider as the final model Otherwise, replace Q by Q' and go to Constructing tree step METHOD The substitution process among each amino acid sites is assumed to be independent, stationary and remain constant over the time [13,14] We can use a time-homogeneous, time-continuous, and timereversible Markov process [13,14,15] to model the substitution process between amino acids Table Data of 11 protein types of influenza A viruses Protein type #Sequences #Alignments HA NA PB2 PA PB1 NS1 NP M2 NS2 M1 PB1-F2 17,261 9,718 6,873 6,443 6,195 4,852 4,568 3,263 2,465 2,399 2,102 646 377 274 245 238 176 162 124 91 88 79 Proportion (%) 26.10 14.69 10.39 9.74 9.37 7.34 6.91 4.93 3.73 3.63 3.18 99 Extensive experiments show that Q is almost unchangeable (Q'~ Q) after three iterations Model performance analysis We compared 11 protein type specific models with the FLU model (the best amino acid substitution model for influenza viruses) by comparing maximum likelihood trees constructed using different models Note that it is the standard metric to compare different models As expected, experiments showed that protein type specific models outperformed all other models when analyzing their corresponding protein sequences (see Table 2) For example, PA model is the best model in 98.37% cases when analyzing PA alignments As we can see from Table that, the NS2 does not completely outperform other models It is the best model in only 65% of cases when analyzing NS2 sequences This is due to the fact that only a small amount of NS2 protein sequences are available for estimating the NS2 model Table shows the summary comparisons between FLU and protein type specific models in term of log likelihoods when analyzing their corresponding proteins Note that the greater log likelihood per site is the better model It is obvious that the protein type specific models are better than FLU model when analyzing their corresponding proteins For example, the log likelihood of HA model (-16.5699) is higher than log likelihoods of other models DATA PREPARATION On Jan 07th 2011, there were more than 9,300 complete genomes including 200,000 protein sequences in the Influenza database at NCBI (www.ncbi.nlm.nih.gov/genomes/FLU/) [18] In the database, 95% of sequences are influenza A proteins, including ~9,000 complete genomes and ~190,000 protein sequences The other sequences are influenza B and C viruses The number of available sequences for influenza B and C types is not enough to estimate protein type specific models for these virus types We concentrate on estimating models for 11 protein types of influenza A viruses The data preparation process is described as below: - Downloading step: We downloaded 200,000 influenza A protein sequences consisting of more than 27 million amino acids - Cleaning step: There are a large number of sequence duplications We removed duplicated sequences and obtained ~100,000 unique protein sequences - Categorizing step: Sequences are classified into 11 classes corresponding to 11 protein types: HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 - Splitting Step: Sequences of the same class were split into subgroups such that each subgroup consists of from to 50 sequences - Aligning step: Sequences of each subgroup are aligned using MUSCLE program (default parameters) [19] and subsequently cleaned by GBLOCK program (parameter ±b5=h) [20] to eliminate sites containing too many gaps We selected 2,500 alignments (66,139 sequences, 1,058,987 sites, and 27,588,017 amino acids) each consists of at least 50 amino acid sites Table Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences For example, PA is the best model in 98.37% of PA alignment RESULTS We estimated amino acid substitution models: HA, NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2 and PB2 for 11 corresponding protein types of influenza A viruses To compare different models, we conducted two folds cross validation To this end, we randomly divided the dataset of each protein type into two equal subsets, one for training and the other for testing 100 Model % cases where the model is the best PA NS1 NP NA HA M2 M1 PB1 PB2 PB1-F2 NS2 98.37 98.30 97.53 95.76 90.87 89.52 87.50 87.39 86.50 68.33 65.93 Table Pairwise comparisons between FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log likelihoods M1 HA M1 M2 NA NS1 NS2 NP PA PB1 PB1-F2 PB2 Table The result of five best models when analyzing NA sequences NA is the best model in 361 over 377 cases 1st LogLK/site LogLK/site M1>M2 M1