PROTEIN TYPE SPECIFIC AMINO ACID SUBSTITUTION MODELS FOR INFLUENZA VIRUSES A Thesis Submitted for the degree of MASTER OF COMPUTER SCIENCE IN THE FACULTY OF INFORMATION TECHNOLOGY BY Nguyen Van Sau Information Technology University of Engineering and Technology Vietnam National University Hanoi, 144 Xuan Thuy, Ha Noi, Viet Nam MAY 2012 © (Sau Nguyen Van), 2012 All rights reserve ACKNOWLEDGMENTS I would like to thank my thesis supervisor Le Sy Vinh for fruitful discussions, excellent advises and friendly behavior I also express my thanks to Hanoi, Water Resources University for their continuous support where I am working I want to thank Prof Hoang Xuan Huan to introduce me to Bioinformatics and thank Dang Cao Cuong for supporting me a lot throughout my research During, I express my gratitude to my family and friends to help me and encourage me while doing this work i Parts of this thesis have been published in the following articles: Nguyen Van Sau, Dang Cao Cuong, Le Si Quang, Le Sy Vinh, "Protein Type Specific Amino Acid Substitution Models for Influenza Viruses," KSE, pp.98-103, 2011 Third International Conference on Knowledge and Systems Engineering, 2011 ii Contents ACKNOWLEDGMENTS I LIST OF FIGURES LIST OF TABLES NOTATIONS/ABBREVIATIONS ORIGINALITY STATEMENT ABSTRACT CHAPTER OVERVIEW 1.1 Motivation 1.2 Organization of this thesis CHAPTER AMINO ACID SUBSTITUTION MODELS 2.1Amino acid sequences 2.2 Amino-acid substitution models 10 CHAPTER METHODS TO ESTIMATE MODELS 13 4.1 Methods 13 4.1.1 Counting methods 13 4.1.2 Maximum likelihood methods 14 4.2 Protein type specific amino acid substitution models estimation 17 CHAPTER DATA PREPARATION 21 3.1 Collecting data 21 3.2 Categorizing data 23 3.3 Splitting data 27 3.4 Aligning data 28 CHAPTER RESULTS 29 CHAPTER SUMMARY AND CONCLUSION 34 APPENDIX 35 BIBLIOGRAPHY 36 iii LIST OF FIGURES Figure Growth of number of base pairs in NCBI from April 2002 to June 2011 Figure Different shapes of Γ-distribution with respect to shape parameter 12 Figure The four-step approach to estimate protein type specific amino acid substitution models 19 Figure Link to download influenza virus' data 21 Figure The Robinson-Foulds distances between trees inferred using FLU and 11 protein type specific models for protein of Influenza A viruses 33 LIST OF TABLES Table Statistical number of deaths present Table Twenty different amino acids Table Data of 11 protein types of influenza A viruses 19 Table Classification data into 11 subgroups 25 Table Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences 29 Table Pairwise comparisons between FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log likelihoods 30 Table The results of the five best models when analyzing HA sequences 31 Table The result of five best models when analyzing NA sequences 31 Table Log likelihood comparison among HA (NA) model and other models when analyzing HA (NA) protein sequences 32 Table 10 Correlations among 12 models 32 NOTATIONS/ABBREVIATIONS WHO: World Health Organization RF: Robinson and Foulds MLE: Maximum Likelihood Estimate EMBL: European Molecular Biology Laboratory NCBI: National Center for Biotechnology Information DDBJ: DNA Data Bank of Japan BLAST: Basic Local Alignment Search Tool MAS: Multiple Alignment Sequences ML: Maximum Likelihood MP: Maximum Parsimony ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others, with whom I have worked at UET/Coltech or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, May 30th, 2012 Signed ………………………………………… ABSTRACT The amino acid substitution model (matrix) is a crucial part of protein sequence analysis systems General amino acid substitution models have been estimated from large protein databases However, they are not specific for influenza viruses In the previous study, we estimated the amino acid substitution model, FLU, for all influenza viruses Experiments showed that FLU outperformed other models when analyzing influenza protein sequences Influenza virus genomes consist of different protein types, which are different in both structures and evolutionary processes Although FLU matrix is specific for influenza viruses, it is still not specific for influenza protein types Since influenza viruses cause serious problems for both human health and social economics, it is worth to study them as specific as possible In this thesis, we used more than 27 million amino acids to estimate 11 protein type specific models for influenza viruses Experiments showed that protein type specific models outperformed the FLU model, the best model for influenza viruses These protein type specific models help researchers to conduct studies on influenza viruses more precisely CHAPTER OVERVIEW 1.1 Motivation Influenza viruses cause a lot of deaths and risks in economics According to World Health Organization (WHO – http://www.who.int/en/), the first recorded influenza pandemic began in Europe and spreads over Asia and Africa in 1580 The biggest epidemic Spanish influenza is believed to have killed at least 20 million up to 40 million people worldwide The “Asian Flu” began in China and killed million people global in 20 th century After that some pandemics continuously occur (see Table for more information) Table Statistical number of deaths present Country 2003 c Azerbaijan Bangladesh Cambodia China Djibouti Egypt Indonesia Iraq Laos Myanmar Nigeria Pakistan Thailand Turkey Viet Nam Total 0 0 0 0 0 0 d 0 0 0 0 0 0 2004 2005 2006 2007 c 0 0 0 0 0 0 17 29 46 c 0 0 20 0 0 61 98 c 13 18 55 0 0 12 11 c d c d 0 0 0 1 1 4 0 0 25 42 37 24 20 0 0 2 0 0 1 0 0 0 0 0 88 59 44 33 d 0 0 0 0 0 0 12 20 32 d 0 0 13 0 0 19 43 d 10 45 0 0 79 2008 2009 2010 2011 c d c d c d 0 0 0 0 0 1 7 0 0 0 0 39 29 13 32 12 21 19 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 73 32 48 24 48 24 Total c d 17 15 40 26 151 52 178 146 2 1 25 17 12 119 59 564 330 Communicable Disease Surveillance & Response (CSR), World Health Organization: http://www.who.int/csr/disease/avian_influenza/country/cases_table_2011_08_09/en/index.html Note that: c is cases and d is deaths Algorithm to remove duplicate’s sequence Algorithm 1: Remove duplicate’s sequence Input: File data in which has N sequences Output: File after removing duplicate’s sequences begin num=0; //number sequences new newSeq[N]; for i from to N-1 begin int unique=0; // unique sequence if unique=0, else unique=1 fixSeq = D[i]; if(num ==0) newSeq[0]=fixSeq; for j from to N-1 if fixSeq == newSeq[j]; begin unique=1; break; end if(unique == 0) begin num++; newSeq[num] = fixSeq; end end end After removing duplicate’s sequence in the data set sequence that we downloaded in the above We must classify the data set into categories by types of influenza protein for estimating specific amino acid substitution models There are 11 type of protein influenza which we must classify data set of into them, such as: HA, NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2, and PB2 Firstly, we need to classify the data into three groups: influenza A viruses, influenza B viruses and influenza C viruses by the description of each name’s sequence by string “Influenza A viruses”, or “Influenza B viruses”, or “Influenza C viruses”, corresponding Here, we only consider the set of Influenza A viruses because after having sets of Influenza B viruses and Influenza C viruses, we have analyzed them and seen not enough information to estimate specific models of them 24 Second, we classify those sequences into 11 subgroups by identity science name of them We can see Table for more information about this classification Table Classification data into 11 subgroups Subgroups HA NA Names for identification “HA protein”, “HA”, “Hemagglutinin”, “heemagglutinin HA-1 region”, “haemagglutinin HA1 chain”, “haemagglutinin precursor”, “haemagglutinin; HA protein”, “hamagglutinin”, “heamagglutinin precursor”, “hemagglutinin HA precursor”, “hemagglutinin glycoprotein precursor”, “hemagglutinin precursor”, “hemagglutinin precursor (putative); putative”, “hemagglutinin protein subunit 1”, “hemegglutinin”, “H3 hemagglutinin”, “H3HA1 surface glycoprotein”, “H5 hemagglutinin”, “Haemagglutinin”, “L protein”, “haemagglutinin subunit HA1”, “haemaglutinin precursor”, “heamagglutinin”, “hemagglutinin (HA1 domain)”, “hemagglutinin subunit”, “hemagglutinin 1”, “hemagglutinin 5”, “hemagglutinin A”, “hemagglutinin H1 subtype”, “hemagglutinin H1”, “hemagglutinin H3”, “hemagglutinin H5”, “hemagglutinin HA1 chain”, “Haemagglutinin HA1 precursor”, “hemagglutinin HA2”, “hemagglutinin HA”, “hemagglutinin esterase precursor”, “hemagglutinin gene”, “hemagglutinin glycoprotein”, “hemagglutinin precursor (high yield phenotype)”, “hemagglutinin precursor (partial)”, “hemagglutinin prepropeptide”, “hemagglutinin protein subunit precursor”, “hemagglutinin protein”, “hemagglutinin subtype 1”, “hemagglutinin subtype H5”, “hemagglutinin subunit 1”, “hemagglutinin/neuraminidase precursor”, “hemaglutinin HA1”, “hemaglutinin”, “hemmagglutinin”, “hemmaglutinin”, “polyprotein”, “prehemagglutinin HA1”, “prehemagglutinin”, “HA1 hemagglutinin”, “HA1”, “HAY subunit of haemagglutinin”, “Haemagglutinin precursor HA1 region”, “haemagluttinin”, “haemmagglutinin”, “hemaggluitnin”, “hemagglutinin chain”, “hemagglutinin H6”, “hemagglutinin HA1 domain”, “hemagglutinin HA1 subunit”, “hemagglutinin HA1”, “hemagglutinins HA1 and HA2 precursor”, “hemmaglutinin”, “polyprotein precursor”, “truncated hemagglutinin”, “prehemagglutinin” “NA glycoprotein”, “NA”, “Neuramindase”, “neuramidase”, “neuramindase NAN1”, “neuraminidase glycoprotein”, “neuraminidase 25 M1 M2 NS1 NS2 NP PA PB1 protein 2”, “neuraminidase subtype 1”, “non-structural protein 2”, “neuraminidase subtype 1”, “unnamed protein product; neuraminidase”, “neuraminidase”, “neuraminidasa”, “N2 neuraminidase”, “hypothetical NB-NA hybrid protein”, “neruaminidase”, “neuraminidase N2”, “neuraminidase cell surface glycoprotein”, “neuraminidase protein N1”, “neuraminidase subtype 2”, “neuraminidase surface protein”, “neuraminidase; sialidase”, “M protein”, “M1”, “matix protein”, “matrix protein 1; M1”, “matrix protein M1”, “membrane protein M1”, “M1 protein”, “MP”, “Matrix Protein 1”, “M”, “P42 protein”, “martix protein 1”, “martix protein”, “matrix M1”, “matrix protein (M1)”, “matrix protein M1”, “matrix”, “membrane matrix protein M1”, “membrane matrix protein”, “membrane protein 1”, “M2 protein”, “Matrix Protein 2”, “M2”, “matrix M2”, “matrix protein 2”, “matrix protein 2; M2”, “membrane ion channel; M2”, “membrane ion channel”, “membrane protein 2”, “transmembrane protein”, “M2 matrix protein”, “M2 membrane protein”, “matrix protein M2”, “matrix protein M2”, “membrane ion channel M2”, “membrane ion channel protein”, “membrane protein M2”, “NS protein”, “NS1”, “Nonstructural Protein 1”, “non-structural protein NS1”, “truncated NS1”, “unnamed protein product”, “NS1 nonstructural protein”, “NS1 protein”, “NS”, “Non-structural protein 1”, “non structural protein”, “NS2”, “NS2 nonstructural protein”, “NS2 protein”, “nonstructural protein NS2”, “nuclear export protein NEP”, “nuclear export protein NS2”, “NS-2”, “NEP/NS2”, “NEP”, “NS2/NEP”, “Non structural protein 2”, “Nonstructrual Protein 2”, “Nonstructural Protein 2”, “nuleoprotein NP”, “NP protein”, “Nucleoprotein (partial sequence, AA 9-188)”, “NP”, “Nucleoprotein”, “DI-3 protein”, “PA”, “PA polymerase precursor”, “PA polymerase subunit”, “PA polymerase”, “PA protein”, “Polymerase PA”, “RNA polymerase subunit”, “RNA-directed RNA polymerase subunit P2”, “acidic protein 2”, “polymerase 3”, “polymerase complex subunit PA”, “polymerase subunit PA”, “DI-2 protein”, “PA polymerase protein”, “Polymerase A Protein”, “Polymerase acidic protein”, “RNA polymerase A”, “basic polymerase 1”, “basic polymerase subunit 1”, “PB1”, 26 PB1-F2 PB2 “polymerase PB1”, “polymerase basic 1”, “polymerase basic protein”, “polymerase basic subunit 1”, “polymerase complex subunit PB1”, “polymerase subunit PB1”, “basic polymerase protein 1”, “basic protein 2”, “PB1-F2”, “PB2”, “RNA polymerase”, “polymerase protein”, “polymerase protein 2”, 3.3 Splitting data After splitting sequences of the same type into subgroups such that each subgroup consists of from to 50 sequences In the way, we must divide random sequences into subgroups to avoid these sequences have a close relationship So, we must split randomly data to avoid concentration correlate data Consequently, data will distribute normal and better to generate models Algorithm to split random sequences Algorithm 2: Random split sequences Input: The file data contains sequences, and in each file is least number of sequences Output: Files in which has at least sequences Begin arraySequences[N]; new arrayPositions[n]; int remainder=N; begin arrayPosition[n]=getNRandomPosition(arraySequences); writeInANewFile(arrayPosition); end while(remainder>=2*n); end 27 3.4 Aligning data Sequences of each subgroup are aligned using MUSCLE program (default parameters) (Edgar, Robert C., 2004) and subsequently cleaned by GBLOCK program (parameter – b5=h) (Castresana, 2000) to eliminate sites containing too many gaps We selected 2,500 alignments (66,139 sequences, 1,058,987 sites, and 27,588,017 amino acids) each consists of at least 50 amino acid sites 28 CHAPTER RESULTS We estimated amino acid substitution models: HA, NA, M1, M2, NS1, NS2, NP, PA, PB1, PB1-F2 and PB2 for 11 corresponding protein types of influenza A viruses To compare different models, we conducted two folds cross validation To this end, we randomly divided the dataset of each protein type into two equal subsets, one for training and the other for testing Model performance analysis We compared 11 protein type specific models with the FLU model (the best amino acid substitution model for influenza viruses) by comparing maximum likelihood trees constructed using different models Note that it is the standard metric to compare different models Table Summary of FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models when analyzing their corresponding protein type sequences For example, PA is the best model in 98.37% of PA alignment Model % cases where the model is the best PA 98.37 NS1 98.30 NP 97.53 NA 95.76 HA 90.87 M2 89.52 M1 87.50 PB1 87.39 PB2 86.50 PB1-F2 68.33 NS2 65.93 As expected, experiments showed that protein type specific models outperformed all other models when analyzing their corresponding protein sequences (see Table 5) For example, PA model is the best model in 98.37% cases when analyzing PA alignments As we can see from Table that, the NS2 does not completely outperform other models It is the best 29 model in only 65% of cases when analyzing NS2 sequences This is due to the fact that only a small amount of NS2 protein sequences are available for estimating the NS2 model Table shows the summary comparisons between FLU and protein type specific models in term of log likelihoods when analyzing their corresponding proteins Note that the greater log likelihood per site is the better model It is obvious that the protein type specific models are better than FLU model when analyzing their corresponding proteins For example, the log likelihood of HA model (-16.5699) is higher than log likelihoods of other models Table Pairwise comparisons between FLU and HA, M1, M2, NA, NS1, NS2, NP, PA, PB1, PB1-F2, PB2 models in term of log likelihoods LogLK/site LogLK/site FLU (M2) M1>M2 M1