Understanding the functional roles of intrinsic protein disorder in NFkB transcription factors

UNDERSTANDING THE FUNCTIONAL ROLES OF INTRINSIC PROTEIN DISORDER IN NFΚB TRANSCRIPTION FACTORS LIM SHEN JEAN B.Sc.(Hons.), NUS A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF BIOCHEMISTRY NATIONAL UNIVERSITY OF SINGAPORE 2011 i UNDERSTANDING THE FUNCTIONAL ROLES OF INTRINSIC PROTEIN DISORDER IN NFΚB TRANSCRIPTION FACTORS LIM SHEN JEAN NATIONAL UNIVERSITY OF SINGAPORE 2011 ii Acknowledgements I am grateful to my supervisor, Associate Professor Tan Tin Wee, for his guidance on my research project. Next, I would like to thank Assistant Adjunct Professor Victor Tong and Dr. Asif Khan (John Hopkins University) for their valuable ideas and advice for my project. I am also very grateful for the IT assistance provided by Mark de Silva and Lim Kuan Siong from the Life Sciences Institute. Finally, I would like to express my appreciation to all my colleagues, as well as the administrative staff in the Department of Biochemistry, National University of Singapore, for their strong support during the course of my project. iii Summary Protein dynamics, particularly, intrinsic protein disorder has been implicated in cellular functions. Intrinsic protein disorder contributes to transcription and cell signalling through the accommodation of multiple interaction partners and modification sites, and provision of regulation flexibility. Here, in support with previous studies, I hypothesize that analogous with sequence conservation of functionally important sites, intrinsic protein disorder properties are evolutionary conserved. To further support and test this hypothesis, in the more specific context of transcriptional regulation in cell signaling, I developed an in silico analysis pipeline for the identification of intrinsically disordered protein residues, data mining and indepth analysis of the conservation, localization and function of predicted disordered regions. The Nuclear Factor Kappa-light-chain-enhancer of Activated B cells (NFκB/Rel), important for a variety of processes including cell survival, inflammation and immunity, was chosen as the exemplar protein for this study. The findings highlight distinctive key roles of conserved disordered and nondisordered in different aspects of NFκB function. Differences in the distribution and conservation patterns of protein disorder in each NFκB protein type raise the possibility of conserved disorder signatures in different protein families, which, if true, will prove valuable for functional characterization. On a larger scale, this project shows a meaningful perspective for the understanding of protein function, through intrinsic protein disorder. The analysis pipeline developed in this study will be instrumental for large-scale functional studies of protein families. Findings from this project will also contribute to scientific knowledge in transcriptional regulation and cell signaling. iv List of Tables Table 1. Ranges of timescales and amplitudes where protein dynamics have been reported to occur. Table 2. Performance comparison of primary and meta-predictors for disorder prediction at their respective optimum thresholds. The predictive performance of MetaDisorder MD2 and P+F (DisBatch) is highlighted in bold. ii List of Figures Figure 1. The two types of protein dynamics (or protein motions) and their distribution, relative to protein structure. Figure 2. A) Bar plot of mean accuracy values of primary and meta disorder predictors at their respective optimum thresholds, with standard error estimates. B) Boxplot of accuracy values of primary and meta disorder predictors at their respective optimum thresholds. Each boxplot depicts the minimum accuracy value, lower quartile, median, upper quartile, maximum accuracy value and any outlier observation(s) for each predictor. The boxplot for MetaDisorder MD2 and P+F (DisBatch) is highlighted in grey. Figure 3. Sequence submission page of DisBatch. DisBatch is available at http://bioslax01.bic.nus.edu.sg/meta/. Figure 4 Output page of DisBatch. The page provides download links for each output file, and a link to the help page at the bottom of the page. Figure 5. Detailed sequence inclusion and exclusion criteria for records in NFκB Base. Figure 6. Number of records present in NFκB Base (Release: Beta 2.0) for each NFκB protein type. NFκB Base is available at http://proline.bic.nus.edu.sg/~shenjean/nfkb/. Figure 7. A typical entry page of NFκB Base. Each entry contains information, where available on source accession, NFκB protein type, description, organism, gene name, chromosome name, sequence length, accession number(s) of duplicate record(s) and cross-links to major online databases, including NCBI Protein (sequence database), UniProt (sequence database), GO (Gene Ontology database), HGNC (gene nomenclature database), InterPro (protein domain and family database), PDB (protein iii structure database), PubMed (literature database) and NCBI Taxonomy (taxonomy database). Figure 8. Sample keyword search output of NFκB Base, displaying the accession number, source accession number, organism and description fields. NFκB Base supports keyword searches in all or specific fields, where users can submit a query at the top of every page, shown in the upper frame of this figure. Figure 9. The Browse page of NFκB Base with jQuery supported dynamic data search and display. Figure 10. BLAST interface for NFκB Base. Figure 11. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the RHD domain of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The average disorder score cutoffs of 0.5 and 1.5 were used to distinguish between moderately (predicted only by PrDOS to be disordered) and highly disordered (predicted by both PrDOS and FoldIndex) residues, respectively. Shannon’s entropy values were also plotted in the graph for comparison. Figure 12. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at the RHD domain of A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. Figure 13. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the IPT domain of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. Figure 14. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at the IPT domain of A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. iv Figure 15. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at sites with no functional annotation in A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. Figure 16. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at sites with no functional annotation in A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. Figure 17. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the ANK domain (in red) and Death domain (in black) of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. Figure 18. Scatter plot of average disorder score against the standard deviation of disorder scores for Class I NFκB proteins, A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The scatter plots show 2 distinct quadrants of: conserved nondisordered residues (bottom left) and conserved disordered residues (bottom right). Functional domains and sites were annotated in the graph and coloured accordingly. Figure 19. Scatter plot of average disorder score against the standard deviation of disorder scores for Class II NFκB proteins, A) RelA, B) RelB and C) C-Rel, as predicted by DisBatch. Figure 20. (Cont’d from Figure 19) Scatter plot of average disorder score against the standard deviation of average disorder score for Class II NFκB proteins, A) Dorsal, B) Dif, as predicted by DisBatch. Figure 21. Scatter plot of average disorder score against the CV of average disorder score for Class I NFκB proteins, A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The scatter plot shows 4 distinct quadrants of: non-conserved, nondisordered residues (top left of scatter plot), non-conserved disordered residues (top right), conserved non-disordered residues (bottom left) and conserved disordered residues (bottom right). Functional domains and sites were annotated in the graph and coloured accordingly. v Figure 22. Scatter plot of average disorder score against the CV of average disorder score for Class II NFκB proteins, A) RelA, B) RelB and C)C-Rel, as predicted by DisBatch. Figure 23. (Cont’d from Figure 22) Scatter plot of average disorder score against the CV of average disorder score for Class II NFκB proteins, A) Dorsal, B) Dif, as predicted by DisBatch. Figure 24. Structures of representative Class I NFκB homodimers, NFκB1 (top) and NFκB2 (bottom), coloured according to protein disorder annotations (left) and βfactors (right). The C-terminal IPT domain contains ankyrin protein binding sites enveloping the dimerization interface. Ankyrin repeats and the Death domain were not present in the 3D structures. The α-helical insert regions are conserved disordered residues, highlighted in red, at the left of the protein structure in the N-terminal RHD domain. Figure 25. Structures of representative Class II NFκB homodimers, RelA (top) and C-Rel (bottom), coloured according to protein disorder annotations (left) and β-factors (right). Figure 26. Structures of representative NFκB heterodimers formed between Class I and Class II NFκB proteins, coloured according to protein disorder annotations (left) and β-factors (right). Examples shown here are the RelA-NFκB1 (top) and RelBNFκB2 (bottom) heterodimers. Figure 27. Structures of representative RelA homodimer (top) and RelA-NFκB1 heterodimer (bottom) in the IκB inhibited state, coloured according to protein disorder annotations (left) and β-factors (right). vi List of Abbreviations ADP - Adenosine Diphosphate ATP – Adenosine Triphosphate CASP - Critical Assessment of Techniques for Protein Structure Prediction CD - Circular Dichroism CD4 - Cluster of Differentiation 4 CGI – Common Gateway Interface CSV – Comma Seperated Values DisProt - Database of Protein Disorder DSSP - Dictionary of Secondary Structure of Proteins HIV - Human Immunodeficiency Virus HTML - HyperText Markup Language JAK - Janus kinase LAMP – Linux Apache MySQL PERL/PHP/Python MAPK - Mitogen-Activated Protein Kinase (MAPK) NCBI - National Center for Biotechnology Information NFkB - Nuclear Factor Kappa-light-chain-enhancer of activated B Cells NMR - Nuclear Magnetic Resonance P13K - Phosphatidylionsitol 3-Kinase PDB – Protein Data Bank PONDR - Predictor Of Natural Disordered Regions PSSM – Position-Specific Scoring Matrix RH Domain – Rel Homology domain SD – Standard Deviation STAT - Signal Transducer and Transcription Factors SVM – Support Vector Machine TAD – Transactivation Domain RMSD - Root Mean Square Deviation vii Table of Contents 1 Introduction ....................................................................................................................... 1 1.1 Protein Dynamics ........................................................................................................... 1 1.2 Functional Significance of Protein Dynamics ................................................................. 2 1.2.1 1.3 Role of Protein Dynamics in Cell Signaling ................................................................. 3 Intrinsic Protein Disorder ............................................................................................... 4 1.3.1 Role of Intrinsic Protein Disorder in Cell Signaling..................................................... 5 1.3.2 Identification of intrinsic protein disorder................................................................. 5 1.3.2.1 Computational Tools for Intrinsic Protein Disorder Prediction ................................. 6 1.3.2.1.1 Ab-Initio Approaches.............................................................................................. 6 1.3.2.1.2 Template-based Approaches ................................................................................. 7 1.3.2.1.3 Meta Approaches ................................................................................................... 8 1.3.2.2 Benchmark Datasets for Intrinsic Protein Disorder Prediction .................................. 9 1.3.3 Functional Conservation of Intrinsic Protein Disorder .............................................. 9 1.4 2 Hypothesis.................................................................................................................... 10 Literature Review ............................................................................................................. 10 2.1 Transcription Factors ................................................................................................... 10 2.2 The NFkB Transcription Factor Family ......................................................................... 11 2.2.1 Mechanisms of Action of NFκB ................................................................................ 12 2.2.2 NFκB in Human Diseases .......................................................................................... 14 2.3 Computational analysis of NFκB proteins .................................................................... 15 2.3.1 Systems analysis of NFκB signaling machinery ........................................................ 15 2.3.2 Sequence Analysis of NFκB ...................................................................................... 16 2.3.2.1 Structural Analysis of NFκB ...................................................................................... 17 2.4 2.4.1 Protein Dynamics Analysis of NFκB.............................................................................. 18 Intrinsic Protein Disorder Analysis of NFκB ............................................................. 18 2.5 Limitations of reported studies.................................................................................... 18 2.6 Research Aims and Objectives ..................................................................................... 19 3 DisBatch: A Faster Meta-Prediction System for Large-Scale Identification of Intrinsically Disordered Protein Regions ..................................................................................................... 21 3.1 Background .................................................................................................................. 21 viii 3.2 Materials and Methods................................................................................................ 22 3.2.1 Server Infrastructure ................................................................................................ 22 3.2.2 Primary Disorder Predictor Selection ...................................................................... 23 3.2.3 Meta-predictor Development .................................................................................. 23 3.2.4 Performance Evaluation........................................................................................... 24 3.2.5 Performance Measures............................................................................................ 25 3.2.6 Web Interface .......................................................................................................... 26 3.3 Results .......................................................................................................................... 26 3.3.1 Predictive Performance ........................................................................................... 26 3.3.2 Features ................................................................................................................... 29 3.4 Discussion..................................................................................................................... 31 3.4.1 Predictive Performance ........................................................................................... 31 3.4.2 Scoring Algorithm..................................................................................................... 32 3.4.3 Benchmark Model .................................................................................................... 32 3.4.4 Testing Dataset ........................................................................................................ 33 3.4.5 Software Limitation.................................................................................................. 34 3.5 Future Work ................................................................................................................. 34 3.6 Chapter Conclusion ...................................................................................................... 35 4 NFκB Base : A Specialized Database of NFκB Proteins ..................................................... 36 4.1 Background .................................................................................................................. 36 4.2 Materials and Methods................................................................................................ 37 4.2.1 Server Infrastructure ................................................................................................ 37 4.2.2 Sequence Data Collection ........................................................................................ 37 4.2.2.1 Inclusion and Exclusion Criteria ............................................................................... 37 4.2.3 Database Design....................................................................................................... 38 4.2.4 Web Interface .......................................................................................................... 39 4.2.5 Results ...................................................................................................................... 40 4.2.5.1 NFκB Base Content................................................................................................... 40 4.2.5.2 Features ................................................................................................................... 40 4.2.5.2.1 Keyword Search ................................................................................................... 40 4.2.5.2.2 Sequence Similarity Search .................................................................................. 43 4.2.5.2.3 Batch Download ................................................................................................... 43 4.2.6 Discussion................................................................................................................. 45 4.2.7 Future Work ............................................................................................................. 45 ix 4.2.7.1 Community Annotation Policy ................................................................................. 45 4.2.8 Chapter Conclusion .................................................................................................. 46 5 The Role of Conserved Disordered Residues in NFκB Function ....................................... 47 5.1 Background .................................................................................................................. 47 5.2 Materials and Methods................................................................................................ 48 5.2.1 Sequence Data Collection ........................................................................................ 48 5.2.2 Multiple Sequence Alignment.................................................................................. 48 5.2.3 Entropy Analysis ....................................................................................................... 49 5.2.4 Intrinsic Protein Disorder Analysis ........................................................................... 49 5.2.5 Conservation of Intrinsic Protein Disorder .............................................................. 49 5.2.6 Structural Analysis ................................................................................................... 50 5.3 Results .......................................................................................................................... 51 5.3.1 Conserved intrinsic protein disorder signatures in NFκB ........................................ 51 5.3.2 Structural Analysis ................................................................................................... 68 5.4 Discussion..................................................................................................................... 73 5.5 Future Work ................................................................................................................. 76 5.6 Chapter Conclusion ...................................................................................................... 77 6 Conclusion ........................................................................................................................ 79 7 References ....................................................................................................................... 80 x 1 Introduction 1.1 Protein Dynamics Protein structures are dynamic in nature and undergo motion – a property that is an integral part of their function[1-3]. Protein dynamics (or protein motion) occurs over a wide range of amplitudes and timescales. For example, simple local internal motions, such as bond and angle rotations, occur on a femto- to picosecond timescale[4]. Side-chain and loop motions occur on a pico- to nanosecond time scale, while global external motions involving large-scale conformational rearrangements occur on a micro- to millisecond timescale[5,6]. Molecular interactions and binding occur on the second timescale (Table 1)[2]. Additionally, complex, orchestrated protein motion, such as those involving molecular motors has also been observed[3]. Table 1. Ranges of timescales and amplitudes where protein dynamics have been reported to occur. Timescale Femtosecond Picosecond Nanosecond Microsecond Millisecond >1 second Examples Bond and angle vibrations Side chain rotations Hinge bending at domain interfaces Helix-coil transitions Protein folding, actin-myosin motion Molecular interaction, binding Amplitude < 0.001 - 0.1 Å 0.1 - 1 Å 1 – 10 Å 10 Å - 100 Å 10 Å - 100 Å 10 - >100 Å 1 Figure 1. The two types of protein dynamics (or protein motions) and their distribution, relative to protein structure. Across timescales and amplitudes, protein dynamics can be broadly categorized into internal and external motion[7]. Internal motion involves the deformation of protein segment(s) such as bond, angle or side-chain rotations[7]. External motion, on the other hand, encompasses the translational and rotational motions of protein segment(s), such as hinge and shear motion, involving the protein backbone (Figure 1)[7,8]. Besides well-structured, ordered regions of proteins, protein dynamics have also been studied in non-globular, unstructured and/or flexible regions (to be referred to as intrinsically disordered regions)[9], where they contribute to a number of important functions. Intrinsically disordered regions will be described in detail in Section 1.2. 1.2 Functional Significance of Protein Dynamics Protein dynamics are fundamentally involved in important biological events, such as protein folding, conformational changes and protein-protein interactions[2]. These events are in turn vital to a large array of essential biological processes and functions[1,3,6,10-12]. 2 An example is the crucial role of protein dynamics in muscle contraction[6]. Muscle contraction involves the cross-bridge cycle, with the first step involving adenosine triphosphate (ATP) binding to the myosin head. Binding of the myosin head to actin myofilaments, and calcium to the complex, leads to changes in electrostatic charges and cross-bridge formation. Subsequent hydrolysis of ATP to adenosine triphosphate (ADP) alters the conformation of the head of the cross-bridge and produces energy for the pulling movement of the actin filament towards the centre of the cell. Finally, the release of ADP disrupts binding with the actin filament and restarts the cycle with the next ATP binding event, in the presence of calcium ions. At a smaller scale, protein dynamics is also involved in human immunodeficiency virus (HIV) infection[12]. This is mediated through the binding of the envelope glycoprotein, gp120, to a c (CD4) receptor. Briefly, the binding event causes conformational changes in gp120, in turn promoting the binding of HIV-1 to chemokine receptors on the host cell, such as CCR5 or CXCR4. This activates the gp41 protein and promotes the fusion of the HIV outer membrane with the host cell, thereby permitting viral entry and infection. 1.2.1 Role of Protein Dynamics in Cell Signaling An important process where protein dynamics plays an especially significant role is in cell signaling[10,11]. Cell signaling involves specific recognition sites and strict regulation of participating proteins to coordinate molecular interactions at intraand/or inter-pathway levels, ultimately resulting in combinatorial functional diversity. The dynamics of vital signaling proteins, such as calmodulin, p53, BRCA1 and MAP2, and their functional significance have been investigated[10,11,13-15]. Many of these proteins partake in local internal motion via intrinsically disordered residues 3 that facilitate multiple molecular recognition mechanisms, interactions and regulation[13-15]. 1.3 Intrinsic Protein Disorder Previous examples in Section 1.2 illustrate the functional role of protein dynamics in protein segments or regions with stable, localized structures. Conventional ideas, based on the “lock-and-key” model, highlighted the functional importance of stable, localized structures. However, there has been increasing evidence that non-globular domains with unstable and flexible structures, termed intrinsically (or natively) disordered proteins or protein regions, are also important for function[9,16,17]. Intrinsically disordered proteins lead to poor protein expression and therefore pose difficulties in protein purification and crystallization, hindering high throughput structural determination[18]. Functional sites, mainly short linear motifs such as sorting signals, targeting signals, protein ligands and post-translational modification sites, have been observed in intrinsically disordered proteins and regions[18]. To date, many intrinsically disordered proteins and protein regions have been reported[19,20]. These proteins and regions have been discovered to be either completely or largely disordered, becoming structured only in their bound states (e.g. CREB-CBP complex [21]) or in the presence of changes in the biochemical environment [19,20]. Intrinsically disordered proteins and protein regions have been reported to engage multiple binding partners and are involved in many biological events and pathways, especially during cell signaling[14,15,22-24]. 4 1.3.1 Role of Intrinsic Protein Disorder in Cell Signaling In the context of cell signaling, intrinsically disordered proteins and regions have been associated with many regulatory events. Intrinsic protein disorder confers various functional advantages, which include the capability to i) accommodate more interaction partners and modification sites, ii) provide flexibility in regulation with multiple, relatively low affinity linear interaction sites, iii) provide regulation specificity with fewer linear motif types and iv) provide large intermolecular interfaces with smaller protein, genome and cell sizes[25]. For example, the recognition of DNA by disordered peptides has been shown to be involved in the regulation of gene expression by transcription, epigenetic modifications and gene silencing[26]. 1.3.2 Identification of intrinsic protein disorder Intrinsically disordered proteins and protein regions can be indirectly observed experimentally, using X-ray crystallography, Nuclear Magnetic Resonance (NMR-), Raman-, Circular Dichroism (CD-) spectroscopy and hydrodynamic measurements[18]. These laboratory methods recognize different types of protein disorder, giving rise to various definitions of intrinsic protein disorder, such as highly flexible regions, regions lacking a secondary structure or regions lacking a welldefined tertiary structure[18,27]. Experimental methods for detecting intrinsic protein disorder are often hampered by the lack of stable protein structures[27]. To overcome this limitation, various computational tools have been developed for the prediction of intrinsically disordered proteins and protein regions from primary protein sequences[27]. 5 1.3.2.1 Computational Tools for Intrinsic Protein Disorder Prediction Various definitions have been used to describe intrinsically disordered protein regions[18]. Consequently, computational tools designed for the prediction of intrinsic protein disorder utilize different approaches, based on different operational definitions of intrinsic protein disorder[18]. They can be broadly classified into abinitio approaches, template-based approaches and meta approaches[28]. 1.3.2.1.1 Ab-Initio Approaches Ab-initio approaches utilize only sequence-derived information for disorder prediction. They originated from early methods that detect low-complexity regions in protein sequences, such as SEG[9],[29]. Wootton’s study on compositionally biased regions in sequence databases illustrated the association between these regions and non-globular domains[9]. However, these methods have been shown to produce copious false hits, since the correlation between disordered regions and low sequence complexity does not always hold true. More refined methods have since been designed[30]. The earliest prediction system developed specifically for intrinsic protein disorder prediction was the suite of PONDR® (Predictor Of Natural Disordered Regions) neural network predictors, which identify intrinsically disordered regions based on properties such as local amino acid composition, flexibility, hydropathy and coordination number[31]. Subsequent examples include the FoldIndex software, in which prediction is based on the average residue hydrophobicity and net charge[32]. IUPred is another tool in which intrinsic protein disorder is predicted through 6 estimates of the capability of amino acid residues to form stable, favourable contacts based on pair-wise energy content[33]. IUPred adopted the underlying assumption that in contrast to globular proteins, intrinsically disordered proteins are not capable of forming a large number of stable, favourable interactions[33]. Some ab-initio methods derive secondary and/or tertiary structure information from input protein sequences to check for the presence of loops or coils, which are considered to be non-regular secondary structures. For example, GlobPlot[34] calculates Russell/Linding propensities for input amino acid residues to be in regular secondary structures (α -helices or ß-strands) and non-regular secondary structures, defined by the Definition of Secondary Structure of Proteins (DSSP)[35], respectively. On the other hand, DISOPRED2[36] and the DisEMBL REMARK465 predictors were trained on Protein Data Bank (PDB)[37] structural data[18] to identify amino acid residues present in the sequence but missing in X-ray structures. DisEMBL also predicts protein disorder by detecting “hot loops”, utilizing both secondary and tertiary structure information derived from input sequences[18]. The algorithm detects highly dynamic DSSP-defined loops/coils with high β-factors (C-α temperature factors), according to the training set of PDB[37] structure data[18]. 1.3.2.1.2 Template-based Approaches Template-based approaches perform comparisons of input data with similar sequence or structure data to determine intrinsic protein disorder. For example, PrDOS[38] performs PSI-BLAST searches of query protein sequences against structural datasets of homologous proteins to predict intrinsically disordered residues, in addition to its support vector machine (SVM) algorithm trained on position-specific scoring matrices (PSSM). DISOclust[39] performs template-based prediction by first 7 determining the per-residue error of the input protein sequence in multiple protein fold recognition models, built from homologous templates, followed by analysis of the conservation of per-residue error across these models. 1.3.2.1.3 Meta Approaches Meta approaches are tools, termed meta-predictors, which combine the prediction results of multiple prediction methods. The availability of primary intrinsic protein disorder prediction tools has sparked increased research interest in meta-predictors, which have demonstrated higher prediction accuracies than primary predictors. An example of a meta-prediction system is Meta-Disorder (MD) predictor, which integrates prediction results from orthogonal sources of information and explicit predictions of secondary structure, solvent accessibility and other sequence properties, as inputs to neural networks for model training[40]. Subsequently, MD selects the optimum algorithm for disorder prediction[40]. GeneSilico Disorder MD2 is another example of a high performance meta-predictor[41]. The genetic algorithm-based system first combines and weighs the results of 15 primary predictors, based on accuracy. Subsequently, it collects the best alignments from the 8-fold recognition method and infers protein disorder from alignment gaps. Other meta-predictors reported in the literature include metaPrDOS[42] and PONDR-FIT[43]. In support of meta-prediction efforts, a metaserver, MeDor[44], has also been developed to facilitate easy retrieval and visualization of results from primary disorder prediction systems. 8 1.3.2.2 Benchmark Datasets for Intrinsic Protein Disorder Prediction To provide further impetus for intrinsic protein disorder prediction, since 2002, the worldwide Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments introduced a new category for protein disorder prediction, using blind benchmark datasets[45]. Intrinsic protein disorder prediction has also been facilitated by the availability of the Database of Protein Disorder (DisProt) since 2005[46]. DisProt is a specialized database containing sequences across multiple species annotated with experimentally verified intrinsically disordered regions[46]. 1.3.3 Functional Conservation of Intrinsic Protein Disorder The functional importance of intrinsically disordered proteins and protein regions raises the likelihood that intrinsically disordered protein residues are evolutionarily conserved. This proposal is in line with studies demonstrating that protein dynamics properties, such as protein backbone flexibility, protein side-chain dynamics and protein vibrational dynamics, are conserved[47-50]. Conservation of protein disorder has been studied by Chen et al. who demonstrated that intrinsically disordered regions are conserved in protein domains and families[51]. Reports have also shown that evolutionary conservation and maintenance of protein disorder is costly and therefore non-trivial and non-random, further supporting its indispensable functional significance[26,52-54]. 9 1.4 Hypothesis In the context of cell signaling, the evidence outlined in previous sections implies that cell signaling proteins generally possess varying degrees of protein dynamics[10,11,22]. These dynamics modulate changes in binding affinity and specificity, which is in turn responsible for generating downstream functional diversity in signaling pathways. In addition, dynamic properties of proteins have been found to be encoded in their primary sequences and conserved in protein domains and families [10,29]. Nevertheless, to date, in-depth analysis on the correlation between conservation of dynamic properties and sequence and functional conservation is lacking in literature. In view of the importance of intrinsically disordered protein regions in cell signaling, it is hypothesized that a case study on an exemplar cell signaling protein homologous sequence family will bring useful insights to the relationship between conservation of dynamic properties and sequence conservation. For this project, I have selected the Nuclear Factor Kappa-light-chain-enhancer of Activated B cells (NFκB/Rel), a transcription factor protein family important for a variety of processes including cell survival, inflammation and immunity[55-57]. This project is part of a larger study exploring the function and role of NFκB in cell signaling and immunity. 2 Literature Review 2.1 Transcription Factors Transcription factors are a group of cell signaling proteins primarily involved in transcriptional regulation, one of the key events of cell signaling responsible for gene regulation and downstream protein expression[57]. These proteins play a pivotal role 10 as ‘central signaling hubs’ that carry and control the flow of information in biological pathways from receptors to DNA[13]. Transcription factors regulate a variety of diverse cellular and organismal processes[57]. Their high binding specificities, coupled with tight regulation, have enabled transcription factors to process a huge diversity of signal information with remarkable precision[57]. To date, the intricate mechanisms of transcriptional regulation machinery have not been fully elucidated. 2.2 The NFkB Transcription Factor Family The NFκB (Nuclear Factor Kappa-light-chain-enhancer of activated B cells) or Rel protein family consists of a group of ubiquitously expressed, highly inducible and structurally-related eukaryotic transcription factors[58]. They are involved in a large variety of cellular and organismal processes, including the cellular stress response, cell proliferation and survival, apoptosis, inflammation and innate and adaptive immunity[55-57,59-61]. All NFκB transcription factors are related by a highly conserved NH2-terminal Rel homology (RH) domain, responsible for DNA binding and dimerization[58]. These proteins can be divided into two functionally distinct classes that are capable of heterodimerizing freely, based on their C-terminus sequence[58]. There are five mammalian NFκB proteins: NFκB1(p50/p105), NFκB2 p52/p100), RelA(p65), RelB and c-Rel[59. The Class I proteins, including NFκB1 (p50/p105), NFκB2 (p52/p100) and Drosophila Relish, contain a number of ankyrin repeats with trans-repression activity at their C-terminus[59]. Class I proteins possess strong DNA binding activity but weak transcriptional activation potential and are generally not activators of transcription, except when they form heterodimers with Class II proteins[59. The Class II (Rel) proteins, including RelA(p65), RelB, c-Rel, v-Rel and 11 the Drosophila Dorsal and Dif proteins, in contrast, exhibit weak DNA binding activity and are observed to contain a potent trans-activation domain at their Cterminus[59]. 2.2.1 Mechanisms of Action of NFκB NFκB proteins associate into homo- and hetero-dimers that bind to target 9-10 DNA base pair κB sites[59. The p50-RelA heterodimer represents the prototypical NFκB complex and is the major NFκB complex found in most cells. The subunit composition of the NFκB complex affects its DNA binding site specificity, subcellular localization, trans-activation potential and mode of regulation, therefore leading to combinatorial diversity of the downstream responses[58,62,63]. NFκB complexes are regulated via several pathways that control its translocation from the cytoplasm to the nucleus, in response to extracellular stimuli[61,64]. To date, at least three major signaling pathways have been identified: the IκB kinase (IKK)dependent canonical pathway, the IKK-dependent non-canonical pathway, and the IKK-independent p38-CK2 pathway[61,64]. The IKK-dependent canonical pathway involves the regulation of NFκB dimers containing RelA or c-Rel, through association with a family of inhibitors known as IκBs (inhibitors of κB), which includes p100, p105, IκBα, IκBβ, IκBγ, IκBε, IκBΖ, Bcl-3 and the Drosophilia Cactus protein[65]. IκBs typically inhibit the interaction of NFκB with DNA by blocking the DNA binding sites of NFκB transcription factors[65]. IκB-NFκB interactions are, in turn, mediated by the IκB kinase (IKK), a complex composed of the catalytic IKKα and IKKβ subunits, and a regulatory subunit known as IKKγ or NEMO[61,64]. The IKK complex, upon activation, phosphorylates two specific serine residues located at the NH2-regulatory domain of IκB, leading to IκB ubiquitination and proteosome12 mediated degradation[61,64]. NFκB dimers containing RelB and NFκB2 (p52/p100) are activated through the IKK-dependent non-canonical pathway, where homodimeric IKKα lacking the IKKγ (NEMO) subunit phosphorylates the C-terminal region of p100[61,64]. This leads to the ubiquitination and degradation of the p100 IκB-like Cterminal sequences, which in turn releases and activates p52-RelB[61,64]. The IKK-independent p38-CK2 pathway is activated by UV and the hepatitis B virus trans-acting factor PX. Upon UV stimulation, IκBα proteins have been found to be phosphorylated by CK2, leading to ubiquitination and degradation[61,64]. Recent evidence has also suggested that regulation of the NFκB pathway may involve other processes such as ubiquitination, acetylation, prolyl isomerization (in the case of RelA and p50), as well as phosphorylation (in the case of c-Rel and RelA)[58,61,66]. Activation of the NFκB complex results in its export from the cytoplasm to the nucleus. This is mediated by specific nuclear-importing signals present in the Rel homology domain, which binds to κB sites in the regulatory regions of inducible promoters for the activation of targeted gene expression[58,61,66]. Similar to other rapid-acting primary transcription factors, such as STATs (signal transducer and transcription factors), nuclear hormone receptors and c-Jun, NFκB transcription factors can induce rapid changes in gene expression without the need for new protein synthesis[58,61,66]. Promoter-bound NFκB activates target gene expression via the assembly of enhanceosomes – large nucleoprotein complexes resulting from the cooperative binding of regulatory elements, such as chromatin-remodeling proteins, nuclear coactivators, kinases and histone acetylases[58,61,66]. 13 2.2.2 NFκB in Human Diseases NFκB transcription factors are involved in the upregulation of a variety of genes, some of which are responsible for cell proliferation and cell survival[58,60]. Aberrant inactivation of NFκB leads to increased susceptibility to apoptosis[60]. On the other hand, aberrant activation of NFκB has frequently been observed in cancers, where it stimulates the expression of gene clusters, including oncogenes, that promote cell survival, inflammation, angiogenesis, tumor development, progression and metastasis[67,68]. Activation of NFκB in cancer cells has been attributed to chronic stimulation of the IKK pathway, as well as mutations in NFκB genes or its regulatory genes such as IκB[67,68]. Potential cross-talk between IKK/NFκB and other major signaling pathways, including the mitogen-activated protein kinase (MAPK), JAK/STAT (Janus kinase/signal transducer and transcription factor), p53 and phosphatidylionsitol 3kinase (PI3K) pathways, which have been implicated in cancer, have also been observed[67,68]. The involvement of NFκB-related pathways in cancers has led to investigation of its use as potential biomarkers, as well as therapeutic targets[69,70]. In addition, NFκB proteins play an important role in both the innate and adaptive immune response, by serving as a regulator of a variety of processes. This includes Tcell development, maturation and proliferation upon activation of T-cell receptors, Bcell development, survival, division and immunoglobulin expression, control of the immune response and malignant transformation[56,60,71-75]. NFκB transcription factors perform various immune-related regulatory activities and function via the differential activation of NFκB complexes in response to a diverse spectrum of signals[56,60,71-75]. These signals are propagated from receptors including the antigen receptors, pattern-recognition receptors and receptors for members of TNF 14 and IL-1 cytokine families[56,60,71-75]. Consequently, misregulation of NFκB signaling machinery in the immune system has been associated with immunodeficiency and inflammatory diseases[56,57,74]. Constitutive activation of NFκB has been frequently observed in asthma, arthritis, renal inflammatory disease, sepsis and many other diseases[56,57,74,76]. 2.3 Computational analysis of NFκB proteins Findings discussed in the previous sections were primarily gathered from experiments using conventional laboratory techniques. To complement laboratory approaches, computational approaches have also been utilized for experiments on NFκB proteins. In silico methods, driven by technological advances leading to sophisticated algorithms and the availability of experimental datasets, have sped up the acquisition of meaningful information on NFκB proteins. 2.3.1 Systems analysis of NFκB signaling machinery Systems biology, as an emerging field emphasizing “integrative” rather than “reductionist” approaches, involves the inter-disciplinary study of interactions, functions and behaviours of multi-component biological systems[77,78]. In this field, complex data is integrated from various experimental platforms[77,78]. The field of systems biology arises from the availability of large datasets from high throughput microarray and genomic platforms, as well as advances in computational techniques, which facilitate large-scale analysis of biological mechanisms, pathways and networks[77,78]. To this end, computational biology has been identified as one of the fundamental cornerstones of systems biology for the processing, interpretation and manipulation of complex, large-scale multi-experimental datasets[77,78]. 15 In the specific context of NFκB proteins, integrative systems biology approaches have been used to identify and study their roles, as well as their downstream target genes, in cellular pathways and networks[72,79-81]. These approaches yield useful insights on the functions of NFκB proteins by utilizing tools, including computational predictions, gene expression profiling, functional annotation from biological databases and transcription factor binding site analysis, combined with experimental validation via RNAi knockdown or other experiments[72,79-81]. Systems biology approaches complement conventional laboratory approaches for the investigation of interactions between critical modules or components in cellular pathways and networks. It has been established that genes and proteins do not function in isolation, instead engaging in complex dynamic interactions to perform their biological roles and functions[78,]. These interactions are in turn regulated by mechanisms involving transcription factors, signaling pathways and networks. Whilst conventional laboratory research has been instrumental for the identification of genes and proteins critical for cellular processes such as NFκB transcriptional regulation, systems biology approaches attempt to integrate data from various experimental sources to obtain an all-encompassing view of how biological systems function as a whole[72,79-81]. As the field of systems biology continues to grow and mature, more exciting applications of large-scale, integrative approaches will contribute to and reshape the landscape of knowledge discovery in NFκB research. 2.3.2 Sequence Analysis of NFκB Besides research at the systems-level, large scale promoter sequence studies of NFκB binding sites has also been conducted. Such experiments aim to identify and characterize conserved NFκB binding sites within sets of gene promoters[83,84]. 16 These computational analysis efforts have in turn led to the development of transcription factor databases and sophisticated prediction algorithms for the prediction of transcription factor binding sites (including κB sites)[85-88]. These have proved useful in predicting the involvement of NFκB and its downstream target genes in various biological pathways. On-going bioinformatics sequence analyses, employing comparative genomics and laboratory functional studies, have led to the identification of NFκB/Rel homologues in various organisms since its discovery by Sen and Baltimore in 1986. To date, functionally conserved homologues of mammalian NFκB have been identified in a variety of simpler organisms, including Drosophilia melanogaster (fruit fly)[71,89], Aedes aegypti (yellow fever mosquito)[90], Aedes gambiae (malaria vector)[90], Pinctada fucata shrimp)[92,93], (pearl oyster)[91], Cnidarians (sea Litopenaeus anemones and vannamei corals)[94] (pacific and white Porifera (sponges)[59]. 2.3.2.1 Structural Analysis of NFκB Complementary to sequence analysis, structural analyses of NFκB proteins have also been conducted via computational means. Following 3D structural determination of NFκB complexes bound to DNA, experimental efforts have been channelled towards elucidating the detailed binding mechanisms of NFκB complexes in relation to their corresponding 3D structures [95-97]. Additionally, computational approaches employing molecular modeling and simulations for the study of NFκB inhibitors[98], κB DNA sites[99] and the evolution of DNA-binding and protein dimerization domains[100] have been reported in the literature. 17 2.4 Protein Dynamics Analysis of NFκB To date, only one protein dynamics study mentioning NFκB proteins is present in the literature. The authors simulated the interaction between C-Rel and a 20-bp DNA sequence and observed a unique and dynamic NFκB recognition site. The study was focused on the dynamics of the DNA, rather than the dynamics of the C-Rel protein during binding[99]. However, the effects of protein dynamics in cell signaling and allosteric control have been studied and reviewed in general[10,11,15,48-50] . 2.4.1 Intrinsic Protein Disorder Analysis of NFκB No intrinsic protein disorder analysis focusing solely on NFκB has been recorded in literature. Nevertheless, general research efforts using intrinsic protein disorder to identify protein binding sites[101,102] and analyse the functions of chromatin remodeling proteins have been recorded[22]. In the context of cell signaling, the functional roles of intrinsic protein disorder in cytoplasmic signaling domains[22] and in scaffold proteins, which integrate cell signaling pathways[15], have been reported. The most relevant study of intrinsic protein disorder in transcription factors was conducted by Wells et al., who analyzed p53’s intrinsically disordered N-terminal trans-activation domain (TAD) using NMR spectroscopy and X-Ray studies[14]. 2.5 Limitations of reported studies Based on the literature review, there appears to be limited research on the effects of dynamic regions, or more specifically, intrinsically disordered protein regions, on the function of NFκB transcription factors. Furthermore, general research efforts on NFκB are mostly focused on specific classes, types or states of NFκB proteins. Thus, they seem to provide only isolated, contextual 18 views of the NFκB signaling machinery. Clearly, a general macroscopic overview of the functional role of protein dynamics in NFκB proteins, across all known subclasses and organisms, is lacking. 2.6 Research Aims and Objectives In Section 1.4, I have proposed the hypothesis that dynamic properties of proteins, particularly cell signaling proteins, may contribute to their function and thus may be evolutionary conserved. For this thesis, using NFκB transcription factors as an exemplar, my research aim was to computationally analyse the conservation of protein dynamics in this protein family and the functional effects that result. In Section 1.1, it was highlighted that protein dynamics typically occur at two levels – movements of intrinsically disordered protein regions, as well as local internal and global external motion occurring at larger amplitudes[7,9]. The primary focus of my research was on protein dynamics occurring in intrinsically disordered protein regions. To systematically achieve my research aim, firstly, there was a need for the development of an in silico tool for large-scale identification of intrinsically disordered residues. Next, NFκB sequence and structure data had to be collected and stored in an online database. Subsequently, residues predicted to be disordered in NFκB protein sequences would be subjected to analyses of their conservation, localization on 3D protein structures and potential biological functions. Specific objectives have been laid out for each phase of the research project, as follows: - To develop an efficient system for large-scale identification of intrinsically disordered regions in proteins. 19 - To collect high quality NFκB sequence and structure data - To develop a specialized database of NFκB protein sequences and structures for the benefit of the research community - To implement the developed prediction system and relevant analysis tools to analyse the conservation and functional roles of intrinsically disordered protein residues in NFκB signaling machinery. For my research project, an in silico approach was adopted since large-scale data mining and analysis was an integral part of the project. In silico approaches speed up these procedures to promote knowledge discovery and provide useful leads for experimental validation. The methodology and findings, discussed in the next chapters, will lay the foundation for further research in the field of protein dynamics, as well as transcriptional regulation and cell signaling, potentially leading to significant contributions to research in cell signaling. 20 3 DisBatch: A Faster Meta-Prediction System for Large-Scale Identification of Intrinsically Disordered Protein Regions 3.1 Background The identification of intrinsically disordered protein regions facilitates high throughput structural determination, since these relatively unstructured and flexible regions are reported to hamper protein purification and crystallization[34]. Additionally, intrinsically disordered regions have been known to be important for protein function, through roles such as the presentation of protein modification sites and the modulation of flexibility and specificity in protein-protein interactions[26]. Evidence has shown the evolutionary conservation and maintenance of protein disorder to be non-trivial and non-random, suggesting functional significance[26,5254]. Recently, computational methods, based on various sequence and structural features in intrinsically disordered regions, have played an increasing role in the identification of intrinsic protein disorder. In particular, meta-predictors that combine the results of multiple primary prediction methods have been extensively applied due to higher prediction accuracies[38]. Nevertheless, most meta-predictors reported are limited in terms of availability and scalability. Many are slow, unavailable locally and impose practical restrictions on the number of submissions by users, posing difficulties for large-scale batch sequence predictions. For example, GeneSilico MetaDisorder MD2[41], the best disorder prediction method in CASP8 & CASP9[45], utilizes 15 21 primary disorder predictors and takes an average of 3 days for the prediction of 1-5 protein sequences, with a limitation of 10 jobs per day. Furthermore, the software is also not available for local use. These constraints greatly limit the ability of the scientific community to perform large scale protein disorder analysis. In view of these limitations, I have developed a lightweight disorder meta-predictor designed for rapid fully automated large-scale disorder analysis from protein sequences. The prediction system, named DisBatch (available at http://bioslax01.bic.nus.edu.sg/meta/), demonstrates comparable performance with GeneSilico MetaDisorder MD2, but with more than 10x speedup. The DisBatch metapredictor is now available both as a web service and as a local software package. 3.2 Materials and Methods 3.2.1 Server Infrastructure DisBatch was written using a combination of Bash, Perl and R scripts. The metaprediction software was developed and hosted in the BioSlax 7.5 live operating system (http://www.bioslax.com), developed by the Bioinformatics Centre in the National University of Singapore (NUS), based on the Slax (http://www.slax.org) Slackware Linux base distribution. BioSlax contains a suite of bioinformatics tools (known as modules), which can be booted from any PC using the computer’s memory. The operating system also allows for easy addition of new modules containing additional software, services and settings, which can similarly be loaded and activated upon boot-up. The BioSlax server running DisBatch consists of a frontend web portal and a Cloud-based backend. The Cloud backend server runs the BioSLAX virtual machine using a Citrix Xen® hypervisor. 22 3.2.2 Primary Disorder Predictor Selection Primary disorder predictors were first selected based on their availability and scalability. Chosen predictors were required to allow for either i) software download for local use, or ii) if used remotely as a web service, unrestricted number of submissions by each user per day. Selected predictors include i) DisEMBL REMARK465[18], ii) FoldIndex[32] and iii) PrDOS[38]. Information on these disorder predictors were discussed previously in Section 1.3.2.1. 3.2.3 Meta-predictor Development The performance of each primary predictor was evaluated against Release 5.7 of the DisProt dataset[46], which contains sequences annotated with experimentally verified intrinsically disordered regions, to determine the optimum threshold with the highest accuracy. The DisProt testing set was checked for the presence of NFκB records and none were observed. 5 candidate meta-predictors were built from each possible combination of primary predictors at their optimum thresholds where the accuracy is highest. Both DisEMBL REMARK 465[18] and PrDOS[38] predictors convert their results to probability scores, therefore their outputs were combined by averaging or weighted averaging. Weights for the meta-predictor integrating DisEMBL REMARK 465[18] and PrDOS[38] were assigned according to the Matthews correlation coefficient (MCC) values[103]. Accuracy values were not used for weighting since both tools yield almost equal accuracy at their optimum thresholds. FoldIndex[32] rearranged Uversky et al.’s fold boundary equation to calculate the prediction score. In his study, the default window size of 51 was used for disorder prediction[32]. According to the modified equation, positive FoldIndex[32] scores 23 indicate probable folded proteins or regions and negative FoldIndex scores indicate likely disordered proteins or regions. Since FoldIndex[32] does not yield probability scores, the original scores were converted to binary values at each position. Positive FoldIndex[32] scores representing predicted folded residues were assigned a value of 0, while negative scores representing predicted disordered residues were assigned a score of 1. Due to the difference in scoring system, the probability scores returned from DisEMBL REMARK 465[18] and/or PrDOS[38] were combined with the FoldIndex[32] output by simple addition for all relevant meta-predictors. The optimum threshold of each meta-predictor yielding the highest accuracy was determined. The best performing meta-predictor is the combination of FoldIndex[32] and PrDOS[38], at the threshold of 1.5, with positive prediction by both tools (FoldIndex[32] binary score of 1 and PrDOS[38] probability cutoff score of ≥ 0.5 for predicted intrinsically disordered residues). 3.2.4 Performance Evaluation Due to low prediction speed and submission restrictions on the MD2 server, only 286 out of 638 sequences from the DisProt[46] dataset were predicted successfully over a period of 2 months. For fair comparison, the performance of each predictor was compared against Gene Silico MetaDisorder MD2[41], the best disorder prediction method in CASP9[105] , using this subset. 24 3.2.5 Performance Measures Performance measures used were sensitivity (SE), specificity (SP), accuracy (ACC), positive predictive value (PPV) and negative predictive value (NPV). These were calculated based on the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). TP and TN denote the number of known disordered amino acid residues and ordered residues predicted correctly, respectively. FP represents ordered residues predicted to be disordered, while FN represents known disordered residues predicted to be ordered. SE = TP/(TP+FN), SP = TN/(TN+FP) represent the proportion of correctly predicted disordered amino acid residues and ordered residues in each protein sequence respectively. ACC = (TP+TN)/N, where N represents the total number of residues in each protein sequence, is a measure of the proportion of all correctly predicted residues (disordered and ordered) in each protein sequence. PPV = TP/(TP+FP) indicates the proportion of positively predicted residues (TP + FP) that are correctly predicted as disordered (TP), while NPV = TN/(TN+FN) indicates the proportion of negatively predicted residues (TN + FN) that are correctly predicted as ordered (TN). MCC measures the randomness of the prediction and is calculated as: The MCC value ranges between -1 and 1: MCC = 1 for 100% agreement of the prediction, MCC = 0 for completely random prediction and MCC = -1 for 100% disagreement of the prediction. SE, SP, ACC, PPV, NPV and MCC for each sequence in the testing set were calculated, summed and averaged over the total number of sequences. 25 3.2.6 Web Interface A Web interface was set up to facilitate online access to DisBatch (FoldIndex[32] + PrDOS[38]) at http://bioslax01.bic.nus.edu.sg/meta. DisBatch accepts sequences in FASTA format as input. Unix, Perl and R commands used in DisBatch are called remotely from CGI scripts written in Bash, which in turn submit and retrieve predictions from the FoldIndex and PrDOS servers. Due to limitations in computational resources, a maximum of 50 sequences is allowed per submission. For large-scale disorder predictions, users can download the DisBatch software for free. 3.3 Results 3.3.1 Predictive Performance I have successfully developed DisBatch, a light-weight meta-predictor optimized using two primary predictors – FoldIndex[32] and PrDOS[38], to automate largescale batch disorder predictions. DisBatch combines the prediction output of FoldIndex[32] and PrDOS[38] by simple addition. DisBatch gives the best accuracy value of 67.79% when the threshold is set to 1.5, where there is an agreement of positive prediction from FoldIndex[32] (binary score: 1) and positive prediction from PrDOS[38] (probability score : ≥ 0.5). DisBatch (67.79% accuracy) slightly outperforms all primary and meta-predictors selected and tested in this study and is comparable to GeneSilico Metadisorder MD2’s[41] accuracy of 69.21% (Table 1 and Figure 2). Standard error estimates in Figure 2 indicates that the performance improvement of DisBatch may not be significant. Nevertheless, DisBatch performs predictions faster (with more than 10x speedup) when compared to MD2[41]. The average prediction rate of DisBatch is 10 minutes 26 per sequence (dependant on PrDOS’[38] server load and prediction speed) while the average prediction rate of MD2[41] is 3 days per 1-5 sequences. Table 2. Performance comparison of primary and meta-predictors for disorder prediction at their respective optimum thresholds. The predictive performance of MetaDisorder MD2 and P+F (DisBatch) is highlighted in bold. Disorder Predictor Primary Predictor DisEMBL PrDOS FoldIndex Meta-Predictor GeneSilico MD2 P+D P+D (MCC-weighted) D+F P+F (DisBatch) P+D+F Threshold Accuracy 0.5 0.6 NA 66.04% 67.08% 64.75% NA 0.5 0.5 1.4 1.5 1.9 69.21% 67.63% 67.69% 66.83% 67.79% 67.75% 27 B Figure 2. A) Bar plot of mean accuracy values of primary and meta disorder predictors at their respective optimum thresholds, with standard error estimates. B) Boxplot of accuracy values of primary and meta disorder predictors at their respective optimum thresholds. Each boxplot depicts the minimum accuracy value, lower quartile, median, upper quartile, maximum accuracy value and any outlier observation(s) for each predictor. The boxplot for MetaDisorder MD2 and P+F (DisBatch) is highlighted in grey. 28 3.3.2 Features The DisBatch web interface is intuitive and consists of a simple search page where users can input a maximum number of 50 sequences in FASTA format (Figure 3). Since the prediction speed of DisBatch is dependent on the server load of PrDOS[38], users are able to set a timeout value, in terms of the number of minutes per sequence, for time-efficient prediction. If PrDOS[38] results are not fetched after the timeout value, DisBatch will only display prediction results from FoldIndex[32] in the output page. To further support large-scale analysis, DisBatch recommends users to contact PrDOS[38] directly if the server response is slow or the input dataset is large. Users can check their prediction status by clicking on the “Check Prediction Status” button after submitting their sequences. Upon successful prediction, DisBatch generates a number of output files (Figure 4). Firstly, it provides raw prediction outputs from FoldIndex[32] and PrDOS[38] in their original format. DisBatch also returns its meta-prediction output score (the sum of the FoldIndex[32] binary score and the PrDOS[38] probability score) at each residue position in the sequence, in comma separated values (CSV) format. Lastly, the summed meta-prediction score of DisBatch was converted to the number of votes at each position and formatted as CSV. The minimum number of votes is 0 when FoldIndex[32] and PrDOS[38] return a positive value and a probability score of less than 0.5, respectively, while the maximum number of votes is 2 when both primary predictors agree on a positive prediction of a residue being potentially intrinsically disordered. A help page is provided, with prominent hyperlinks at the home page and output page, for further explanation of the DisBatch output files. 29 Figure 3. Sequence submission page of DisBatch. DisBatch is available at http://bioslax01.bic.nus.edu.sg/meta/. Figure 4 Output page of DisBatch. The page provides download links for each output file, and a link to the help page at the bottom of the page. 30 The DisBatch web server only accepts a maximum number of 50 sequences per submission, to avoid server overload. For larger-scale predictions, DisBatch is available as a local Unix package, downloadable from the web interface. Besides the installation files, detailed documentation with full installation and usage instructions can be found in the download page (http://bioslax01.bic.nus.edu.sg/meta/download.php). 3.4 Discussion Availability and scalability poses severe limitations for most disorder meta-predictors reported in literature, hindering their use in large-scale predictions from protein sequences. In this study, it has been demonstrated that a relatively lightweight predictor can be utilized for fast, automated, large-scale disorder predictions, with comparable performance to highly accurate meta-disorder predictors. This is important because quick, accurate predictors promote large-scale protein disorder analysis of proteins and their functions. To date, such large-scale studies have not been extensively reported in literature, in part due to restrictions set by current highperformance meta-predictors. 3.4.1 Predictive Performance The results indicate that DisBatch produces slightly higher accuracy at its optimum threshold, when compared to other primary and meta-predictors examined in this study. The advantage of this slightly better predictive performance is significant with large input datasets, which DisBatch is designed specifically to cater for. Nevertheless, Gene Silico Metadisorder MD2[41] is still recommended for smallscale disorder predictions, since it yields higher accuracy. The results, however, 31 suggest that the distribution of scores amongst various predictors used in this study occurs at a very wide range of accuracy and it is unclear whether low-scoring hits correlate across the predictors. 3.4.2 Scoring Algorithm Discretisation of FoldIndex scores and the addition of these scores to probability scores returned by other primary predictors in DisBatch may have introduced problems and artefacts that may affect the accuracy of the meta-predictor. As such, the performance of DisBatch may be improved by seeking alternatives to discretisation, such as converting the empirical distribution from FoldIndex of a large number of sequences into a probability estimate. 3.4.3 Benchmark Model The performance of DisBatch was benchmarked against the performance of GeneSilico Meta Disorder MD2[41], claimed to be the best disorder prediction method in CASP9. Also, GeneSilico Meta Disorder MD2 is available without restrictions for large-scale disorder prediction, albeit at a slow speed. Other metapredictors considered as benchmark models are not scalable for large-scale prediction and thus their predictive performance is not evaluated. For example, PONDR-FIT[43] only allows for manual submission through the web page, while metaPrDOS[42] restricts submission to less than 10 sequences per hour and recommends PrDOS for large-scale predictions instead. 32 3.4.4 Testing Dataset With regards to the accidental DisProt subset, comprising of 286 sequences, used for testing the predictors, it is difficult to guarantee that the sequences in the dataset were not used by any of the predictors as part of their training dataset. DisEMBL[18] and PrDOS[38] use PDB[37] structural files as the positive and negative training set, while FoldIndex[32] mines literature information for the positive training set consisting of intrinsically unfolded proteins and data from PDB[37] as the negative training set. Since the DisProt[46] dataset was compiled from published experimental data, it may overlap with the training set used by the primary predictors selected in this study, especially FoldIndex[32]. In addition, the DisProt testing set may not represent a genomic sample of disordered protein sequences. Other possible sources of dataset bias have also been identified in this study. Firstly, the representativeness of the accidental DisProt[46] subset has not been evaluated to ensure that it has enough quality for performance evaluation purposes. Therefore, the subset may be biased in terms of protein length and/or protein family and species representation. Similarly, the ratio of ordered and disordered residues in the dataset was not determined and therefore the proportion may also be biased, hindering objective performance evaluation. These problems can possibly be overcome by adopting an iterative approach to curate more intrinsically disordered regions in proteins. In addition, performance evaluation of disorder predictors can also be conducted on a larger set of data representative of a complete complex genome, such as the metazoan genome. 33 3.4.5 Software Limitation One major limitation of meta-predictors like DisBatch is their reliance on remote primary prediction server(s). Since both local and online versions of DisBatch use PrDOS[38], the speedup of DisBatch is largely dependent on PrDOS'[38] prediction speed and this makes DisBatch vulnerable to the idiosyncrasies of PrDOS[38]. Similarly, FoldIndex[32] results are retrieved through connection with the FoldIndex server, albeit at a significantly faster speed (≈ 3 seconds per sequence on average) compared to PrDOS’[38] average of 10 minutes per sequence. DisBatch returns results from FoldIndex[32] as an alternative if the PrDOS[38] server is facing technical difficulties. Nevertheless, other meta-predictors demonstrated to have comparable results with DisBatch in this study. In particular, the DisEMBL[18] and FoldIndex[32] (D+F) meta-predictor yields 66.83% accuracy as compared to DisBatch’s 67.79%. This meta-predictor can be considered as another alternative. One advantage of this alternative is that DisEMBL[18], unlike PrDOS[38], can be executed offline without any server connection. 3.5 Future Work To address pitfalls pertaining to predictive performance, dataset and software, highlighted in Section 3.4, rigorous investigation into the benchmark testing dataset is necessary to ensure objective performance evaluation. More in-depth examination of the testing dataset, as well as the training datasets of the primary predictors, should be carried out in the future to lend support to the performance advantage of DisBatch. Future work can include analyses on protein length, protein families and species covered in the testing set, as well as the correlation of high and low-scoring hits, and the proportion of ordered and disordered residues. Furthermore, blind CASP[105] 34 datasets not present in the training set of all primary predictors can also be used as the testing set, to eliminate the problem of dataset bias. More importantly, the prediction and scoring algorithm in DisBatch can be improved to yield higher accuracy. As discussed, the scoring algorithm can be revised to include only probability scores. On the other hand, innovative prediction algorithms can be explored and incorporated into DisBatch. Future improvements to the DisBatch web service are also necessary to improve usability. The speedup of DisBatch compared to other meta-prediction tools available can be quantified in terms of initial, transitional and terminal stage. The prediction service can be configured to return results by e-mail for greater user convenience. Also, interactive visualization and analysis tools such as plots and annotated sequence views can be provided by the web server to further facilitate meaningful large-scale analysis. 3.6 Chapter Conclusion This study addresses the problem of using meta-predictors to predict intrinsically disordered protein regions. High-performing meta-predictors like GeneSilico's MetaDisorder MD2[41] are slow and pose access limitations. Hence, I propose an alternative meta-predictor, DisBatch, which is much faster and has comparable performance. DisBatch is available both as a web service and local software. Despite the limitations raised in this chapter, the study represents a call for the development of large-scale disorder predictors with a more balanced performance-to-time ratio. Such powerful predictors will further drive research in intrinsic protein disorder and lend crucial applications to the elucidation of the biological functions of intrinsically disordered residues, which is the focus of my project. 35 4 NFκB Base : A Specialized Database of NFκB Proteins 4.1 Background NFκB transcription factors play a critical role in transcriptional activation and are associated with a wide range of important cellular processes involving cell proliferation and survival and the immune response[65,106-109]. In addition to experimental and structural data, protein sequences of NFκB have important research value for functional characterization of the protein family[83,90]. Numerous studies have been performed with NFκB sequences, as outlined in Section 2.3.2. To date, while a large number of NFκB protein sequences can be found in major sequence databases, there is no specialized, publicly accessible database specifically for NFκB protein sequences, though datasets containing NFκB target genes are available online (http://people.bu.edu/gilmore/nf-kb/target/index.html). This presents a need for a centralized repository containing an annotated dataset of NFκB protein sequences to fill the gap in current resources on NFκB. In this chapter, I present NFκB Base, a specialized database of experimentally verified, manually curated NFκB protein sequences. The database is integrated with analysis tools including (i) dynamic data display, (ii) keyword search and (iii) Basic Local Alignment Search Tool (BLAST) sequence search. The main aim of NFκB Base is to support and facilitate large-scale sequence and functional studies of NFκB proteins. NFκB Base is available at http://bioslax01.bic.nus.edu.sg/nfkb/. 36 4.2 Materials and Methods 4.2.1 Server Infrastructure Similar to DisBatch, NFκB Base was developed and hosted in the BioSlax 7.5 live operating system (http://www.bioslax.com), using the Linux-Apache-MySQL-PHP (LAMP) software stack. 4.2.2 Sequence Data Collection Keyword and sequence similarity searches were performed against NCBI Protein (GenBank Flat File Release 177.0, Release Date: April 15, 2010)[110], UniProt [52] (Release 2010_06, published May 18, 2010)[111] and PDB (Release date: June 1, 2010)[37]. Sequence similarity searches were performed using the Basic Local Alignment Search Tool (BLAST+) software, version 2.2.23[112]. Literature information was also mined from sequence records. 4.2.2.1 Inclusion and Exclusion Criteria All information was manually filtered and verified, according to keywords and literature, to remove irrelevant and hypothetical records. Duplicates for each record were also identified and recorded. The inclusion and exclusion criteria used during the filtering and curation process for records in NFκB Base are documented in Figure 5. 37 Figure 5. Detailed sequence inclusion and exclusion criteria for records in NFκB Base. 4.2.3 Database Design Collected and manually reviewed NFκB sequence data was stored in the MySQL Relational Database Management System. Each entry is assigned a unique accession number, beginning with the NFκB protein type and followed by a unique 5-digit serial number. A typical entry contains annotated information, where available on (i) source 38 accession number, (ii) NFκB protein type, (iii) protein name and description, (iv) scientific and common name of source organism, (v) gene name, (vi) chromosome name, (vii) sequence length, (vii) sequence, (viii) accession number(s) of duplicate record(s) and (ix) relevant cross-links to major databases. Cross-references link to the NCBI Protein[110], UniProt[111], Gene Ontology (GO)[113], Hugo Gene Nomenclature Committee (HGNC)[114], InterPro[115], PDB[37], PubMed[116] and the NCBI Taxonomy databases[116], where additional sequence, function, gene nomenclature, protein domain and family, structural, literature and taxonomy information can be found, respectively. These links are included to further increase NFκB information coverage. 4.2.4 Web Interface The web interface for NFκB Base was constructed with HyperText Markup Language (HTML), PHP, web BLAST, Perl CGI scripts and the jQuery libary. HTML was used to present web page content, while PHP was used for database server connection for search queries and entry display. The web BLAST Perl CGI was used to perform online BLAST searches by calling the local BLAST package and BLAST databases. The jQuery library, based on Asynchronous Javascript and XML (AJAX), allows for dynamic table browsing and display. It is designed to present a quick and concise view of the database, displaying only important fields, including NFκB and source accession numbers with relevant hyperlinks to full entry records, organism name and protein descriptions dynamically, for simple navigation of the database. 39 4.2.5 Results 4.2.5.1 NFκB Base Content The latest release of NFκB Base (Beta 2.0) contains 413 records of experimentally verified protein sequences within the eukaryotic NFκB family. There are 22 C-Rel records, 41 Dorsal records, 29 Dif records, 70 NFκB1 records, 59 NFκB2 records, 95 RelA records, 19 RelB records and 71 Relish records (Figure 6). These records were collected from major sequence and structure databases, including NCBI Protein[110], UniProt[111] and PDB[37], and were subsequently filtered and reviewed manually. Each record was assigned a unique accession number containing information on the protein type and serial number. In addition, each record is linked to a detailed entry page, where all annotated data is organized in fields (Figure 7). 4.2.5.2 Features 4.2.5.2.1 Keyword Search NFκB Base supports keyword search, through the integration of a search query box on the top of each page. Users can query the database based on specific fields, including NFκB Base accession number, source database accession number, Gene Identifier (GI) number[116], protein description, gene name and organism name. For more general searches, users can also look up a keyword in all fields of the database. Search results are displayed in tabular format, with basic information including NFκB Base accession number, source database accession number, organism name and protein description (Figure 8). The accession number fields are hyperlinked to the NFκB Base entry page and source database entry page. Alternatively, users can browse the database dynamically, in the same tabular view as the search output page (Figure 9). Users can customize the number of entries to 40 display and perform a keyword search on these fields dynamically. Each displayed field can also be sorted dynamically, in ascending or descending order. Figure 6. Number of records present in NFκB Base (Release: Beta 2.0) for each NFκB protein type. NFκB Base is available at http://proline.bic.nus.edu.sg/~shenjean/nfkb/. Figure 7. A typical entry page of NFκB Base. Each entry contains information, where available on source accession, NFκB protein type, description, organism, gene name, chromosome name, sequence length, accession number(s) of duplicate record(s) and cross-links to major online databases, including NCBI Protein (sequence database), UniProt (sequence database), GO (Gene Ontology database), HGNC (gene nomenclature database), InterPro (protein domain and family database), PDB (protein structure database), PubMed (literature database) and NCBI Taxonomy (taxonomy database). 41 Figure 8. Sample keyword search output of NFκB Base, displaying the accession number, source accession number, organism and description fields. NFκB Base supports keyword searches in all or specific fields, where users can submit a query at the top of every page, shown in the upper frame of this figure. Figure 9. The Browse page of NFκB Base with jQuery supported dynamic data search and display. 42 4.2.5.2.2 Sequence Similarity Search Besides keyword searches, NFκB Base also integrates the BLAST[112] tool for sequence similarity searches against all or specific types of NFκB proteins (C-Rel, Dorsal, Dif, NFκB1, NFκB2, RelA, RelB and Relish) recorded in the database. BLAST[112] is a local sequence comparison tool that returns nucleotide or protein sequences containing identical or similar regions with the input query sequence. All BLAST types, including blastp, blastn, blastx, tblastn and tblastx are supported by NFκB Base, allowing easy identification of matching or similar NFκB sequences to the query sequence. The BLAST interface provides all standard configurable parameters, similar to the original NCBI[116] BLAST interface. 4.2.5.2.3 Batch Download NFκB Base provides batch download of all records or records specific to a particular NFκB protein type stored in NFκB Base. Users can batch retrieve all annotations in CSV format or sequence files in FASTA format. 43 Figure 10. BLAST interface for NFκB Base. 44 4.2.6 Discussion NFkB Base represents the first specialised database containing annotated, experimentally verified information pertaining to NFκB proteins. High quality data is important for research, especially those utilizing computational approaches. With increasing data publicly available in general and specialised databases, the laborious procedures of data collection and annotation often form the rate-limiting step of datacentric research. NFkB Base aims to speed up and facilitate research on NFκB transcription factors by providing readily accessible, high quality information on these proteins. To further benefit the research community, NFκB Base is also part of the BioDB100 initiative (http://biodb100.apbionet.org) aiming to support the reproducibility of scientific data through archival and re-instantiation 4.2.7 Future Work As the quality of data lies in the core of databases like NFκB Base, future efforts entail the compilation of additional information to be integrated in NFκB Base. The database can be expanded to include relevant information such as conserved domain and functional site data, as well as protein dynamics data such as annotations on either experimentally verified or computationally predicted intrinsically protein disorder residues. More interactive structure, sequence and phylogeny visualizations and analysis tools can also be built into the NFκB Base web interface. 4.2.7.1 Community Annotation Policy In addition, to speed up knowledge discovery and sharing, NFκB Base can adopt a community annotation policy, similar to Allergen Atlas[117] and T3SEdb[118], 45 where users can submit new curated data and/or update existing NFκB information in the database. The inclusion and exclusion criteria outlined in Figure 5 can be used as a guide for the community annotation policy. 4.2.8 Chapter Conclusion This study describes the development of a specialized database, NFκB Base, containing specific annotated information on NFκB proteins, as well as integrated analysis tools. The database contributes to research on the NFκB transcription regulation machinery, through the sharing of curated information that leads to a better understanding of the structure and function of these important proteins. 46 5 The Role of Conserved Disordered Residues in NFκB Function 5.1 Background Contrary to the conventional view that folded, structured proteins are important for function, it has been discovered that intrinsically disordered proteins or protein regions, which are more flexible than their counterparts, contribute to a variety of cellular functions[17]. The functional roles of these proteins or protein regions have been studied, particularly in the field of cell signaling[14,15]. They allow for the accommodation of multiple interaction partners and modification sites, as well as regulation flexibility[18]. Analogous with sequence conservation of functionally important sites, the evolutionary conservation of intrinsic protein disorder has been discovered to be non-trivial and non-random, further signifying its functional importance[26,53,54]. Literature review of studies on intrinsic protein disorder, as described in Section 1.3, gave rise to my hypothesis that intrinsic protein disorder properties of cell signaling proteins are evolutionary conserved in protein families. Details on the formulated hypothesis can be found in Section 1.4. The hypothesis called for systematic protein sequence and disorder conservation analyses on an exemplar cell signaling protein family for validation. To this end, the Nuclear Factor Kappa-light-chain-enhancer of Activated B cells (NFκB/Rel) proteins, crucial for processes such as cell proliferation, survival, inflammation and immunity[55,58], were selected as the exemplar protein family. Currently, no 47 investigation on the functional role of intrinsic protein disorder in NFκB transcription factors has been recorded. This study therefore addressed a specific gap in NFκB literature. In the final phase of my research project, I developed an in silico analysis pipeline for the identification of intrinsically disordered protein residues and the analysis of the conservation, localization and function of predicted disordered regions. The results of the analysis revealed distinctive protein disorder distribution patterns across each NFκB protein type, which are conserved across different species. This implies the functional contribution of intrinsically disordered protein residues in promoting DNA and ankyrin protein binding events. 5.2 Materials and Methods 5.2.1 Sequence Data Collection NFκB protein sequences used for this study were collected from NFκB Base (Chapter 4), which contains 413 experimentally verified, manually annotated NFκB sequences. To minimize redundancy, the collected sequences were further processed to select the longest, unique representative sequence for each organism in each NFκB protein type. A total of 18 NFκB1 representative sequences across multiple organisms were analysed. These 18 sequences comprised 11 NFκB2 sequences, 14 C-Rel sequences, 6 Dif sequences, 16 Dorsal sequences, 19 RelA sequences, 5 RelB sequences and 24 RelB sequences. 5.2.2 Multiple Sequence Alignment For each NFκB protein type, multiple sequence alignment was performed using Multiple Sequence Comparison by Log-Expectation (MUSCLE)[119] hosted on 48 EUAsiaGrid[120]. The alignments were checked and edited manually for misalignments in BioEdit[121]. Positions in the alignments where more than 50% of the sequences contained a gap were removed to yield entropy and conservation measurements with high statistical support. 5.2.3 Entropy Analysis Based on the multiple sequence alignment, the level of amino acid residue conservation at each position was inferred using Shannon’s entropy values calculated using BioEdit[121]. The entropy value at each position provides a measure of uncertainty at each position relative to other positions and is defined as H(x) = -∑f(b, x)ln(f(b, x)), where f(b, x) is the frequency at which residue b is found at position x[122]. The maximum entropy is calculated as Hmax = ln n = ln 20, where n represents the maximum number of variations at a particular position[122]. High entropy values correspond to positions in the alignment with high variability and thus low residue conservation. 5.2.4 Intrinsic Protein Disorder Analysis DisBatch (Version 0.02, Chapter 3) was used to predict potentially disordered residues, with the threshold set at 0.5 (for positive prediction by PrDOS[38] only) and 1.5 (for positive prediction by both PrDOS[38] and FoldIndex[32]). 5.2.5 Conservation of Intrinsic Protein Disorder Conservation of intrinsic protein disorder at each residue position in the multiple sequence alignment, for each NFκB protein type across multiple species, was measured first using the standard deviation (SD), followed by the coefficient of 49 variation (CV). While SD is unit-dependent, CV represents a normalized, scaleinvariant measure of the dispersion of the average disorder score around the mean. CV is expressed as the ratio of the standard deviation to the mean: CVx= σ(µx) / µx where σ(µx) represents the standard deviation of the mean DisBatch protein disorder score, µx, at position x, across all sequences (from multiple orgnanisms) of a specific NFκB type in the multiple sequence alignment. Low CV values correspond to residue positions in the alignment where DisBatch protein disorder scores across all sequences share low standard deviations in relation to the mean, thereby implying conservation of the average disorder scores. 5.2.6 Structural Analysis Structural data of NFκB proteins were mined from annotated sequence records and extracted from the Protein Data Bank (release date: June 1, 2010)[37]. A total number of 35 wild type and mutated protein NFκB structures in bound states from all available species were collected and carefully reviewed. Protein sequences from structural records were a subset of the NFκB sequence dataset in NFκB Base that are used in analysis of intrinsic protein disorder properties. 16 representative, unique structures for each NFκB dimer combination in either active or inhibited states were selected for further analysis. The NFκB protein structures were superimposed and annotated using PyMOL (version 0.99rc6)[123]. Root mean square deviation (RMSD) values representing the average distance between superimposed atoms were also calculated by PyMoL. β-factors for each atom, indicative of the degree of thermal displacement, were extracted from PDB files and visualized graphically in PyMOL[37] . 50 5.3 Results 5.3.1 Conserved intrinsic protein disorder signatures in NFκB To predict intrinsically disordered residues in both Class I and Class II NFκB proteins, DisBatch was used. The average DisBatch score for each position in the multiple sequence alignment was calculated and compared with Shannon’s entropy values[122] to check for any correlation between intrinsic protein disorder and residue conservation. High entropy values represent positions in the alignment with high variability, implying low residue conservation. NFκB proteins share a conserved N-terminal Rel homology (RH) domain, which is further subdivided into the N-terminal specificity sub-domain and the C-terminal IPT sub-domain[58]. The N-terminal specificity (RHD) sub-domain resembles the core domain of the transcription factor p53, whereas the C-terminal IPT sub-domain is an immunoglobulin like fold containing the interaction site of NFκB with its inhibitor, IκB[58]. The RH domain also contains DNA binding sites, ankyrin protein binding sites and the dimerization interface[58]. As expected, Shannon’s entropy analysis[122] in each NFκB protein type showed high residue conservation within the NH2-terminal Rel homology (RH) domain, particularly for the DNA and ankyrin protein binding sites, as observed by troughs representing lower entropy values in Figure 11 and Figure 12. Unlike Class II NFκB proteins, Class I NFκB proteins contain protein-protein interaction domains including ankyrin repeats[124-127] in the ANK domain and the Death domain[128] at the Cterminal. The ANK domain[124-127] is responsible for mediating protein-protein interactions, while the Death domain[128] acts as an adaptor and recruiter in signaling pathways [64]. In comparison with the N-terminal RH domain, the ANK and Death domains in Class I NFκB proteins appeared to be less conserved according to the 51 Shannon’s entropy values[122]. Nevertheless, there seemed to be no apparent correlation between the average disorder score and Shannon’s entropy values, suggesting that intrinsic protein disorder is not associated with residue conservation. The distribution of the average disorder score in each NFκB protein type (Figure 11 & Figure 12) showed predicted moderately and highly disordered residues (residues with an average disorder score of ≥0.5 and ≥1.5) to be associated with DNA binding sites in the RHD N-terminal specificity structural sub-domain, as well as with the ankyrin protein binding sites in the C-terminal IPT structural sub-domain for both Class I and Class II NFκB proteins. The ANK and Death domains in Class I NFκB proteins (Figure 11) were observed to be generally non-disordered. Notably, a prominent spike of average disorder score at the N-terminal end of the ANK domain was seen in all Class I proteins, except Relish. Interestingly, DNA binding sites in Class I proteins were observed to be more disordered, with generally higher average disorder scores than Class II proteins. Some DNA binding residues at the C-terminal end of the RHD domain, at approximate alignment position 150-200, were predicted to be ordered in most Class I and Class II proteins but were absent in RelA, RelB and C-Rel. Furthermore, residues that are part of the dimerization interface were found to be generally more disordered in Class I proteins (except in NFκB1) and less disordered in Class II proteins. These differences in protein disorder patterns between Class I and Class II proteins may shed light on their differences in mechanism and function. 52 Figure 11. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the RHD domain of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The average disorder score cutoffs of 0.5 and 1.5 were used to distinguish between moderately (predicted only by PrDOS to be disordered) and highly disordered (predicted by both PrDOS and FoldIndex) residues, respectively. Shannon’s entropy values were also plotted in the graph for comparison. 53 Figure 12. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at the RHD domain of A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. 54 Figure 13. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the IPT domain of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. 55 Figure 14. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at the IPT domain of A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. 56 Figure 15. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at sites with no functional annotation in A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. 57 Figure 16. Distribution of the average disorder score at each alignment position for Class II NFκB proteins at sites with no functional annotation in A) RelA, B) RelB, C) C-Rel, D) Dorsal and E) Dif, as predicted by DisBatch. 58 Figure 17. Distribution of the average disorder score at each alignment position for Class I NFκB proteins at the ANK domain (in red) and Death domain (in black) of A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. 59 To further assess the conservation of protein disorder properties in NFκB proteins, I measured the standard deviation of the average disorder score, as well as the scaleinvariant coefficient of variation (CV) values at each position. Scatter plots for each NFκB protein type were generated to further examine the relationship between the average disorder score and the standard deviation and CV of the average disorder score (Figure 18-Figure 23). Each point in the scatter plot represents a specific alignment position in a particular NFκB protein type. The scatter plots of both the standard deviation and CV against the average disorder score generally agree with each other (Figure 18-Figure 23). These plots show distinct quadrants, mainly i) conserved, non-disordered (residues not predicted to be disordered by DisBatch (bottom left of scatter plot) and ii) conserved, disordered (bottom right; Figure 18Figure 23). In all NFκB protein types, the conserved, non-disordered quadrant comprised some ankyrin protein binding sites, IPT domain residues, RHD domain residues and nonfunctionally annotated residues. Most ANK and Death domain residues in Class I proteins were represented as outliers in the conserved, non-disordered quadrant, while DNA binding sites in Class I NFκB proteins tended to lie in proximity to or within the conserved disordered quadrant (Figure 18 & Figure 21). In addition, only dimerization interface residues in NFκB2 were observed clearly in both plots to lie within the conserved disordered quadrant (Figure 19 & Figure 22), while dimerization interface in Class II proteins were generally found in the non-disordered quadrants (Figure 19, Figure 20, Figure 22 & Figure 23). Class II NFκB proteins, on the other hand, had slightly less DNA binding sites within the conserved disordered quadrant and slightly more within the conserved, non-disordered quadrant, compared to the Class I proteins (Figure 19, Figure 20, Figure 22 & Figure 23). 60 Interestingly, many DNA binding sites in Dif were found in the non-conserved, disordered quadrant (Figure 20 & Figure 23). The scatter plot of the average disorder score against the CV of the average disorder score indicated a general inverse correlation (Figure 21, Figure 22 & Figure 23). The correlation between the 2 variables was not perfect and quantification of the correlation using the coefficient of determination (R2) yielded low values of ≤ 0.5. This may possibly be an artefact since it is expected for the CV and average disorder score to show an inverse relationship if the standard deviation does not depend on the mean. 61 Figure 18. Scatter plot of average disorder score against the standard deviation of disorder scores for Class I NFκB proteins, A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The scatter plots show 2 distinct quadrants of: conserved non-disordered residues (bottom left) and conserved disordered residues (bottom right). Functional domains and sites were annotated in the graph and coloured accordingly. 62 Figure 19. Scatter plot of average disorder score against the standard deviation of disorder scores for Class II NFκB proteins, A) RelA, B) RelB and C) C-Rel, as predicted by DisBatch. 63 Figure 20. (Cont’d from Figure 19) Scatter plot of average disorder score against the standard deviation of average disorder score for Class II NFκB proteins, A) Dorsal, B) Dif, as predicted by DisBatch. 64 Figure 21. Scatter plot of average disorder score against the CV of average disorder score for Class I NFκB proteins, A) NFκB1, B) NFκB2 and C) Relish, as predicted by DisBatch. The scatter plot shows 4 distinct quadrants of: non-conserved, non-disordered residues (top left of scatter plot), non-conserved disordered residues (top right), conserved non-disordered residues (bottom left) and conserved disordered residues (bottom right). Functional domains and sites were annotated in the graph and coloured accordingly. 65 Figure 22. Scatter plot of average disorder score against the CV of average disorder score for Class II NFκB proteins, A) RelA, B) RelB and C)C-Rel, as predicted by DisBatch. 66 Figure 23. (Cont’d from Figure 22) Scatter plot of average disorder score against the CV of average disorder score for Class II NFκB proteins, A) Dorsal, B) Dif, as predicted by DisBatch. 67 5.3.2 Structural Analysis Following intrinsic protein disorder analysis at the sequence level, an attempt was made to map the predicted disordered residues in NFκB proteins to representative 3D structural data. The structural analysis was conducted to provide a more in-depth case study of NFκB sequence properties and the structures analyzed represented a subset of the NFκB protein sequence dataset. Prior to this, all available 3D structures in each NFκB protein types were superimposed against each other to confirm the representativeness of the selected structure sample. All superimposed structures exhibited high structural similarity with low root mean square deviation (RMSD) values (data not shown). Conserved non-disordered and conserved disordered residues in each NFκB protein type were mapped to available PDB[37] structures, according to their respective quadrants demarcated in Figure 18 to Figure 23. These annotated structures were compared with β-factor values and visualized as heat maps according to the original PDB annotation[37]. Structural analysis of Class I NFκB homodimers showed most predicted conserved, non-disordered residues to surround the DNA at the N-terminal RHD domain and most conserved disordered residues to be present at the C-terminal IPT domain containing the ankyrin protein binding sites and dimerization interface (Figure 24). Only residues in the N-terminal Rel Homology (RH) domain were visible in 3D structures, while coordinates of residues occurring in the C-terminal ANK and Death domains were not present in the PDB[37] records. 68 Class II homodimers showed more conserved disordered residues surrounding the DNA and more conserved, non-disordered residues at the C-terminal IPT domain with ankyrin protein binding sites and the dimerization interface (Figure 25). In addition, the insert α-helical regions of both Class I and Class II NFκB dimers were found to be disordered (Figure 24 & Figure 25). The insert region in Class I proteins differed from the Class II proteins in that the former contains an additional α-helix. These disordered insert α-helical regions formed grooves in the NFκB proteins that resembled potential protein binding sites. In Class I NFkB proteins the clefts formed by the insert regions were narrow and deep (Figure 24) whereas in the Class II NFkB proteins the clefts formed by the insert regions were much wider and shallower (Figure 25). According to literature, the insert regions represent potential interaction and/or binding sites[96]. NFκB heterodimers are formed between Class I and Class II proteins. As expected, heterodimers contain a hybrid of conserved non-disordered and disordered residues in all functional sites and domains, suggesting distinct mechanisms and functions differing between Class I and Class II homodimers (Figure 26). For the heterodimers, the configuration of the conserved disordered and non-disordered residues of their component monomers matched with those observed in their respective homodimers. Interestingly, inhibited NFκB dimers were largely made of conserved disordered residues. In contrast, their inhibitors (IκB) were found to be greatly conserved and non-disordered (Figure 27). The protein structures annotated with intrinsic protein disorder information were compared with their respective β-factor annotation. β-factors are also known as temperature factors, which represent the degree of thermal displacement and thus flexibility of an atom[37]. Research has associated intrinsically disordered protein 69 regions with high β-factors[129]. Here, the results showed a general, loose agreement between conserved disordered residues and residues with high β-factors, as well as conserved non-disordered residues and residues with low β-factors. This is with the exception of Class I monomers (NFκB1 and NFκB2) that are part of the heterodimers in Figure 26, where residues with high β-factors were found to be conserved and nondisordered. From the observation, the DisBatch predictor appears to be more sensitive towards intrinsically disordered regions as compared to annotated β-factors. There is a possibly that the predictor may “over-predict” disordered regions, thus operating at the lower range of accuracy with false positives. Alternatively, the predictor can probably detect regions with some dynamics, but generally evolved – not to have intrinsic disorder – but to remain disordered until a binding event occur. In that mode, some sequence conservation of these dynamic properties is expected since intrinsically disordered proteins exhibit structure-function relationship in the bound form. Nevertheless, β-factors are not guaranteed reliable indicators of disorder, since they vary according to the effects of local packing and the structural environment [130]. 70 Figure 24. Structures of representative Class I NFκB homodimers, NFκB1 (top) and NFκB2 (bottom), coloured according to protein disorder annotations (left) and β-factors (right). The C-terminal IPT domain contains ankyrin protein binding sites enveloping the dimerization interface. Ankyrin repeats and the Death domain were not present in the 3D structures. The α-helical insert regions are conserved disordered residues, highlighted in red, at the left of the protein structure in the N-terminal RHD domain. Figure 25. Structures of representative Class II NFκB homodimers, RelA (top) and C-Rel (bottom), coloured according to protein disorder annotations (left) and β-factors (right). 71 Figure 26. Structures of representative NFκB heterodimers formed between Class I and Class II NFκB proteins, coloured according to protein disorder annotations (left) and β-factors (right). Examples shown here are the RelANFκB1 (top) and RelB-NFκB2 (bottom) heterodimers. Figure 27. Structures of representative RelA homodimer (top) and RelA-NFκB1 heterodimer (bottom) in the IκB inhibited state, coloured according to protein disorder annotations (left) and β-factors (right). 72 5.4 Discussion This study aimed to examine the functional roles of intrinsic protein disorder in the exemplar NFκB transcription factors. To this end, I have utilized a suite of computational tools to identify and analyze intrinsically disordered protein regions in different types of NFκB proteins at both sequence and structural levels. Comparisons were made between our findings and well-known measures, including Shannon’s entropy values[122] and β-factors[37]. β-factors are well known to correspond to crystal quality and R-value in any given crystal structure[37]. From both sequence and structural analysis, key differences in terms of protein disorder patterns between the Class I and Class II NFκB proteins have been observed. Firstly, Class I NFκB proteins were more disordered in the vicinity of the DNA contacting sites at the N-terminal RHD domain compared to Class II proteins. This may explain their reported stronger DNA binding activity in comparison with Class II proteins[58,106]. Protein recognition of DNA target sites represents a crucial event for key gene functions, one of which includes transcription. It has been reported that protein-DNA interactions proceed via an initial, non-specific binding state which accelerates the search for target sites through multiple intramolecular processes, including diffusion along the DNA[131,132]. This is facilitated by flexible, dual-role residues which act as switches between non-specific and specific binding states through conformational changes and rearrangements[131,132]. More specifically, Kalodimos et al. observed that the hinge region in the DNA binding domain of the lactose repressor is disordered in the free as well as non-specific binding state but forms an α-helix in the specific binding state[132]. Hence, it could be proposed that disordered residues in NFκB proteins function similarly in promoting protein-DNA interactions. 73 Secondly, Class I NFκB proteins were more disordered than Class II proteins at their dimerization interfaces, suggesting different modes of dimerization. However, there seemed to be no substantial literature on the differences in dimerization modes between both NFκB protein classes. Class I NFκB proteins contain ankyrin repeats in the ANK domain[124-127,133-135] and the Death domain[128], not found in Class II proteins. Both domains have been discovered to be non-disordered. For Class I proteins, only predicted disordered residues occurring in the Rel Homology (RH) domain could be mapped to the 3D crystal structure, whereas ankyrin repeats and Death domains occurring in the Nterminal IPT domain were not found in the structure. On the other hand, Class II NFκB proteins have been reported to possess a potent trans-activation domain at the C-terminus[58]. However, sequence-specific positions of the trans-activation domain were not found in NFκB records in major databases during the data mining step. Therefore, analyses of the presence, sequence localization and/or structural localization of the N-terminal IPT and trans-activation domains were not applicable to this study. From this study, it could be observed that not all functional sites were intrinsically disordered. Most ANK[125] and Death domain[128] residues in Class I proteins were discovered to be conserved and non-disordered. Some ankyrin protein binding site residues in both classes of proteins were conserved and non-disordered, while the rest were either non-disordered or disordered in a conserved or non-conserved manner. This could also be applied to RHD and IPT domain residues. Nonetheless, many functional sites in NFκB proteins have been observed to be conserved, both in terms of sequences (from Shannon’s entropy values)[122] and intrinsic protein disorder properties (from SD and CV values). The conservation of 74 intrinsic protein disorder was not reflected by Shannon’s entropy analysis[122], since there was no general observable trend between the dispersion of the average disorder score and Shannon’s entropy values (data not shown)[122]. In fact, Chen et al. have revealed Shannon’s entropy analysis to show relatively lower levels of sequence conservation in disordered regions compared to non-disordered regions[51]. This suggested that the conservation of protein disorder may capture some functional aspects of NFκB proteins not reflected via residue conservation. I therefore propose the use of the standard deviation (SD) and/or coefficient of variation (CV) as a more appropriate measure of determining the conservation of protein disorder, since they take into account the conservation of intrinsic protein disorder as a general characteristic, possibly encompassing physiochemical and structural properties rather than a residue-specific characteristic. Taken together, the results suggested that evolutionarily conserved disordered residues offer an alternative perspective on the functional roles of NFκB proteins, especially in facilitating binding events including DNA binding, ankyrin protein binding and possibly the binding of other proteins (from the discovery of disordered α-helix insert regions). However, conserved non-disordered residues do also contribute to specific NFκB functions. This highlighted that protein functions in NFκB transcription factors are not all necessarily affected through protein disorder. Rather, specific localizations of disordered and non-disordered residues in each functional site contribute to various aspects protein function, and each residue type plays unique, specific functional roles. Taking this view, intrinsic protein disorder signatures may be critical in determining protein function. In this study, it has been shown that differences in intrinsic protein signatures in both classes may account for their differences in mechanism between 75 both classes of NFκB proteins. These differences may possibly even between different types of proteins, as seen from certain exceptions in disorder signatures mentioned in Section 5.3. The monomeric composition of the NFκB dimers has been found to affect its DNA-binding site specificity, subcellular localization, trans-activation potential and mode of regulation, thus leading to combinatorial diversity of the downstream responses[63,136]. Here, intrinsic protein disorder signatures for each NFκB protein type may account for the variability in function for each type of NFκB dimer. 5.5 Future Work This research demonstrated how the analysis of protein disorder can be applied to study the function of transcription factors such as the NFκB protein family, which are involved in many important cellular and organismal processes. Further experimental validation and characterization of these predicted disordered residues, through both computational and laboratory means, are necessary to support the functional roles of these conserved disordered (and/or non-disordered) residues. Experimentally, imaging approaches can be used to observe in more detail the dynamics involved in NFκB interactions. Mutagenesis studies can be performed on conserved disordered residues to observe the change in function. Also, systems biology approaches integrating both experimental and computational methods can be utilized in the future for modeling and understanding NFκB interactions and pathways. Computationally, datasets on additional protein families can be used as positive and negative controls for a more robust analysis. These control datasets can also be used 76 to rigorously test the use of SD and/or CV as a quantifying measure of the conservation of intrinsic protein disorder. In addition, the value of using unique disorder motifs for protein families (in the form of matrices) for functional prediction, using approaches such as machine learning methods, can be investigated. If it can be demonstrated that protein families indeed have distinct disorder patterns, a database of these signatures can be developed for the benefit of the scientific community. Lastly, more in-depth structural analysis can be conducted to lend support to the findings in this study. More work can be done to quantify the correlation between intrinsic protein disorder and β-factors. Smith et al. advised on normalizing β-factors for more accurate comparisons[130]. The same procedure can be applied to this study in the future. Further structural analysis can also include molecular dynamics simulations and probabilistic conformational space sampling of known NFκB interacting structures, to investigate the effects and functional implications of interactions involving intrinsically disordered residues on permitted NFκB protein conformations. 5.6 Chapter Conclusion Protein disorder has been implicated in various regulatory functions in the cell, particularly in transcription and cell signaling, allowing for the accommodation of multiple interaction partners and modification sites and the provision of flexibility in regulation[15,16]. In this study, I have described a study aimed at investigating the functional role of intrinsically disordered regions in NFκB proteins, a set of eukaryotic transcription factors involved in diverse cellular and organismal processes. I have examined the conservation of predicted disordered regions across known 77 representative NFkB protein sequences and analyzed the sequence and structural configuration of these disordered residues. From this study, distinctive protein disorder patterns across each NFκB protein type conserved across different species were observed, potentially highlighting key differences in mechanism and function between NFκB protein classes and types. Intrinsic protein disorder study therefore provides a different perspective in gaining more in-depth understanding of the mechanisms underlying NFκB transcriptional regulation. 78 6 Conclusion For my thesis, I have developed an analysis pipeline, comprising computational prediction, data mining and sequence and structure analysis, to investigate the functional role of intrinsic protein disorder in the exemplar NFκB protein family. The pipeline represents the first critical step in analyzing and understanding intrinsic protein disorder and its role in protein function. Quantitative and qualitative findings of this project supported the emerging paradigm that the dynamics of signaling proteins in general, play important roles in modulating their functions. Protein disorder thus offers a new way of analyzing and understanding protein-protein binding and interactions, contributing to the further understanding of functional conservation in relatively diverse proteins. One exciting and meaningful implication of this project is that protein families may possess distinctive disorder signatures or motifs, which can prove valuable for functional characterization. To this end, the protein disorder analysis pipeline outlined, once established, can be applied to study the conservation and configuration of dynamic residues in other transcription factors and protein families. Key questions that can be addressed in future can include those pertaining to the complexity of transcription regulation machinery and signaling pathways encompassing signal integration and cross-talk. The methodology and findings of this study will lay the foundation to similar research in the future, thereby contributing significantly to research in the field of transcriptional regulation and cellular signaling in the future. 79 7 References 1. Barnes C, Monteith W, Pielak G. Internal and Global Protein Motion Assessed with a Fusion Construct and In-Cell NMR Spectroscopy. Chembiochem. 2010; 2. Miller RJD. Energetics and Dynamics of Deterministic Protein Motion. Acc. Chem. Res. 1994;27(5):145-150. 3. Hammes-Schiffer S, Benkovic S. Relating protein motion to catalysis. Annu Rev Biochem. 2006;75:519-541. 4. Genberg L, Richard L, McLendon G, Miller R. Direct observation of global protein motion in hemoglobin and myoglobin on picosecond time scales. Science. 1991;251(4997):1051-1054. 5. Jasnin M, Moulin M, Haertlein M, Zaccai G, Tehei M. In vivo measurement of internal and global macromolecular motions in Escherichia coli. Biophys J. 2008;95(2):857-864. 6. Terada TP, Sasai M, Yomo T. Conformational change of the actomyosin complex drives the multiple stepping movement. Proc Natl Acad Sci U S A 2002;99(14):9202-9206. 7. Hirose S, Yokota K, Kuroda Y, Wako H, Endo S, Kanai S, et al. Prediction of protein motions from amino acid sequence and its application to protein-protein interaction. BMC Struct Biol 2010;10:20. 8. Echols N, Milburn D, Gerstein M. MolMovDB: analysis and visualization of conformational change and structural flexibility. Nucleic Acids Res 2003;31(1):478-482. 9. Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 1994;18(3):269-285. 10. Smock R, Gierasch L. Sending signals dynamically. Science. 2009;324(5924):198-203. 11. Bu Z, Callaway D. Proteins MOVE! Protein dynamics and long-range allostery in cell signaling. Adv Protein Chem Struct Biol. 2011;83:163-221. 12. Ashish, Juncadella IJ, Garg R, Boone CD, Anguita J, Krueger JK, et al. Conformational rearrangement within the soluble domains of the CD4 receptor is ligand-specific. J Biol Chem 2008;283(5):2761-2772. 13. Good M, Zalatan J, Lim W. Scaffold proteins: hubs for controlling the flow of cellular information. Science. 2011;332(6030):680-686. 80 14. Structure of tumor suppressor p53 and its intrinsically disordered N-terminal transactivation domain. Proceedings of the National Academy of Sciences; 2008. 5762 p. 15. Cortese M, Uversky V, Dunker A. Intrinsic disorder in scaffold proteins: getting more from less. Prog Biophys Mol Biol. 2008;98(1):85-106. 16. Dunker A, Silman I, Uversky V, Sussman J. Function and structure of inherently disordered proteins. Curr Opin Struct Biol. 2008;18(6):756-764. 17. Uversky V, Dunker A. Understanding protein non-folding. Biochim Biophys Acta. 2010;1804(6):1231-1264. 18. Linding R, Jensen L, Diella F, Bork P, Gibson T, Russell R, et al. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11(11):1453-1459. 19. Dunker A, Oldfield C, Meng J, Romero P, Yang J, Chen J, et al. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics. 2008; 20. Uversky V. Natively unfolded proteins: a point where biology waits for physics. Protein Sci. 2002;11(4):739-756. 21. Radhakrishnan I, Pérez-Alvarado G, Parker D, Dyson H, Montminy M, Wright P, et al. Solution structure of the KIX domain of CBP bound to the transactivation domain of CREB: a model for activator:coactivator interactions. Cell. 1997;91(6):741-752. 22. Sigalov AB, Uversky VN. Differential occurrence of protein intrinsic disorder in the cytoplasmic signaling domains of cell receptors. Self/nonself 2011;2(1):55-72. 23. Sandhu KS. Intrinsic disorder explains diverse nuclear roles of chromatin remodeling proteins. J Mol Recognit 2009;22(1):1-8. 24. Mészáros B, Simon I, Dosztányi Z. The expanding view of protein-protein interactions: complexes involving intrinsically disordered proteins. Phys Biol 2011;8(3):035003. 25. Mészáros B, Simon I, Dosztányi Z. Prediction of protein binding regions in disordered proteins. PLoS Comput Biol 2009;5(5):e1000376. 26. Brown CJ, Johnson AK, Dunker AK, Daughdrill GW. Evolution and disorder. Curr Opin Struct Biol 2011; 27. Dosztányi Z, Tompa P. Prediction of protein disorder. Methods Mol Biol. 2008;:103-115. 28. 17 Protein disorder prediction servers [Internet]. [updated 2009; cited 2011]. Available from: http://rosettadesigngroup.com/blog/521/17-protein-disorder-prediction-servers/ 81 29. Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK, et al. Sequence complexity of disordered protein. Proteins 2001;42(1):38-48. 30. Dunker A, Brown C, Lawson J, Iakoucheva L, Obradović Z. Intrinsic disorder and protein function. Biochemistry. 2002;41(21):6573-6582. 31. Li, Romero, Rani, Dunker, Obradovic. Predicting Protein Disorder for N-, C-, and Internal Regions. Genome Inform Ser Workshop Genome Inform 1999;10:30-40. 32. Prilusky J, Felder CE, Zeev-Ben-Mordehai T, Rydberg EH, Man O, Beckmann JS, et al. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics 2005;21(16):3435-3438. 33. Dosztányi Z, Csizmok V, Tompa P, Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 2005;21(16):3433-3434. 34. Linding R, Russell RB, Neduva V, Gibson TJ. GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 2003;:3701-3708. 35. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983;22(12):2577-2637. 36. Ward J, Sodhi J, McGuffin L, Buxton B, Jones D. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337(3):635-645. 37. Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol 2003;10(12):980. 38. Ishida T, Kinoshita K. PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res 2007;35(Web Server issue):W460-W464. 39. McGuffin LJ. Intrinsic disorder prediction from the analysis of multiple protein fold recognition models. Bioinformatics 2008;24(16):1798-1804. 40. Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B. Improved Disorder Prediction by Combination of Orthogonal Approaches. PLoS ONE. 2009;4(2):e4433. 41. Genesilico metadisorder service [Internet]. [updated 2011; cited 2011]. Available from: http://iimcb.genesilico.pl/metadisorder/overview.html 42. Ishida T, Kinoshita K. Prediction of disordered regions in proteins based on the meta approach. Bioinformatics 2008;24(11):1344-1348. 43. Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR-FIT: a meta-predictor 82 of intrinsically disordered amino acids. Biochim Biophys Acta 2010;1804(4):996-1010. 44. Lieutaud P, Canard B, Longhi S. MeDor a metaserver for predicting protein disorder. BMC Genomics. 2008;9(Suppl 2):S25. 45. Moult J, Fidelis K, Zemla A, Hubbard T. Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins 2003;53 Suppl 6:334-339. 46. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, et al. DisProt: the Database of Disordered Proteins. Nucleic Acids Res 2007;35(Database issue):D786-D793. 47. Ahmed A, Villinger S, Gohlke H. Large-scale comparison of protein essential dynamics from molecular dynamics simulations and coarse-grained normal mode analyses. Proteins. 2010;78(16):3341-3352. 48. Maguid S, Fernández-Alberti S, Parisi G, Echave J. Evolutionary conservation of protein backbone flexibility. J Mol Evol. 2006;63(4):448-457. 49. Maguid S, Fernandez-Alberti S, Echave J. Evolutionary conservation of protein vibrational dynamics. Gene. 2008;422(1-2):7-13. 50. Law AB, Fuentes EJ, Lee AL. Conservation of Side-Chain Dynamics Within a Protein Family. J. Am. Chem. Soc. 2009;131(18):6322-6323. 51. Chen J, Romero P, Uversky V, Dunker A. Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J Proteome Res. 2006;5(4):879-887. 52. Dunker A, Garner E, Guilliot S, Romero P, Albrecht K, Hart J, et al. Protein disorder and the evolution of molecular recognition: theory, predictions and observations. Pac Symp Biocomput. 1998;1998:473-484. 53. Schlessinger A, Schaefer C, Vicedo E, Schmidberger M, Punta M, Rost B, et al. Protein disorder-a breakthrough invention of evolution? Curr Opin Struct Biol 2011; 54. Liu J, Tan H, Rost B. Loopy proteins appear conserved in evolution. J Mol Biol. 2002;322(1):53-64. 55. Sonnhammer E, Eddy S, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997;28(3):405-420. 56. Tatusov R, Koonin E, Lipman D. A genomic perspective on protein families. Science 1997;278(5338):631-637. 57. Ahn K, Aggarwal B. Transcription factor NF-kappaB: a sensor for smoke and stress signals. 83 Ann N Y Acad Sci. 2005;1056:218-233. 58. Rangan G, Wang Y, Harris D. NF-kappaB signalling in chronic kidney disease. Front Biosci 2009;14:3496-3522. 59. Courtois G, Gilmore TD. Mutations in the NF-kappaB signaling pathway: implications for human disease. Oncogene 2006;25(51):6831-6843. 60. Gilmore TD. Introduction to NF-kappaB: players, pathways, perspectives. Oncogene 2006;25(51):6680-6684. 61. Gauthier M, Degnan B. The transcription factor NF-kappaB in the demosponge Amphimedon queenslandica: insights on the evolutionary origin of the Rel homology domain. Dev Genes Evol 2008;218(1):23-32. 62. Gómez J, García-Domingo D, Martínez-A C, Rebollo A. Role of NF-kappaB in the control of apoptotic and proliferative responses in IL-2-responsive T cells. Front Biosci 1997;2:d49-d60. 63. Vallabhapurapu S, Karin M. Regulation and function of NF-kappaB transcription factors in the immune system. Annu Rev Immunol 2009;:693-733. 64. Hayden MS, Ghosh S. Shared principles in NF-kappaB signaling. Cell 2008;132(3):344-362. 65. Mercurio F, Manning AM. Multiple signals converging on NF-kappaB. Curr Opin Cell Biol 1999;11(2):226-232. 66. Siebenlist U, Franzoso G, Brown K. Structure, regulation and function of NF-kappa B. Annu Rev Cell Biol 1994;10:405-455. 67. Baldwin AS. The NF-kappa B and I kappa B proteins: new discoveries and insights. Annu Rev Immunol 1996;14:649-683. 68. Hayden MS, Ghosh S. Signaling to NF-kappaB. Genes Dev 2004;18(18):2195-2224. 69. Karin M. NF-kappaB and cancer: mechanisms and targets. Mol Carcinog 2006;45(6):355361. 70. Karin M. Nuclear factor-kappaB in cancer development and progression. Nature 2006;441(7092):431-436. 71. Kim HJ, Hawke N, Baldwin AS. NF-kappaB and IKK as therapeutic targets in cancer. Cell Death Differ 2006;13(5):738-747. 72. Van Waes C. Nuclear factor-kappaB in development, prevention, and therapy of cancer. Clin Cancer Res 2007;13(4):1076-1082. 84 73. Wu L, Choe K, Lu Y, Anderson K. Drosophila immunity: genes on the third chromosome required for the response to bacterial infection. Genetics 2001;159(1):189-199. 74. Gugasyan R, Grumont R, Grossmann M, Nakamura Y, Pohl T, Nesic D, et al. Rel/NFkappaB transcription factors: key mediators of B-cell activation. Immunol Rev 2000;176:134140. 75. Bunting K, Rao S, Hardy K, Woltring D, Denyer G, Wang J, et al. Genome-wide analysis of gene expression in T cells to identify targets of the NF-kappa B transcription factor c-Rel. J Immunol 2007;178(11):7097-7109. 76. Panzer U, Steinmetz O, Turner J, Meyer-Schwesinger C, von R, Meyer T, et al. Resolution of renal inflammation: a new role for NF-kappaB1 (p50) in inflammatory kidney diseases. Am J Physiol Renal Physiol 2009;297(2):F429-F439. 77. Liou H, Hsia C. Distinctions between c-Rel and other NF-kappaB proteins in immunity and disease. Bioessays 2003;25(8):767-780. 78. Jones W, Brown M, Wilhide M, He S, Ren X. NF-κB in cardiovascular disease: Diverse and specific effects of a "general" transcription factor? Cardiovasc Toxicol 2005;5(2):183-201. 79. Papin J, Subramaniam S. Bioinformatics and cellular signaling. Curr Opin Biotechnol 2004;15(1):78-81. 80. Huang S, Wikswo J. Dimensions of systems biology. Rev Physiol Biochem Pharmacol. 2006;:81-104. 81. Strange K. The end of "naive reductionism": rise of systems biology or renaissance of physiology? Am J Physiol Cell Physiol. 2005;288(5):C968-C974. 82. Rizzetto L, Cavalieri D. A systems biology approach to the mutual interaction between yeast and the immune system. Immunobiology. 2010; 83. Cadeiras M, von B, Sinha A, Shahzad K, Lim W, Grenett H, et al. Drawing networks of rejection - a systems biological approach to the identification of candidate genes in heart transplantation. J Cell Mol Med. 2010; 84. Wei L, Fan M, Xu L, Heinrich K, Berry M, Homayouni R, et al. Bioinformatic analysis reveals crel as a regulator of a subset of interferon-stimulated genes. J Interferon Cytokine Res. 2008;28(9):541-551. 85. Gyorffy A, Baranyai Z, Cseh A, Munkácsy G, Jakab F, Tulassay Z, et al. Promoter analysis suggests the implication of NFkappaB/C-Rel transcription factors in biliary atresia. Hepatogastroenterology 2008;55(85):1189-1192. 85 86. Elkon R, Rashi-Elkeles S, Lerenthal Y, Linhart C, Tenne T, Amariglio N, et al. Dissection of a DNA-damage-induced transcriptional network using a combination of microarrays, RNA interference and computational promoter analysis. Genome Biol 2005;6(5):R43. 87. Papatsenko D, Levine M. Quantitative analysis of binding motifs mediating diverse spatial readouts of the Dorsal gradient in the Drosophila embryo. Proc Natl Acad Sci U S A 2005;102(14):4966-4971. 88. Udalova I, Mott R, Field D, Kwiatkowski D. Quantitative prediction of NF-kappa B DNAprotein interactions. Proc Natl Acad Sci U S A 2002;99(12):8167-8172. 89. Linnell J, Mott R, Field S, Kwiatkowski D, Ragoussis J, Udalova I, et al. Quantitative highthroughput analysis of transcription factor binding specificities. Nucleic Acids Res 2004;32(4) 90. Matys V, Fricke E, Geffers R, Gössling E, Haubrock M, Hehl R, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003;31(1):374-378. 91. Ip Y, Levine M. Molecular genetics of Drosophila immunity. Curr Opin Genet Dev 1994;4(5):672-677. 92. Waterhouse RM, Kriventseva EV, Meister S, Xi Z, Alvarez KS, Bartholomay LC, et al. Evolutionary dynamics of immune-related genes and pathways in disease-vector mosquitoes. Science 2007;316(5832):1738-1743. 93. Wu X, Xiong X, Xie L, Zhang R. Pf-Rel, a Rel/nuclear factor-kappaB homolog identified from the pearl oyster, Pinctada fucata. Acta Biochim Biophys Sin (Shanghai) 2007;39(7):533539. 94. Huang X, Yin Z, Liao J, Wang P, Yang L, Ai H, et al. Identification and functional study of a shrimp Relish homologue. Fish Shellfish Immunol 2009;27(2):230-238. 95. Huang X, Yin Z, Jia X, Liang J, Ai H, Yang L, et al. Identification and functional study of a shrimp Dorsal homologue. Dev Comp Immunol 2009; 96. Sullivan J, Kalaitzidis D, Gilmore T, Finnerty J. Rel homology domain-containing transcription factors in the cnidarian Nematostella vectensis. Dev Genes Evol 2007;217(1):63-72. 97. Müller CW, Harrison SC. The structure of the NF-kappa B p50:DNA-complex: a starting point for analyzing the Rel family. FEBS Lett 1995;369(1):113-117. 98. Muller C, Rey F, Sodeoka M, Verdine G, Harrison S. Structure of the NF-κB p50 homodimer bound to DNA. Nature 1995;373(6512):311-317. 86 99. Ghosh G, Van Duyne G, Ghosh S, Sigler P. Structure of NF-κB p50 homodimer bound to a kB site. Nature 1995;373(6512):303-310. 100. Pande V, Sharma R, Inoue J, Otsuka M, Ramos M. A molecular modeling study of inhibitors of nuclear factor kappa-B (p50)--DNA binding. J Comput Aided Mol Des 2003;17(12):825-836. 101. Mura C, McCammon J. Molecular dynamics of a kappaB DNA element: base flipping via cross-strand intercalative stacking in a microsecond-scale simulation. Nucleic Acids Res 2008;36(15):4941-4955. 102. Copley R, Totrov M, Linnell J, Field S, Ragoussis J, Udalova I, et al. Functional conservation of Rel binding sites in drosophilid genomes. Genome Res 2007;17(9):13271335. 103. Xue B, Dunker AK, Uversky VN. Retro-MoRFs: Identifying Protein Binding Sites by Normal and Reverse Alignment and Intrinsic Disorder Prediction. International journal of molecular sciences 2010;11(10):3725-3747. 104. Liu T, Altman RB. Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 2009;9:72. 105. Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000;16(5):412-424. 106. Uversky V, Gillespie J, Fink A. Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins. 2000;41(3):415-427. 107. Moult J, Fidelis K, Kryshtafovych A, Rost B, Tramontano A. Critical assessment of methods of protein structure prediction - Round VIII. Proteins 2009;77 Suppl 9:1-4. 108. Huxford T, Malek S, Ghosh G. Structure and mechanism in NF-kappa B/I kappa B signaling. Cold Spring Harb Symp Quant Biol 1999;:533-540. 109. Bottex-Gauthier C, Pollet S, Favier A, Vidal D. [The Rel/NF-kappa-B transcription factors: complex role in cell regulation]. Pathol Biol (Paris) 2002;50(3):204-211. 110. Baeuerle PA, Baltimore D. NF-kappa B: ten years after. Cell 1996;87(1):13-20. 111. Thanos D, Maniatis T. NF-kappa B: a lesson in family values. Cell 1995;80(4):529-532. 112. Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Sayers E. GenBank. Nucleic Acids Res 2009;37(Database issue):D26-D31. 113. Schneider M, Lane L, Boutet E, Lieberherr D, Tognolli M, Bougueleret L, et al. The 87 UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. J Proteomics 2009;72(3):567-573. 114. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol 1990;215(3):403-410. 115. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25-29. 116. Povey S, Lovering R, Bruford E, Wright M, Lush M, Wain H, et al. The HUGO Gene Nomenclature Committee (HGNC). Hum Genet. 2001;109(6):678-680. 117. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. New developments in the InterPro database. Nucleic Acids Res 2007;35(Database issue):D224D228. 118. Sayers E, Barrett T, Benson D, Bolton E, Bryant S, Canese K, et al. Database resources of the National Center for Biotechnology Information. <b>Nucleic</b> Acids Res. 2010;38(Database issue):D5-16. 119. Tong J, Lim S, Muh H, Chew F, Tammi M. Allergen Atlas: a comprehensive knowledge center and analysis resource for allergen information. Bioinformatics 2009; 120. Tay D, Govindarajan K, Khan A, Ong T, Samad H, Soh W, et al. T3SEdb: data warehousing of virulence effectors secreted by the bacterial Type III Secretion System. BMC Bioinformatics. 2010;7 121. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004; 122. EUAsiaGrid – Widening the Uptake of e-Research in the Asia-Pacific Region. eScience, 2008. eScience '08. IEEE Fourth International Conference on; Indianapolis, IN: 2008. 360 p. 123. Hall T. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser 1999;41:95-98. 124. Pierce JR. An introduction to information theory. Dover Pubns; 1980. 305 p. 125. The PyMOL Molecular Graphics System, Version 0.99rc6 [Internet]. Available from: http://www.pymol.org 126. Al-Khodor S, Price C, Kalia A, Abu K. Functional diversity of ankyrin repeats in microbial proteins. Trends Microbiol. 2009; 127. Voronin D, Kiseleva E. [Functional role of proteins containing ankyrin repeats]. 88 Tsitologiia. 2007;49(12):989-999. 128. Li J, Mahajan A, Tsai M. Ankyrin repeat: a unique motif mediating protein-protein interactions. Biochemistry. 2006;45(51):15168-15178. 129. Mosavi L, Cammett T, Desrosiers D, Peng Z. The ankyrin repeat as molecular architecture for protein recognition. Protein Sci. 2004;13(6):1435-1448. 130. Park HH, Lo YC, Lin SC, Wang L, Yang JK, Wu H, et al. The death domain superfamily in intracellular signaling of apoptosis and inflammation. Annu Rev Immunol 2007;25:561-586. 131. Goh GKM, Dunker AK, Uversky VN. A comparative analysis of viral matrix proteins using disorder predictors. Virol J. 2008;5:126. 132. Smith DK, Radivojac P, Obradovic Z, Dunker AK, Zhu G. Improved amino acid flexibility parameters. Protein Sci. 2003;12(5):1060-1072. 133. Chirgadze DY, Demydchuk M, Becker M, Moran S, Paoli M. Snapshot of protein structure evolution reveals conservation of functional dimerization through intertwined folding. Structure 2004;12(8):1489-1494. 134. Kalodimos CG, Biris N, Bonvin AMJJ, Levandoski MM, Guennuegues M, Boelens R, et al. Structure and Flexibility Adaptation in Nonspecific and Specific Protein-DNA Complexes. Science 2004;305(5682):386-389. 135. Huang J, Zhao X, Yu H, Ouyang Y, Wang L, Zhang Q, et al. The ankyrin repeat gene family in rice: genome-wide identification, classification and expression profiling. Plant Mol Biol. 2009;71(3):207-226. 136. Wu T, Tian Z, Liu J, Yao C, Xie C. A novel ankyrin repeat-rich gene in potato, Star, involved in response to late blight. Biochem Genet. 2009;47(5-6):439-450. 137. Zhu Y, Kakinuma N, Wang Y, Kiyama R. Kank proteins: a new family of ankyrin-repeat domain-containing proteins. Biochim Biophys Acta. 2008;1780(2):128-133. 138. Hoffmann A, Baltimore D. Circuitry of nuclear factor κB signaling. Immunol Rev 2006;210:171-186. 89 [...]... reviewed in general[10,11,15,48-50] 2.4.1 Intrinsic Protein Disorder Analysis of NFκB No intrinsic protein disorder analysis focusing solely on NFκB has been recorded in literature Nevertheless, general research efforts using intrinsic protein disorder to identify protein binding sites[101,102] and analyse the functions of chromatin remodeling proteins have been recorded[22] In the context of cell signaling,... or in the presence of changes in the biochemical environment [19,20] Intrinsically disordered proteins and protein regions have been reported to engage multiple binding partners and are involved in many biological events and pathways, especially during cell signaling[14,15,22-24] 4 1.3.1 Role of Intrinsic Protein Disorder in Cell Signaling In the context of cell signaling, intrinsically disordered proteins... cell signaling, the functional roles of intrinsic protein disorder in cytoplasmic signaling domains[22] and in scaffold proteins, which integrate cell signaling pathways[15], have been reported The most relevant study of intrinsic protein disorder in transcription factors was conducted by Wells et al., who analyzed p53’s intrinsically disordered N-terminal trans-activation domain (TAD) using NMR spectroscopy...Table of Contents 1 Introduction 1 1.1 Protein Dynamics 1 1.2 Functional Significance of Protein Dynamics 2 1.2.1 1.3 Role of Protein Dynamics in Cell Signaling 3 Intrinsic Protein Disorder 4 1.3.1 Role of Intrinsic Protein Disorder in Cell Signaling 5 1.3.2 Identification of intrinsic protein disorder 5 1.3.2.1... database containing sequences across multiple species annotated with experimentally verified intrinsically disordered regions[46] 1.3.3 Functional Conservation of Intrinsic Protein Disorder The functional importance of intrinsically disordered proteins and protein regions raises the likelihood that intrinsically disordered protein residues are evolutionarily conserved This proposal is in line with studies... provide further impetus for intrinsic protein disorder prediction, since 2002, the worldwide Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments introduced a new category for protein disorder prediction, using blind benchmark datasets[45] Intrinsic protein disorder prediction has also been facilitated by the availability of the Database of Protein Disorder (DisProt) since... methods for detecting intrinsic protein disorder are often hampered by the lack of stable protein structures[27] To overcome this limitation, various computational tools have been developed for the prediction of intrinsically disordered proteins and protein regions from primary protein sequences[27] 5 1.3.2.1 Computational Tools for Intrinsic Protein Disorder Prediction Various definitions have been... and protein- protein interactions[2] These events are in turn vital to a large array of essential biological processes and functions[1,3,6,10-12] 2 An example is the crucial role of protein dynamics in muscle contraction[6] Muscle contraction involves the cross-bridge cycle, with the first step involving adenosine triphosphate (ATP) binding to the myosin head Binding of the myosin head to actin myofilaments,... date, only one protein dynamics study mentioning NFκB proteins is present in the literature The authors simulated the interaction between C-Rel and a 20-bp DNA sequence and observed a unique and dynamic NFκB recognition site The study was focused on the dynamics of the DNA, rather than the dynamics of the C-Rel protein during binding[99] However, the effects of protein dynamics in cell signaling and allosteric... provide large intermolecular interfaces with smaller protein, genome and cell sizes[25] For example, the recognition of DNA by disordered peptides has been shown to be involved in the regulation of gene expression by transcription, epigenetic modifications and gene silencing[26] 1.3.2 Identification of intrinsic protein disorder Intrinsically disordered proteins and protein regions can be indirectly observed ... verified intrinsically disordered regions[46] 1.3.3 Functional Conservation of Intrinsic Protein Disorder The functional importance of intrinsically disordered proteins and protein regions raises the. .. Role of Intrinsic Protein Disorder in Cell Signaling In the context of cell signaling, intrinsically disordered proteins and regions have been associated with many regulatory events Intrinsic protein. .. Cell Signaling Intrinsic Protein Disorder 1.3.1 Role of Intrinsic Protein Disorder in Cell Signaling 1.3.2 Identification of intrinsic protein disorder 1.3.2.1

Định dạng
Số trang	102
Dung lượng	6,22 MB