Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 167 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
167
Dung lượng
2,15 MB
Nội dung
Systematic Assessment of Protein Interaction Data using Graph Topology Approaches Jin Chen B.C.Sc. (Hons) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE 2006 Copyright by Jin Chen 2006 Systematic Assessment of Protein Interaction Data using Graph Topology Approaches by Jin Chen, B.Eng. Dissertation Presented to the Faculty of the School of Computing of the National University of Singapore in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY National University of Singapore October 2006 Systematic Assessment of Protein Interaction Data using Graph Topology Approaches Approved by Dissertation Committee: ACKNOWLEDGMENTS I would like to express my gratitude to all those who gave me the possibility to complete this thesis. I would like to express my deep and sincere gratitude to my supervisor, Associate Professor Wynne Hsu, Ph.D., vice dean of the School of Computing, National University of Singapore. Her wide knowledge and her logical way of thinking have been of great value for me. Her understanding, encouraging and guidance have provided a good basis for the present thesis. I am deeply grateful to my co-supervisor, Associate Professor Mong Li Lee, Ph.D., assistant dean of the School of Computing, National University of Singapore, for her systematic and constructive instructions, and for her important support throughout this work. I have furthermore to thank my co-supervisor, Dr. See-Kiong Ng, Ph.D, department manager, Knowledge Discovery Department, Institute for Infocomm Research, whose help, stimulating suggestions and encouragement helped me in all the time of research for and writing of this thesis. I wish to express my warm and sincere thanks to Professor Limsoon Wang, Ph.D, National University of Singapore, for his constant encouragement and effective comments, which have had a remarkable influence on my entire research in the field of computational biology. I warmly thank my colleagues, Tiefei Liu, Xin Xu, Zeyar Aung, Hugo Willy and Hon Nian Chua, for their valuable advice, friendly help, and valuable hints. v Their extensive discussions and interesting explorations related to my work have been very helpful for this study. I wish to extend my warmest thanks to all those who have helped me with my work. Especially, I would like to give my special thanks to my wife, Juan Lang. It is her patient love that enabled me to complete this work. She was of great help in difficult times. Without her encouragement and understanding, it would have been impossible for me to finish my Ph.D study. Jin Chen National University of Singapore October 2006 vi Systematic Assessment of Protein Interaction Data using Graph Topology Approaches Publication No. Jin Chen, PhD National University of Singapore, 2006 Supervisor: Wynne Hsu, Cosupervisor: Mong Li Lee, See-Kiong Ng Advances in high-throughput protein interaction detection methods enable biologists to experimentally detect protein interactions at the whole genome level for many organisms. However, current protein interaction detection via high-throughput experimental methods such as yeast-two-hybrid are reported to be highly erroneous. At the same time, the false negative rate of the interaction networks have also been estimated to be high. The purpose of this study was to investigate protein interaction networks from the topological aspect, and to develop a series of effective computational methods to automatically purify these networks, i.e., to identify true protein interactions from the existing protein interaction networks and discover unknown protein interactions, by their topological nature. This thesis introduced three different approaches. First, it presented a novel measure called IRAP, and further IRAP*, to assess the reliability of protein interaction based on the alternative paths in the protein interaction network. A candidate protein interaction is likely to be reliable if it is involved in a closed loop, in which the alternative path of interactions between the two interacting proteins is strong. The algorithm AlternativePathFinder was designed to compute the IRAP value for each interaction in a protein interaction network. vii Second, the thesis presented a new model to identify true protein interactions with meso-scale (middle size) network motifs in the protein interaction networks. The algorithm NeMoFinder was designed to discover such network motifs efficiently. In the algorithm, frequent trees are discovered firstly. Tree is a simper structure than graph and the number of distinct trees is much smaller than the number of graphs with the same size. By finding frequent trees, graph G is naturally divided into a set of graphs GD, in which each graph is an embedding of a frequent tree. Then, the notion of graph cousin was introduced to reduce the computational time of motif candidate generation and frequency counting in GD. Third, the thesis exploited the currently available biological information that are associated with network motif vertices to capture not only the topological shapes, but also the biological contexts in which they occurred in the PPI networks for network motif applications. We present a method called LaMoFinder to label network motifs with Gene Ontology terms in a PPI network. We also show how the resulting labeled network motifs can be used to predict unknown protein functions. Validation of IRAP and network motifs as measures for assessing the reliability of protein interactions from conventional high-throughput experiments was performed. For Saccharomyces cerevisiae, IRAP/motif models discovered 81.5% reliable protein interactions if the cutoff threshold was set to 0.5. If the threshold was increased to 0.85, all the reliable protein interactions could be captured either by the IRAP model or by the network motif model. Experimental results demonstrated that both of the measures are good for assessing the reliability of protein interactions from conventional high-throughput experiments. Furthermore, the performance of IRAP/motif is clearly better than other topology based evaluation methods, such as IG1 and IG2, for identifying true positive and false negative protein interactions. Protein function prediction experiments showed that the labeled network motifs extracted are biologically meaningful and can achieve better performance (both precision and recall) than existing PPI topology based methods for predicting unknown protein functions. The results suggest that a significant proportion of true protein-protein interactions could be identified by our IRAP/motif models. These two models could viii facilitate the rapid construction of protein interaction networks that will help scientists in understanding the biology of living systems. The results also suggest that exploring remote but topologically similar proteins with labeled network motifs could enable a more precise functional prediction of unknown proteins. ix CONTENTS Acknowledgments v Abstract vii List of Tables xiv List of Figures xv Summary xix Chapter Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Scope 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Literature Review 2.1 2.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Graph Theoretic Terminology . . . . . . . . . . . . . . . . . . 2.1.2 Biological Terminology . . . . . . . . . . . . . . . . . . . . . . Protein-protein interaction network . . . . . . . . . . . . . . . . . . . 10 2.2.1 Yeast PPI Network . . . . . . . . . . . . . . . . . . . . . . . . 11 x broken down into simple units that can help researchers discover unknown principle of complex network. Overcoming the drawbacks of existing algorithms for detecting unique network motifs, The algorithm NeMoFinder could rapidly scale to meso-size network motifs. In the algorithm NeMoFinder, a new framework was designed with the ability to directly scale to motifs with certain size. In the framework, frequent trees were firstly discovered, because tree is a simper topological structure than graph and the number of distinct trees is much less than the number of graphs with the same size. By finding frequent trees, graph G was naturally divided into a set of graphs GD, in which each graph was an embedding of a frequent tree. Then, three kinds of join operations were introduced to reduce the computational time of motif candidate generation and frequency counting in GD. Experimental results showed that NeMoFinder was able to discover meaningful network motifs from the yeast protein interaction network successfully. While running NeMoFinder on yeast data, we discovered about 100 times more network motifs than existing ones. The protein interaction evaluation based on meso-scale network motifs are more reliable than small local motifs (c.f. “IG2”). The performance of meso-scale network motif is similarly accurate as IRAP, but has advantages if network is sparse (i.e., where few alternate paths are present). The results suggest that the two approaches, alternative path and network motif, can facilitate the rapid construction of protein interaction networks that help scientists in understanding the biology of living systems and unknown behaviors of real networks. Current network motifs are unlabeled (and uninformative). As a result, the currently available biological information that are associated with the vertices (the proteins) cannot be exploited for further knowledge discovery applications. In this thesis, we have proposed a method called LaMoFinder to annotate network motifs with the biological information associated with the proteins in the PPI network. Our method was specifically devised to handle the large labeling space as well as the sophisticated scheme (GO) in which the proteins were annotated. As a result, we have captured not only the topological shapes of the motifs, but also the biological context in which they occurred in the labeled network motifs. We also demonstrated how the network motifs labeled by LaMoFinder can be used to predict the functions 133 of unknown proteins in the PPI network. Our superior performance against other current prediction methods confirmed that the network motifs have been adequately enriched by LaMoFinder for the more sophisticated biological applications such as protein function prediction. 7.2 Recommendations The IRAP and IRAP* measures are currently based on the “strongest alternative path” model. A candidate interaction that is not accompanied by a strong alternative path of interactions in the overall protein interaction network is considered to be unreliable. While this may not be true for all the biologically relevant protein interactions, we have performed an analysis on our yeast-two-hybrid protein interaction datasets and found that more than 80% of interactions in our experiments have at least one alternative path. With a significant proportion of interactions captured by the current IRAP and IRAP* measures, it is acceptable that the measure cannot evaluate the other 20% of protein interactions. The other measure, network motif, is based on the frequent and unique subgraphs that are found solely in the current protein interaction network. Protein interactions that are captured by at least one significant network motif are considered to be reliable. As this work focuses on the topological significant interactions which are thought to be the most biologically important, the protein interactions with no network motifs involved are lost. The labeled network motifs cover even less protein interactions. The number of the lost interactions varies with the threshold of frequency and uniqueness given by users. Generally, for S. cerevisiae, about 96% protein interactions are involved in at least one network motif; for E. coli, about 80% protein interactions are involved in at least one network motif. Therefore, while both of the two approaches capture a large part of protein interactions with distinct approaches, there are still a certain proportion of the protein interactions that cannot be evaluated by current IRAP/motif model. The next step is to develop further network models to capture protein interactions associated with more sophisticated topological characteristics than alternative paths and network 134 motifs. New models could be developed in the following ways. 7.2.1 Combine IRAP/motif model with other existing models Besides the IRAP/motif model, there are some existing protein interaction evaluation methods based on the protein interaction network topology. For example, Bader et al [BCC04] developed a quantitative method which treated pairs of proteins close together in multi-networks as positive examples, and proteins connected in one network and far apart in the second network as negative examples. By combining the existing protein interaction evaluation methods with the IRAP/motif model, detection of more protein interactions in the protein interaction data may be possible. 7.2.2 Disconnected Network Motifs In our network motif model, we focused on finding the simplest topological units that are connected. A network motif is connected if there is a path between every pair of vertices in the motif. However, the current protein interaction network is not only with many false positives but also has a high ratio of false negatives. The false negative problem is critical by the fact that the combination of independent datasets results in a low overlap rate[HF01, MKS+ 02]. With the missing interactions, an interesting network motif could be separated. Consequently, a disconnected network motif will be overlooked by our network motif model since it focuses only on connected motifs. Therefore, it would be interesting to develop an algorithm to discover disconnected network motifs with gaps (missing nodes or missing edges). The disconnected motifs could be generated by glue smaller connected motifs that often occur in the protein interaction network with a close distance, or could be discovered directly in a similar way as finding discrete subgraphs or subtrees in a complex network. 135 7.2.3 Incorporate with protein functional interaction networks The linkage in the protein functional interaction network indicates that the two connected proteins have the same function. Naturally, the functional network is much larger than the physical protein interaction network that we focused on. An alternative path that does not appear in the physical interaction network but appears in the functional network may indicate two possibilities. First, the two target proteins are strongly correlated in functional annotations but not physically connect with each other. Second, there exist a physical alternative path, but the path does not exist in the current physical protein interaction network due to the high error rate. Therefore, the interacting pair with only functional alternative path could be assigned a weight based on the number of missing edges in the path. With this approach, we hope to detect more protein interactions in the physical interaction data. It is also reasonable to assume that there are no false negatives in the functional network, which means the physical protein interaction data is a subset of the functional protein interaction data. Hence, in the disconnected network motif discovery approach introduced in section 7.2.2, a gap in physical network should have its corresponding edge in the functional network. Therefore, the disconnected motif discovery approach could be more effective since the search space is dramatically reduced. 7.3 End note There is no better way to end the thesis by relating some “history” [CCH+ 06]. Professor Limsoon Wong first learned, at GIW 2002, of the possibility of ranking the reliability of protein-protein interactions reported in high-throughput Y2H assays from Dr. Rintaro Saito, who was showing a poster of his works on IG1 [SSH02b] and IG2 [SSH02a]. Professor Limsoon Wong was so impressed with the poster that, upon returning to Singapore, he told his colleagues Dr. See-Kiong Ng and Mr. 136 Soon-Heng Tan about it. Dr. See-Kiong Ng subsequently followed up on the idea with his collaborators A/P Wynne Hsu, Dr. Mong Li Lee and me; and developed improvements including IRAP [CHLN04, CHLN05b, CHLN05a], IRAP* [CHLN06c], and NeMoFinder [CHLN06b, CHLN07]. Mr. Soon-Heng Tan did not follow up on the idea, though he was inspired to work on identification of protein-protein binding motifs [LLTN04]. Professor Limsoon Wong followed up on the paper, and co-authored with Haiquan Li and Jinyan Li a paper on binding motifs [LLW06]. He also co-authored with Dr. Wing-Kin Sung and Mr. Hon Nian Chua a paper on using indirect neighbours to infer protein function [CSW06]. As we can see, the discussion of professor Limsoon Wong and Dr. Rintaro Saito at GIW 2002 has lead to a fruitful chain of results. 137 BIBLIOGRAPHY [AA04] I. Albert and R. Albert. Conserved network motifs allow proteinprotein interaction prediction. Bioinformatics, 20(18):3346–3352, 2004. [ABC+ 04] Patrick Aloy, Bettina Bottcher, Hugo Ceulemans, et al. Structurebased assembly of protein complexes in yeast. Science, 303:2026–2029, 2004. [Alo03] U. Alon. Biological networks: the tinkerer as an engineer. Science, pages 1866–1867, 2003. [BCC04] J.S. Bader, A. Chaudhuri, and J. Chant. Gaining confidence in protein interaction networks. Nature, 22(1):78–85, 2004. [BCM+ 03] C. Brun, F. Chevenet, D. Martin, J. Wojcik, A. Guenoche, and B. Jacq. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol., 5(1):R6, 2003. [BDH03] G.D. Bader, Betel D., and C.W. Hogue. Bind: the biomolecular interaction network database. Nucleic Acids Research, 31(1):248–250, 2003. 138 [BJR+ 02] A. L. Barabasi, H. Jeong, R. Ravasz, Z. Neda, T. Vicsek, and A. Schubert. Evolution of the social network of scientific collaborations. Physica A, 311:590, 2002. [BR99] Albert-Laszlo Barabasi and Albert Reka. Emergence of scaling in random networks. Science, 286, 1999. [CCH+ 06] Jin Chen, Hon Nian Chua, Wynne Hsu, Mong Li Lee, See-Kiong Ng, Rintaro Saito, Wing-Kin Sung, and Limsoon Wong. Increasing confidence of protein-protein interactomes. GIW, 2006. [CHLN04] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Systematic assessment of high-throughput experimental data for reliable protein interactions using network topology. ICTAI, pages 368–372, 2004. [CHLN05a] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Assessing reliability of protein interaction data from high-throughput experiments with belief inference(poster). APBC, 2005. [CHLN05b] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Towards discovering reliable protein interactions from high-throughput experimental data using network topology. Artificial Intelligence in medicine, 35(1-2):37–47, 2005. [CHLN06a] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Discovering and exploiting meso-scale network motifs in protein interactomes. Technical Report TRC6/06, National University of Singapore, 2006. [CHLN06b] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Dissecting genome-wide protein-protein interactions with meso-scale network motifs. SIGKDD, 2006. [CHLN06c] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics, 22(16):1998–2004, 2006. 139 [CHLN07] Jin Chen, Wynne Hsu, Mong Li Lee, and See-Kiong Ng. Labeling network motifs in protein interactomes for protein function prediction. ICDE, 2007. [CSW06] Hon Nian Chua, Wing-Kin Sung, and Limsoon Wong. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 22:1623–1630, 2006. [DBTM+ 01] A. Davy, P. Bello, N. Thierry-Mieg, et al. A protein-protein interaction map of the caenorhabditis elegans 26s proteasome. EMBO Rep, 2(9):821–828, 2001. [Dij59] E.M. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959. [DMSC02] Minghua Deng, Shipra Mehta, Fengzhu Sun, and Ting Chen. Inferring domain-domain interactions from protein-protein interactions. Genome Research, 12:1540–1548, 2002. [DSC03] M. Deng, F. Sun, and T. Chen. Assessment of the reliability of proteinprotein interactions and protein function prediction. PSB, 2003. [DSXE02] C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg. Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol. Cell Proteomics, 1:349–356, 2002. [DZM+ 03] M. Deng, K. Zhang, S. Mehta, T. Chen, and F. Z. Sun. Prediction of protein function using protein-protein interaction data. J. Comp. Biol., 10(6):947–960, 2003. [EIKO99] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86–90, 1999. [ESBB98] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95:14863–14868, 1998. 140 [FFF99] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. SIGCOMM, pages 251–262, 1999. [For96] S. Fortin. The graph isomorphism problem. Technical Report TR9620, Department of Computing Science, University of Alberta, 1996. [FS89] S. Fields and O. Song. A novel genetic system to detect protein-protein interactions. Nature, 340:245–246, 1989. [G+ 02] A.C. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415:141–147, 2002. [GBBK02] N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes. Topological and casual structure of the yeast transcriptional regulatory network. Nature Genetics, 31:60–63, 2002. [GDC03] Bader GD, Betel D, and Hogue CW. Bind: the biomolecular interaction network database. Nucleic Acids Res, 31(1):248–250, 2003. [GO206] The gene ontology (go) project in 2006. Nucleic Acids Res, 34(Database issue):322–326, 2006. [GR03] Debra S. Goldberg and Frederick P. Roth. Assessing experimentally derived interactions in a small world. PNAS, 100(8):4372–4376, 2003. [Gri01] A. Grigoriev. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage t7 and the yeast saccharomyces cerevisiae. Nucleic Acids Res, 29(17):3513– 3519, 2001. [HF01] T. R. Hazbun and S. Fields. Networking proteins in yeast. Proc Natl Acad Sci U S A, 98(8):4277–4278, 2001. [HNO+ 01] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Takagi. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast, 18(6):523–531, 2001. 141 [HS00] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information Processing Letters, 76(4-6):175–181, 2000. [HWP03] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraph in the presence of isomorphism. ICDM, pages 549–552, 2003. [HWPY04] J. Huan, W. Wang, J. Prins, and J. Yang. Spin: Mining maximal frequent subgraphs from graph databases. SIGKDD, 2004. [ICO+ 01] T. Ito, T. Chiba, R. Ozawa, et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A, 98(8):4569–4574, 2001. [IMK05] S. Itzkovitz, R. Milo, and N. Kashtan. Coarse-graining and self- dissimilarity of complex networks. Phys. Rev. E, 71(016127), 2005. [ITM+ 00] T. Ito, K. Tashiro, S. Muta, et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A, 97(3):1143–1147, 2000. [IWM00] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph. PKDD, pages 13–23, 2000. [JMBO01] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411:41–42, 2001. [JYG+ 03] R. Jansen, H. Yu, D. Greenbaum, et al. A bayesian networks approach for predicting protein-protein interactions from genomic data. Science, pages 449–453, 2003. [KGKN02] M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya. The kegg databases at genomenet. Nucleic Acids Res, 30(1):42–46, 2002. 142 [KI05] Ryan Kelley and Trey Ideker. Systematic interpretation of genetic interactions using protein networks. Nature Biotechnology, 23:561– 566, 2005. [KIMA04] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics, 20(11):1746–1758, 2004. [KK04a] M. Kuramochi and G. Karypis. An efficient algorithm for discovering frequent subgraphs. TKDE, 2004. [KK04b] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. In SIAM International Conference on Data Mining, 2004. [LLTN04] Haiquan Li, Jinyan Li, Soon-Heng Tan, and See-Kiong Ng. Discovery of binding motif pairs from protein complex structural data and protein interaction sequence data. PSB, pages 312–332, 2004. [LLW06] Haiquan Li, Jinyan Li, and Limsoon Wong. Discovering motif pairs at interaction sites from sequences on a proteome-wide scale. Bioinformatics, 22(8):989–996, 2006. [LRR+ 02] T.I. Lee, N.J. Rinaldi, F. Robert, D.T. Odom, Z. Bar-Joseph, G.K. Gerber, N.M. Hannett, C.R. Harbison, C.M. Thompson, Simon I., Zeitlinger J., E.G. Jennings, H.L. Murray, D.B. Gordon, B. Ren, J.J. Wyrick, J. Tagne, Volkert T.L., E. Fraenkel, Gifford D.K., and R.A. Young. Transcriptional regulatory networks in saccharomyces cerevisiae. Science, 298:799–804, 2002. [LSBG02] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics, 2002. [LSBG03] PW Lord, RD Stevens, A Brass, and CA Goble. Semantic similarity 143 measures as tools for exploring the gene ontology. In Proceedings of the Pacific Symposium on Biocomputing, pages 601–612, 2003. [LWG01] P. Legrain, J. Wojcik, and J. M. Gauthier. Protein-protein interaction maps: a lead towards cellular functions. Trends Genet, 17:346–352, 2001. [Man90] J. Manning. Geometric symmetry in graphs. Ph.D thesis, Purdue University, 1990. [MFCG03] L. Mirabeau, I. Feldman, M. Cokol, and E. Goodnoe. odeling innovation with fitness landscapes: The star network motif. NECSI, 2003. [MFG+ 02] H. W. Mewes, D. Frishman, U. Guldener, et al. Mips: a database for genomes and protein sequences. Nucleic Acids Res, 30(1):31–34, 2002. [MHMF00] S. McCraith, T. Holtzman, B. Moss, and S. Fields. Genome-wide analysis of vaccinia virus protein-protein interactions. Proc Natl Acad Sci U S A, 97(9):4879–4884, 2000. [MIK04] R. Milo, S. Itzkovitz, and N. Kashtan. Superfamilies of designed and evolved networks. Science, 303(5663):1538–1542, 2004. [MKS+ 02] C.V. Mering, R. Krause, B. Snel, M. Cornell, S.G. Oliver, S. Fields, and P. Bork. Comparative assessment of largescale data sets of proteinprotein interactions. Nature, 417:399–403, 2002. [MP+ 99] E. Marcotte, M. Pellegrini, et al. Detecting protein function and protein-protein interactions from genome sequences. Science, 285(751753), 1999. [MS02] S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 296(5569):910–913, 2002. [MSOI+ 02] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002. 144 [MVR+ 01] Lisa R. Matthews, Philippe Vaglio, Jerome Reboul, et al. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or ”interologs”. Genome Research, 11(12):2120–2126, 2001. [Oli00] S. Oliver. Guilt-by-association goes global. Nature, 403:601–603, 2000. [P+ 03] S Peri et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research, 13:2363–2371, 2003. [PB01] J. Park and D. Bolser. Conservation of protein interaction network in evolution. Genome Inform Ser Workshop Genome Inform, 12:135–140, 2001. [PMT+ 99] M. Pellegrini, E.M. Marcotte, M.J. Thompson, et al. Assigning protein functions by comparative analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 96(4285-4288), 1999. [PWJ04] N. Przulj, D.A. Wigle, and I. Junsica. Functional topology in a network of protein interactions. Bioinformatics, 20(3):340–348, 2004. [PZ05] Pengjun Pei and Aidong Zhang. A topological measurement for weighted protein interaction network. In CSB, pages 268–278, 2005. [RSDR+ 01] J. C. Rain, L. Selig, H. De Reuse, et al. The protein-protein interaction map of helicobacter pylori. Nature, 409(6817):211–215, 2001. [SCN+ 04] Li S, Armstrong CM, Bertin N, Ge H, et al. A map of the interactome network of the metazoan c. elegans. Science, 303(5657):540–543, 2004. [SG01] I.G. Serebriiskii and E.A. Golemis. Two-hybrid system and false positives. approaches to detection and elimination. Methods Mol. Biol., 177:123–134, 2001. [SM03] V. Spirin and L.A. Mirny. Protein complexes and functional modules in molecular networks. PNAS, 100(21):12123–12128, 2003. 145 [SOMMA02] S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcriptional regulation network of escherchia coli. Nature, 31:64–68, 2002. [SS04] F. Schreiber and H. Schwobbermeyer. Towards motif detection in networks: Frequency concepts and flexible search. NETTAB’04, pages 91–102, 2004. [SSH02a] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, 19:756–763, 2002. [SSH02b] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement to assess the reliability of a protein-protein interaction. Nucleic Acids Res, 30:1163–1168, 2002. [SSM03] E. Sprinzak, S. Sattath, and H. Margalit. How reliable are experimental protein-protein interaction data? J Mol Biol, 327(5):919–923, 2003. [SUF03] B. Schwikowski, P. Uetz, and S. Fields. A network of protein-protein interactions in yeast. Nature Biotechnol, 18:623–627, 2003. [TO00] Sophia Tsoka and Christos A. Ouzounis. Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion. Nat Genet., 26:141–142, 2000. [UGC+ 00] P. Uetz, L. Giot, G. Cagney, et al. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403(6770):623–627, 2000. [Wag01] A. Wagner. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Molecular Biology and Evolution, 18:1283–1292, 2001. 146 [Wag03] Andreas Wagner. Does selection mold molecular networks? Sci. STKE, page 41, 2003. [Wat03] Duncan J. Watts. SmallWorlds: The Dynamics of Networks between Order and Randomness. Princeton, NJ:Princeton University Press, 2003. [WBV00] A. Walhout, S. Boulton, and M. Vidal. Yeast two-hybrid systems and protein interaction mapping projects for yeast and worm. Yeast, 17:88–94, 2000. [WOB03] S. Wuchty, Z.N. Oltvai, and A.L. Barabasi. Nature Genetics, 25:176– 179, 2003. [WR06] S. Wernicke and F. Rasche. Fanmod: a tool for fast network motif detection. Bioinformatics, 22(9):1152–1153, 2006. [WS98] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393:440–442, 1998. [WS01] Jerome Wojcik and Vincent Schachter. Protein-protein interaction map inference using interacting domain profile pairs. Bioinformatics, 17 Suppl 1:S296–305, 2001. [WSL+ 00] A. Walhout, R. Sordella, X. Lu, et al. Protein interaction mapping in c. elegans using proteins involved in vulval development. Science, 287:116–122, 2000. [Wuc01] S. Wuchty. Scale-free behavior in protein domain networks. Molecular Biology and Evolution, 18(9):1694–1702, 2001. [XRS+ 00] I. Xenarios, D. Rice, L. Salwinski, M. Baron, E. Marcotte, and D. Eisenberg. Dip: The database of interacting proteins. Nucleic Acids Research, 28:289–291, 2000. 147 [XSD+ 02] I. Xenarios, L. Salwnski, X.J. Duan, et al. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research, 30(1):303–305, 2002. [YH02] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. ICDM, 2002. [YLSea04] E. Yeger-Lotem, S. Sattath, and N. Kashtan et. al. Network motifs in integrated cellular networks of transcription-regulation and proteinprotein interaction. PNAS, 101(16):5934–5939, 2004. [Z+ 01] H. Zhu et al. Global analysis of protein activities using proteome chips. Science, 293:2101–2105, 2001. [ZKW02] X. Zhou, M. C. Kao, and W. H. Wong. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl. Acad. Sci. U S A, 99(20):12783–88, 2002. 148 [...]... Occurrences of t4 1 in G 90 5.4 Occurrences of t4 2 in G 91 5.5 Set of graphs GD4 ; each graph in GD4 embeds t4 1 and/or t4 2 92 5.6 Generate 3-edge subgraphs from size-4 trees 92 5.7 Examples of graph join operations for 3-edge subgraphs 5.8 Generate 4-edge subgraphs from repeated 4-edge subgraphs of G 5.9 Examples of graph. .. represent the set of vertices of a graph G, and E(G) to represent the set of edges of a graph G A graph is undirected if its edges are undirected, and otherwise it is directed Vertices joined by an edge are said to be adjacent A neighbor of a vertex v is a vertex adjacent to v We denote by N (v) the set of neighbors of vertex v (called the neighborhood of v) The degree of a vertex is the number of edges incident... documents these experimentally determined protein- protein interactions They also present protein interaction from the molecular level to the pathway level for various organisms The abundant number of protein interactions allows us to analyze organisms at the genome level Recent studies on the reliability of high-throughout detection of protein interactions using Y2H have revealed high error rates [EIKO99,... but one of the versions is lost 2.2 Protein- protein interaction network Proteins are the molecules that actually participate in life’s many biological processes They are often described as the “workers” in living cells Similar to social animals, proteins often interact with each other frequently Functions of a protein are usually provided by its interacting with other proteins and genes The interactions... However, understanding the structure of these intracellular networks is a complex task, which is complicated by the presence of and interactions between networks of different kinds of elements **To make the problem simple, this thesis focuses only on protein- protein interaction (PPI) networks, to interpret the activity of proteins as well as how these proteins interact from the graph topological prospect It... introduced three distinct approaches First, it presented a novel measure called IRAP, and further IRAP*, for assessing the reliability of protein interaction based on the alternative paths in the protein interaction network Second, the thesis presented a new model to identify true protein interactions with large size network motifs in the protein interaction networks A scalable algorithm NeMoFinder was designed... is the number of edges in the path The shortest path length between vertices u and v is commonly denoted by d(u, v) The diameter of a graph is the maximum of d(u, v) over all vertices u and v; if a graph is disconnected, we assume that its diameter is equal to the maximum of the diameters of its connected components A subgraph of G is a graph whose vertices and edges all belong to G A subgraph with k... the existing protein- protein 5 interaction networks of many organisms It could help biologists in identifying true protein interactions and predict unknown protein functions It also may guide researchers to discover unknown protein links or narrow down the list of candidates before biological experiments The tools presented in the study could be used to generate highly reliable protein interaction networks,... complex, interaction network PPI networks are commonly represented in a graph format, with vertices corresponding to proteins and edges corresponding to protein- protein interactions An example of a PPI network constructed in this way is presented in Figure 2.1 [PWJ04] The network consists of many small subnets (groups of proteins that interact with each other but not interact with any other protein) ... subnet comprising more than half of all interacting proteins The volume of experimental data on protein- protein interactions is rapidly increasing thanks to high-throughput techniques which are able to produce large batches of PPIs For example, yeast contains over 5000 proteins, and currently about 18000 PPIs have been identified between the yeast proteins, with hundreds of labs around the world adding . Systematic Assessment of Protein Interaction Data using Graph Topology Approaches Jin Chen B.C.Sc. (Hons) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL OF COMPUTING NATIONAL. UNIVERSITY OF SINGAPORE 2006 Copyright by Jin Chen 2006 Systematic Assessment of Protein Interaction Data using Graph Topology Approaches by Jin Chen, B.Eng. Dissertation Presented to the Faculty of the. of Singapore October 2006 vi Systematic Assessment of Protein Interaction Data using Graph Topology Approaches Publication No. Jin Chen, PhD National University of Singapore, 2006 Supervisor: