(BQ) Part 2 book “Systems and computational biology” has contents: Gene regulation, networking and signaling in and between genomes, omics-based molecular and cellular experimental systems - examples and applications.
Part Gene Regulation, Networking and Signaling in and Between Genomes Prediction and Analysis of Gene Regulatory Networks in Prokaryotic Genomes Richard Münch, Johannes Klein and Dieter Jahn Institute of Microbiology, Technische Universität Braunschweig, Braunschweig Germany Introduction The availability of over 1500 completely sequenced and annotated prokaryotic genomes offers a variety of comparative and predictive approaches on genome-scale The results of such analyses strongly rely on the quality of the employed data and the computational strategy of their interpretation Today, comparative genomics allows for the quick and accurate assignment of genes and often their corresponding functions The resulting list of classified genes provides information about the overall genomic arrangement, of metabolic capabilities, general and unique cellular functions, however, almost nothing about the underlying complex regulatory networks Transcriptional regulation of gene expression is a central part of these networks in all organisms It determines the actual RNA, protein and as a consequence metabolite composition of a cell Moreover, it allows cells to adapt these parameters in response to changing environmental conditions An integral part of transcriptional regulation is the specific interaction of transcription factors (TFs) with their corresponding DNA targets, the transcription factor binding sites (TFBSs) or motifs Recent advances in extensive data mining using various high-throughput techniques provided first insights into the complex regulatory networks and their interconnections However, the computational prediction of regulatory interactions in the promoter regions of identified genes remains to be difficult Consequently, there is a high demand for the in silico identification and analysis of involved regulatory DNA sequences and the development of software tools for the accurate prediction of TFBSs In this chapter we focus on methods for the prediction of TFBSs in whole prokaryotic genomes (regulons) Although, many studies were sucessfully performed in eukaryotes they are often not transferable to the special features of bacterial gene regulation In particular the prokaryotic genome organization concerning clusters of co-transcribed polycistronic genes, the lack of introns and the shortness of promoter sequences necessitates adapted computational approaches Besides the genomic structure there are also differences in the regulatory control logic Prokaryotic promoters often possess one or few regulatory interactions while the repertoire of regulators consists of only a couple of global TFs but many local TFs (Price et al., 2008) On the other hand, eukaryotic promoters and enhancers involve the concerted binding of multiple regulators, so called cis-regulatory modules (CRMs) or composite elements (Loo & Marynen, 2009) Many excellent reviews in the field prokaryotic gene regulation were recently published with focus on the broad spectrum of approaches for the experimental and theoretical reconstruction of gene regulatory networks and their 150 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH interspecies transfer (Baumbach, 2010; Rodionov, 2007; van Hijum et al., 2009; Zhou & Yang, 2006) Here, we focus on practical aspects how to detect new members of a regulon for genes or genomes of interest We will summarize useful bioinformatics databases, methods and algorithms available for unraveling bacterial gene regulatory networks from whole genome sequences Finally, we want to indicate the limitations and technical problems of such approaches and give a survey on recent improvements in this field Strategies for the prediction of transcripion factor binding sites Basically, today exist at least two general approaches to recognize regulatory sequence patterns One challenging approach called pattern discovery relies on a statistical overrepresentation of DNA sequence motifs present in promoters of structurally and funktionally related or co-regulated genes In that case it is a de-novo prediction where the binding site and the corresponding regulator are unknown The list of investigated genes can be derived from clusters of co-expressed genes available in microarray experiments, from ChIP-on-chip experiments or from orthologous genes of related organisms In the latter case this method is called phylogenetic footprinting (McCue et al., 2001) Pattern discovery algorithms are top-down approaches that use various learning principles with different degrees of performance (Sandve et al., 2007; Su et al., 2010; Tompa et al., 2005) The advantage of this method is the detection of potential regulatory DNA sequences even if there is little known about the corresponding regulation A recent study in prokaryotes appling a pattern discovery approach revealed that the predicted patterns matched up to 81% of known individual TFBSs (Zhang et al., 2009) However, this approach has limitted value in getting a clue about what specific regulator is involved in a predicted TFBS An alternative approach on which we focus in this chapter is called pattern matching It makes use of prior knowledge in form of a predetermined pattern that can be assigned to a specific regulator The pattern is usually build based on a profile of known TFBSs for which experimental evidence is available (Fig A) Using this set of DNA sequences a probabilistic model describing the pattern degeneracy is constructed Application of the model on a given sequence results in a score for the likelihood that the investigated sequence belongs to the same sequence family The application of pattern matching involves the availability of a reliable training set of TFBSs For that purpose, several specalized databases provide collections and patterns of prokaryotic TBFSs supplemented with various related information like promoter and operon structures A limited list of important data sources is shown in table In the following examples a data set of 40 experimentally proven TFBSs from the anaerobic regulator Anr of Pseudomonas aeruginosa is used (Trunk et al., 2010) There are different ways of pattern representation Traditionally, the usage of IUPAC code for base ambiguities is a straightforward way to describe a binding motif (NC-IUB, 1985) In this approach, combinations of certain bases are assigned to an extended alphabet of specific letters (Fig B) IUPAC code can be easily converted into a regular expression (Fig C) A regular expression is a formal language for pattern matching, that can be used to scan for ambiguous IUPAC strings in order to predict new TFBSs (Betel & Hogue, 2002) (Fig B) Allthough the IUPAC letter code is very concise and still widely used among biologists it does not describe a proper weighting of bases Additionally, the majority rules how to generate a consensus sequences are to some extent arbitrary (Day & McMorris, 1992) However, in the case that the training set consists of only a few sequences the usage of IUPAC code can still make sense Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes Name Year Data content URL CoryneRegNet 2006 Coynebacerium TFBSs, http://www.coryneregnet.de regulatory networks, predictions 1513 References Baumbach et al (2009) DBTBS 2001 B subtilis TFBSs, operons, predictions http://dbtbs.hgc.jp Sierro et al (2008) DPInteract 1998 E coli TFBSs, PWMs http://arep.med.harvard.edu /dpinteract Robison et al (1998) PRODORIC 2003 prokaryotic TFBSs, PWMs, promoters, expression data http://www.prodoric.de Grote et al (2009) PromEC 2001 E coli promoters http://margalit.huji.ac.il /promec Hershberg et al (2001) RegPrecise 2010 predicted TFBSs http://regprecise.lbl.go Novichkov et al (2010) http://regtransbase.lbl.gov Kazakov et al (2007) RegTransBase 2007 prokaryotic TFBSs, PWMs RegulonDB 1998 E coli TFBSs, PWMs, operons, http://regulondb.ccg.unam.mx Gama-Castro et al (2011) Tractor_DB 2004 predicted TFBSs of γ-proteobacteria http://www.tractor.lncc.br Pérez et al (2007) Table List of important public databases about bacterial gene regulation The table shows the name, year of establishment, data content, the internet address and the latest reference of the respective database A more accurate description of a binding pattern is achieved by probabilistic models like a frequency matrix (or alignment matrix) (Staden, 1984) Instead of considering only the most common bases at each position a matrix comprises the frequencies for each nucleotide at each position (Fig D) Based on frequency matrices many models for the calculation of weights were proposed Such a model is broadly called position weight matrix (PWM) or position specific scoring matrix (PSSM) PWMs can be considered as simplified profile hidden Markov models (HMM) that not allow insertion and deletion states (Durbin et al., 1998) Formally, a PWM is an array M of weights w where each column corresponds to the position of the TFBS motif of the length l and each row represents the letter of the sequence alphabet A In case of DNA A ∈ { A, C, G, T } (equation 1) w A,1 wC,1 M= wG,1 wT,1 w A,2 wC,2 wG,2 wT,2 ··· ··· ··· ··· w A,l wC,l wG,l wT,l (1) Many very related examples for the calculation of individual weights were proposed in the literaure (Berg & von Hippel, 1987; Fickett, 1996; Schneider et al., 1986; Staden, 1984; Stormo, 2000) The information theoretical approach and modifications of it ((Schneider et al., 1986)) are widely used and some of the most successful methods for both the modeling and the prediction of potential TFBSs Information is a measure of uncertainty which means that 152 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH a highly conserved position with the exclusive occurence of one specific nucleotide gets the highest information value of bits In other words there is a maximum certainty of finding this nucleotide at this position In contrast, an information value of bits represents a highly degenerated position and the highest uncertainty of finding a specific nucleotide The information vector R(l ) represents the total information content of a profile of aligned sequences at the position l with f (b, l ) indicating the frequency of the base b at position l R(l ) = + T ∑ b= A f (b, l ) log2 f (b, l ) (2) An information PWM m(b, l ) is generated by multiplying the base frequencies f (b, l ) with the total information content R(l ) (Fig E) m(b, l ) = f (b, l ) · R(l ) (3) For pattern matching applications a PWM is used by summing up the corresponding weights of a candidate sequence to a score Afterwards, these scores are compared to a predefined cut-off (or threshold) to filter out potential predictions The derived score is often correlated to the binding affinity of a TF thus the information score can be interpreted as an rough estimate to the specific bindung energy However, this is only possible under the simplifying assumption that each position of a pattern contributes independently to the TF-TFBS interaction This additivity assumption is controversially discussed but is was shown that it is in fact a reasonable approximation (Benos et al., 2002) The graphical representation of an information PWM is called sequence logo (Schneider & Stephens, 1990) In a sequence logo each PWM weight is equivalent to the individual letter size so the total height of the stack of letters represents the information content R(l ) at this position Sequence logos allow an illustrative visualization of the sequence conservation and binding preference of a regulator (Fig F) Statistical significance of pattern matching Regulatory sequences are commonly short (usually 6-18 bp), the sample size of experimentally proven sites is often limited and in many cases the observed level of sequence conservation is low Consequently, the genome-wide statistically occurance frequency of derived patterns is often unrealistically high In such cases, searches generally generate increasing numbers of false-predictions the lower the threshold score is set This is demonstrated in Fig showing the score distributions of true and false predictions of a genome wide search in P aeruginosa using the PWM of the Anr regulator (Fig E) In the shown example matches in coding regions were considered as false-predictions (false-positives) and matches that are part of the training set were naturally ranked as true-predictions (true-positives) Score distributions are also important indicators to evaluate the predictive capacity of a PWM (Medina-Rivera et al., 2011) In order to improve the predictive power of pattern matching, commonly a cut-off score is set in a way, that improves the ratio of true- and false-predictions However, thereby the total number of hits will still contain to some extent false-positives while some true matches become lost (false-negatives) From this it follows that matches of TFBS predictions can not be classified in a binary manner like a dignostic test, since true-positives and false-positives are always coexisting Alternatively, they can be grouped into a classification schema consisting 1535 Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes A) Excerpt of 40 Sample sequences (training set) 40 T T T T T T T T G G G G A A A A T C T C T T T C C T G G G C C A G A A A T T T T C C C C A A A A A A A A T C T C G C T T G A T G G A T C A A H Y N B N B V K C A R B) IUPAC consensus Y T G C) Regular Expression [CT]TG[ACT][CT][ACGT][CGT][ACGT][CGT][ACG][TG]CA[AG] D) Frequency Matrix 33 A C G T 0 40 0 40 29 17 20 6 17 10 7 15 16 10 10 18 15 16 10 20 17 11 37 12 40 0 13 40 0 14 35 5 0.00 0.30 0.05 0.35 0.02 0.05 0.03 0.02 0.00 0.17 0.10 0.18 0.07 0.07 0.12 0.01 0.04 0.09 0.10 0.02 10 0.35 0.05 0.30 0.00 11 0.00 0.04 0.08 1.43 12 0.00 2.00 0.00 0.00 13 2.00 0.00 0.00 0.00 14 1.27 0.00 0.18 0.00 E) Position Weight Matrix A C G T 0.06 0.15 0.00 0.97 0.00 0.00 0.00 2.00 0.00 0.00 2.00 0.00 0.65 0.16 0.00 0.09 F) Sequence Logo 13 12 10 C C G A T G 14 G A C G G TCAA A T G CGC C TG 11 T AC C A TG T C 5′ 1 bits 3′ Fig Various pattern representations for a taining set 40 Anr binding sites from Pseudomonas aeruginosa (Trunk et al., 2010) The deduced IUPAC consensus (B), regular expression (C), frequency matrix (D), position weight matrix (E) and sequence logo (F) are shown 154 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH B 10 0 Number of matches 250 200 150 100 50 Number of matches 300 12 350 A 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 11.5 12.0 12.5 13.0 Score 13.5 14.0 14.5 15.0 Score Fig Score distributions of false-positive matches (A) and true-positive matches (B) from a genome wide search in P aeruginosa using the Anr PWM of four different classes (Fig 3) which is called a two-by-two confusion matrix or contingency table (Fawcett, 2004) Dataset Match No Match Positive Negative True-Positive False-Positive False-Negative True-Negative Fig A two-by-two confusion matrix illustrates all four possible outcomes of matches in the positive and in the negative dataset Thus, setting a cut-off score can be considered as important decision-making process Instead of setting an arbitrary cut-off value it is possible to determine an optimized threshold For that purpose, a number of statistical performance measurements for binary classification are available Sensitivity Sn (or true-positive rate) measures the proportion of positive matches which are correctly identified at a given cut-off score c Hereby, the positive matches include both the number of true-positives TP and false-negatives FN TP (4) TP + FN Similarly, specificity Sp (or true-negative rate) measures the proportion of correctly identified negative matches at a given cut-off score c where the amount of negative matches is the sum of true-negatives TN and false-positive FP Sn(c) = Sp(c) = TN TN + FP (5) 1557 Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes This definition involves that the sensitivity and specificity plots as a function of the cut-off show opposite behaviour which results in an increase of specificity (get less false-positives) at the cost of sensitivity (find less true-positives) and vice versa (Fig A) A receiver operating characteristics (ROC) curve summarizes the classification performance in a plot of sensitivity versus (1-specificity) ROC curves are fundamental tools for the evaluation of the classification models An optimal ROC curve would cross the upper left corner or coordinate (0,1) representing 100% sensitivity and specificity whereas a random guess would produce a point along the diagonal line (Fig A) Thus, the diagonal line divides the ROC space: points above the digonal represent good classification results, points below the line indicate poor results (Fawcett, 2004) 1.0 B 1.0 A 0.8 0.6 0.0 0.2 0.4 TP rate 0.6 0.4 0.0 0.2 Performance 0.8 Sensitivity Specificity 12.0 12.5 13.0 13.5 Score 14.0 14.5 0.0 0.2 0.4 0.6 0.8 1.0 FP rate Fig Performance measurements for the prediction of the Anr regulon in Pseudomonas aeruginosa (A) Sensitivity (green) and specificity (red) plot (B) ROC graph An alternative way to optimize the performance of pattern matching and to produce stastistically significant results is the calculation of a p-value A p-value depicts the likelihood to find a score that is as least as good by change P-values can be either determined by simulation or estimated via a compound importance sampling approach (Oberto, 2010) Finally, appropriate thresholds for pattern searches are determined as a tradeoff between sensitivity and specificity to maximize both values Despite optimized cut-off values this approach can results in a poor sensitivity and a loss of 40-60% of known functional sites (Benítez-Bellón et al., 2002) In addition, the fact that false-predictions commonly exeeds true-predictions by several orders of magnitude (Fig B) was called ’futility theorem’ (Wasserman & Sandelin, 2004) Fortunately, there are many sophisticated approaches to overcome this problem in a reasonable way (see section 4) Improvements to increase the accuracy of TFBS predictions 4.1 Modifications of the score In several studies the information score was modified in different ways One of the most critical points of equation is that it postulates an equal nucleotide distribtuion of the target genome which is the case e.g for Escherichia coli with a GC content of 51.8% For this reason, 156 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH the calculation of the information content of motifs in genomes with highly biased nucleotide composition is likely to be over- or underestimated A more generalized form that considers the background frequencies Pb is given in equation R(l ) = − T ∑ b= A f (b, l ) log2 f (b, l ) Pb (6) This new term turned out to be the relative entropy or Kullback-Leibler distance (Stormo, 2000) An other promising approach deals with biased genome as a discrete channel of noise to discriminate a motif from its background (Schreiber & Brown, 2002) However, it was recently demonstrated, that the unmodified information score performs on average better than other alternatives (Erill & O’Neill, 2009) One reason might be, that binding sites shift towards the genome skew in a co-evolutionary process between TFs and its corresponding TFBSs Other modifications concern the way the score is computationally calculated Since the information vector usually peeks at certain well conserved positions it is possible to get overestimated matches by forming the overall sum For that purpose, it is useful to define a core region consisting of the highly conserved positions Using this approach it is possible to realize the computation of the score in two steps Potential matches have to pass first the core cut-off before they are evaluated by the overall cut-off score (Münch et al., 2005; Quandt et al., 1995) Finally, it is possible to enhance the accuracy by combining multiple (independent) criterions Apart from the pure sequence information, DNA exhibits distinct structural properties caused by interactions from neighboring nucleotides This includes for example DNA curvature, flexibility and stability, amongst others Structural DNA features are available as di- and trinucleotide scale values assigning a particular value to each possible nucleotide combination (Baldi & Baisnée, 2000) These values are derived from empirical measurements or theoretical approaches The calculation of structural features within a DNA sequence stretch is usually performed by summing up and averaging the corresponding di- or trinucleotide scales Prokaryotic promoters usually exhibit distinct structural features which imply that these DNA sequences are more curved and less flexible in comparision to coding regions This feature is necessary in order to enable the melting of the DNA strands for the onset of transcription In most bacterial promoters structural peaks are present around the position -40 upstream of the transcriptional start point (Pedersen et al., 2000) Structural features can provide distinct scores independent from PWM based sequence similarity scores Recently, pattern matching was combined with a binding site model that was trained using 12 different structural properties (Meysman et al., 2011) In this approach, based on conditional random fields, it was shown, that the classification of matches was significantly improved In a similar way, structural and chemical features of DNA decreased the number of false-positives in a supervised learning approach (Bauer et al., 2010) 4.2 Positional preference of TFBSs Prokaryotic genomes usually consist of 6-14% non-coding DNA (Rogozin et al., 2002) In contrast to eukaryotes, the evolvement of non-coding regions appears to be determined primarily by the selective pressure to minimize the amount of non-functional DNA, while maintaining the essential TFBSs Additionally, it was demonstrated in Escherichia coli, that many PWMs show a strong preference for matches in non-coding regions (Robison et al., 1998) Figure A shows the distance of 1741 genomic TFBSs relative to the translational start site of the target gene Only 3.6% of all TFBSs are located after the start codon within 1579 Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes the coding region However, the largest amount of TFBSs is accumulated directly upstream This is also demonstrated in the cumulative percentage of TFBSs against the distance to the translational start (Fig B) According to this result, a total of 75.3% and 87.9% of all TFBSs are located 200bp and 300bp upstream, respectively Thus, prokaryotic promoters are usually short and it is reasonable to constrain searches to non-coding regions with a limit of a few hundred bp upstream to the translational start B 60 40 20 Cumulative Percentage 300 200 100 Frequency 400 80 500 100 A −1000 −800 −600 −400 −200 200 −1000 Distance −800 −600 −400 −200 Distance Fig Histogram of TFBS distances to the translational start site The used dataset consisted of 1741 genomic TFBSs from various bacterial species taken from the PRODORIC database 4.3 Phylogenetic conservation of regulatory interactions The large number of sequenced bacterial genomes offers comparative genomics approaches to predict and to analyze regulatory interactions Similar to phylogenetic footprinting, highly conserved matches in promoter regions of paralogous genes are more likely to be functional targets than non-conserved matches (McCue et al., 2001) This is particulary important for the interspecies transfer of gene regulatory networks (Babu et al., 2006; Baumbach, 2010) but also for the scanning of new regulon members (Pérez et al., 2007) The utilization of pattern matching methods in combination with phylogenetic conservation is also called regulog analysis (Alkema et al., 2004) During a regulog analysis the relativ conservation score RCS is defined by the fraction of orthologs, that share the same potential TFBS RCS = orthologsobserved orthologsexpected (7) In the first step of this and related approaches, the orthologous regulators and the corresponding target gene set are determined This is often realized by bi-directional best BLAST hits (BBH) (Mushegian & Koonin, 1996) In the second step, conserved TFBSs are extracted via pattern matching or pattern discovery approaches Predicted TFBSs with phylogenetic conservation can also be used to extend or to build new PWMs Huge datasets based on phylogenetic reconstruction were generated in various groups of bacteria (Baumbach et al., 2009; Novichkov et al., 2010; Pérez et al., 2007) Further investigetion of regulon evolution revealed the availability of a core set of genes that is widely conserved 158 10 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH across related species and a variable set of target genes reflecting the degree of specialization (Browne et al., 2010; Dufour et al., 2010) However, it was shown, that the outlined approach is commonly only feasible between closely related clades which is due to the fact that TFs evolve rapidly and independently of their target genes (Babu et al., 2006) Morover, orthologous TFs in bacteria often have different functions and regulate different sets of genes (Price et al., 2007) In summary, a high RCS value for a TFBS match represents an independent score for the validation for a real functional targets while a low RCS does not necessarily rule out false-positive matches The phylogenetic conservation approach represents a powerful approach to predict gene regulatory networks in highly related organisms and to get insights into the evolution of regulons Conclusion and outlook In summary the genome-wide recognition of DNA patterns by computational methods is still a challanging task However, major improvements in this field allow for reliable predictions in many cases Especially the rising number of sequenced bacterial genomes in combination with data from high-throughput technologies offers many posibilities for the development of more sophisticated methods in comparative genomics approaches Nevertheless, computational methods for TFBSs prediction can not replace wet-lab experiments but they can help to find new hypotheses that can be verified in an iterative process References Alkema, W B L., Lenhard, B & Wasserman, W W (2004) Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus., Genome Res 14(7): 1362–1373 URL: http://dx.doi.org/10.1101/gr.2242604 Babu, M M., Teichmann, S A & Aravind, L (2006) Evolutionary dynamics of prokaryotic transcriptional regulatory networks., J Mol Biol 358(2): 614–633 URL: http://dx.doi.org/10.1016/j.jmb.2006.02.019 Baldi, P & Baisnée, P F (2000) Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths., Bioinformatics 16(10): 865–889 Bauer, A L., Hlavacek, W S., Unkefer, P J & Mu, F (2010) Using sequence-specific chemical and structural properties of dna to predict transcription factor binding sites., PLoS Comput Biol 6(11): e1001007 URL: http://dx.doi.org/10.1371/journal.pcbi.1001007 Baumbach, J (2010) On the power and limits of evolutionary conservation–unraveling bacterial gene regulatory networks., Nucleic Acids Res URL: http://dx.doi.org/10.1093/nar/gkq699 Baumbach, J., Wittkop, T., Kleindt, C K & Tauch, A (2009) Integrated analysis and reconstruction of microbial transcriptional gene regulatory networks using coryneregnet., Nat Protoc 4(6): 992–1005 URL: http://dx.doi.org/10.1038/nprot.2009.81 Benos, P V., Bulyk, M L & Stormo, G D (2002) Additivity in protein-DNA interactions: how good an approximation is it?, Nucleic Acids Res 30(20): 4442–4451 Benítez-Bellón, E., Moreno-Hagelsieb, G & Collado-Vides, J (2002) Evaluation of thresholds for the detection of binding sites for regulatory proteins in Escherichia coli K12 DNA., Genome Biol 3(3): 13 Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes 159 11 Berg, O G & von Hippel, P H (1987) Selection of DNA binding sites by regulatory proteins Statistical-mechanical theory and application to operators and promoters., J Mol Biol 193(4): 723–750 Betel, D & Hogue, C W V (2002) Kangaroo–a pattern-matching program for biological sequences., BMC Bioinformatics 3(1): 20 Browne, P., Barret, M., O’Gara, F & Morrissey, J P (2010) Computational prediction of the crc regulon identifies genus-wide and species-specific targets of catabolite repression control in Pseudomonas bacteria., BMC Microbiol 10: 300 URL: http://dx.doi.org/10.1186/1471-2180-10-300 Day, W H & McMorris, F R (1992) Critical comparison of consensus methods for molecular sequences., Nucleic Acids Res 20(5): 1093–1099 Dufour, Y S., Kiley, P J & Donohue, T J (2010) Reconstruction of the core and extended regulons of global transcription factors., PLoS Genet 6(7): e1001027 URL: http://dx.doi.org/10.1371/journal.pgen.1001027 Durbin, R., Eddy, S., Krogh, A & Mitchison, G (1998) Biological sequence analysis, Cambridge University Press Erill, I & O’Neill, M C (2009) A reexamination of information theory-based methods for dna-binding site identification., BMC Bioinformatics 10: 57 URL: http://dx.doi.org/10.1186/1471-2105-10-57 Fawcett, T (2004) ROC graphs: Notes and practical considerations for researchers, Technical report, HP Laboratories URL: http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf Fickett, J W (1996) Quantitative discrimination of MEF2 sites., Mol Cell Biol 16(1): 437–441 Gama-Castro, S., Salgado, H., Peralta-Gil, M., Santos-Zavaleta, A., Miz-Rascado, L., Solano-Lira, H., Jimenez-Jacinto, V., Weiss, V., García-Sotelo, J S., López-Fuentes, A., Porrón-Sotelo, L., Alquicira-Hernández, S., Medina-Rivera, A., Martínez-Flores, I., Alquicira-Hernández, K., Martínez-Adame, R., Bonavides-Martínez, C., Miranda-Ríos, J., Huerta, A M., Mendoza-Vargas, A., Collado-Torres, L., Taboada, B., Vega-Alvarado, L., Olvera, M., Olvera, L., Grande, R., Morett, E & Collado-Vides, J (2011) Regulondb version 7.0: transcriptional regulation of escherichia coli k-12 integrated within genetic sensory response units (gensor units)., Nucleic Acids Res 39(Database issue): D98–105 URL: http://dx.doi.org/10.1093/nar/gkq1110 Grote, A., Klein, J., Retter, I., Haddad, I., Behling, S., Bunk, B., Biegler, I., Yarmolinetz, S., Jahn, D & Münch, R (2009) PRODORIC (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes., Nucleic Acids Res 37(Database issue): D61–D65 URL: http://dx.doi.org/10.1093/nar/gkn837 Hershberg, R., Bejerano, G., Santos-Zavaleta, A & Margalit, H (2001) PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites., Nucleic Acids Res 29(1): 277 Kazakov, A E., Cipriano, M J., Novichkov, P S., Minovitsky, S., Vinogradov, D V., Arkin, A., Mironov, A A., Gelfand, M S & Dubchak, I (2007) RegTransBase–a database of regulatory sequences and interactions in a wide range of prokaryotic genomes., Nucleic Acids Res 35(Database issue): D407–D412 URL: http://dx.doi.org/10.1093/nar/gkl865 160 12 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH Loo, P V & Marynen, P (2009) Computational methods for the detection of cis-regulatory modules., Brief Bioinform 10(5): 509–524 URL: http://dx.doi.org/10.1093/bib/bbp025 McCue, L., Thompson, W., Carmack, C., Ryan, M P., Liu, J S., Derbyshire, V & Lawrence, C E (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes., Nucleic Acids Res 29(3): 774–782 Medina-Rivera, A., Abreu-Goodger, C., Thomas-Chollier, M., Salgado, H., Collado-Vides, J & van Helden, J (2011) Theoretical and empirical quality assessment of transcription factor-binding motifs., Nucleic Acids Res 39(3): 808–824 URL: http://dx.doi.org/10.1093/nar/gkq710 Meysman, P., Dang, T H., Laukens, K., Smet, R D., Wu, Y., Marchal, K & Engelen, K (2011) Use of structural dna properties for the prediction of transcription-factor binding sites in Escherichia coli., Nucleic Acids Res 39(2): e6 URL: http://dx.doi.org/10.1093/nar/gkq1071 Münch, R., Hiller, K., Grote, A., Scheer, M., Klein, J., Schobert, M & Jahn, D (2005) Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes., Bioinformatics 21(22): 4187–4189 URL: http://dx.doi.org/10.1093/bioinformatics/bti635 Mushegian, A R & Koonin, E V (1996) A minimal gene set for cellular life derived by comparison of complete bacterial genomes., Proc Natl Acad Sci U S A 93(19): 10268–10273 NC-IUB (1985) Nomenclature Committee of the International Union of Biochemistry (NC-IUB) Nomenclature for incompletely specified bases in nucleic acid sequences Recommendations 1984., Eur J Biochem 150(1): 1–5 Novichkov, P S., Laikova, O N., Novichkova, E S., Gelfand, M S., Arkin, A P., Dubchak, I & Rodionov, D A (2010) RegPrecise: a database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes., Nucleic Acids Res 38(Database issue): D111–D118 URL: http://dx.doi.org/10.1093/nar/gkp894 Oberto, J (2010) Fitbar: a web tool for the robust prediction of prokaryotic regulons., BMC Bioinformatics 11: 554 URL: http://dx.doi.org/10.1186/1471-2105-11-554 Pedersen, A G., Jensen, L J., Brunak, S., Staerfeldt, H H & Ussery, D W (2000) A DNA structural atlas for Escherichia coli., J Mol Biol 299(4): 907–930 URL: http://dx.doi.org/10.1006/jmbi.2000.3787 Pérez, A G., Angarica, V E., Vasconcelos, A T R & Collado-Vides, J (2007) Tractor_DB (version 2.0): a database of regulatory interactions in gamma-proteobacterial genomes., Nucleic Acids Res 35(Database issue): D132–D136 URL: http://dx.doi.org/10.1093/nar/gkl800 Price, M., Dehal, P & Arkin, A (2008) Horizontal gene transfer and the evolution of transcriptional regulation in Escherichia coli., Genome Biol 9(1): R4 URL: http://dx.doi.org/10.1186/gb-2008-9-1-r4 Price, M N., Dehal, P S & Arkin, A P (2007) Orthologous transcription factors in bacteria have different functions and regulate different genes., PLoS Comput Biol 3(9): 1739–1750 URL: http://dx.doi.org/10.1371/journal.pcbi.0030175 Prediction andofAnalysis of Networks Gene inRegulatory Networks in Prokaryotic Genomes Prediction and Analysis Gene Regulatory Prokaryotic Genomes 161 13 Quandt, K., Frech, K., Karas, H., Wingender, E & Werner, T (1995) MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data., Nucleic Acids Res 23(23): 4878–4884 Robison, K., McGuire, A M & Church, G M (1998) A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome., J Mol Biol 284(2): 241–254 Rodionov, D A (2007) Comparative genomic reconstruction of transcriptional regulatory networks in bacteria., Chem Rev 107(8): 3467–3497 URL: http://dx.doi.org/10.1021/cr068309+ Rogozin, I B., Makarova, K S., Natale, D A., Spiridonov, A N., Tatusov, R L., Wolf, Y I., Yin, J & Koonin, E V (2002) Congruent evolution of different classes of non-coding DNA in prokaryotic genomes., Nucleic Acids Res 30(19): 4264–4271 Sandve, G K., Abul, O., Walseng, V & Drabløs, F (2007) Improved benchmarks for computational motif discovery., BMC Bioinformatics 8: 193 URL: http://dx.doi.org/10.1186/1471-2105-8-193 Schneider, T D & Stephens, R M (1990) Sequence logos: a new way to display consensus sequences., Nucleic Acids Res 18(20): 6097–6100 Schneider, T D., Stormo, G D., Gold, L & Ehrenfeucht, A (1986) Information content of binding sites on nucleotide sequences., J Mol Biol 188(3): 415–431 Schreiber, M & Brown, C (2002) Compensation for nucleotide bias in a genome by representation as a discrete channel with noise., Bioinformatics 18(4): 507–512 Sierro, N., Makita, Y., de Hoon, M & Nakai, K (2008) Dbtbs: a database of transcriptional regulation in bacillus subtilis containing upstream intergenic conservation information., Nucleic Acids Res 36(Database issue): D93–D96 URL: http://dx.doi.org/10.1093/nar/gkm910 Staden, R (1984) Computer methods to locate signals in nucleic acid sequences., Nucleic Acids Res 12(1 Pt 2): 505–519 Stormo, G D (2000) DNA binding sites: representation and discovery., Bioinformatics 16(1): 16–23 Su, J., Teichmann, S A & Down, T A (2010) Assessing computational methods of cis-regulatory module prediction., PLoS Comput Biol 6(12): e1001020 URL: http://dx.doi.org/10.1371/journal.pcbi.1001020 Tompa, M., Li, N., Bailey, T L., Church, G M., Moor, B D., Eskin, E., Favorov, A V., Frith, M C., Fu, Y., Kent, W J., Makeev, V J., Mironov, A A., Noble, W S., Pavesi, G., Pesole, G., Régnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C & Zhu, Z (2005) Assessing computational tools for the discovery of transcription factor binding sites., Nat Biotechnol 23(1): 137–144 URL: http://dx.doi.org/10.1038/nbt1053 Trunk, K., Benkert, B., Quäck, N., Münch, R., Scheer, M., Garbe, J., Jänsch, L., Trost, M., Wehland, J., Buer, J., Jahn, M., Schobert, M & Jahn, D (2010) Anaerobic adaptation in Pseudomonas aeruginosa: definition of the Anr and Dnr regulons., Environ Microbiol 12(6): 1719–1733 URL: http://dx.doi.org/10.1111/j.1462-2920.2010.02252.x van Hijum, S A F T., Medema, M H & Kuipers, O P (2009) Mechanisms and evolution of control logic in prokaryotic transcriptional regulation., Microbiol Mol Biol Rev 73(3): 481–509, Table of Contents URL: http://dx.doi.org/10.1128/MMBR.00037-08 162 14 Systems and Computational Biology – Molecular and Cellular Experimental Systems Will-be-set-by-IN-TECH Wasserman, W W & Sandelin, A (2004) Applied bioinformatics for the identification of regulatory elements., Nat Rev Genet 5(4): 276–287 URL: http://dx.doi.org/10.1038/nrg1315 Zhang, S., Xu, M., Li, S & Su, Z (2009) Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes., Nucleic Acids Res 37(10): e72 URL: http://dx.doi.org/10.1093/nar/gkp248 Zhou, D & Yang, R (2006) Global analysis of gene transcription regulation in prokaryotes., Cell Mol Life Sci 63(19-20): 2260–2290 URL: http://dx.doi.org/10.1007/s00018-006-6184-6 Mining Host-Pathogen Interactions Dmitry Korkin, Thanh Thieu, Sneha Joshi and Samantha Warren University of Missouri, Columbia, USA Introduction Infections are caused by a vast variety of pathogenic agents including viruses, bacteria, fungi, protozoa, multicellular parasites, and even proteins (Anderson and May 1979; Morse 1995; Bartlett 1997; Mandell and Townsend 1998) that target host organisms from virtually all kingdoms of life (Daszak, Cunningham et al 2000; Williams, Yuill et al 2002) Infectious diseases in humans account for 170 thousand deaths in the United States and 14,7 million deaths world-wide (2004; Rossi and Walker 2005) “Neglected diseases”, a group of tropical diseases that are spread among the poorest segment of the world’s population, account for a large portion of human infections (Ayoola 1987; Trouiller, Olliaro et al 2002) With the reluctance of the pharmaceutical industry to invest in the development of drugs for neglected diseases, there is an increasing pressure on the scientific community in academia and non-profit organizations to obtain a fast and inexpensive cure (Trouiller, Torreele et al 2001; Maurer, Rai et al 2004; Fehr, Thurmann et al 2006) In addition to human infections, infections in plant and animals have a multibillion dollar economic impact each year (Bowers, Bailey et al 2001; Whitby 2001) Expanding the studies to the whole animal kingdom allows scientists to study the hostpathogen evolution of virulence mechanisms that are common among plant and animals, such as type III secretion system (T3SS), an elaborate protein-delivery system (Espinosa and Alfano 2004; Abramovitch, Anderson et al 2006) Moreover, studying interactions between pathogens and simpler model organisms, such as drosophila, has led to important findings in mammalian systems and is critical for understanding human infections (Cherry and Silverman 2006) Recently another threat has come to scientists’ attention: the potential use of some pathogens as bioweapons (Whitby 2001; Moran, Talan et al 2008) The attacks can target population directly, or they can target strategic resources such as the world’s most consumed crops Studying HPIs may provide critical knowledge for the development of infection diagnosis and treatment for disaster planning in case of a bioterrorism event A pathogen causing an infectious disease generally exhibits extensive interactions with the host (Munter, Way et al 2006) These complex crosstalks between a host and a pathogen may assist the pathogen in successfully invading the host organism, breaching its immune defence, as well as replicating and persisting within the organism Systematic determination and analysis of HPIs is a challenging task from both experimental and computational approaches, and is critically dependent on the previously obtained knowledge about these interactions The molecular mechanisms of host-pathogen interactions (HPIs) include 164 Systems and Computational Biology – Molecular and Cellular Experimental Systems interactions between proteins, nucleotide sequences, and small ligands (Lengeling, Pfeffer et al 2001; Kahn, Fu et al 2002; Stebbins 2005; Forst 2006) The interactions between the pathogen and host proteins are one of the most important and therefore widely studied group of HPIs (Stebbins 2005) During the last decade, an increasing amount of experimental data on virulence factors, their structures, and their functions has become available (Sansonetti 2002; Stebbins 2005) The first steps towards large-scale systematic determination and analysis of molecular HPIs have recently emerged for important pathogens (Shapira, Gat-Viks et al 2009; Dyer, Neff et al 2010) Recent progress in data mining and bioinformatics allows scientists to accurately predict novel protein-protein interactions, structurally characterize individual proteins and protein complexes, and predict protein functions on a scale of an entire proteome (Thornton 2001; Russell, Alber et al 2004; Shoemaker and Panchenko 2007) Unfortunately, there have been only a handful of methods designed to address the protein interactions between pathogenic agents and their hosts (Cherkasov and Jones 2004; Davis, Barkan et al 2007; Dyer, Murali et al 2007; Lee, Chan et al 2008; Evans, Dampier et al 2009; Tyagi, Krishnadev et al 2009; Doolittle and Gomez 2011) As it is the case for many bioinformatics areas, collecting HPI data into a centralized repository is instrumental in developing accurate predictive methods Recently, several such HPI repositories have been introduced, some are manually curated, while others are reliant on the existing databases (Winnenburg, Urban et al 2008; Driscoll, Dyer et al 2009; Kumar and Nanduri 2010) While this is a promising first step towards a large-scale HPI data collection, one of the largest and most comprehensive sources of experimentally verified HPI data remains largely underexplored: PubMed, a database of peer-reviewed biomedical literature, which includes abstracts of more than 20 million research papers and books (http://www.ncbi.nlm.nih.gov/pubmed/) Unfortunately, the comprehensive manual identification and data extraction of the abstracts containing HPI information from PubMed is not feasible due to the size of PubMed Furthermore, no informatics approach currently available to this automatically In this chapter, we discuss several possible solutions to the problem of automated HPI data collection from the publicly available literature The chapter is organized as follows First, we describe some of the popular HPI databases that are currently available publicly Second, we discuss the state-of-the-art approaches to a related problem of mining general proteinprotein interactions from the literature Third, we propose three approaches to mine HPIs and discuss the advantages and disadvantages of these approaches In conclusion, we discuss the future steps in the area of HPI text mining by highlighting factors that are critical for its successful development Host-pathogen interaction databases During the last several years, a number of resources collecting HPI data have emerged (Snyder, Kampanya et al 2007; Winnenburg, Urban et al 2008; Driscoll, Dyer et al 2009; Kumar and Nanduri 2010) Many resources rely on the automated post-processing of the large-scale databases for general protein-protein interactions, while some other obtain the HPI data by manually curating the biomedical literature Often the resources focus on the human-pathogen interactions Next, we will briefly describe some of the popular databases that include HPI data HPIDB - Host-Pathogen Interaction DataBase One of the most recent HPI database, HPIDB (Kumar and Nanduri 2010) integrates the information from other HPI database, PIG Mining Host-Pathogen Interactions 165 (Driscoll, Dyer et al 2009), and more general protein-protein interaction databases, BIND (Gilbert 2005), GeneRIF(Mitchell, Aronson et al 2003; Pruitt, Tatusova et al 2003), IntAct (Aranda, Achuthan et al 2010), MINT (Zanzoni, Montecchi-Palazzi et al 2002), and Reactome (Matthews, Gopinath et al 2009) Currently, the database has 22,841 proteinprotein interactions between 49 host and 319 pathogen species (Kumar and Nanduri 2010) HPIDB is searchable via a keyword search, a BLAST search, or a homologous HPI search For each query, the following output information is obtained: UniProt accession numbers of both host and pathogen proteins, host and pathogen names, detection method, author name, PubMed publication ID (PMID), interaction type, source database, and comments The homologous HPI search option allows the user to one or both of the following: search for a set of homologous host proteins, and search for a set of homologous pathogen proteins PATRIC – PAThosystems Resource Integration Center PATRIC is a resource that integrates genomics, proteomics, and interactomics data on a comprehensive set of bacterial species as well as a set of data mining and comparative genomics tools (Snyder, Kampanya et al 2007; Sullivan, Gabbard et al 2010) The human-pathogen interaction data for 30 bacterial pathogens are also a part of the resource Similar to HPIDB, the data are extracted and post-processed from a number of general protein-protein interaction databases including BIND (Gilbert 2005), DIP (Xenarios, Fernandez et al 2001), IntAct (Aranda, Achuthan et al 2010), and MINT (Zanzoni, Montecchi-Palazzi et al 2002) With PATRIC a user selects a pathogen from the home page The search can be refined by selecting specific interaction types (e.g., “direct interaction”, “colocalization”), detection methods (e.g., “coimmunoprecipitation”, “two hybrid”), or source databases The results can be visualized as a network of interacting proteins with the colour nodes representing different species and weighted edges representing the number of independent experimental sources supporting the interaction The Pathogen Interaction Gateway (PIG) is a part of PATRIC that is focused on collecting and analysing exclusively the protein-protein human-pathogen interactions and the corresponding interaction networks (Driscoll, Dyer et al 2009) The PIG web interface allows mining the data using two query types: the BLAST search and text keyword search PIG also has a utility that allows the user to visualize the network of protein-protein HPIs followed by the network comparison between the HPI networks extracted for two different pathogen genes PHI-base – the Pathogen-Host Interaction dataBASE PHI-base collects information on experimentally verified pathogenicity, virulence and effector genes from bacterial, fungal, and Oomycete pathogens and includes a variety of infected hosts from plants, mammals, fungus, and insects (Winnenburg, Urban et al 2008) All database entries are manually curated and are supported by experimental evidence and literature citations The current version has a total of 1,065 gene entries participating in 1,335 interactions between 97 pathogens and 76 hosts, supported by 720 literature references The interaction between a host and pathogen organism is considered in this database in a more general sense and often is not associated with any physical interaction between the host and pathogen proteins Using the PHI-base web interface, a user can either a simple quick search or an advanced search, where the user selects one or many of the following search terms: gene, disease (caused by pathogen), host, pathogen, anti-infective, phenotype, and experimental evidence The search output is a list of interactions and their details including PHI-base accession number, gene name, EMBL accession number, phenotype of the mutant, pathogen species, disease name, and experimental host The user can also obtain additional information on nucleotide and amino acid sequences of the pathogen gene, experimental evidence of the 166 Systems and Computational Biology – Molecular and Cellular Experimental Systems interaction, gene ontology (pathogenesis, molecular function, and biological process), and a publication reference Current approaches for mining protein-protein interactions Rapid growth of published biomedical research has resulted in the development of a number of methods for biomedical literature mining over the last decade (Krallinger and Valencia 2005; Rodriguez-Esteban 2009) The methods dealing with the biomolecular information can be generally divided into three categories based on the domain of biomedical knowledge they target: (i) automated protein or gene name identification in a text (Mika and Rost 2004; Seki and Mostafa 2005; Tanabe, Xie et al 2005), (ii) literature-based functional annotation of genes and proteins (Chiang and Yu 2003; Jaeger, Gaudan et al 2008), and (iii) extracting the information on the relationships between biological molecules, such as proteins and RNAs, or genes (Hu, Narayanaswamy et al 2005; Shatkay, Hˆglund et al 2007; Lee, Yi et al 2008) The relationships detected by the third group of methods range from a co-occurrence of the genes and proteins in a text (Hoffmann and Valencia 2005) to detecting the protein-protein interactions (PPIs) (Blaschke and Valencia 2001; Marcotte, Xenarios et al 2001; Donaldson, Martin et al 2003) and identification of signal transduction networks and metabolic pathways (Friedman, Kra et al 2001; Hoffmann, Krallinger et al 2005; Santos and Eggle 2005) Being a special case of protein-protein interactions, HPIs could directly benefit from the advancements of the currently existing text mining methods Extraction of protein-protein interactions from the text has been one of the three main tasks for the recent BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenges, a community-wide effort for evaluating biological text mining and information retrieval systems (Hirschman, Yeh et al 2005; Krallinger, Leitner et al 2008) Three subtasks have been specified: (i) detection of protein-protein interactions relevant documents (interaction article subtask, IAS), (ii) identification of sentences with proteinprotein interactions (interaction sentences subtask, ISS), and (iii) identification of interacting protein pairs (interaction pair subtask, IPS) A relevant problem, the protein interaction method subtask (IMS), is concerned with identification of the type of experimental data used to determine an interaction Approaches that address these subtasks vary from supervised machine learning classifiers, to address the first subtask, to statistical language processing and grammar-based methods to address the second and third subtasks A simple approach to extract protein-protein interactions is to determine the co-existence of proteins in the same sentence (Stephens, Palakal et al 2001; Hoffmann and Valencia 2005) However, this approach is insufficient to handle structured information of biomedical sentences Therefore, pattern matching methods have been proposed that rely on either manually defined patterns (Leroy and Chen 2002; Corney, Buxton et al 2004) or patterns that are automatically generated using dynamic programming (Huang, Zhu et al 2004; Hao, Zhu et al 2005) Another popular group of methods employs the natural language processing parsers A basic approach, called shallow parsing, decomposes sentences into non-overlapping fragments and chunks, and defines the dependencies between the chunks without extracting their internal structure (Thomas, Milward et al 2000; Leroy, Chen et al 2003) Many shallow parsing approaches employ finite-state automata to recognize the interaction relationships between proteins or genes (Thomas, Milward et al 2000; Leroy, Chen et al 2003) One of the most prominent approaches relies on the deep parsing Mining Host-Pathogen Interactions 167 techniques, where the entire structure of a sentence is extracted (Park, Kim et al 2001; Ding, Berleant et al 2003; Daraselia, Yuryev et al 2004; Pyysalo, Ginter et al 2004; Kim, Shin et al 2008; Miyao, Sagae et al 2009) Many deep parsing approaches have successfully employed link grammars (Sleator and Temperley 1995), context-free grammars that rely on a dictionary of rules (linking requirements) to connect, or “link”, pairs of related words (Ahmed, Chidambaram et al 2005; Seoud, Youssef et al 2008; Yang, Lin et al 2009) Each of the above methods, while directly addressing the second and the third subtasks, can also solve the abstract classification problem from the first subtask, based on whether or not the method is able to extract any protein-protein interactions The accuracy of such classification, however, depends on the accuracy of a more difficult subtask of proteinprotein interaction extraction Thus, several methods have been developed to directly address the problem of binary classification of protein-protein interaction relevant publications (Marcotte, Xenarios et al 2001; Calli 2009; Kolchinsky, Abi-Haidar et al 2010) The methods primarily rely on supervised and unsupervised feature-based classification techniques Recently, the first method for classification of HPI-relevant documents has been introduced, which employs a Support Vector Machines (SVM) supervised classifier (Yin, Xu et al 2010) New approaches to detection and mining host-pathogen interactions from biomedical abstracts HPI literature mining is related to a general problem of protein-protein interaction literature mining However, the additional requirement that the interaction occurs exclusively between the host and pathogen proteins makes the task more challenging The accuracy of an HPI mining method will depend on additional factors, such as its ability to correctly assign a host or pathogen organism to the interacting protein Similar to the way the BioCreAtIvE initiative defines three types of protein-protein interaction mining problems (Hirschman, Yeh et al 2005), the problem of HPI mining can be split into three specific tasks: HPI Mining Task 1: Given a biomedical publication (a paper or an abstract), determine whether or not it contains information on HPIs HPI Mining Task 2: Given a biomedical publication containing HPI information, determine specific sentences that contain this information HPI Mining Task 3: Given a biomedical publication that contain HPI information, determine specific pairs of host and pathogen proteins participating in the interactions and the corresponding organisms The first task can be formulated as a standard classification problem, which is often addressed by machine learning methods and for which a number of the method assessment protocols have been developed Here we rely on the following five basic measures The first measure, accuracy, is calculated as f AC N TP N TN / N , where NTP and NTN are the number of true positives and negatives, correspondingly, and N is the number of classified interfaces The other two related measures, precision and recall, are calculated as f PR N TP / NTP N FP and f RE N TP / N TP N FN , correspondingly, where NFP and NFN are the number of false positives and negatives F-score is calculated as F fPR f RE The last fPR fRE 168 measure, MCC Systems and Computational Biology – Molecular and Cellular Experimental Systems the Matthew correlation NTP NTN N FP N FN coefficient NTP N FP NTP N FN NTN N FP NTN N FN is calculated as Similarly, performance on the last task can be easily assessed based on the available information about the host and pathogen proteins and their respective organisms Specifically, we use four different measures The first two measures, fORG and fPRT, address the accuracy of detecting the pairs of interacting host and pathogen organisms as well as their proteins Each measure is calculated as a percentage of the number of correctly detected pairs of organisms/proteins to the total number of pairs The other two measures, gORG and gPRT, account for the partial detection of HPI information, when at least one of the two organisms or proteins is detected Both measures are defined as the percentage of the total number of detected organisms/proteins to the total number of organisms/proteins in all HPIs Unfortunately, evaluating a method’s performance for the second task is more challenging, since the HPI data are often (i) scattered across multiple sentences and (ii) redundant (for instance, the same interaction between two proteins can be mentioned in several sentences) The method assessment for the second task becomes even more challenging when multiple HPIs are present in the same abstract We next introduce several strategies that address the above tasks for the PubMed biomedical abstracts (here and below, we will always consider an abstract of the biomedical publication together with the publication’s title; the latter often provides important information on HPIs) One of the main reasons behind extracting HPI information from the abstracts rather than entire papers is the fact that for many papers, the abstract is the only information that is freely available in PubMed The first strategy is to rely on the existing methods for mining protein-protein interactions followed by additional post-processing to filter out the intra-species interactions Another approach employs the language-based methods traditionally used in protein-protein interaction literature mining The last approach introduces a supervised-learning feature-based methodology, which has recently emerged in the area of biomedical literature mining While each of the approaches is applicable to each of the three tasks, here we will focus on assessing their performance for the first and third tasks 4.1 Data collection Collecting accurate, unbiased, non-redundant data on HPIs is a critical step for efficient training of a supervised method as well as for an accurate assessment of any literature mining approach Both the positive set (abstracts containing HPI information) and the negative set (abstracts that not contain HPI information) were manually selected and annotated To obtain the set of potential candidates for the positive and negative sets we have combined of both searching the existing HPI databases and the PubMed database Our positive set consisted of 175 HPI containing abstracts that include human and non-human hosts The abstracts containing human-pathogen interactions were collected by searching and manually curating abstracts from PIG, a database of host-pathogen interactions manually extracted from the literature (Driscoll, Dyer et al 2009) For each abstract, we required the presence of organism and protein names for both the host and the pathogen, resulting in 89 abstracts Unfortunately, in its current form, PIG only has the abstracts with annotated human-pathogen interactions Therefore to obtain the list of interactions between non-human hosts and their pathogens, we searched using an extensive PubMed query We Mining Host-Pathogen Interactions 169 required the presence in the same abstract of (i) at least one (non-human) host name, (ii) at least one pathogen name, (iii) and at least one interaction keyword We then manually selected from the list another 86 abstracts that contained HPI information, adding them to the positive set To obtain candidates for the negative set, we performed an almost identical search strategy using the same PubMed query but including ‘human’ to the list of the host names We again manually selected the abstracts to ensure that that they did not have any HPI information, even though they contained the important keywords Note that it is significantly harder for a computational approach to distinguish between the abstracts from the obtained negative training set and those from the positive set, compared to a negative training set consisting of abstracts that were randomly chosen from PubMed As a result, we selected 175 abstracts where no HPI information was found, although some of the abstracts included information on intra-species protein-protein interactions The list of manually curated positive and negative sets of PubMed abstracts can be found at: http://korkinlab.org/datasets/philm/philm_data.html 4.2 A naïve approach based on literature mining of protein-protein interactions In a simple naïve approach, we first establish whether an abstract contains any information on a protein-protein interaction using the existing state-of-the-art literature mining methods followed by extraction of the pair of interacting proteins (Fig 1A) We rely on the PIE system, which integrates the natural language processing and machine learning methods to determine the sentences that contain protein-protein interactions in a PubMed abstract and extract the corresponding protein names and the interaction keywords (Kim, Shin et al 2008) Next, for each interacting protein we identify its corresponding organism by applying NLProt protein/gene tagging software (Mika and Rost 2004) NLProt uses a number of techniques, such as the dictionary search, rule-based detection, and feature-based supervised learning, to extract the names of proteins and genes and tag them using SWISSPROT or TrEMBL identifiers (Boeckmann, Bairoch et al 2003) The method also predicts the most likely organisms associated with these proteins/genes It was reported to have a precision of 75% and a recall of 76% on detecting protein/gene names (Mika and Rost 2004) Finally, for each sentence identified as containing a protein-protein interaction by the PIE system, we determine if this interaction is a HPI Specifically, if each of the two proteins forming a protein-protein interaction belongs to a different organism, and these organisms can be assigned the host-pathogen roles, then the interaction is classified as an HPI To assign the host-pathogen roles, we use our manually curated dictionaries of host and pathogen organism names (Table 1) We assessed the naïve approach by applying it to our testing set of 88 abstracts, 44 positive and 44 negative examples As a result in addressing Task 1, the obtained accuracy was 0.53, precision was 1.0, and recall was 0.07 for the classification of HPI-containing abstracts (Task 1); F-score and Matthews Correlation Coefficient were 0.13 and 0.19, correspondingly We found that the method almost completely failed to detect the abstracts containing HPI information; the contribution to the accuracy came primarily from the true negative hits, containing 44 (out of 44) abstracts from the negative testing set Interestingly, both high precision and low recall values could be attributed to the same property of the naïve approach: it failed to accurately detect the protein-protein interactions Indeed, all 41 false negatives were not due to the approach’s failure to assign the host and pathogen roles to the identified organisms, but due to its failure to identify a protein-protein interation in the abstract 170 Systems and Computational Biology – Molecular and Cellular Experimental Systems It is also not surprising that the naïve approach performed poorly when addressing Task 3: the method was able to detect only two proteins out of 44 protein pairs and none of the 44 pairs of organisms, resulting in the only non-zero score of gPRT = 0.02; the other three scores, fORG, fPRT, and gORG were equal to zero Fig Three HPI literature mining approaches (A) Naïve approach (B) Language-based approach (C) Feature-based supervised machine learning approach 171 Mining Host-Pathogen Interactions Dictionary name N Examples Interaction keywords 54 Interact, associate, bind Experimental keywords 28 Yeast two-hybrid, chemical crosslinking Negation keywords 11 Not, neither, inability HPI specific keywords 17 Virulence, effectors, infection Host names 309 Host, plant, human Pathogen names 349 Listeria monocytogenes, Hepatitis virus Table Dictionaries of keywords used by all three approaches N is the number of unique entries for each dictionary 4.3 A language-based approach Our second approach is inspired by the language-based methods in biomedical text mining, which are also widely used in mining protein-protein interactions In HPI text mining, we are faced with additional challenges such as correctly associating the organism name for each protein, ensuring that the extracted interaction is inter- and not intra-species interaction, and combining the information about an HPI from multiple sentences As a result, these additional challenges necessitate adding new modules to the computational pipeline of our approach compared with a pipeline for extracting general protein-protein interactions The HPI mining pipeline consists of the following steps (Fig 1B): (1) text preprocessing, (2) entity tagging, where we identify protein/gene and organism names, (3) grammar parsing, where we parse the input text into dependency structures (4) anaphora resolution, where we identity references to pronouns, (5) syntactic extraction, where we split a complex sentence into simple ones, (6) role matching, where we identify semantic roles in each simple sentence, (7) interaction keyword tagging, and (8) extraction of the actual HPI information We note that this approach directly addresses Tasks and by finding the sentences containing HPI information and extracting the corresponding pairs of host and pathogen organisms and the interacting proteins/genes Task is addressed by classifying each abstract based on whether there was at least one HPI with the complete information extracted from the abstract’s text Entity tagging The entity tagging module identifies named entities in a abstract, such as protein/gene names and the corresponding organism names For a language-based text mining approach, it is critical that all named entities are accurately identified Thus, our language–based approach for HPI literature mining has the most elaborate entity tagging module of all three approaches introduced here Specifically, the module includes three stages: (i) protein/gene name tagging using NLProt, (ii) host/pathogen organism dictionary match, and (iii) post-processing First, we apply the NLProt tagger to identify the names of all proteins/genes occurring in the text and the corresponding organism names (Mika and Rost 2004) We note that in a case when a protein with the same name exists for multiple species, NLProt assigns the most likely organism for each entry of this protein Second, we find a UniProt accession number (Bairoch, Apweiler et al 2005) for each identified protein followed by grouping the proteins/genes with the same accession number into a protein/gene entity Third we search for the organisms missed by NLProt using expanded versions of our host and pathogen organism dictionaries that include synonyms for each 172 Systems and Computational Biology – Molecular and Cellular Experimental Systems organism name and group the organisms under NCBI Taxonomy IDs (Wheeler, Barrett et al 2006) Since NLProt may not identify all organisms in the abstract, our module rescans the abstract text again to find the remaining host and pathogen organisms Finally, the system revisits the entity tagging module again after the next module, Link grammar parsing, provides the internal structure of the sentences in terms of its basic units, phrases The idea is that we can use the internal sentence structure to (i) find additional host/pathogen information that is not present in the dictionary, and (ii) reassign protein/gene name to its correct organism, if needed This stage plays an important role in the entity tagging module, since our host and pathogen dictionaries are potentially incomplete (not all organisms provided by NLProt may be covered); in addition, the dictionaries overlap with each other (the same organism can be both, a host and a pathogen) If an organism name suggested by NLProt for a protein is not found in our dictionary, the entity tagging module nevertheless tries to assign the organism’s role as a host or pathogen It does so by searching for generic keywords (such as “host”, “pathogen”, “pathogenic”, “pathogenesis”, etc.), in each phrase containing the organism name Similarly, the module checks the organism name suggested by NLProt for a protein/gene by identifying the organism’s name in the phrase that contains a protein/gene name To so the module relies on two search patterns: Organism name + protein name (e.g., “Arabidopsis RIN4 protein”); Protein name + preposition + organism name (e.g., “RXLX of human”) The newly obtained information about the organism assignment then replaces the current suggestions provided by NLProt For instance, in the phrase “the Arabidopsis RIN4 protein”, NLProt associates RIN4 with a pathogenic organism, while the dictionary search matches Arabidopsis as a host organism and identifies this phrase as pattern P1 Therefore, Arabidopsis is assigned as the organism for RIN4 protein, followed by the correct assignment of RIN4 as a host protein Link grammar parsing In our next module, we use natural language processing methods to determine the intrinsic structure of each sentence in the abstract In our approach, all grammatical constructions are based on the link grammar, a context-free grammar that relies on the dependency structure of natural language (Sleator and Temperley 1995) In link grammar, every word has a linking requirement, which specifies which types of other words or phrases can link to it Two words can only be linked if their linking requirements match A link is represented as an arc above the two words (Fig 2) The linking requirements are organized into a dictionary that the grammar parser refers to when analyzing a sentence The principal structure in link grammar is the linkage, a set of links that completely connect all words in a sequence Such a sequence of words is called a link grammar sentence if it satisfies three conditions: (i) the links not cross (planarity), (ii) each word is connected to at least another word by a link (connectivity), and (iii) the linking requirements for each word in the sentence are not violated (satisfaction) For example, the linkage for the sentence “Avirulence protein B targets the Arabidopsis RIN4 protein” is shown in Fig In total, the link grammar has 107 main links, each of which can derive many sub-links We implemented the module using an open source link grammar parser from AbiWord project (http://www.abisource.com/projects/link-grammar/) This project implements the original link grammar (Sleator and Temperley 1995), combining it with additional features such as adaptation of the parser to the biomedical sublanguage, BioLG (Pyysalo, Salakoski et al 2006) and an English-language semantic dependency relationship extractor, RelEx (Fundel, Kuffner et al 2007) Mining Host-Pathogen Interactions 173 Fig Internal sentence structure annotated by a link grammar parser for an HPI relevant sentence Words are labelled with the part-of-speech tags: n (noun) and v (verb) A link between two words can be formed to specify a dependency relation Each dependency type has its own unique label: AN, GN, Ss, Os, D*u, G Anaphora resolution In the anaphora resolution module, we determine semantic meaning for pronouns (it, they, he, she), and other language structures in the sentences Unlike the case of intra-species protein-protein interactions, the information on HPIs often spans multiple sentences, with the pronouns often replacing the names of organisms or proteins/genes Therefore, to extract the complete information on a HPI, it is critical to have an accurate anaphora resolution module The module relies on the RelEx anaphora resolution method, which employs Hobbs’ pronoun resolution algorithm (Hobbs 1978) For example, in the sentence “The Pseudomonas syringae type III effector protein avirulence protein B (AvrB) is delivered into plant cells, where it targets the Arabidopsis RIN4 protein”, the anaphora resolution module resolves ‘it’ as ‘The Pseudomonas syringae type III effector protein avirulence protein B (AvrB)’ Syntactic extraction Our syntactic extraction module splits each sentence into one or more simple sentences, where a simple sentence consists of four components organized into the following structure: Subject (S) + Verb (V) + Object (O) + Modifying phrase of verb (M) The module is built based on the automated extractor InTex (Ahmed, Chidambaram et al 2005); it scans a sentence to find all links of the following four types The first type, S-link, connects a subject to a verb, where the subject is located before the verb in the sentence The second type, RS-link connects a verb to a subject, i.e., the subject is located after the verb in the sentence The third type, O-link, connects a verb to an object Finally, the fourth type, MV-link, connects a verb to a modifying phrase The module first determines the beginning of each simple sentence, which can be either an S-link or an RS-link Following each verb from an S- or RS-link, the module determines the verb range by including all possible verb phrases, adverb phrases, or adjective phrase, before and after the verb Finally, for each simple sentence the module determines the objects and modifying phrases for the verb in the corresponding verb range by identifying possible O-links and MV-links For example, the modules split sentence “The Pseudomonas syringae type III effector protein avirulence protein B (AvrB) is delivered into plant cells, where it targets the Arabidopsis RIN4 protein” into two simple sentences: “The Pseudomonas syringae type III effector protein avirulence protein B (AvrB) is delivered into plant cells” and “The Pseudomonas syringae type III effector protein avirulence protein B (AvrB) targets the Arabidopsis RIN4 protein” Interaction keyword tagging In this module, the interaction keywords are tagged by searching (i) our manually curated dictionary of interaction keyword stems, to reduce the search time, and (ii) lexical database WordNet, which contains nouns, verbs, adjectives, and 174 Systems and Computational Biology – Molecular and Cellular Experimental Systems adverbs grouped by semantic concepts, and which uses a morphological function to infer the stem of a word (Fellbaum 1998) In the previous example, the module identifies interaction keywords that are found in our dictionary: “delivered” (the stem is “deliver”) and “targets” (the stem is “target”) Role type matching In this module, we specify the role of each syntactic component depending on whether the component contains complete information about an HPI Here, we consider three types of roles: elementary, partial, and complete A component of the elementary type is defined to be a host entity, a pathogen entity, or an interaction keyword A component of the partial type includes any two distinct components of the elementary type Finally, a syntactic component of complete type includes components of all three elementary types Interaction extraction Once the role of each syntactic component is identified, the components are searched against a set of interaction patterns We first select components of the complete type, since they contain complete information about an HPI occurring between two proteins/genes Next, we combine the elementary and partial components such that they provide the complete HPI information An interaction pattern is defined as LS=RS The left side (LS) is used to match the complete type from syntactic component(s), and the right side (RS) is used to extract the interaction information from each component For example, the pattern SVO = PIH indicates that if a simple sentence includes three components, each of elementary type: subject, verb, and object, then the sentence contains (i) a pathogen entity in the subject, (ii) an interaction keyword in the verb, and (iii) a host entity in the object Note that both sides include a matching part S-V-O In this work, for our patterns we considered the following seven matching parts: S-V-O, S-O, S-V-M, S-M, S, O, and M (for abbreviations, see Syntactic extraction subsection) In addition to the above patterns, we use a set of three template-based filters that allows us to remove those simple sentences that although satisfy an interaction pattern, not have a semantic connection between the host entity, pathogen entity, and interaction keyword The introduced templates are similar to those employed by RelEx: Pattern 1: A + interaction verb + B Pattern 2: Interaction noun + ’between’ + A + ’and’ + B Pattern 3: Interaction noun + ’of’ + A + ’by’ + B, where an interaction keyword can be either the interaction verb or interaction noun Interaction Normalization When mining HPI information from literature, there are several sources for ambiguous information First, there may be multiple HPIs in the same abstract Second, the information about a single HPI may be spread over multiple sentences Finally, the sentences may contain duplicate information about the same HPI Our last module ensures that all sentences containing duplicate HPIs are accounted for and each HPI is reported only once To so, we first extract all HPIs and then determine the duplicate pairs We define two HPIs as duplicate if they have the same host entity and the same pathogen entity We note that two duplicate HPIs may still have different interaction keywords To detect the duplication in HPIs, the module refers to the normalized protein/gene names (in terms of UniProt accession numbers) and organism names (in terms of taxonomy ids) obtained at the entity tagging module Performance of the language-based approach To compare with the feature-based approach, the language-based approach was evaluated using the same testing set of 44 positive and 44 negative examples We first assessed the method’s performance in Mining Host-Pathogen Interactions 175 addressing Task The method was able to classify the abstracts with 0.65 accuracy, 0.84 precision, and 0.36 recall The F-score and Matthew correlation coefficient measures were 0.51 and 0.36, correspondingly The performance of the approach on a more difficult Task was significantly better than of the naïve approach, especially in partial predictions: fORG = 0.18, fPRT = 0.14, gORG = 0.25, and gPRT = 0.25 With the pre-calculated NLProt annotation, the average running time of the system on a single abstract was 36.3 sec on a 2.4 Ghz Intel workstation The computationally most expensive, link grammar parsing, module used 99.95% of the total running time 4.4 A feature-based machine learning approach The basic idea behind the feature-based approach introduced here is to extract a set of characteristic features that provide sufficient information for discriminating between an abstract containing HPI information and another abstract that does not Using a training set of pre-annotated abstracts, the system can then learn how to efficiently discriminate between these two abstract types Moreover, the same characteristic features can be calculated for the individual sentences in the abstract Thus, we can use the same supervised-learning approach to solve Tasks and Finally, to solve Task one can use a simple dictionary-based search for each sentence classified as containing HPI information Our feature-based approach consists of four basic stages (Fig 1C) First, each abstract is preprocessed to find each protein/gene in the abstract and identify its organism name Second, for each abstract a feature vector is generated Third, our supervised learning system is trained by providing the feature vectors generated from the positive and negative sets Finally, the trained system is used on an independent testing set of HPI and non-HPI abstracts to assess the approach Text preprocessing We first add the publication title to the abstract as its first sentence The abstract is then further split into individual sentences by detecting the sentence termination patterns A basic pattern of a period (.), followed by a space and capitalized letter can be directly used to distinguish sentences in a standard text However, there are known challenges when preprocessing a biomedical (or any scientific) publication For instance, the above simple approach is not always applicable, since the periods are often used in the names of proteins, abbreviations such as “i.e.”, “e.g.”, “vs.”, and others We first identify such cases using a predefined dictionary, replace periods in these words by spaces, and then apply the above basic pattern The next steps of the preprocessing stage concerns with detecting the organism and protein/gene names using the entity tagging software NLProt (Mika and Rost 2004) Support vector machines in text categorization The problem of detecting whether an abstract contains HPI information can be formulated as a problem of supervised text categorization, with the goal of classifying abstracts into one of the selected categories In our case, two categories can be naturally defined: (i) abstracts containing HPI information and (ii) abstracts without HPI information Formally, given a training set of n objects, each represented as a vector of N numerical features, xi = (x1, x2, …, xN), and their classification into one of the two classes y{-1,1}, the goal is to train a feature-based classifier based on the training set After the training stage is completed, the classifier can assign a class label from y for any new abstract x In our approach, we use support vector machines (SVM) (Vapnik 1998), a supervised learning method, which is well established in bioinformatics and has been recently applied to identify abstracts containing host-bacteria interaction 176 Systems and Computational Biology – Molecular and Cellular Experimental Systems information (Yin, Xu et al 2010) The basic type of support vector machine (SVM) that addresses this problem is a linear classifier defined by its discriminant function: f (x) w, x b, w, x i1 wi xi N , where w is a weight vector (Vapnik 1998) Geometrically, the problem can be described as finding the decision boundary, a hyperplane that separates two sets of points, corresponding to the sets of positive and negative examples To that, we maximize the margin defined by the closest to the hyperplane positive and negative examples An optimal solution can be found by solving a related quadric optimization problem The problem is further generalized by introducing soft margins, allowing the classifier to misclassify some points The general optimization problem is often formulated in its dual form: n n n minimize i yi y j i j x i , x j i1 j1 i1 subject to: n yi i 0, i C for i 1, 2, , n i1 and the discriminant function is defined as: f (x) i1 yi i x i , x b n Examples from the training set for which i are called support vectors The formalism can be further extended by introducing non-linear classifiers defined using kernel functions, K x, x , similarity measures that replace the standard inner product x, x′ In our approach, we applied and compared two widely used non-linear kernel functions: the polynomial kernel, K P x, x x, x 1 , where d is degree of the polynomial, and d Gaussian radial basis function (RBF), K G ( x, x ) exp( x x / c) Both kernels are implemented using libsvm a freely available SVM software package (Chang and Lin 2001) Feature vectors One approach to generating a descriptive set of features for an abstract is to calculate the frequencies of occurrences of individual words (unigrams) as well as the word pairs (bigrams) from a biomedical text corpus (Yin, Xu et al 2010) While these features can provide important information on the word usage, the number of features depends strongly on the size of the corpus and can easily reach thousands of features In our approach, we propose to use a simpler 12-dimensional feature vector representation, x x1, x2 , , x12 , focusing on quantifying the information directly related to host-pathogen interaction Features x1 and x2 quantify the presence of host and pathogen protein or gene names in the abstract and are calculated based on the protein/gene entity tagging obtained by NLProt (Mika and Rost 2004) Each protein is classified as a host or pathogen protein based on the source organisms extracted either from the NLProt tagging results or directly from the abstract by searching against our dictionary of host and pathogen organisms (Table 1) The dictionary was built using the set of organisms extracted from several databases (Winnenburg, Urban et al 2008; Driscoll, Dyer et al 2009; Kumar and Nanduri 2010) and by adding generic keywords, such as “pathogen”, “host”, “plant”, etc Similarly, features x3 Mining Host-Pathogen Interactions 177 and x4 specify the number of occurrences for the host and pathogen organism names These features are defined using NLProt-based organism annotation and the dictionary of host and pathogen organisms Binary feature x5 specifies the presence or absence of the general protein-protein interaction keywords in the abstract It is obtained by scanning the extended abstract against our interaction keyword dictionary (Table 1) Features x6 and x7 describe additional statistics on protein-protein interaction keyword occurrences The former feature is defined as the percentage of interaction keywords in the total number of words in the abstract The latter feature is defined as the percentage of sentences containing the interaction keywords in the total number of abstract sentences Feature x8 is calculated based on the cumulative keyword typicality for each abstract We define the typicality of a keyword as the percentage of abstracts in the training set containing this keyword Feature x8 is calculated as a sum of typicalities for all protein-protein interactions keywords in a given abstract Our next feature x9 quantifies the amount of experimental evidence used to support the HPI and is defined as the total number of experimental keywords in the abstract, where each keyword is detected by scanning the abstract against our dictionary experimental keywords (Table 1) Some abstracts report the absence of an interaction between host and pathogen proteins Determining the absence of interaction in an abstract by a feature-based approach is difficult, since such an abstract is likely to contain the information similar to an abstract describing a true HPI One of the key differences between these abstracts is the presence of negation keywords present in the former abstract Feature x10 accounts for such keywords and is defined as the percentage of negation keywords in the total number of words in the abstract Similar to other keywords, these keywords are identified using our dictionary of collected negation keywords (Table XX3) A related feature, x11, estimates whether a negation keyword is related specifically to the information on protein-protein interaction in the abstract The feature is defined as the number of words between the negation keyword and the closest interaction keyword in a sentence The last feature, x12, accounts for the HPI-specific keywords, such as virulence, effectors, factors, etc determined using the corresponding dictionary (Table 1) It is calculated as a percentage of such keywords in the total number of words in the abstract Supervised training and HPI detection using SVM The trained SVM classifier is applied in our method twice First, it is applied to the abstracts to identify those containing HPI information (Task 1) Second, it is applied to the individual sentences to determine those that contain this HPI information (Task 2) When applied to a sentence, we generate a 12dimensional feature vector solely based on the information in this sentence and use it as an input to the SVM classifier Once the sentences containing HPI are identified, we use the dictionaries of host and pathogen organisms combined with the protein/gene names to find the pairs of host and pathogen organisms and the corresponding proteins/genes (Task 3) The accuracy of an SVM-based classifier generally can be improved by optimizing a number of parameters during the training stage The error cost parameter, C, controls the tradeoff between allowing training errors and forcing rigid margins In our approach we select the cost parameter and another parameter, Gamma, by evaluating the accuracies of trained models for Task using leave-one-out cross-validation The values of C range from to 20 and the values for Gamma range from 2−10 to 21 The set of parameters on which the SVM classifier reaches its maximum accuracy is selected as a final model In addition, we optimize the degree of the polynomial when considering the polynomial kernel Assessment protocols To assess the performance of the feature-based approach in abstract classification, we employ two benchmarking protocols In the first protocol, the SVM model 178 Systems and Computational Biology – Molecular and Cellular Experimental Systems training is done on the training set and the assessment is performed exclusively on the testing set (Table 2) For the second protocol we use the leave-one-out and 10-fold cross validations on the training set Type Negative Positive- Human Positive-Non-Human Total Training Testing 131 67 64 262 44 22 22 88 Table Testing and training sets of positive (HPI-relevant) and negative (HPI-irrelevant) abstracts Testing data are used to evaluate all three approaches, and training data are used for SVM learning in the feature-based approach The abstracts are extracted from then PubMed database and manually curated Performance of the feature based approach During the leave-one-out cross-validation, an SVM model with the polynomial kernel of degree and parameter values C=2 and Gamma=0.0175 was found to be the most accurate in the abstract classification problem (Table 3) The polynomial kernel was also the most accurate SVM model across both assessment protocols In addition, this SVM model had the highest recall value, with the precision approaching its highest value Overall, the performance of all three SVM kernels, across all evaluation protocols, was similar The performance of the feature-based approach on Task was slightly better than that of the language-based approach in partial predictions: gORG = 0.39 and gPRT = 0.35 However the performance in complete pair predictions was worse: fORG = 0.07 and fPRT = 0.07 The SVM classifier was efficient, taking only 0.003 sec to classify 92 abstracts by an SVM classifier on a 2.66 Ghz Intel Xeon (Quad) workstation However, the high efficiency of this approach was offset by a significantly slower protein tagging component that was done using NLProt and took ~18 on the same workstation to tag proteins in 262 abstracts from the dataset Protocol 10-fold Test LOO fAC 72% 66% 71% PR 73% 69% 72% RE 71% 60% 72% AUC 0.78 0.72 0.78 F-score 0.72 0.64 0.71 Table Evaluations of the feature-based classifier LOO and 10-fold denote leave-one-out and 10-fold cross-validation protocols applied to the models that are trained on the set of 262 abstracts The last protocol corresponds to the evaluation performed only on the testing set of 88 abstracts Conclusion In this chapter, we discussed a new problem for biomedical literature mining that was concerned with mining molecular interactions between the host and pathogen organisms Collecting HPI data is one of the very first steps towards studying and fighting infectious diseases Creating an automated framework for extracting the HPI information from the Mining Host-Pathogen Interactions 179 biomedical literature, including millions of abstracts publicly available in PubMed database, is instrumental in completing this step We formulated three key tasks of HPI literature mining and proposed three computational approaches that addressed these tasks: (i) a naïve approach, which was based on the existing protein-protein interaction mining methods, (ii) a language-based approach, which employed the link grammar, and (iii) a feature-based supervised learning approach, which relied on SVM methodology Both, feature-based and language-based, approaches have been implemented in the PHILM (Pathogen-Host Interaction Literature Mining) web-server, accessible at http://korkinlab.org/philm.html Several important conclusions can be drawn from the comparative assessment of all three approaches First, it became clear that being a new problem in biomedical literature mining (and a more difficult one than mining general protein-protein interactions), HPI text mining required development of new methods tailored to address the specifics of this problem Indeed, for the first task the naïve approach performed with the disappointingly low accuracy of 53% and f-score of just 13%, while accuracy and f-score of the language-based approach were significantly higher, 65% and 51%, correspondingly; the feature-based method had even higher (10-fold) accuracy and f-score, 72% and 72%, correspondingly We note that the performance accuracy of both language-based and feature-based approaches even at this early stage were comparable to the state-of-the-art protein-protein interactions mining methods (Krallinger, Leitner et al 2008) In addition to its poor performance in the abstract classification task, the naïve approach completely failed to detect protein interaction pairs and organism pairs in the third task The feature-based approach performed significantly better when detecting one of the interacting proteins or organisms, while still failing to accurately detect the complete pairs It was not surprising that the highest accuracy of detecting both, host-pathogen organism pairs and protein pairs, was achieved by the most sophisticated language-based approach Second, the analysis of incorrectly classified abstracts and identified pairs of proteins and organisms supported our conclusion that increasing the accuracy of the name tagging system is pivotal to increasing the classification accuracy in both approaches Finally, both language-based and feature-based approaches demonstrated good performance but in different tasks, which suggests that by integrating these two approaches, one can obtain a system with a more accurate overall performance than either of the individual approaches Acknowledgment We acknowledge funding from University of Missouri (Mizzou Advantage to DK), National Science Foundation (DBI-0845196 to DK), and Department of Education (GAANN Fellowship to SW) References (2004) WHO, The world health report 2004: changing history Geneva, World Health Organization Abramovitch, R B., J C Anderson, et al (2006) "Bacterial elicitation and evasion of plant innate immunity." Nat Rev Mol Cell Biol 7(8): 601-611 Ahmed, S T., D Chidambaram, et al (2005) IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text Proceedings of the ACL-ISMB Workshop 180 Systems and Computational Biology – Molecular and Cellular Experimental Systems on Linking Biological Literature Ontologies and Databases: Mining Biological Semantics, Detroit, Association for Computational Linguistics Anderson, R M and R M May (1979) "Population biology of infectious diseases: Part I." Nature 280(5721): 361-367 Aranda, B., P Achuthan, et al (2010) "The IntAct molecular interaction database in 2010." Nucleic Acids Research 38(Database issue): D525-531 Ayoola, E A (1987) "Infectious diseases in Africa." Infection 15(3): 153-159 Bairoch, A., R Apweiler, et al (2005) "The Universal Protein Resource (UniProt)." Nucleic Acids Res 33(Database issue): D154-159 Bartlett, J G (1997) "Update in infectious diseases." Annals of internal medicine 126(1): 48-56 Blaschke, C and A Valencia (2001) "The potential use of SUISEKI as a protein interaction discovery tool." Genome informatics International Conference on Genome Informatics 12: 123-134 Boeckmann, B., A Bairoch, et al (2003) "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003." Nucleic Acids Research 31(1): 365 Bowers, J H., B A Bailey, et al (2001) "The impact of plant diseases on world chocolate production." Plant Health Progress Calli, C (2009) Prediction of protein-protein interaction relevance of articles using references 24th International Symposium on Computer and Information Sciences (ISCIS 2009), Guzelyurt, IEEE Chang, C and C Lin (2001) LIBSVM: a library for support vector machines Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm Cherkasov, A and S J Jones (2004) "An approach to large scale identification of nonobvious structural similarities between proteins." BMC Bioinformatics 5: 61 Cherry, S and N Silverman (2006) "Host-pathogen interactions in drosophila: new tricks from an old friend." Nat Immunol 7(9): 911-917 Chiang, J H and H C Yu (2003) "MeKE: discovering the functions of gene products from biomedical literature via sentence alignment." Bioinformatics 19(11): 1417-1422 Corney, D P., B F Buxton, et al (2004) "BioRAT: extracting biological information from full-length papers." Bioinformatics 20(17): 3206-3213 Daraselia, N., A Yuryev, et al (2004) "Extracting human protein interactions from MEDLINE using a full-sentence parser." Bioinformatics 20(5): 604-611 Daszak, P., A A Cunningham, et al (2000) "Emerging infectious diseases of wildlife-threats to biodiversity and human health." Science 287(5452): 443-449 Davis, F P., D T Barkan, et al (2007) "Host pathogen protein interactions predicted by comparative modeling." Protein Sci 16(12): 2585-2596 Ding, J., D Berleant, et al (2003) "Extracting biochemical interactions from MEDLINE using a link grammar parser." Donaldson, I., J Martin, et al (2003) "PreBIND and Textomy mining the biomedical literature for protein-protein interactions using a support vector machine." BMC Bioinformatics 4: 11 Doolittle, J M and S M Gomez (2011) "Mapping protein interactions between Dengue virus and its human and insect hosts." PLoS neglected tropical diseases 5(2): e954 Driscoll, T., M D Dyer, et al (2009) "PIG the pathogen interaction gateway." Nucleic Acids Research 37(Database issue): D647-650 Mining Host-Pathogen Interactions 181 Driscoll, T., M D Dyer, et al (2009) "PIG the pathogen interaction gateway." Nucleic Acids Res 37(Database issue): D647-650 Dyer, M D., T M Murali, et al (2007) "Computational prediction of host-pathogen proteinprotein interactions." Bioinformatics 23(13): i159-166 Dyer, M D., C Neff, et al (2010) "The human-bacterial pathogen protein interaction networks of Bacillus anthracis, Francisella tularensis, and Yersinia pestis." PLoS One 5(8): e12089 Espinosa, A and J R Alfano (2004) "Disabling surveillance: bacterial type III secretion system effectors that suppress innate immunity." Cell Microbiol 6(11): 1027-1040 Evans, P., W Dampier, et al (2009) "Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs." BMC medical genomics 2: 27 Fehr, A., P Thurmann, et al (2006) "Editorial: drug development for neglected diseases: a public health challenge." Trop Med Int Health 11(9): 1335-1338 Fellbaum, C (1998) WordNet : an electronic lexical database Cambridge, USA, MIT Press Forst, C V (2006) "Host-pathogen systems biology." Drug Discov Today 11(5-6): 220-227 Friedman, C., P Kra, et al (2001) "GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles." Bioinformatics 17(Suppl 1): S74 Fundel, K., R Kuffner, et al (2007) "RelEx relation extraction using dependency parse trees." Bioinformatics 23(3): 365 Gilbert, D (2005) "Biomolecular interaction network database." Brief Bioinform 6(2): 194-198 Hao, Y., X Zhu, et al (2005) "Discovering patterns to extract protein-protein interactions from the literature: Part II." Bioinformatics 21(15): 3294-3300 Hirschman, L., A Yeh, et al (2005) "Overview of BioCreAtIvE: critical assessment of information extraction for biology." BMC Bioinformatics Suppl 1: S1 Hobbs, J (1978) "Resolving pronoun references." Lingua 44(4): 311-338 Hoffmann, R., M Krallinger, et al (2005) "Text mining for metabolic pathways, signaling cascades, and protein networks." Sci STKE 2005(283): pe21 Hoffmann, R and A Valencia (2005) "Implementing the iHOP concept for navigation of biomedical literature." Bioinformatics 21 Suppl 2: ii252-258 Hu, Z Z., M Narayanaswamy, et al (2005) "Literature mining and database annotation of protein phosphorylation using a rule-based system." Bioinformatics 21(11): 27592765 Huang, M., X Zhu, et al (2004) "Discovering patterns to extract protein-protein interactions from full texts." Bioinformatics 20(18): 3604-3612 Jaeger, S., S Gaudan, et al (2008) "Integrating protein-protein interactions and text mining for protein function prediction." BMC Bioinformatics Suppl 8: S2 Kahn, R A., H Fu, et al (2002) "Cellular hijacking: a common strategy for microbial infection." Trends Biochem Sci 27(6): 308-314 Kim, S., S Y Shin, et al (2008) "PIE: an online prediction system for protein-protein interactions from text." Nucleic Acids Research 36(Web Server issue): W411-415 Kolchinsky, A., A Abi-Haidar, et al (2010) "Classification of protein-protein interaction full-text documents using text and citation network features." IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 7(3): 400-411 Krallinger, M., F Leitner, et al (2008) "Overview of the protein-protein interaction annotation extraction task of BioCreative II." Genome Biology Suppl 2: S4 182 Systems and Computational Biology – Molecular and Cellular Experimental Systems Krallinger, M and A Valencia (2005) "Text-mining and information-retrieval services for molecular biology." Genome Biol 6(7): 224 Kumar, R and B Nanduri (2010) "HPIDB-a unified resource for host-pathogen interactions." BMC Bioinformatics 11(Suppl 6): S16 Lee, H., G Yi, et al (2008) "E3Miner: a text mining tool for ubiquitin-protein ligases." Nucleic Acids Research 36(Web Server issue): W416 Lee, S A., C H Chan, et al (2008) "Ortholog-based protein-protein interaction prediction and its application to inter-species interactions." BMC Bioinformatics Suppl 12: S11 Lengeling, A., K Pfeffer, et al (2001) "The battle of two genomes: genetics of bacterial host/pathogen interactions in mice." Mamm Genome 12(4): 261-271 Leroy, G and H Chen (2002) "Filling preposition-based templates to capture information from medical abstracts." Pac Symp Biocomput: 350-361 Leroy, G., H Chen, et al (2003) "A shallow parser based on closed-class words to capture relations in biomedical text." Journal of biomedical informatics 36(3): 145-158 Mandell, G L and G C Townsend (1998) "New and emerging infectious diseases." Transactions of the American Clinical and Climatological Association 109: 205-216; discussion 216-207 Marcotte, E M., I Xenarios, et al (2001) "Mining literature for protein-protein interactions." Bioinformatics 17(4): 359-363 Matthews, L., G Gopinath, et al (2009) "Reactome knowledgebase of human biological pathways and processes." Nucleic Acids Research 37(Database issue): D619-622 Maurer, S M., A Rai, et al (2004) "Finding cures for tropical diseases: is open source an answer?" PLoS Med 1(3): e56 Mika, S and B Rost (2004) "NLProt: extracting protein names and sequences from papers." Nucleic Acids Res 32(Web Server issue): W634-637 Mika, S and B Rost (2004) "Protein names precisely peeled off free text." Bioinformatics 20(suppl 1): i241 Mitchell, J A., A R Aronson, et al (2003) "Gene indexing: characterization and analysis of NLM's GeneRIFs." AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium: 460-464 Miyao, Y., K Sagae, et al (2009) "Evaluating contributions of natural language parsers to protein-protein interaction extraction." Bioinformatics 25(3): 394-400 Moran, G J., D A Talan, et al (2008) "Biological terrorism." Infect Dis Clin North Am 22(1): 145-187, vii Morse, S S (1995) "Factors in the emergence of infectious diseases." Emerg Infect Dis 1(1): 715 Munter, S., M Way, et al (2006) "Signaling during pathogen infection." Sci STKE 2006(335): re5 Park, J C., H S Kim, et al (2001) "Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar." Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing: 396-407 Pruitt, K D., T Tatusova, et al (2003) "NCBI Reference Sequence project: update and current status." Nucleic Acids Research 31(1): 34-37 Pyysalo, S., F Ginter, et al (2004) Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions nternational Workshop on Natural Language Mining Host-Pathogen Interactions 183 Processing in Biomedicine and its Applications (JNLPBA), Association for Computational Linguistics Pyysalo, S., T Salakoski, et al (2006) "Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches." BMC Bioinformatics 7(Suppl 3): S2 Rodriguez-Esteban, R (2009) "Biomedical text mining and its applications." PLoS Comput Biol 5(12): e1000597 Rossi, V and J Walker (2005) Assessing the Economic Impact and Costs of Flu Pandemics Originating in Asia Oxford: Abbey House, Oxford Economic Forecasting Group Russell, R B., F Alber, et al (2004) "A structural perspective on protein-protein interactions." Curr Opin Struct Biol 14(3): 313-324 Sansonetti, P (2002) "Host-pathogen interactions: the seduction of molecular cross talk." Gut 50 Suppl 3: III2-8 Santos, C and D Eggle (2005) "Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction." Bioinformatics 21(8): 1653 Seki, K and J Mostafa (2005) "A hybrid approach to protein name identification in biomedical texts." Information Processing & Management 41(4): 723-743 Seoud, A., A Youssef, et al (2008) Extraction of protein interaction information from unstructured text using a link grammar parser, IEEE Shapira, S D., I Gat-Viks, et al (2009) "A physical and regulatory map of host-influenza interactions reveals pathways in H1N1 infection." Cell 139(7): 1255-1267 Shatkay, H., A Hˆglund, et al (2007) "SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data." Bioinformatics 23(11): 1410 Shoemaker, B A and A R Panchenko (2007) "Deciphering protein-protein interactions Part II Computational methods to predict protein and domain interaction partners." PLoS Comput Biol 3(4): e43 Sleator, D and D Temperley (1995) Parsing English with a Link Grammar Third International Workshop on Parsing Technologies, ACL/SIGPARSE Snyder, E E., N Kampanya, et al (2007) "PATRIC: the VBI PathoSystems Resource Integration Center." Nucleic Acids Research 35(Database issue): D401-406 Stebbins, C E (2005) "Structural microbiology at the pathogen-host interface." Cell Microbiol 7(9): 1227-1236 Stephens, M., M Palakal, et al (2001) "Detecting gene relations from Medline abstracts." Pac Symp Biocomput: 483-495 Sullivan, D E., J L Gabbard, Jr., et al (2010) "Data integration for dynamic and sustainable systems biology resources: challenges and lessons learned." Chemistry & biodiversity 7(5): 1124-1141 Tanabe, L., N Xie, et al (2005) "GENETAG: a tagged corpus for gene/protein named entity recognition." BMC Bioinformatics 6(Suppl 1): S3 Thomas, J., D Milward, et al (2000) "Automatic extraction of protein interactions from scientific abstracts." Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing: 541-552 Thornton, J M (2001) "From genome to function." Science 292(5524): 2095-2097 184 Systems and Computational Biology – Molecular and Cellular Experimental Systems Trouiller, P., P Olliaro, et al (2002) "Drug development for neglected diseases: a deficient market and a public-health policy failure." Lancet 359(9324): 2188-2194 Trouiller, P., E Torreele, et al (2001) "Drugs for neglected diseases: a failure of the market and a public health failure?" Trop Med Int Health 6(11): 945-951 Tyagi, N., O Krishnadev, et al (2009) "Prediction of protein-protein interactions between Helicobacter pylori and a human host." Molecular bioSystems 5(12): 1630-1635 Vapnik, V N (1998) Statistical learning theory New York, Wiley Wheeler, D., T Barrett, et al (2006) "Database resources of the national center for biotechnology information." Nucleic Acids Research Whitby, S M (2001) "The potential use of plant pathogens against crops." Microbes Infect 3(1): 73-80 Williams, E S., T Yuill, et al (2002) "Emerging infectious diseases in wildlife." Revue scientifique et technique 21(1): 139-157 Winnenburg, R., M Urban, et al (2008) "PHI-base update: additions to the pathogen host interaction database." Nucleic Acids Research 36(Database issue): D572 Winnenburg, R., M Urban, et al (2008) "PHI-base update: additions to the pathogen host interaction database." Nucleic Acids Research 36(Database issue): D572-576 Xenarios, I., E Fernandez, et al (2001) "DIP: The Database of Interacting Proteins: 2001 update." Nucleic Acids Research 29(1): 239-241 Yang, Z., H Lin, et al (2009) "BioPPIExtractor: A protein-protein interaction extraction system for biomedical literature." Expert Systems with Applications 36(2): 2228-2233 Yin, L., G Xu, et al (2010) "Document classification for mining host pathogen proteinprotein interactions." Artificial Intelligence in Medicine 49(3): 155-160 Zanzoni, A., L Montecchi-Palazzi, et al (2002) "MINT: a Molecular INTeraction database." FEBS Lett 513(1): 135-140 10 Prediction of Novel Pathway Elements and Interactions Using Bayesian Networks Andrew P Hodges, Peter Woolf and Yongqun He ξ University of Michigan, Ann Arbor, MI, USA Introduction Signalling and regulatory pathways that guide gene expression have only been partially defined for most organisms Given the increasing number of microarray measurements, it may be possible to reconstruct such pathways and uncover missing connections directly from experimental data One major question in the area of microarray-based pathway analysis is the prediction of new elements to a particular pathway Such prediction is possible by independently testing the effects of added genes or variables on the overall scores of the corresponding expanded networks A general network expansion framework to predict new components of a pathway was suggested in 2001 (Tanay and Shamir, 2001) Many machine learning approaches for identifying hidden or unknown factors have appeared in the literature recently (Gat-Viks and Shamir, 2007; Hashimoto, et al., 2004; Herrgard, et al., 2003; Ihmels, et al., 2002; Needham, et al., 2009; Parikh, et al., 2010; Pena, et al., 2005; Tanay and Shamir, 2001; Yu and Li, 2005) Compared to existing pathway expansion methods based on correlation, Boolean, or other strategies (Hashimoto, et al., 2004; Herrgard, et al., 2003; Ihmels, et al., 2002; Tanay and Shamir, 2001), Bayesian network-based expansion methods provide distinct advantages A Bayesian network (BN) is a representation of a joint probability distribution over a set of random variables (Friedman, et al., 2000) Bayesian networks are able to identify causal or apparently causal relationships (Friedman, et al., 2000), and can be used to predict both linear and nonlinear functions Furthermore, BN analysis is robust to error and noise and easily interpretable by humans Bayesian network-based expansion has been used for gene expression data analysis (Gat-Viks and Shamir, 2007; Pena, et al.) We have recently developed an algorithm termed “BN+1” which implements Bayesian network expansion to predict new factors and interactions that participate in a specific pathway (Hodges, et al., 2010; Hodges, et al., 2010) This algorithm has been tested using E coli microarray data (Hodges, et al., 2010) and verified with a synthetic network (Hodges, et al., 2010) This Book Chapter aims to first provide a detailed review on different computational methods for pathway element prediction, introduce how a BN analysis is typically performed, and then describe how this BN+1 algorithm works We will also introduce our MARIMBA software program (http://marimba.hegroup.org) which can implement the BN+1 algorithm along with many other useful features So far, the success of BN+1 in new pathway element prediction has been demonstrated in prokaryotic E coli system This paper will introduce our new study of applying BN+1 to predict new pathway elements for 186 Systems and Computational Biology – Molecular and Cellular Experimental Systems eukaryotic B-cell receptor (BCR) pathway using high throughput microarray data from perturbed B-cells obtained from the Alliance for Cellular Signalling (AfCS) (Zhu, et al., 2004) Finally, we will present current challenges and possible future directions in this field Overview of different computational methods for prediction of new pathway elements In this section, we describe several existing methods for pathway expansion By pathway expansion, we mean the expansion of a known set of variables with some biological role or function to include novel interacting or downstream variables This definition is highly flexible and can be used for a variety of biological and biomedical situations 2.1 Correlation methods and pathway expansion Some of the most prevalent approaches used towards analyzing high-throughput datasets are correlation-based methods Correlation methods attempt to identify the degree of similarity or dissimilarity between two or more variables (e.g., the expression profiles of two genes) using simple computational distance metrics, such as Manhattan and Pearson metrics (Herrero, et al., 2001) An underlying assumption is that cellular processes often require the participation of multiple gene products which are expected to show correlated expression patterns as well as physical interactions (Meier and Gehring, 2008) To predict new pathway elements using correlation methods, one or more genes (or other biological entities) are usually selected initially as a target of interest for comparison A correlation is then determined between each other gene’s (or entity’s) expression pattern and that of the gene of interest Those correlations appearing above some established threshold or ranking are then represented as either edges in a network or as a dendrogram in an expression-based heatmap diagram For example, Herrgard et al defined subset of variables with specific modular behaviors and network structure using correlations and linear multiple regression (Herrgard, et al., 2003) These modules are then expanded to identify other neighboring variables with likely interactions or influences with the module-based sub-networks Tanay et al (2001) introduced a fitness function-based approach for expanding sets of variables in literature models (Tanay and Shamir, 2001) One advantage of these correlation-based methods is the ability to compute all pair-wise correlations for genes or features on a gene expression microarray or other high-throughput datasets However, the correlation networks themselves not imply any directionality for the interactions, such as which gene activates or represses a correlated gene, or whether those genes are instead co-regulated by another biological entity The types and sometimes directionality of interactions must be determined using one or more analysis procedures, such as gene enrichment, promoter analysis, and context-dependent (or conditiondependent) analysis (Meier and Gehring, 2008) The correlation-based methods are often sensitive to the underlying distance metrics and assumptions, and are easily misinterpreted when the wrong metrics are employed In addition, nonlinear (e.g biphasic) interactions cannot usually be detected using correlation-based methods 2.2 Clustering-based identification of new pathway elements Various clustering method can be used to group genes based on expression values and identify potential new genes to specific pathways Unsupervised and supervised clustering Prediction of Novel Pathway Elements and Interactions Using Bayesian Networks 187 methods have been developed (Raychaudhuri, et al., 2001) Unsupervised clusting methods, such as hierarchical clustering (Eisen, et al., 1998), self-organizing maps (Tamayo, et al., 1999), and model-based clustering (e.g., CRCView (Xiang, et al., 2007)), arrange genes and samples in groups/clusters based solely on the similarities in gene expression Supervised methods, including EASE (Hosack, et al., 2003) and gene set enrichment analysis (GSEA) (Subramanian, et al., 2005), use sample classifiers and gene expression to identify hypothesis-driven correlations The Gene Ontology program (GO) is frequently used for gene enrichment analysis by many software programs, for example, DAVID (Huang da, et al., 2009) and GOStat (Beissbarth and Speed, 2004) One major disadvantage of such clustering-based methods on identifying new pathway elements is that detailed gene-gene interactions and directionalities cannot be predicted 2.3 Boolean network-based pathway expansion In Boolean network modelling, originally introduced by Kauffman (Kauffman, 1969) (Kauffman, 1969) (Kauffman, 1969), gene expression is quantized to only to two levels: ON and OFF The gene expression level (state) of each gene is functionally related to the expression states of some other genes using logical rules Probabilistic Boolean Networks (PBN) share the appealing rule-based properties of Boolean networks, but are robust in the face of uncertainty (Shmulevich, et al., 2002) Hashimoto et al proposed a method to grow genetic regulatory networks from seed genes based on PBN analysis (Hashimoto, et al., 2004) In their study, Boolean functions were implemented towards globally expanding a set of seed genes from known literature-extracted interactions for vascular endothelial growth factor pathway genes using melanoma and glioma data (Hashimoto, et al., 2004) The output of this algorithm depends on the PBN-based objective function The disadvantage of this approach is that the two-level representation in Boolean network often oversimplifies the complex biological systems 2.4 Mutual information-based method Mutual information-based methods have been used for modelling, refining, and expanding biological pathways In probability theory and information theory, the mutual information of two random variables is a quantity that measures the mutual dependence of the two variables Recent reports by Luo et al (Luo, et al., 2008; Luo and Woolf, 2010; Watkinson, et al., 2009) and others have shown the utility and improved modelling of using three-way and higher mutual information influences for a given variable However, the assembly of these multi-parent interactions into larger global networks is yet a challenging issue 2.5 Bayesian network pathway refinement and expansion Bayesian networks have recently been widely used for biological pathway reconstruction and expansion Since this is the major topic of this book chapter, we will introduce it in more details in the next sections Bayesian network (BN) analysis In this section, we introduce Bayesian networks and their uses in biomedical research Most specifically, models generated for understanding biological pathways and relevant gene regulatory networks are discussed 188 Systems and Computational Biology – Molecular and Cellular Experimental Systems 3.1 Introduction to Bayesian networks One exciting development in bioinformatics research was the advent and application of Bayesian networks (BN) in biological research Basically, BNs are graphical representations of statistical interdependencies amongst sets of nodes BNs model interactions amongst sets of variables (e.g genes, proteins) as probabilistic dependencies or influences Judea Pearl introduced the notion of Bayesian networks in 1985 (Pearl, 1985; Pearl, 1988) to emphasize three aspects: (i) Often subjective nature of the input data information; (ii) Reliance on Bayes’s conditioning as the basis for information updating; and (iii) Distinction between causal and evidential modes of reasoning Bayesian networks were later implemented by Heckerman et al, Friedman et al, and various other research labs towards biological research (Cooper and Herskovits, 1992; Friedman, et al., 2000; Heckerman, 1995) Specifically, a BN for a set of variables X = {X1, X2, ,Xn} consists of (1) a network structure S that encodes a set of conditional independence assertions about variables in X, and (2) a set P of conditional probability distributions associated with each variable (Heckerman, 2008) Together, these components denote the joint probability distribution for X The BN structure S is a directed acyclic graph, meaning that the network is hierarchical and has both top-level and terminal nodes and no directed paths which eventually return to them We use Pai to denote the parents of node Xi in S as well as the variables corresponding to those parents Given structure S, the joint probability distribution for X is given by (1) Different methods have been developed to learn BN structures and will be introduced in detail next 3.2 Learning Bayesian networks (BNs) The problem of learning a Bayesian network can be stated as follows: given a training dataset of independent instances, find a network that best matches the dataset The common approach to this problem is to introduce a statistically sound scoring function that evaluates each network with respect to the training dataset and to search for the optimal network based on this score To dissect the processes of learning BNs, we summarize five major steps as follows: Data selection and pre-processing Prior definition (including variables and edges) Selection of network searching strategy (e.g., simulated annealing, greedy) BN execution with a specific scoring method Results output and analysis These steps will be introduced in detail here for gene expression data analysis: 3.2.1 Data selection and preprocessing BN is a powerful tool for analyzing high throughput data, e.g., DNA microarray data Preprocessing is usually required to normalize raw data and possibly filter out those genes that not show significant changes over all conditions Prediction of Novel Pathway Elements and Interactions Using Bayesian Networks 189 3.2.2 Prior definition (including variables and edges) After selecting appropriate data and variable sets for investigation, settings for the BN simulation must be chosen Initially, assumptions must be made as to whether structural priors (e.g the requirement of certain interactions to appear in a model) should be included or not in the BN analysis It is not necessary to assume any structural priors for the initial set of variables However, structural priors can be implemented, especially in cases where the biological interactions to be represented are well-established and also fully represented in the underlying biological data used for modelling 3.2.3 Set up network searching strategy Once the prior is specified, the BN learning becomes finding a structure that maximizes the BN score according to a BN scoring function This problem is proven to be NP-complete (Chickering, 1996) Thus heuristic search is needed The decomposition of the score is crucial for the optimization problem For example, a local search procedure that changes one edge at a time can efficiently evaluate the gains of a specified score made by adding, removing, or reversing an edge An example of such a procedure is a greedy random search algorithm with random restarts Although this procedure does not necessarily achieve a global maximum, it reaches a local maximum and does perform well in practice (Friedman, et al., 2000) Another commonly used method is simulated annealing search algorithm with a temperature schedule that allows for “reannealing" as the temperature is lowered (Heckerman, 1995) Other BN searching strategies include stochastic hill-climbing and genetic algorithm (Friedman, et al., 2000) 3.2.4 Bayesian network scoring approaches The key part of BN learning is to determine a scoring metric that compares networks and identifies the most likely or ‘best supported’ networks Bayesian network scoring is based upon conditional probabilities One commonly used scoring method is the BDe score (Cooper and Herskovits, 1992; Heckerman, 1995), which is a posterior probability defined as: n qi P M|D i 1 j 1 ( ri 1)! Nij ri 1 ri N ijk ! , ! (2) k 1 where n is the number of variables, qi is the number of parent configurations for given variable i, ri is the arity of variable i, Nij is the number of observations with selected parent configuration qi, Nijk is the number of observations of child in state k with parent configuration qi (Cooper and Herskovits, 1992) The calculation of this score is implemented in many software programs such as BANJO (Smith, et al., 2006) Another BN scoring method is the Bayesian Information Criterion (BIC), which was specifically designed to compensate for overfitting (Schwarz, 1978) 3.2.5 Bayesian network analysis software Many BN analysis software programs are available Dr Kevin Murphy provides an excellent summary of existing software packages for Bayesian network modelling (http://www.cs.ubc.ca/~murphyk/Bayes/bnsoft.html) Table lists selected BN software programs from Dr Murphy’s website and other resources 190 Systems and Computational Biology – Molecular and Cellular Experimental Systems Name Source API GUI Undir Exec Free Inference Exp Reference W,U, Y N N (Bose, et al., 2006) M Jtree, W,U, (Conrady and N Jouffe, 2011) N Gibbs M W,U, Several Y N (Murphy, 2001) M options Banjo Java Y N D BayesiaLab N N Y C,G BNT Matlab, C Y N D,U BNJ Java N Y D C Y Jtree, IS N Causal Explorer Matlab, C/C++ Y N D W,U, M Y N N Deal R Y Y D I Y N N Genie C++ Y Y D C Y Jtree N Java Bayes Java Y Y D C Y Jtree, Varelim N LibB N Y N D W,L Y N N MARIMBA N N Y D I Y N Y miniTUBA N N Y D Y N N openBUGS Y Y Y D I W,U, M Y Gibbs N OpenPNL C++ Y Y D Y Jtree, Gibbs N PEBL Python Y Y D Y N N WinMine N N Y D,U Y N N W,L W,U, M W http://bnj.sourcefo rge.net (Aliferis, et al., 2003) (Bøttcher and Dethlefsen, 2003) (Druzdzel, 1999) http://www.cs.cm u.edu/~javabayes/ Home/ http://www.cs.huj i.ac.il/labs/compbi o/LibB/ (Hodges, et al., 2010; Hodges, et al., 2010) (Xiang, et al., 2007) (Lunn, et al., 2000; McCarthy, 2007) http://sourceforge net/projects/open pnl/ (Shah and Woolf, 2009) (Chickering, 2002) Notes: The catergories listed include: Source, source code; API, application program interface for programmatic access; GUI, graphical user interface; Undir, ability to handle undirected graphes; Exec, the type of execution, including W:Windows, U:Unix, L:Linux, M:Mac, I:OS-independent, or C:any with compiler; Free, the availability of the software as either free (e.g academic) or commercial; Inference, inferencing ability; Exp, ability for network expansion; and Ref, references Table Selected software programs for BN analysis 3.2.6 BN result output and analysis To visualize BN results, different methods can be performed For example, BANJO uses DOT type of BN result output (Reference: http://www.graphviz.org/Documentation/dotguide.pdf) MARIBMA uses DOT and can also export networks as sif format for use in Cytoscape (http://www.cytoscape.org) Since different BNs are available, it is crucial for a user to select ‘best-scoring’ networks and/or generate consensus networks Often methods are also needed to build weighted networks based on computational analysis or from literature and other database queries Prediction of Novel Pathway Elements and Interactions Using Bayesian Networks 191 Bayesian network expansion methods Bayesian network (BN) expansion is an approach that is built upon the BN method and aims to identify new pathway elements that participate in a specified network In this section, we will introduce basic BN expansion methods and then focus on describing our internally developed BN+1 algorithm and its implementation 4.1 General BN expansion Compared to the otehr network expansion methods described above, Bayesian networkbased expansion methods provide distinct advantages, such as prediction of both linear and nonlinear functions, robustness in noise data analysis, and identification of causal or appearly causal influences representing interactions among genes In general, Bayesian network expansion can be defined as the addition of new variables to an existing network, followed by rescoring and ranking of those variables BN-based expansion has been used for gene expression data analysis (Gat-Viks and Shamir, 2007; Pena, et al.) For example, Pena et al reported an algorithm AlgorithmGPC that also grows BN models from seed genes (Pena, et al.) This approach starts with one single gene and builds networks around this gene through expansion and pruning with a set number of genes Gat-Viks et al also generated a Bayesian network-based refinement and expansion method (Gat-Viks and Shamir, 2007) A main limitation of this approach is that it requires high quality of prior knowledge on the signaling pathways The topology of the biological pathways may not be consistent with networks learned from transcriptional gene expression data obtained via DNA microarray studies Therefore, a fixed topology as initial seed network may not be appropriate for robust network expansion simulaions Other BN expansion methods have also been published (Needham, et al., 2009; Parikh, et al., 2010) These approaches differ from each other but all showed different levels of success in identifying new pathway elements In the following two sections, we will introduce our BN+1 algorithm (Hodges, et al., 2010; Hodges, et al., 2010), and how it is implemented in the MARIMBA software 4.2 The BN+1 algorithm In our recent study, we developed an algorithm termed “BN+1” which implements Bayesian network expansion to predict new factors and interactions that participate in a specific pathway (Hodges, et al., 2010; Hodges, et al., 2010) Broadly, the BN+1 algorithm iteratively tests to see if any single variable added to a given pathway will significantly improve the likelihood of the overall network This approach is based on the observation that those variables which are hidden and regulate or are regulated by a network are more likely ranked with high posterior probability scores Using a compendium of microarray gene expression data obtained from Escherichia coli, the BN+1 algorithm predicted many novel factors that influence the E coli reactive oxygen species (ROS) pathway Some of the predicted new ROS and biofilm regulators (e.g., uspE and its interaction with gadX) were further experimentally verified (Hodges, et al., 2010) In another study, a synthetic network was also designed to further evaluate this algorithm Based on the synthetic data analysis, the BN+1 method is able to identify both linear and nonlinear relationships and correctly identify variables near the starting network (Hodges, et al., 2010) The BN+1 algorithm is specified in Figure A few notes are provided here in our BN+1 implementation: 192 Systems and Computational Biology – Molecular and Cellular Experimental Systems The selection of seed (or core) genes is an important step The seed genes can be selected from an existing pathway database, from literature survey, or from internal experimental results Since it is computationally expensive to calculate BNs using a large number of variables, it is often necessary to filter out some genes from an initial list using different criteria, for example, filtering out those genes that not have significant changes among all microarray chips While we use a top network structure generated from initial core gene simulation as a prior, we prefer not to fix the core network structure for subsequent network expansion This preference makes our approach differ from a commonly used method of fixing the prior structure Our argument is that the prior structure is often determined by many layers of studies, including DNA, RNA and protein data analyses When only RNA transcriptomic data are used, such prior structure may not hold The fixture of a prior structure would result in obtaining suboptimal networks that not match the datasets used for BN simulation BN+1 Algorithm Input: N variables (e.g., genes) from a dataset (e.g., microarray dataset) with L observations each Data Preprocessing (Optional) Filter out m variables (e.g., via coefficient of variation (c.v.) 0.8) and anti-correlation (light squares: Pearson