1140 Gautam B. Singh Fig. 59.2. The processing stages required for generating the Markovian models for DNA pat- terns these elements and defines consensus and matrices for elements of certain function, and thus to provide means of identifying regulatory signals in anonymous genomic sequences. TRANSFAC: Transcription Factors and Regulation The development path for the TRANSFAC database has been geared by the objective to pro- vide a biological context for understanding the function of regulatory signals found in genomic sequences. The aim of this compilation of signals was meant to provide all relevant data about the regulating proteins and allow researchers to trace back transcriptional control cascades to their origin (Wingender et al., 2001, Matys et al., 2003). The TRANSFAC database contains information about regulatory DNA sequences and the transcription factors binding to and act- ing through them. At the core of this database are its components describing the transcription factor (FACTOR) and its corresponding binding site (SITE) and the regulation of the cor- responding gene (GENE). The GENE table is one of the central tables in this database. It is linked to several other databases including S/MARtDB (Liebich et al., 2002), TransCOM- PEL (Margoulis et al., 2002), LocusLink, OMIM, and RefSeq (Wheeler et al., 2004). Sites are experimentally proven for their inclusion in the database. The experimental evi- dence of the transcription factor and the DNA-binding site is described, and the cell type from which the factor is derived is linked to the respective entry in the CELL table. A set of weight matrices are derived from the collection of binding sites. These matrices are recorded in the MATRIX table. Moreover, as determined by their DNA-binding domain, the transcription fac- tors are assigned to a certain class, and hence link to the CLASS table is established. The starting point for accessing these databases is the following web-site: www.gene- regulation.com. As an example, consider the following somewhat edited entry from the SITES table in the TRANSFAC database shown in Figure 59.3. The entry provides a wide variety of information about the transcription factor, such as the binding sequence motif (SQ) and the first (SF) and the last position (ST) of the factor binding site. The accession number of the binding factor itself is provided (BF) – this is in fact a key for the FACTOR table in TRANSFAC. The source of the factor is identified (SO). The specific type of cells where the factor was found to be active are identified, 3T3, C2 myoblasts, and F9 in this case. Additional information about these cells is accessible under the CELL table with the accession numbers of 0003, 0042 and 0069 respectively. External database references and their corresponding accession numbers are provided under the (DR) field, as well as publication titles (RT) and citation information (RL). 59 Learning Information Patterns in Biological Databases 1141 Fig. 59.3. A sample record from TRANSFAC database Another set of patterns are significant for bringing about the structural modifications to the DNA. It is necessary for the DNA to be in a structurally open conformation 1 for the gene expression to successfully occur. The MARS or Matrix Attachment regions are relatively short (100-1000 bp long) sequences that anchor the chromatin loops to the nuclear matrix and enable it to adopt the open conformation needed for gene expression (Bode, 1996). Approximately 100,000 matrix attachment sites are believed to exist in the mammalian nucleus, a number that roughly equals the number of genes. MARs have been observed to flank the ends of genic domains encompassing various transcriptional units (Bode, 1996,Nikolaev et al., 1996). A list of structural motifs that are responsible for attaching DNA to the nuclear matrix is shown in Table 59.1. 59.2.2 Clustering Biological Patterns Clustering is an important step as it directly impacts the success of the downstream model gen- eration process. Given a set of sequence patterns, S, the objective of the clustering process is to partition these into groups such that each group represents patterns that are either related due to sequence level, functional or structural similarity. The pattern similarity measured purely at the sequence level can be measured by the string-edit or Levenstein’s distance. In most cases, the sequence level similarity implies functional and/or structural similarity. However, some- times known similarity in function (for example, the categorization of MAR specific patterns above) may be used to form clusters regardless of the sequence level similarity. Consider a given sequence pair, −→ a = a 1 a 2 a n and −→ b = b 1 b 2 b m , where both the se- quences are defined over the alphabet, A={A,C,T ,G}. Let d(a i ,b j ) denote the distance be- 1 Within a cell, the DNA can be in a loosely packed, open conformation by adopting a 11 nm fiber structure, or in a tightly packed, closed conformation by adopting a 30 nm fiber struc- ture. 1142 Gautam B. Singh Table 59.1. Polymorphism is commonly observed in biological patterns. A stochastic basis for pattern representation is thus justifiable. The list of motifs that are functionally related to MARs was generated by studying related literature. Index Motif Name DNA Signature m 1 ORI Signal AT TA m 2 ORI Signal ATTTA m 3 ORI Signal ATTTTA m 4 TG-Rich Signal TGTTTTG m 5 TG-Rich Signal TGTTTTTTG m 6 TG-Rich Signal TTTTGGGG m 7 Curved DNA Signal AAAANNNNNNNAAAANNNNNNNAAAA m 8 Curved DNA Signal TTTTNNNNNNNTTTTNNNNNNNTTTT m 9 Curved DNA Signal TTTAAA m 10 Kinked DNA Signal TANNNTGNNNCA m 11 Kinked DNA Signal TANNNCANNNTG m 12 Kinked DNA Signal TGNNNTANNNCA m 13 Kinked DNA Signal TGNNNCANNNTA m 14 Kinked DNA Signal CANNNTANNNTG m 15 Kinked DNA Signal CANNNTGNNNTA m 16 mtopo-II Signal RNYNNCNNGYNGKTNYNY m 17 dtopo-II Signal GTNWAYATTNATNNR m 18 AT-Rich Signal WWWWWW tween the i th symbol of sequence −→ a and j th symbol of sequence −→ b . d(a i ,b j ) is defined as |g i −g j |. Also, let g(k) be the cost of inserting (or deleting) an additional gap of size k.If the distance between these two pattern sequences of lengths m and n is denoted as D m,n , the recursive formulation of Levenstein’s distance is defined by D i, j = Min ⎧ ⎨ ⎩ D i−1, j−1 + d(a i ,b j ), Min 1≤k≤j {D i, j−k + g(k)}, Min 1≤l≤i {D i−l, j + g(l)} (59.1) Having computed the similarity between all pattern pairs, a clustering algorithm described below is applied for grouping these into pattern-clusters. This clustering approach is based upon the work described in (Zahn, 1971, Page, 1974). In this graph-theoretic approach, each vertex v x represents a pattern x ∈ S, belonging to the set of patterns being clustered. The normalized Levenstein’s distance between two patterns x and y, denoted as δ xy , is the weight of an edge e xy connecting vertices v x and v y . The clustering process proceeds as follows: • Construct a Minimum Spanning Tree (MST). The MST covers the entire set S of pat- terns. The MST is build using Prim’s algorithm. Since the MST covers the entire set of training sequences, it is considered to be the root-cluster that is iteratively sub-divided into smaller child clusters, by repeated applications of steps and below. • Identify an Inconsistent Edge in the MST. This process is based on the value of mean μ i and standard deviation σ i of distance values for edges in a cluster C i . The cluster and edge with the largest z-score, z max is identified. The variable e i jk denotes weight of an edge in this cluster C i . z max = Max Max i j,k∈C i e i jk − μ i σ i (59.2) • Remove the Inconsistent Edge: The edge identified in step above is further subject to the condition that its z-score be larger than pre-specified threshold. If this condition is satisfied, the inconsistent edge is removed, causing the cluster containing the inconsistent edge to be split into two child clusters 2 . However, if the edge’s z-score falls below the 2 Removing any edge of a tree (the MST in this case) causes the tree to be split into two trees (into two MSTs in this case). 59 Learning Information Patterns in Biological Databases 1143 threshold, the iterative subdivision process halts. It may be noted that the threshold for inconsistent edge removal is often specified in terms of the σ . The categorization of the DNA pattern sequences into appropriate clusters is essential to train the pattern models as described in the following section. The quality of each cluster will be assessed to ensure that sufficient examples exist in the cluster to train a stochastic model. In the absence of sufficient examples, the patterns will be represented as boolean decision trees. 59.2.3 Learning Cluster Models A DNA sequence matrix is a set of fixed-length DNA sequence segments aligned with respect to an experimentally determined biologically significant site. The columns of a DNA sequence matrix are numbered with respect to the biological site, usually starting with a negative num- ber. A DNA sequence motif can be defined as a matrix of depth 4 utilizing a cut-off value. The 4-column/mononucleotide matrix description of a genetic signal is based on the assumptions that the motif is of fixed length, and that each nucleotide is independently recognized by a trans-acting mechanism. For example the following frequency matrix has been reported for the TATAA box. Table 59.2. Weight Matrix for TATA Box A 8 4 58 4 51 38 53 30 C 14 6 0 0 3 0 1 2 G 32 1 1 0 0 0 0 8 T 6 49 1 56 6 22 6 20 If a set of aligned signal sequences of length “L” corresponding to the functional signal under consideration, then F =[f bi ],(b ∈ Σ ),(j = 1 L) is the nucleotide frequency matrix, where f bi is the absolute frequency of occurrence of the b-th type of the nucleotide out of the set Σ = {A, C, G, T} at the i-th position along the functional site. The frequency matrix may be utilized for developing an un-gapped score model when searching for the sites in a sequence. Typically a log-odds scoring scheme is utilized for this purpose of searching for pattern xof length Las shown in Eq. (63.3). The quantity e i (b) speci- fies the probability of observing the base b at position i is defined using the frequency matrix such as the one shown above. The quantity q(b) represents the background probability for the base b. S = L ∑ i=1 log e i (x i ) q(x i ) (59.3) The elements of log e i (x i ) q(x i ) behave like a scoring matrix similar to the PAM and BLOSUM matrices. The term Position Specific Scoring Matrix (PSSM) is often used to define the pattern search with matrix. A PSSM can be used to search for a match in a longer sequence by evalu- ating a score S j ,for each starting point j in the sequence from position 1 to (N −L + 1) where Lis the length of the PSSM. These optimized weight matrices can be used to search for func- tional signals in the nucleotide sequences. Any nucleotide fragment of length L is analyzed and tested for assignment to the proper functional signal. A matching score of L ∑ i=1 W (b i ,i) is assigned to the nucleotide position being examined along the sequence. In the search formu- lation, b i is the base at position i along the biological sequence, and W (b i ,i) represents the 1144 Gautam B. Singh corresponding weight matrix entry for symbol b i occurring at position i along the motif. A more detailed example for learning the PSSM for a pattern cluster is shown in Figure 59.4(a). A stochastic extension of the PSSM is based on a Markovian representation of biological sequence patterns. As a first step toward learning the pattern-HMM one of the two common HMM architectures must be selected to define the topology. These are the fully connected ergodic architecture and the Left Right (LR) architecture. The fully connected architecture offers a higher level modeling capability, but generally requires a larger set of training data. The left-right configuration on the other hand is powerful enough for modeling sequences, does not require a large training set, and facilitates model comprehension. Moderate level of available training data often dictates that the LR-HMM be utilized for representing the pattern clusters. The initial parameters for the pattern-HMM are assigned heuristically. The number of states, N, denoted as, S = {S 1 ,S 2 ,S N }, in a pattern-HMM may set to as large a value as the total number of DNA symbols in the longest pattern in that cluster. Smaller number of states are heuristically chosen in practice. With each state S i , an emission probability vector corresponding to the emission of each of the symbols, {A,C,T ,G}, is associated. The process of generating each pattern is sequential such that the x th symbol generated, D x , is a result of the HMM being in a hidden state q x = S i . The parameters of the HMM are denoted as λ ={A,B, π } and defined as follows (Rabiner, 1989). A: The N ×N matrix A = {a i, j } representing the state transition probabilities. a ij = Pr[q x+1 = S j |q x = S i ] 1 ≤i, j ≤N (59.4) B: The N × k state dependent observation symbol probability matrix for each base n={A,C,T ,G}. The elements of this matrix, B = {b j (n)}, are defined as follows: b j (n)=Pr[D x = n|q x = S j ] 1 ≤ j ≤ N,1 ≤ d ≤ k (59.5) π : The initial state distribution probabilities, π = { π i }. π i = Pr[q 1 = S i ] 1 ≤i ≤ N (59.6) The Maximally Likelihood Estimation procedure suggested by BaumWelch is next uti- lized for training the each pattern-HMM such that the pattern sequences in a cluster would be the maximally likely set of samples generated by the underlying HMM. Figure 59.4(b) repre- sents the training methodology applied for learning the HMM parameters based on the local alignment block used for training the PSSM in Figure 59.4(a). Thus, pattern HMMs may be associated with clusters where the number of instances is large enough to allow us to adequately learn its parameters. In the case of smaller clusters, the pattern clusters will be represented as PSSMs, profiles or regular expressions. Profiles are similar to PSSMs (Gribskov et al., 1990) and are generated using the sequences in a cluster when the alignment between the members of a cluster is strong. Regular expressions constitute the method of choice for smaller groups of shorter patterns where compositional statistics are hard to evaluate. 59.3 Searching for Meta-Patterns The process of discovering hierarchical pattern associations is posed in terms of the relation- ships between models of a family of patterns, rather than between individual patterns. This 59 Learning Information Patterns in Biological Databases 1145 Fig. 59.4. (a) A PSSM based model induced from a Multiple Sequence Alignment (b) A HMM induced from the same alignment will enable us to validate the meta-pattern hypotheses in a computationally tractable manner. The patterns are associated when they occur within a specific distance of each other, called their association interval. The association interval will be established using a split-and-merge procedure. Using a default association interval of 1000 bp, the overall significance of patterns found by splitting this intervals is assessed. Additionally, the neighboring windows are merged to assess the statistical significance of larger regions. In this manner, the region with the high- est level of significance is considered as the association interval for a group of patterns. 1146 Gautam B. Singh The statistical significance is associated with each pattern-model pair detected within an association interval. This is achieved through two levels of searching the GenBank 3 . Level I search yields the regions that exhibit a high concentration of patterns. This is the first step toward generating pattern association hypotheses that are biologically significant, as patterns working in coordination are generally expected to be localized close to each other. In Level II search aims at building support and confidence where Level I hypotheses may be accepted or rejected based on pre-specified criteria. Consider, for example, two patterns A and B where there is a strong correlation between these two pattern HMMs in the Level I search. However, Level II search may reveal that there are a substantially large number of instances outside the high pattern density regions where their occurrence is independent of each other. This will lead to the rejection of the A ≡B hypothesis. 59.3.1 Level I Search: Locating High Pattern Density Region High Pattern Density Regions or HPDRs aims at isolating the regions on the DNA sequence where the patterns modeled by the HMMs occur in a density that is higher than expected. The level I search is aimed at identifying HPDR as shown in Figure 59.5. These regions may be located by measuring the significance of patterns detected in a window of size W located at a given position on the sequence. A numerical value for pattern-density at location x on the DNA sequence is obtained by treating the pattern occurrences within a window centered at location x as trials from independent Poisson processes. The null hypothesis, H 0 , tested in each window is essentially that the pattern frequencies observed in the window are no different from those expected in a random sequence. Large deviation from the expected frequency of patterns in a window forces the rejection of H 0 . The level of confidence with which H 0 is rejected is used to assign a statistical pattern- density metric to the window. Specifically, the pattern density in a window is defined to be, ρ = -log(p), where the p is the probability of erroneously rejecting H 0 . As a matter of detail it may be noted that the value of ρ is computed for both the forward and the reverse DNA strands and the average of the two is taken to be the true density estimate for that location. Fig. 59.5. High Pattern Density Regions or HPDRs are detected by statistical means for all sequences in the database. 3 GenBank is the database of DNA sequences that is publicly accessible from the National Institute of Health, Bethesda, MD, USA. 59 Learning Information Patterns in Biological Databases 1147 In order to compute ρ , assume that we are searching for k distinct types of patterns within a given window of the sequence. In general, these patterns are defined as rules R 1 ,R 2 , , R k . The probability of random occurrence of the various k patterns is calculated using the AND-OR relationships between the individual motifs. Assume that these probabilities for k patterns are p 1 , p 2 , ,p k . Next, a random vector of pattern frequencies, F, is constructed. F is a k-dimensional vector with components, F= {x 1 ,x 2 , ,x k }, where each component x i is a random variable representing the frequency of the pattern R i in the W base-pair window. The component random variables x i are assumed to be independently distributed Poisson pro- cesses, each with the parameter λ i = p i ·W. Thus, the joint probability of observing a frequency vector F obs = {f 1 , f 2 , ,f k } purely by chance is given by: P(F obs )= k ∏ i=1 e − λ i λ f i f i ! where λ i = p i .W (59.7) The steps required for computation of α , the cumulative probability that pattern frequen- cies equal to or greater than the vector F obs occurs purely by chance is given by Eq. (59.8) below. This corresponds to the one-sided integral of the multivariate Poisson distribution and represents the probability that the H 0 is erroneously rejected. α = Pr(x 1 ≥ f 1 ,x 2 ≥ f 2 , ,x k ≥ f k ) = Pr(x 1 ≥ f 1 ) ∧ Pr(x 2 ≥ f 2 ) ∧ ∧ Pr(x k ≥ f k ) = ∞ ∑ x 1 = f 1 exp − λ 1 λ x 1 1 x 1 ! · ∞ ∑ x 2 = f 2 exp − λ 2 λ x 2 2 x 2 ! ∞ ∑ x K = f K exp − λ K λ x K k x k ! (59.8) The p-value, α , in Eq. (59.8) is utilized to compute the value of ρ or the cluster-density as specified in Eq. (59.9) below: ρ = ln 1.0 α = −ln( α ) = k ∑ i=1 λ i + k ∑ i=1 ln f i ! − k ∑ i=1 f i ln λ i − k ∑ i=1 ln(1 + λ i f i +1 + + λ t i ( f i +1)( f i +2) ( f i +t) ) (59.9) The infinite summation term in Eq. (59.9) quickly converges and thus can be adaptively calculated to the precision desired. For small values of λ i , the series may be truncated such that the last term is smaller than an arbitrarily small constant, ε . Fig. 59.6. The analysis of human protamine gene cluster using the MAR-Finder tool. Default analysis parameters were used. 1148 Gautam B. Singh Figure 59.6 presents the output from the analysis of the human protamine gene sequence. This statistical inference algorithm based on the association of patterns found within the close proximity of a DNA sequence region has been incorporated in the MAR-Finder tool. A java- enabled version of the tool described in (Singh et al., 1997) is also available for public access from http://www.MarFinder.com. We also need t take into consideration the interdependence of pattern occurrences. Let f ij correspond to the observed frequency of pattern defined by pattern-HMM H j in the i th . win- dow sample. Using the frequency data from n window samples, and the mean frequency, −→ f = ( f i ), the correlation matrix, R =(r ij ) can be evaluated as follows: r ij = s ij s i s j = n ∑ r=1 ( f ri − f i )( f rj − f j ) n ∑ r=1 ( f ri − f i ) 2 n ∑ r=1 ( f rj − f j ) 2 (59.10) If the sample correlation matrix, R, is equal to the identity matrix, the variables can be considered to be uncorrelated or independent. The hypothesis r ij = 0 can be tested using the statistic t ij defined in Eq.(59.11). t ij follows a Student’s distribution with (n −2) degrees of freedom (Kachigan, 1986). t ij = r ij √ n −2 1 −r 2 ij (59.11) If a pattern interdependence is detected, the pairwise correlation terms in R can be used to remove surrogate variables, i.e. one of the two patterns that exhibit a high degree of cor- relation. Removal of surrogate variables results in retaining a core subset of original patterns that account for the variability of the observed data (Hair et al., 1987). Let there be k such core patterns that get retained for subsequent analysis stage. If the pairwise correlation terms of R k are non-zero, the Mahalanobis Transformation can be applied to the vector −→ f k to trans- form it to a vector −→ z k . The property of such a transformation is that the correlation matrix of the transformed variables is guaranteed to be the identity matrix I (Mardia et al., 1979). The Mahalanobis Transformation for obtaining the uncorrelated vector −→ z k from the observed frequency of core vectors −→ f k is specified in Eq. (59.12), with the l i denoting the eigenvalues. −→ z k = S − 1 2 k ( −→ f k − −→ f k ) S − 1 2 k = ΓΛ − 1 2 Γ (59.12) where −→ f k is the observed frequency vector and Λ − 1 2 = diag(l − 1 2 i ) The value for α can next be computed based on the transformed vector −→ z k as shown in Eq. (59.13). The components of the transformed vector are independent, and thus the multi- plication of individual probability terms is justifiable. Each component, z i , represents a linear combination of the observed frequency values. α = Pr(z 1 ≥ z f 1 ,z 2 ≥ z f 2 , ,z c ≥ z f c ) = Pr(z 1 ≥ z f 1 ) ·Pr(z 2 ≥ z f 2 ) · ·Pr(z c ≥ z f c ) = ∞ z 1 =z f 1 e −1 z 1 ! · ∞ z 2 =z f 2 e −1 z 2 ! · · ∞ z c =z f k e z c ! −1 (59.13) 59 Learning Information Patterns in Biological Databases 1149 59.3.2 Level II Search: Meta-Pattern Hypotheses The meta- or higher level pattern hypotheses are generated and tested within the HPDRs. Specifically, the Pattern Association (PA) hypotheses are generated and verified within these HPDRs. These PA hypotheses are build in a bottom-up manner from the validation of pair-wise associations. For example, for two patterns A and B, a PA-hypothesis that we might validate is that A → B, with the usual semantics that the occurrence of a pattern A implies that occurrence of pattern B within a pre-specified association distance. Furthermore, if the PA-hypothesis stating that B → A is also validated, the relationship between patterns A and B is promoted to that of Pattern Equivalence (PE), denoted as A ↔B or A ≡B. Transitivity can be used to build larger groups of associations, such that if A → B and B →C , then the implication A → BC may be concluded. Similar statement can be made about the PE-hypotheses 4 . Meta pattern formation using transitivity rules will lead to the discovery of mosaic type meta-patterns. For the purpose of developing a methodology for systematically generating PA-hypotheses, the DNA sequence is represented as a sequence of a 2-elements. The first element in this sequence is the pattern match location, and the second element identifies the specific HMM(s) that matched. (It is pos- sible for more than one pattern model to match the DNA sequence at a specific location). Such a representation shown in Eq.(59.14), is denoted as F S , is the pattern-sequence corresponding to the biological sequence S. F S = (x 1 ,P a ),(x 2 ,P b ), ,(x i ,P r ), ,(x n ,P v ) (59.14) Eq.(59.15) specifies the set of pattern hypotheses generated within each HPDR. The op- erator cadr(L) is used to denote car(cdr(L)). The equation specifies that unique hypotheses are formed considering the closest pattern P y instance to a given pattern P x instance. H AB = {(A,B)|A = cadr(P x ) ∧ B = cadr(P y )∧ Δ A,B = ||car(P x ) −car(P y )|| ∧ ( Δ A,B < θ )∧ (¬∃P z )( ||car(P x ) −car(P z )|| < Δ A,B )} (59.15) A N ×N matrix C, similar to a contingency table (Gokhale, 1978, Brien, 1989) is used for recording significance of the each PA-hypotheses generated from the analysis of all HPDRs in the entire set of sequences. Recall that these regions were identified during the Level I search. The score for cell C A,B is updated according to Eq. (59.16) for every pattern pair (A,B) hypotheses H AB generated in these regions. The probabilities of random occurrence of patterns A and B are p A and p B respectively, and Δ A,B is the distance between them. C AB = C AB + ρ AB = C AB +( λ A + λ B −ln λ A −ln λ B ) where λ A = p A Δ AB , and λ B = p B Δ AB (59.16) Information theoretical approach based on mutual information content is next utilized for characterizing the strengths between pattern pairs. Contents of the contingency table (after all the sequences in the database have been processed) need to be converted to correspond to 4 The functional significance of A → BC is that protein binding to site A will lead to the the binding of proteins at sites B and C. For a meta-pattern of the form A ↔B, both the proteins must simultaneously bind to bring forth the necessary function. . (59. 12) , with the l i denoting the eigenvalues. −→ z k = S − 1 2 k ( −→ f k − −→ f k ) S − 1 2 k = ΓΛ − 1 2 Γ (59. 12) where −→ f k is the observed frequency vector and Λ − 1 2 = diag(l − 1 2 i ) The. (Liebich et al., 20 02) , TransCOM- PEL (Margoulis et al., 20 02) , LocusLink, OMIM, and RefSeq (Wheeler et al., 20 04). Sites are experimentally proven for their inclusion in the database. The experimental. Pr(x 2 ≥ f 2 ) ∧ ∧ Pr(x k ≥ f k ) = ∞ ∑ x 1 = f 1 exp − λ 1 λ x 1 1 x 1 ! · ∞ ∑ x 2 = f 2 exp − λ 2 λ x 2 2 x 2 ! ∞ ∑ x K = f K exp − λ K λ x K k x k ! (59.8) The p-value, α , in Eq. (59.8)