... the position of occurrence Due to the many areas of applications of runs and patterns, different statistics of runs and patterns have been defined for various purposes Occurrence of runs, distance... small number of runs, or clusters of same symbols, or by runs of unexpected length Thus the total number of runs and the lengths of the runs should reflect the existence of some sort of pattern... consisting of 1s and 0s In [7], the analysis of nucleotide sequences leads to the analysis of runs and patterns on the alphabet set {a, t, c, g} The authors find the mean and variance of the number of
Analysis of Runs and Patterns Lim Soon Kit An academic exercise presented in partial fulfillment for the degree of Master of Science in Mathematics. Supervisor : Dr. Zhang Louxin Department of Mathematics National University of Singapore 2003/2004 Contents Acknowledgements iii Summary iv 1 Introduction and background 1 1.1 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Non-parametric Statistics . . . . . . . . . . . . . . . . . . . . . 5 1.3 Psychology, Ecology and Meteorology . . . . . . . . . . . . . . . 7 1.4 Statistical quality control and charts . . . . . . . . . . . . . . . 8 1.5 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.6 Radar astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Sociology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.8 Formulation of the run statistics problems . . . . . . . . . . . . 12 2 Success runs 16 2.1 Probability of occurrence . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Mean waiting time for first occurrence . . . . . . . . . . . . . . 24 i 2.3 Mean and variance of number of occurrences (overlapping allowed) 27 3 Success-failure runs 29 3.1 Number of occurrences of R (overlapping allowed) . . . . . . . . 30 3.2 Distance between occurrences . . . . . . . . . . . . . . . . . . . 36 4 Spaced Patterns 39 4.1 Probability of occurrence . . . . . . . . . . . . . . . . . . . . . . 40 4.2 ¯n . . . . . . . . . . . . . . . . . . . . . . . . . Approximating Q 47 4.3 Bounds on β and λ . . . . . . . . . . . . . . . . . . . . . . . . . 55 Bibliography 58 ii Acknowledgements I would like to take this opportunity to express my gratitude to my supervisor, Dr. Zhang Louxin for his help and guidance throughout this academic exercise. His advice on different aspects of life, not just academically, was much appreciated. iii Summary A run or pattern is a specified sequence of outcomes that may occur at some point in the series of trials. Runs containing a single symbol are called success runs. Runs containing 2 symbols are called success-failure runs. Runs containing more than 2 symbols are called multiple runs. In Chapter 1, we define various statistics based on runs and patterns and look at how the analysis of runs and patterns is being used in practical applications. In Chapter 2, we will derive the probability of occurrence of a sucess run of length l before or at position n, denoted by Qn , through a recurrence relation in its complement Qn . The probability of the first occurrence of a sucess run of length l at position n, denoted by fn , is also obtained. We will also obtain the mean waiting time for the first occurence of a success run and the mean and variance of the number of occurrences. In Chapter 3, we look at success-failure runs and discuss how to obtain the mean and variance of the number of occurrences of such runs. We also define and obtain expressions for the ‘distance’ between occurrences of success-failure iv runs. In Chapter 4, we study spaced patterns where a spaced pattern is denoted by a specified sequence on the set {1, * }, and the 1s correspond to the matching positions and 0s correspond to the ‘don’t care’ positions. We will derive the probability of occurrence of a spaced pattern through a set of recurrence relations. We will also look at some asymptotic results for approximating the probability of occurrence of spaced patterns. v Chapter 1 Introduction and background Consider a series of n trials, x1 , x2 , . . . , xn , where each trial has at least m ≥ 2 possible outcomes and xi ∈ A = {0, 1, . . . , m} for all i. A run or pattern is defined to be a specified sequence of outcomes that may occur at some point in the series of trials. For example, let A = {1, 2, 3} and consider the runs R1 = 1111, R2 = 1122 and R3 = 1231. Runs containing a single symbol, for example R1 , are called success runs. Runs containing 2 symbols, for example R2 , are called success-failure runs. Runs containing more than 2 symbols, for example R3 , are called multiple runs. We can also consider runs specified by the set A ∪ {∗} where the *s correspond to ‘don’t care’ positions. We call such runs ‘spaced patterns’. For example, in the spaced pattern R4 = 12 ∗ 1, the ‘*’ can be any symbol from the alphabet set A. The probabilistic analysis of runs and patterns plays an important role in many statistical areas such as reliability of engineering systems, DNA sequenc1 Chapter 1: Introduction and background 2 ing, nonparametric statistics, psychology, ecology and radar astrology. In the next few sections, we will look at how the analysis of runs and patterns is being used in practical applications. 1.1 Bioinformatics The sequence similarity of a newly discovered gene to known genes is often an important clue to its function and structure. By comparing DNA/protein sequences, one can learn about functionality or the structure of proteins without performing any experiments. Therefore, whenever a biologist sequences a gene, the next thing to do is to search the sequence or protein databases for similar sequences. To give a measure of the similarity between two DNA or protein sequences, we need a definition: An alignment of two sequences x and y is a new pair of sequences x and y of equal length such that 1. x and y are obtained from x and y respectively by inserting occurrence of the space symbol ‘-’; 2. no two space symbols lie in the same position in x and y . For example, if x = abcbdc and y = accdbdb, then one alignment of x and y is as follows: x = a − b y = a c c b − d c − c d b d b Chapter 1: Introduction and background 3 Furthermore, the above alignment contains 3 matches, 2 mismatches and 3 insertions of ‘-’. The quality of an alignment, that is, the degree to which it displays the similarity between the two sequences, is measured by a score. Such a score is given by the sum of scores associated with the individual columns of the alignment. The score of a column is given by a symmetric scoring function f that maps pairs of symbols from the alphabet A ∪ {−} to real numbers, where A is {A, C, T, G} (for DNA sequences) or the set of amino acids. Generally, we will have f (a, a) > 0 for all a ∈ A and f (a, b) < 0 for a = b, so that matched columns increase the score of the alignment whereas mismatches and insertions of ‘-’ are penalized. Such a score scheme can be given in the form of a table, such as A C T G − A 3 -2 -1 -1 -1 C -2 3 -1 -1 -1 T -1 -1 3 -2 -1 G -1 -1 -2 3 -1 - -1 -1 -1 -1 The similarity between 2 sequences x and y are measured by the maximum score of an alignment of x and y. To search a sequence database for sequences similar to a given sequence x, it would be desirable to search through the entire database for high-scoring local alignments. However, this would require a huge amount of computational time. Instead, a heuristic approach is used for this Chapter 1: Introduction and background 4 problem. The BLAST (Basic Local Alignment Search Tool) programme first find a reasonably long exact matches (consecutive k bases) between the given sequence and a sequence in the database, and then extend these exact matches into local alignments. The principal underlying this is that, based on statistical study, two sequences are likely to have high-scoring local alignment only if there are reasonably long exact matches between them. The larger the value of k, the faster the programme but the poorer the sensitivity. The value of k is usually chosen to be 11 by considering tradeoff between search speed and sensitivity. Another programme used for database search is PatternHunter [13]. Unlike BLAST which looks for consecutive k matches, PatternHunter looks for nonconsecutive k matches given by a matching pattern. More precisely, a specific set of matching positions is a spaced pattern of *s and 1s, where the 1s gives the matching positions. For example, if we used the spaced pattern 1*1*1, then we are only looking for matches in the first, third and fifth positions. The two sequences AT CGACC and AGCT ACC contain 2 matches of the spaced pattern 1*1*1, ending at the fifth and seventh positions given below: A T C G A C C A G C T A C C 1 0 1 1 1 0 1 Note that we are only looking for matches at the first, third and fifth positions, hence we do not care about the other positions. Chapter 1: Introduction and background 5 The sensitivity of a spaced pattern S is defined as the probability of the occurrence of S in a random 0-1 sequence of fixed length N = 64. Both the pattern and the number of 1s in S determine its sensitivity. For example, the spaced pattern 111*1**11**1*1*111 has a probability of occurrence of 0.712 for a 64-bit random string, while the pattern 11111111111 used by BLAST has only a probability of 0.3. Theoretical analysis of the sensitivity of spaced patterns is important both in theory and in practical applications. 1.2 Non-parametric Statistics Nonparametric test is a test for a hypothesis which is not a statement about parameter values. The type of statement permissible then depends on the definition accepted for the term parameter. The hypothesis for a nonparametric test can only be related to the form of the population, such as in goodnessof-fit tests, or with some characteristic of the probability distribution of the sample data, such as in tests of trend and randomness, and for identical sampled populations. Suppose that on some day during lunch time, a queue of fifteen persons waiting in line to get into a certain restaurant is observed. There are eight males(M) and seven females(F) forming a line in the arrangement: M, F, M, F, M, F, M, F, M, F, M, F, M, F, M Would this be considered a random arrangement by sex? Intuitively, the answer Chapter 1: Introduction and background 6 is no, since the males and females seem to alternate, suggesting intentional mixing by pairs. This arrangement is an extreme case, just like M, M, M, M, M, M, M, M, F, F, F, F, F, F, F with intentional clustering. In the less extreme situations, the randomness of an arrangement can be tested statistically using the theory of runs. Given an ordered sequence of two or more types of symbols, a run is defined to be a succession of one or more identical symbols which are followed and preceded by a different symbol or no symbol at all. Hence in the analysis of strings and patterns, a run is simply a consecutive pattern of length at least one. Clues to lack of randomness are provided by the tendency of symbols to exhibit a definite pattern in the sequence. If there is sequential dependency among symbols of the same type, then they may tend to cluster and this may be indicated by an unusually small number of runs, or clusters of same symbols, or by runs of unexpected length. Thus the total number of runs and the lengths of the runs should reflect the existence of some sort of pattern. Hence the two criterion can be used to test for randomness. Too few runs, too many runs, a run of excessive length, or too many runs of excessive length etc, can be used as statistical criteria for the rejection of the null hypothesis of randomness as these situations should occur rarely if the sequence is truly random. Chapter 1: Introduction and background 1.3 7 Psychology, Ecology and Meteorology The theory that success breeds success, that is, that attaining a positive outcome makes it more probable that a positive outcome will be attained on the next trial, is often considered in psychological achievement testing, animal learning studies, athletic competition and similar matters. It is also possible that failure breeds failure, or that both phenomena may be present. In animal learning experiments, a test animal performs a sequence of trials, in each of which it either succeeds or fails at selecting a box containing food, at running a maze or at some other tasks. By observing the length of the longest run, a psychologist can test for improvement in the animal’s performance. In athletic competition, a psychologist may make the hypothesis that winning a trophy one year increases an athletic team’s motivation to win it the next year. The psychologist obtains a sample of the list of wins(W) and losses(L) of an annual trophy for a certain team, and writing it in the form of a sequence, for example: WWWLLWWWWLLLLWWLLLWWWWWW Then a nonparametric test based on the length of the longest run can be used accept or reject the hypothesis. In ecological studies on the distribution of some characteristics such as species type or the prevalence of a specific disease, it is quite common to take a line or belt transect and then observe the characteristics of the sampled trees. Chapter 1: Introduction and background 8 The length of the longest run of trees having that specific characteristic is then used for drawing some useful conclusions on the segregation of the species or the spread of the disease. Similar situations arise in meteorology when one is interested in checking whether there is a tendency towards the persistence of the same type of weather. 1.4 Statistical quality control and charts Sampling inspection plans commonly used are those based on the familiar curtailed inverse sampling scheme in which inspection stops when either a certain number of non-conforming items is completed or a prespecified number of items is sampled, whichever comes early. The lot will be rejected in the cae of the former, while the lot will be accepted in the case of the latter. Such sampling plans are desirable as they are economical based on time and cost considerations. The idea of using run as a stopping criterion in acceptance sampling was introduced. From that time on, many run-based acceptance sampling plans have been proposed and investigated. In the application of statistical methods to quality control charts, a usual procedure is to construct a control chart with control limits spaced about the mean such that under conditions of statistical control, or random sampling, the probability of an observation falling outside these limits is a given α, for example, α = 0.05. The occurrence of a point outside these limits is taken as an indication of the Chapter 1: Introduction and background 9 presence of assignable causes of variation in the production line. Such a form of chart has been found to be of particular value in the detection of the presence of assignable causes of variability in the quality of manufactured product. As recently pointed out, however, the statistician may not only help to detect the presence of assignable causes, but also help to discover the causes themselves in the course of further research and development. For this purpose, runs of different kinds and of different lengths have been found useful by industrial statisticians. Quality control engineers have found that a convenient indication of lack of control is the occurrence if long runs of observations whose values lie above or below that of the median of the sample. 1.5 Reliability A consecutive k-out-of-n failure system is a system of components numbered 1 through n which fails if and only if at least k consecutively numbered components fail. A more formal and precise definition of this reliability system is as follows. Consider a system consisting of n components placed in a line and labeled as first, second, and so on, up to n-th. Each component can only be in one of two states, either operating (up/good/working) or failed (down/bad/not working), and so aslo the entire system. The n components are assumed to work independently of each other. Then, the whole system fails whenever at least k consecutive components are in their failed state - that is, the system functions if and only if there exists no succession of k failed components. Such Chapter 1: Introduction and background 10 systems arise in many different settings including telecommunications and computer networks. In order to make the idea of this system clear, we now provide two examples: 1. A system of n radar stations is used for transmitting information from Site A to Site B. Suppose that the stations are equally spaced between the two sites and that each station is able to transmit signals up to a distance of k microwave stations. Then, the system clearly becomes nonfunctional (unable to transmit information from Site A to Site B) if and only if at least k radar stations are out of order; 2. A fluid transportation network uses pumps and pipes to carry the fluid fron Point A to Point B. Suppose that the pump stations are equally spaced between the two points and that each pump station is able to transport fluid up to a distance of k pump stations. If one pump is down, the transportation of fluid is not hindered as the previous station can overcome this difficulty. However, when k or more consecutive pumps are down, the transportation of fluid stops. Many different aspects and characteristics of such systems have been studied under various assumptions regarding the random variables describing the performance of the components. Early work dealt with independent and identically distributed components and considered procedures for finding the reliability of such systems. When analyzing the reliability of a consecutive k-out-of-n failure Chapter 1: Introduction and background 11 system, the reliability can be viewed as the probability that the random number of runs of at least k consecutive failures in n independent Bernoulli trials is zero. Alternatively, the reliability can be viewed as the probability that the waiting time until the k consecutive failure is greater that n. 1.6 Radar astronomy In radar astronomical observations of minor planets or asteroids, data consist of echo-power spectral-density estimates at a sequence of Doppler frequencies. Empirically determined background filter shape can be removed from the raw spectrum, and the resultant backgound-free spectrum normalized to the rootmean-square fluctuation in the receiver noise. If no echo is present, the model that the spectral estimates behave as a sequence of independent and identically distributed Normal(0,1) random variables agrees with both a priori theoretical considerations and a posteriori experimental evidence. However, if the target has a sufficiently large radar cross-section, a radar echo would be expected to produce a sequence of above-average readings in some portion of the frequency band. A test for the presence of an echo can be based on the length of the longest run of positive readings. Chapter 1: Introduction and background 1.7 12 Sociology The behaviour of groups of people forming lines and other structure can be modelled as a Markov-dependent sequence of trials in which some characteristc such as the sex of the individual is taken as the trial outcome. It is of interest to find out whether the occurrence of certain runs is plausible under various types and orders of Markov dependence. For example, for a group of primary school children queueing up before a canteen stall, a sociologist may wish to test the hypothesis that groupings of small children are random with regard to sex against the alternative hypothesis that small children of the same sex tend to congregate. The total number of runs in the sequence can be used as a nonparametric test for randomness. 1.8 Formulation of the run statistics problems Recall that a run or pattern is defined to be a specified sequence of outcomes that may occur at some point in the series of trials. Let x1 , x2 , . . . , xn be a sequence of n trials, where xi ∈ A = {0, 1, . . . , m} for all i and m ≥ 1. Since we are mainly interested in success runs, success-failure runs and spaced patterns, we will let m to be 1 throughout this thesis, that is , we will take A to be the set of binary bits {0, 1}. Generally, the probabilities of the m outcomes can vary arbitrarily from trial to trial, and can be L-order Markov dependent on the L preceding outcomes. In this thesis, we will assume Chapter 1: Introduction and background 13 that the outcome probabilities at any trial are independent of the outcomes of all previous trials. It is necessary to address the question of how to count the number of times a given run or pattern occurs in a sequence of trials. For example, we want to count the number of occurrences of the patterns 1010 and 1111, and the spaced pattern 1*1*1 in the sequence 1111010101111010. If overlapping occurrences of 1010 are all relevant and counted, then the pattern 1010 is counted as occuring in positions 4-7, 6-9 and 13-16, thus occurring a total of 3 times.Even though two of these patterns overlap in positions 6 and 7, both are counted. If second and higher overlapping patterns are not counted, the pattern 1010 is counted as occurring only twice, namely in positions 4-7 and 13-16. The pattern 1111 occurs twice in positions 1-4 and 10-13 without ambiguity. For the spaced pattern 1*1*1, we are only interested in the bit 1 to occurring at the first, third and fifth positions. That is, any occurrences of 10101, 11101, 10111 or 11111 in the sequence contributes to one count of the spaced pattern 1*1*1 if overlaps are allowed. Therefore the spaced patten 1*1*1 occurs a total of five times in the sequence at positions 2-6, 4-8, 6-10. 812 and 11-15, ignoring overlaps. Hence it is clear that the number of times that a given patttern occurs depends on whether overlaps are allowed. For different purposes different definitions have been adopted. For example, in [8], Feller considers only non-overlapping counting for a recurrent event. In [9], both Chapter 1: Introduction and background 14 overlapping and non-overlapping countings are considered. It is also largely a matter of convention and convenience whether we consider the starting or ending position as the position of occurrence. In the previous example, the pattern 1111 occurs in the sequence from positions 1 to 4. We say that the pattern 1111 occurs at position 1 if we consider the starting position as the position of occurrence. On the other hand, we can say that the pattern 1111 occurs at position 4 if we consider the ending position instead as the position of occurrence. Throughout, we will take the ending position as the position of occurrence. Due to the many areas of applications of runs and patterns, different statistics of runs and patterns have been defined for various purposes. Occurrence of runs, distance between occurrences, length of longest success runs and other problems have been studied in various papers. In [8], the author obtains the generating function of the waiting time for a run of l successes and derives the related mean waiting time. The probability that a success run occurs before a failure run is also discussed. In [9], probability distributions of the number of success runs of size exactly k and the number of success runs of size greater than or equal to k, among others, are derived using the technique of Markov chain imbedding. In [7], the ‘distance’ between occurrences and the mean and variance of the number of occurrences is obtained for the pattern 1010. The distance between two occurrences of a pattern is the difference between the two positions of occurrences. For example, in the sequence 1101001101, the pattern Chapter 1: Introduction and background 15 11 occurs twice and the distance between the two occurrences is 4. In [10], the length of the longest success run and the total number of runs are mainly of interest in randomness testing. In this thesis we will derive the probability of occurrence of a sucess run of length l before or at position n, denoted by Qn , through a recurrence relation in its complement Qn . The probability of the first occurrence of a sucess run of length l at position n, denoted by fn , is also obtained. We will also obtain the mean waiting time for the first occurence of a success run and the mean and variance of the number of occurrences. Next, we will look at success-failure runs and discuss how to obtain the mean and variance of the number of occurrences of such runs. We also define and obtain expressions for the ‘distance’ between occurrences of success-failure runs. Finally, we study spaced patterns where a spaced pattern is denoted by a specified sequence on the set {1, * }, and the 1s correspond to the matching positions and 0s correspond to the ‘don’t care’ positions. We will derive the probability of occurrence of a spaced pattern through a set of recurrence relations. We will also look at some asymptotic results for approximating the probability of occurrence of spaced patterns. Chapter 2 Success runs In this chapter, we will derive the probability of occurrence of a sucess run of length l before or at position n, denoted by Qn , through a recurrence relation in its complement Qn . The probability of the first occurrence of a sucess run of length l at position n, denoted by fn , is also obtained. Let A = {0, 1}. Recall that a success run is a run R consisting of a single symbol on A. We will consider a success run R to be a pattern of consecutive 1s, that is, R = 11 · · · 1 = 1l , where l is the length of R. At any trial, the probability of the bit 1 occuring is denoted by p, and the probability of the bit 0 occuring is denoted by q = 1 − p. We assume that the outcome probabilities at any trial are independent of the outcomes of all previous trials. Recall that, if R = 1l occurs at positions k to k + l − 1, we say that R occurs at position k + l − 1, the ending position. 16 Chapter 2: Success runs 2.1 17 Probability of occurrence chapter Let Qn , n ≥ 1, denote the probability that R occurs before or at position n and let Qn = 1 − Qn . Hence Qn = 0 for n < l and Ql = pl . Note that if R does not occur before or at position n, then the first bit of 0 must have occurred in the first l positions of the trials. That is, we have Qn = qQn−1 + qpQn−2 + qp2 Qn−3 + · · · + qpl−1 Qn−l l−1 qpi Qn−1−i . = i=0 Theorem 2.1. Qn = Qn−1 − qpl Qn−1−l . Chapter 2: Success runs 18 Proof. l−1 l−1 i Qn − Qn−1 = qpi Qn−2−i qp Qn−1−i − i=0 i=0 l−2 (qpi+1 − qpi )Qn−2−i − qpl−1 Qn−1−l = qQn−1 + i=0 l−1 l−2 qpi Qn−2−i + = q i=0 l−2 (qpi+1 − qpi )Qn−2−i − qpl−1 Qn−1−l i=0 l−2 q 2 pi Qn−2−i + q 2 pl−1 Qn−1−l + = i=0 l−1 qp (qpi+1 − qpi )Qn−2−i − i=0 Qn−1−l l−2 (q 2 pi + qpi+1 − qpi )Qn−2−i + (q 2 pl−1 − qpl−1 )Qn−1−l = i=0 l−2 qpi (q + p − 1)Qn−2−i + qpl−1 (q − 1)Qn−1−l = i=0 = −qpl Qn−1−l . Remark. The above result has been proven by many researchers independently, including [5]. Corollary 2.2. Let fn be the probability that R first occurs at position n. Then fn = qpl Qn−1−l . Chapter 2: Success runs 19 Proof. fn = P (R first occurs at position n) = P (R occurs before or at position n) − P (R occurs before or at position n − 1) = Qn − Qn−1 = Qn−1 − Qn = qpl Qn−1−l . Remark. fn satisfies the same recurrence relation as Qn since qfn−1 + qpfn−2 + · · · + qpl−1 fn−l = q(Qn−2 − Qn−1 ) + qp(Qn−3 − Qn−2 ) + · · · +qpl−1 (Qn−2 − Qn−1 ) = Qn−1 − Qn = fn . Proposition 2.1.1. For any l ≤ n ≤ 2l, Qn = pl + (n − l)pl q. Chapter 2: Success runs 20 Proof. From Theorem 2.1 we get that Qn = Qn−1 + pl qQn−l−1 = Qn−1 + pl q since Qn−l−1 = 1 for l ≤ n ≤ 2l = Qn−2 + 2pl q . = .. = Qn−(n−l) + (n − l)pl q = pl + (n − l)pl q. Proposition 2.1.2. Q2l+1 = pl + (l + 1)pl q − p2l q. Proof. Q2l+1 = Q2l + pl qQl = pl + lpl q + pl q(1 − pl ) = pl + (l + 1)pl q − p2l q. Proposition 2.1.3. For any n ≥ l, Qn ≥ Ql Qn−l . Proof. The proof is by induction on n. Chapter 2: Success runs 21 When n = l, equality holds and hence the statement is true. Now, Qn+1 − Ql Qn+1−l = qQn + qpQn−1 + · · · + qpl−1 Qn−l+1 − Ql (qQn−l + qpQn−l−1 + · · · +qpl−1 Qn+1−2l ) = q(Qn − Ql Qn−l ) + qp(Qn−1 − Ql Qn−l−1 ) + · · · +qpl−1 (Qn−l+1 − Ql Qn+1−2l ) ≥ 0. by strong induction. Theorem 2.3. For any n ≥ 2l, k ≥ l, we have 1. Qn ≥ Qk Qn−k+l−1 , 2. Qn ≤ Qk Qn−k . Proof. We will prove (1) by induction on k. For the base case k = l, the inequality holds by Proposition 2.1.3. Hence Qn − Qk+1 Qn−(k+1)+l−1 = Qn − Qk+1 Qn−k+l−2 = qQn−1 + qpQn−2 + · · · + qpl−1 Qn−l −Qn−k+l−2 (qQk + qpQk−1 + · · · + qpl−1 Qk+1−l ) = q(Qn−1 − Qk Qn−k+l−2 ) + qp(Qn−2 − Qk−l Qn−k+l−2 ) + · · · +qpl−1 (Qn−l − Qk+1−l Qn−k+l−2 ) ≥ 0 Chapter 2: Success runs 22 by strong induction. To prove (2), we let Ai denote the event that R occurs at position i, and Ai the complement of Ai . Note that Ai = ∅ for i = 1, 2, . . . , l − 1. Hence we have Qn = P (Al Al+1 · · · Ak · · · Ak+l · · · An ) ≤ P (Al Al+1 · · · Ak Ak+l · · · An ) = P (Al Al+1 · · · Ak ) × P (Ak+l Ak+l+1 · · · An ) = Qk Qn−k . Theorem 2.4. For any n ≥ 2l, k ≥ l, we have 1. fn ≥ fk Qn−k+l−1 , 2. fn ≤ fk Qn−k . Proof. Recall that fn = P (R first occurs at position n) = qpl Qn−l−1 . Using the first inequality of Theorem 2.3 with n and k replaced by n − l − 1 and k − l − 1 respectively, we have Qn−l−1 ≤ Qk−l−1 Q(n−l−1)−(k−l−1)+l−1 = Qk−l−1 Qn−k+l−1 . Chapter 2: Success runs 23 Multiplying both sides of the inequality by qpl , we get that fn = qpl Qn−l−1 ≤ qpl Qk−l−1 Qn−k+l−1 = fk Qn−k+l−1 . To prove the second inequality, observe that fn = Qn−1 − Qn = P (Al Al+1 · · · An−1 ) − P (Al Al+1 · · · An ) = P (Al+1 Al+2 · · · An ) − P (Al Al+1 · · · An ) = P (Al Al+1 · · · Ak · · · Ak+l Ak+l · · · An ) ≤ P (Al Al+1 · · · Ak Ak+l · · · An ) = P (Al Al+1 · · · Ak ) × P (Ak+l Ak+l+1 · · · An ) = fk Qn−k . Chapter 2: Success runs 2.2 24 Mean waiting time for first occurrence Let R = 1l be a success run of length l. The mean waiting time for first occurrence of R, or ‘recurrence times’ as mentioned in [8], is derived using generating functions by Feller. In this section, we will use a different method to derive the mean waiting time. Let v denote the position where R first occurs. We will obtain E(v), the mean of v in terms of l and p, the probability of the bit 1 occurring. 1 − pl Theorem 2.5. [8] E(v) = . (1 − p)pl Proof. By definition of the mean, we have ∞ iP (v = i) E(v) = i=l ∞ ifi = i=l = lfl + (l + 1)fl+1 + · · · i=l fi = i=l+k i=l+2 i=l+1 ∞ ∞ Observe that fi + · · · . fi + fi + = l ∞ ∞ ∞ (Qi−1 − Qi ) = Ql+k−1 and hence i=l+k ∞ E(v) = l + Qi . i=l Chapter 2: Success runs 25 ∞ Next we find an expression for Qi . i=l+1 ∞ ∞ (Qi−1 − qpl Qi−l−1 ) Qi = i=l+1 i=l+1 ∞ ∞ Qi−1 − qp = l Qi−l−1 i=l+1 i=l+1 ∞ l Qi − qpl ( = ∞ Qi + Qi ) i=0 i=l i=l+1 ∞ l = Ql + Qi − qp ∞ l Qi − qp l i=0 i=l+1 ∞ Collecting terms of Qi on the left, we have i=l+1 ∞ Qi i=l+1 1 = (Ql − qpl l qp l Qi ) i=0 1 = (1 − pl ) − qpl l = Qi . i=l+1 l Qi i=0 l−1 1−p −( Qi + Ql ) qpl i=0 1 − pl = − l − Ql . qpl Chapter 2: Success runs 26 Finally we get that ∞ E(v) = l + Qi i=l ∞ = l + Ql + Qi i=l+1 1 − pl qpl 1 − pl = . (1 − p)pl = Chapter 2: Success runs 2.3 27 Mean and variance of number of occurrences (overlapping allowed) In this section we will find the mean and variance of the total number of occurrences of R in a string of length n. Define the indicator variable Ij by 1 if R = 1l occurs at position j, Ij = 0 otherwise. Hence in a random string of length n, the total number of times R occurs is n given by T = Ij . Therefore j=l n E(T ) = E(Ij ) j=l = (n − l + 1)pl . n Ij2 ) 2 E(T ) = E( j=l + 2(n − l) terms of the form E(Ij Ij+1 ) + 2(n − (l + 1)) terms of the form E(Ij Ij+2 ) . + .. + 2(n − (l + l − 2)) terms of the form E(Ij Ij+l−1 ) + (n − (2l − 1))(n − (2l − 2)) terms of the form E(Ij Ij+k ), k ≥ l. n n Ij ). For 1 ≤ k ≤ l − 1, E(Ij Ij+k ) = pl+k . For Ij2 ) = E( Note that E( j=l j=l 2l k ≥ l, E(Ij Ij+k ) = p . Hence, l−2 E(T 2 ) = (n − l + 1)pl + 2pl (n − l − i)pi+1 + (n − 2l + 1)(n − 2l + 2)p2l . i=0 Chapter 2: Success runs 28 Finally, we have V ar(T ) = E(T 2 ) − (E(T ))2 l−2 l (n − l − i)pi+1 + (n − 2l + 1)(n − 2l + 2)p2l l = (n − l + 1)p + 2p i=0 −(n − l + 1)2 p2l . Chapter 3 Success-failure runs A success-failure run is a run containing 2 symbols. In this chapter, a successfailure run is a pattern consisting of 1s and 0s. In [7], the analysis of nucleotide sequences leads to the analysis of runs and patterns on the alphabet set {a, t, c, g}. The authors find the mean and variance of the number of occurrences and the distance between occurrences of runs, focusing on a specific success-failure run, namely gaga. In this chapter, we will generalize some of the results in [7] on the mean, variance of number of occurrences and distance between occurrences to any success-failure run. Let A = {0, 1}. We consider the pattern R, where R is any arbitrary pattern consisting of 1s and 0s. We say that R is of weight w if there are w 1s (and hence l − w 0s) in R. 29 Chapter 3: Success-failure runs 3.1 30 Number of occurrences of R (overlapping allowed) Let R = r1 r2 . . . rl be a pattern of length l and weight w where ri ∈ {0, 1} for i = 1, 2, . . . , l. We want to find expressions for the mean and variance of the total number of occurrences of R in a string of length n. First, we illustrate how to find the mean and variance of the total number of occurrences of R = 1001 with length 4 and weight 2 in a string of length n. Then we will generalize the results for any arbitrary R. As in the previous chapter, we define the indicator variable Ij by Ij = 1 if R occurs at position j, 0 otherwise. n Ij Then the total number of times R occurs in a string of length n is T = j=l and E(T ) = (n − l + 1)pw q l−w where p and q are as defined previously. Hence, when R = 1001, l = 4, w = 2 and we have E(T ) = (n − 3)p2 q 2 . To find Chapter 3: Success-failure runs 31 the V ar(T ), we first calculate E(T 2 ). n 2 Ij2 ) E(T ) = E( j=4 + 2(n − 4) terms of the form E(Ij Ij+1 ) + 2(n − 5) terms of the form E(Ij Ij+2 ) + 2(n − 6) terms of the form E(Ij Ij+3 ) + (n − 6)(n − 7) terms of the form E(Ij Ik ), |j − k| ≥ 4. n Since Ij2 = Ij , we have E( n Ij ) = (n − 3)p2 q 2 . It is impossible Ij2 ) = E( j=4 j=4 for the pattern 1001 to occur at both positions j and j + 1. Therefore one or both of Ij and Ij+1 must be zero, that is, Ij Ij+1 = 0. Hence E(Ij Ij+1 ) = 0. Similarly, the pattern 1001 cannot occur at both positions j and j + 2, so E(Ij Ij+2 ) = 0. Next, Ij Ij+3 is zero unless the pattern 1001 occurs both at positions j and j + 3. This happens if and only if the pattern 1001001 occupies positions j − 3, j − 2, j − 1, j, j + 1, j + 2 and j + 3. This event occurs with probability p3 q 4 . Finally, when |j − k| ≥ 4, Ij Ik is zero unless the pattern 1001 occurs both at positions j and k. Since there are no overlapping positions as |j − k| ≥ 4, this event occurs with probability p4 q 4 . Therefore we have E(T 2 ) = (n − 3)p2 q 2 + 2(n − 6)p3 q 4 + (n − 6)(n − 7)p4 q 4 , and V ar(T ) = E(T 2 ) − (E(T ))2 = (n − 3)p2 q 2 + 2(n − 6)p3 q 4 + (n − 6)(n − 7)p4 q 4 − ((n − 3)p2 q 2 )2 = (n − 3)p2 q 2 + 2(n − 6)p3 q 4 + (33 − 7n)p4 q 4 . Chapter 3: Success-failure runs 32 Now, we will generalize the results for any arbitrary R. Theorem 3.1. Let R be any pattern of length l and weight w and let T be the total number of times R occurs in a string of length n. Then l−2 2 w l−w E(T ) = (n − l + 1)p q w l−w + 2p q i ((n − l − i)µl−1−i i=0 λ(rl−j )) + j=0 (n − 2l + 1)(n − 2l + 2)p2w q 2(l−w) where λ(0) = q, λ(1) = p and µi is to be defined. Proof. n 2 Ij2 ) E(T ) = E( j=l + 2(n − l) terms of the form E(Ij Ij+1 ) + 2(n − (l + 1)) terms of the form E(Ij Ij+2 ) . + .. + 2(n − (l + l − 2)) terms of the form E(Ij Ij+l−1 ) + (n − (2l − 1))(n − (2l − 2)) terms of the form E(Ij Ik ), |j − k| ≥ l. n Since Ij can only take on values of 0 and 1, we have Ij2 Ij2 ) = = Ij , hence E( j=l n Ij ) = E(T ). We now find the terms E(Ij Ij+k ) for k ≥ 1. E( j=l Observe that Ij Ij+k takes on the value 1 if and only if Ij = Ij+k = 1, that is, the pattern R occurs at position j and also at position j + k. The probability of such an event happening depends on the pattern R. Recall that Ai denotes the event that R occurs at position i. In addition, we let Ai,j denote the event Chapter 3: Success-failure runs 33 that R occurs at position i and j. For j < l, E(Ij ) = 0 and hence E(Ij Ij+k ) = 0 for any k ≥ 1. For j ≥ l and k ≥ l, Aj and Aj+k are independent events and hence E(Ij Ij+k ) = P (Aj,j+k ) = P (Aj ) × P (Aj+k ) = (pw q l−w )2 = p2w q 2(l−w) . (3.1.1) For j ≥ l and k < l, the events Aj and Aj+k are no longer independent, therefore, the expression of E(Ij Ij+k ) depends on the structure of R. Hence we define the constant µi to be 1 if r1 r2 . . . ri = rl−i+1 rl−i+2 . . . rl , µi = 0 otherwise. That is, µi is 1 if the first i bits of R are the same as the last i bits, and 0 otherwise. Then for j ≥ l and k < l, R can occur at position j and j + k only if µl−k = 1. The table below gives the values of µi for certain patterns of length 4. pattern µ1 µ2 µ3 µ4 1010 0 1 0 1 1111 1 1 1 1 1001 1 0 0 1 1101 1 0 0 1 1011 1 0 0 1 Chapter 3: Success-failure runs 34 For example, if R1 = 1010, then µ1 = 0, µ2 = 1, µ3 = 0 and µ4 = 1. Furthermore, if R1 occurs at position j, then R1 cannot occur at position j + 1 or j + 3 since µ1 = µ3 = 0. Thus, for j ≥ l and k < l, E(Ij Ij+k ) = µl−k P (Aj )λ(rl )λ(rl−1 ) · · · λ(rl−(k−1) ) k−1 w l−w = µl−k p q λ(rl−i ). (3.1.2) i=0 Combining equations (3.1.1) and (3.1.2), we get E(T 2 ) = E(T ) +2(n − l)pw q l−w µl−1 λ(rl ) 1 +2(n − (l + 1))pw q l−w µl−2 λ(rl−i ) i=0 .. . l−2 w l−w +2(n − (l + l − 2))p q µl−(l−1) λ(rl−i ) i=0 2w 2(l−w) +(n − (2l − 1))(n − (2l − 2))p q l−2 = (n − l + 1)pw q l−w + 2pw q l−w i ((n − l − i)µl−1−i i=0 2w 2(l−w) +(n − 2l + 1)(n − 2l + 2)p q λ(rl−j )) j=0 . Theorem 3.2. Let R be any pattern of length l and weight w and let T be the total number of times R occurs in a string of length n. Then l−2 w l−w V ar(T ) = (n − l + 1)p q w l−w + 2p q i ((n − l − i)µl−1−i i=0 ((3l − 1)(l − 1) − n(2l − 1))p2w q 2(l−w) . λ(rl−j )) + j=0 Chapter 3: Success-failure runs 35 Proof. V ar(T ) = E(T 2 ) − (E(T ))2 l−2 w l−w = (n − l + 1)p q w l−w i ((n − l − i)µl−1−i + 2p q i=0 λ(rl−j )) j=0 +(n − 2l + 1)(n − 2l + 2)p2w q 2(l−w) − (n − l + 1)2 p2w q 2(l−w) l−2 w l−w = (n − l + 1)p q w l−w + 2p q i ((n − l − i)µl−1−i i=0 λ(rl−j )) j=0 +((3l − 1)(l − 1) − n(2l − 1))p2w q 2(l−w) . We will now use Theorem 3.2 to find V ar(T ) for R = 1001 once again. For R = 1001, we have λ(r1 ) = λ(r4 ) = p, λ(r2 ) = λ(r3 ) = q, µ1 = µ4 = 1 and µ2 = µ3 = 0. Hence, by Theorem 3.2, 4−2 2 4−2 V ar(T ) = (n − 4 + 1)p q 2 4−2 + 2p q i ((n − 4 − i)µ4−1−i i=0 +((3(4) − 1)(4 − 1) − n(2(4) − 1))p λ(r4−j )) j=0 2(2) 2(4−2) q = (n − 3)p2 q 2 + 2p2 q 2 ((n − 6)pq 2 ) + (33 − 7n)p4 q 4 = (n − 3)p2 q 2 + 2(n − 6)p3 q 4 + (33 − 7n)p4 q 4 which is the same as calculated previously. Remark. Note that all the formulae given are for the case when overlapping of pattern is allowed. If we do not allow overlapping, then the sequence of occurrences of a pattern R form a special case of recurrent events, the theory of which is a major area of probability theory. A detailed study of recurrent events can be found in [8]. Chapter 3: Success-failure runs 3.2 36 Distance between occurrences In this section, we are interested in the ‘distance’ between two occurrences of the same pattern R. For example, given that R = 1010 occurred at position j, what is the probability that R occurs again at position j + 1, or j + 2 and so on? We first define the ‘distance’ between two occurrences. If given that a pattern R occurs at position j and the next earliest reoccurrence is at position j + k, we say that the distance between the two occurrences is k. With µi defined in Section 3.1, a pattern R of length l having just occurred at some position j, can next occur at position j + 1, j + 2, . . . , j + l − 1, that is, at distance k = 1, 2, . . . , l − 1 respectively, only if µl−k = 1. Denote by dk , the probability that the distance until the next occurrence of R after occurrence at position j is k. Recall that Aj is the event that the pattern R occurs at position j and Aj,k is the event that the pattern R occurs both at position j and k. We will denote the intersection of 2 sets Ai and Aj by Ai Aj . Note that we can partion Aj+k into union of disjoint events: Aj+k = Aj+1 Aj+2 . . . Aj+k−1 Aj+k ∪Aj+1,j+k ∪ Aj+1 Aj+2,j+k ∪ Aj+1 Aj+2 Aj+3,j+k ∪ . . . ∪Aj+1 Aj+2 . . . Aj+k−2 Aj+k−1,j+k = Aj+1 Aj+2 . . . Aj+k−1 Aj+k ∪ B1 ∪ B2 ∪ . . . ∪ Bk−1 Chapter 3: Success-failure runs 37 where Bi = Aj+1 . . . Aj+i−1 Aj+i,j+k for i = 1, 2, . . . , k − 1. Hence, dk = P (Aj+1 Aj+2 . . . Aj+k−1 Aj+k |Aj ) = P (Aj+k |(Aj − (B1 ∪ B2 . . . ∪ Bk−1 |Aj ))). k−l w l−w Theorem 3.3. For k ≥ l, dk = p q w l−w −p q k−1 di − i=1 k−i−1 di µl−k+i λ(rl−j ). j=0 i=k−l+1 Proof. P (Aj+k |Aj ) = pw q l−w . For i ≤ k − l, P (Bi |Aj ) = di pw q l−w since Aj+1 . . . Aj+i−1 Aj+i and Aj+k are independent events. For i = k − l + 1, k − l + 2, . . . , k − 1, P (Bi |Aj ) = di µl−(k−i) λ(rl )λ(rl−1 ) . . . λ(rl−(k−i−1) ) k−i−1 = di µl−k+i λ(rl−j ). j=0 Hence, dk = P (Aj+k |Aj ) − P (B1 ∪ B2 . . . ∪ Bk−1 |Aj ) k−l w l−w = p q − k−1 P (Bi |Aj ) − i=1 P (Bi |Aj ) i=k−l+1 k−l = pw q l−w − pw q l−w k−i−1 k−1 di µl−k+i di − i=1 j=0 i=k−l+1 k−1 Theorem 3.4. For k < l, dk = µl−k k−1 λ(rl−i ) − i=0 λ(rl−j ). k−i−1 di µl−k+i i=1 λ(rl−j ). j=0 Chapter 3: Success-failure runs 38 k−1 Proof. P (Aj+k |Aj ) = µl−k λ(rl−i ). i=0 Hence, dk = P (Aj+k |Aj ) − P (B1 ∪ B2 . . . ∪ Bk−1 |Aj ) k−1 = µl−k k−1 λ(rl−i ) − i=0 k−i−1 di µl−k+i i=1 λ(rl−j ). j=0 Chapter 4 Spaced Patterns Recall that a spaced pattern is denoted by a specified sequence on the set {1, * }, where the 1s correspond to the matching positions and 0s correspond to the ‘don’t care’ positions. The length l of a spaced pattern R is the sum of the number of 1s and *s in R. The number of 1s in R is called the weight of R, denoted by w. For example, if the pattern of interest is R = 1 ∗ ∗1, then l = 4, w = 2 and the sequence 1100011010001 contains exactly 1 occurrence of R at position 9. We may assume that a spaced pattern always starts and ends with 1. 39 Chapter 4: Spaced patterns 4.1 40 Probability of occurrence In this section, we introduce a method from [5] to derive the probability of occurrence, Qn , of a spaced pattern R through a set of recurrence relations. A simple recurrence relation between Qn and fn is first observed. Then fn is computed recursively through another set of two recurrence relations. Let R be a spaced pattern of length l and weight w. Recall that Qn is the probability that R occurs before or at position n and fn is the probability that R first occurs at position n in an infinite Bernoulli random sequence. Let An denote the event that R occurs at position n. The first l initial values of Qn and fn are given by Qn = fn = 0 for 1 ≤ n ≤ l − 1 and Ql = fl = pw . By the definition of Qn and fn , it is easy to see that, for n ≥ 2, we have Qn = Qn−1 + fn . (4.1.1) We will now compute fn as follows. Let C(R) = {x1 , x2 , · · · , xm } be the set of all distinct patterns obtained from R with the same matching positions. That is, the ‘don’t care’ positions ‘*’ are replaced by 0 or 1 and Chapter 4: Spaced patterns 41 m = 2l−w . For example, if R = 1 ∗ 1 ∗ 1, then C(R) = {10101, 11111, 11101, 10111}. Hence R occurs at position n if and only if there is an xi ∈ C(R) that occurs at position n. For 1 ≤ i ≤ m, let A(i) n denote the event that the pattern xi ∈ C(R) (i) occurs at position n. Note that the A(i) n ’s are disjoint. We also let fn be the probability that the pattern xi ∈ C(R) occurs at position n. Therefore we have fn(i) . fn = (4.1.2) 1≤i≤m Now, for 1 ≤ i ≤ m, fn(i) = P (A1 A2 · · · An−1 A(i) n ) (i) = P (A1 A2 · · · An−l An(i) \ ∪l−1 j=1 (A1 A2 · · · An−j−1 An−j An )) (k) m (i) = P (A1 A2 · · · An−l An(i) \ ∪l−1 j=1 (∪k=1 A1 A2 · · · An−j−1 An−j An )) (k) and in order that the joint event An−j A(i) n is non-empty, the first l − j positions of xi ∈ C(R) must be the same as the last l − j positions of xk ∈ C(R). Hence if we let xi [a, b] denote the pattern of xi ∈ C(R) from position a to position b, (k) then An−j A(i) n = ∅ if xk [j + 1, l] = xi [1, l − j]. Therefore we have that l−1 fn(i) = (1 − Qn−l )P (xi ) − ( (k) fn−j )P (xi [l − j + 1, l]) (4.1.3) j=1 k∈Φi,j where P (xi ) is the probability that the pattern xi occurs at position n and Φi,j = {k|xk [j + 1, l] = xi [1, l − j]}. Chapter 4: Spaced patterns 42 Equations (4.1.1), (4.1.2) and (4.1.3) enable us to compute Qn and fn recursively. For example, if R = 1 ∗ 1, then Q3 = f3 = p2 . We will compute Q4 using the above algorithm. Let C(R) = {x1 = 111, x2 = 101}. We have Φ1,1 = {k|xk [2, 3] = x1 [1, 2]} = {1}, Φ1,2 = {k|xk [3, 3] = x1 [1, 1]} = {1, 2}, Φ2,1 = {k|xk [2, 3] = x2 [1, 2]} = ∅, Φ2,2 = {k|xk [3, 3] = x2 [1, 1]} = {1, 2}. (1) (2) Then we can calculate f4 and f4 to be p3 −p4 and p2 −p3 respectively. Hence (1) (2) f4 = f4 + f4 = p2 − p4 and thus Q4 = Q3 + f4 = 2p2 − p4 . In [5], it was shown that, for R = 1 ∗ 1, we have Qn = qQn−1 + pq 2 Qn−3 + p2 q 2 Qn−4 . A very important question related to homology search is whether there exists a unique optimal spaced pattern that has the largest probablity of occurrence, Qn , for each fixed n, l and w, and any values of p. For fixed small values of w from 7 to 14, the optimal spaced pattern with the largest probablity of occurrence in a 64-bit random sequence are found in [5] and are listed in the table below. Chapter 4: Spaced patterns 43 w Optimal spaced pattern 7 11**1*1*111 8 11**1**1*1*111 9 11*11*1*1***111 10 11*11***11*1*111 11 111*1**1*1**11*111 12 111*1*11*1**11*111 13 111*11*11**1*1*1111 14 111*111**1*11**1*1111 Remark. The probabilty of occurrence of a spaced pattern R at any position is the sum of all the probability of occurrence of xi ∈ C(R) in that position. Each xi is a binary pattern, that is, a success-failure run. Hence the above method can be used to compute the probability of occurrence of a success-failure run before or at position n. Let x be a success-failure run. Let R be a spaced pattern obtained from x by replacing the 0s in x by *s and C(R) = {x = x1 , x2 , · · · , xm } as defined previously. We can compute Qn , the probability of occurrence of the successfailure run x before or at position n, with the following two equations Qn = Qn−1 + fn(1) (4.1.4) l−1 fn(i) = (1 − Qn−l )P (xi ) − ( j=1 k∈Φi,j with P (xi ) and Φi,j as defined previously. (k) fn−j )P (xi [l − j + 1, l]) (4.1.5) Chapter 4: Spaced patterns 44 Theorem 4.1. [5] Let R be a spaced pattern of length l. For any 2l−1 ≤ k ≤ n, we have 1. fn ≥ fk Qn−k+l−1 , 2. fn ≤ fk Qn−k . Proof. Let the infinite random sequence be S = s1 s2 · · · . Recall that An is the event that R occurs at position n of S. For 1 ≤ i ≤ l − 1, let Bi be the set of all binary patterns of length i. For any x ∈ Bi , let Ex denote the event that sk−l+2 sk−l+3 · · · sk−l+i+1 = x. Note that the event E0 ∪ E1 is the whole sample space and occurs with probability 1. For a pattern x with length less than l − i, we have Ex = Ex0 ∪ Ex1 with the two events Ex0 and Ex1 being disjoint. Let Ai,j = Ai Ai+1 · · · Aj . It was shown earlier in the proof of Theorem 1.4(2) that fn = P (Al Al+1,n ). By applying conditional probability, we have fn = P (Al Al+1,k Ak+1,n ) = P (Ex )P (Al Al+1,k Ak+1,n |Ex ) x∈Bl−1 = P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ). x∈Bl−1 On the other hand, fk Qn−k+l−1 = P (Al Al+1,k )P (Ak+1,n ) = P (Al Al+1,k |E0 ∪ E1 )P (Ak+1,n |E0 ∪ E1 ). Chapter 4: Spaced patterns 45 Hence we only need to prove that, for any 1 ≤ i ≤ l − 1, we have P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ) ≥ x∈Bi P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ). x∈Bi−1 Now, since Ex is the disjoint union of Ex0 and Ex1 , for any x ∈ Bi−1 , we have P (Al Al+1,k |Ex )P (Ak+1,n |Ex ) = (pP (Al Al+1,k |Ex1 ) + qP (Al Al+1,k |Ex0 ))(pP (Ak+1,n |Ex1 ) + qP (Ak+1,n |Ex0 )). By definition, P (Al Al+1,k |Ex1 ) ≤ P (Al Al+1,k |Ex0 ) and P (Ak+1,n |Ex1 ) ≤ P (Ak+1,n |Ex0 ). Applying Chebyshev’s inequality to the above, we get that P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ) ≤ P (Ex )(pP (Al Al+1,k |Ex1 )P (Ak+1,n |Ex1 ) + qP (Al Al+1,k |Ex0 )P (Ak+1,n |Ex0 )) = P (Ex1 )P (Al Al+1,k |Ex1 )P (Ak+1,n |Ex1 ) + P (Ex0 )P (Al Al+1,k |Ex0 )P (Ak+1,n |Ex0 ). For x ∈ Bi−1 , Bi = {x0, x1}, hence by taking summation on both sides, we get that P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ) ≥ x∈Bi P (Ex )P (Al Al+1,k |Ex )P (Ak+1,n |Ex ) x∈Bi−1 which proves the result. The proof of the second inequality is the same as the proof for Theorem 2.4(2). Theorem 4.2. [5] Let R be a spaced pattern of length l. For any 2l−1 ≤ k ≤ n, we have Chapter 4: Spaced patterns 46 1. Qn ≥ Qk Qn−k+l−1 , 2. Qn ≤ Qk Qn−k . ∞ Proof. Since fi = Qn , we have that i=n+1 ∞ Qk Qn−k+l−1 = fi Qn−k+l−1 i=k+1 ∞ ≤ fn−k+i i=k+1 = Qn The proof of the second inequality is the same as the proof for Theorem 2.3(2). Chapter 4: Spaced patterns 4.2 47 ¯n Approximating Q ¯ n when n becomes very large. In this section, we will give an approximation of Q In particular, we will state an asymptotic result from [3] and give a proof for a special case of the stated theorem. Theorem 4.3. [3] There exist β > 0 and λ > 0 such that Qn → βλn as n → ∞. The proof of Theorem 4.3 can be found in [3]. Note that both β and λ depends on a given spaced pattern R. It is clear that a pattern of consecutive ones is a special case of a spaced pattern. We now give a proof of a special case of Theorem 4.3. More specifically, we will prove Theorem 4.3 when R is restricted to a pattern of consecutive ones. From Chapter 2, we know that fn satisfies the linear homogeneous recurrence relation fn = qfn−1 + qpfn−2 + qp2 fn−3 + · · · + qpl−1 fn−l . (4.2.1) Therefore, we can obtain a solution for fn in terms of the roots of the characteristic equation of the recurrence relation (4.2.1) given by the polynomial xl − qxl−1 − qpxl−2 − · · · − qpl−1 = 0. (4.2.2) This solution of fn is given in the following theorem. l Theorem 4.4. Let r1 , r2 , · · · , rl be the roots of (4.2.2) and let g(x) = (x−ri ). i=1 l Then fn = i=1 qpl g(1) rn−l−1 . (1 − ri )g (ri ) i Chapter 4: Spaced patterns 48 Proof. Suppose that the roots of (4.2.2) are distinct and given by r1 , r2 , · · · , rl . The solution to the recurrence relation (4.2.1) is given by fn = A1 r1n + A2 r2n + · · · + Al rln where the constants A1 , A2 , · · · , Al are to be determined. Note that the l initial values are given by fl+i = qpl for 1 ≤ i ≤ l. Hence, to solve for the Ai ’s, we have to solve the following system of l equations with l unknowns: qpl = A1 r1l+i + A2 r2l+i + · · · + Al rll+i for 1 ≤ i ≤ l. The above system can be written in the form BX = C where B is the coefficient matrix and C is the column vector (qpl , qpl , · · · , qpl )T . Let Bi be the matrix obtained from B by replacing the ith column of B by the column vector C. Then by Cramer’s rule, Ai = det(Bi )/det(B). The determinant of B is given by r1l+1 r2l+1 · · · rll+1 r1l+2 r2l+2 · · · rll+2 .. .. .. .. . . . . r1l+l r2l+l · · · = r1l+1 r2l+1 · · · rll+1 rll+l 1 1 ··· 1 r1 .. . r2 .. . ··· .. . rl .. . r1l−1 r2l−1 · · · = r1l+1 r2l+1 · · · rll+1 (rj − rk ). j>k rll−1 Chapter 4: Spaced patterns 49 The determinant of Bi is given by r1l+1 · · · qpl · · · rll+1 r1l+2 · · · qpl · · · rll+2 .. .. . . . .. . . . . . . r1l+l · · · qpl · · · l+1 l+1 ri+1 · · · rll+1 = qpl r1l+1 · · · ri−1 rll+l 1 ··· 1 ··· 1 r1 .. . ··· 1 ··· . . . .. . . . . rl .. . r1l−1 · · · l+1 l+1 = qpl r1l+1 · · · ri−1 ri+1 · · · rll+1 (1 − rk ) k≤i−1 rll−1 1 ··· (rk − 1) k≥i+1 (rj − rk ). i=kk rk ) respectively. 1 1 ··· rll−1 (1−rk ) k≤i−1 are both Van- (rk −1) k≥i+1 (rj − i=kk (rj − rk ) qpl k≤i−1 (1 ril+1 k≤i−1 (ri k≥i+1 (rk − rk ) − rk ) k≥i+1 (rk qpl = l+1 ri (1 − rk ) (1 − rk ) (ri − rk ) k≥i+1 (ri − rk ) k≤i−1 = = = qp l k≤i−1 (1 ril+1 − rk ) k≥i+1 (1 k=i (ri i=k . In this l+1 l+1 case f (1/p) = q(l + 1) > 1 and hence 1/p is the larger root, that is, x < 1/p. To find x iteratively, note that if x < 1/p and f (x) > x, then x > x and hence such a value of x is a lower bound for x . However, since f (x) is increasing, x < x implies that f (x) < f (x ). On the other hand, f (x) > x and thus the value f (x) is a better approximation to x than x. In this way, we can improve Chapter 4: Spaced patterns 53 any given approximation. For example, starting with the approximation x0 = 1, we get x1 = f (x0 ), x2 = f (x1 ) and so on. In this way, we obtain a sequence xk defined by xk+1 = f (xk ). This sequence increases monotonically and the limit is a root of the equation f (x) = x. For example, let l = 2 and p = q = 0.5. Using the iterative process above, we obtain x to be 1.236068(correct to 6 decimal places). For n up to 15, the exact values of Qn and the corresponding approximations are summarised in the next table. Chapter 4: Spaced patterns 54 n Qn exact Approximation 2 0.75 0.76631189 3 0.625 0.61995933 4 0.5 0.50155762 5 0.40625 0.40576863 6 0.328125 0.32827371 7 0.265625 0.26557901 8 0.21484375 0.21485793 9 0.173828125 0.17382371 10 0.140625 0.14062633 11 0.113769531 0.11376909 12 0.092041015 0.09204112 0.07446289 0.07446283 14 0.060241699 0.06024169 15 0.048736572 0.04873655 13 Chapter 4: Spaced patterns 4.3 55 Bounds on β and λ In the previous section, we discussed an asmptotic result from [3] which states that there exist β > 0 and λ > 0 such that Qn = (1 + n )βλ n , where n → 0 as n → ∞. By using the inequalities in Theorem 4.2, we can obtain bounds on λ and β as shown in the next two propositions. Proposition 4.3.1. [6] 1 ≤ β ≤ λ−(l−1) . Proof. Using inequality (2) of Theorem 4.2 with n replaced by 2n and k replaced by n, we have that 2 Q2n ≤ Qn . Hence (1 + 2n )βλ 2n ≤ (1 + n) 2 2 2n β λ which implies that β ≥ 1. Also, since Qn ≥ Qn Ql−1 as Ql−1 ≤ 1, we get that (1 + n )βλ n ≥ (1 + n )βλ n (1 + l−1 )βλ l−1 which implies that β ≤ λ−(l−1) . Proposition 4.3.2. [6] (Q2l−1 )1/l ≤ λ ≤ (Q2l−1 )1/(2l−1) . Proof. Using inequality (2) of Theorem 4.2 repeatedly starting with n replaced by (2l − 1)n and k replaced by 2l − 1, we have that Q(2l−1)n ≤ (Q2l−1 )n . Chapter 4: Spaced patterns 56 Hence Q2l−1 ≥ (Q(2l−1)n )1/n = (β(1 + Since β(1 + (2l−1)n ) (2l−1)n )) 1/n 2l−1 λ . is bounded away from 0 and ∞, (β(1 + 1/n (2l−1)n )) → 1 as n → ∞. Therefore, by letting n → ∞, we have the upper bound for λ, that is, λ ≤ (Q2l−1 )1/(2l−1) . To prove the lower bound, we use inequality (1) of Theorem 4.2 repeatedly starting with n replaced by nl and k replaced by 2l − 1, hence we get Qnl ≥ (Q2l−1 )n−2 Q2l . Therefore (1 + nl )βλ nl ≥ (Q2l−1 )n−2 (1 + 2l )βλ 2l , and we obtain λ ≥ (Q2l−1 )1/l . Conclusion We have derived a set of recurrence relations for computing the probability of occurrence of a spaced pattern and presented some asymptotic results for approximating Qn . However, many questions on spaced patterns remain open. The following are two such problems that remain unsolved: 1. Is the probability of occurrence Qn of a spaced pattern R polynomial-time computable in terms of n, p, and the length l and weight w of R? Chapter 4: Spaced patterns 57 2. For fixed l and w, if R is an optimal spaced pattern among all spaced patterns of length l and weight w for a particular 0 < p < 1 and for a large n , will R remain optimal for all p and for all n ≥ n ? Bibliography [1] N. Balakrishnan and M. V. Koutras. Runs and Scans with Applications, John Wiley & Sons, 2002. [2] James Bradley. Distribution-Free Statistical Tests, Englewood Cliffs, N.J, Prentice-Hall, 1968. [3] J. Buhler, U. Keich and Y. Sun. Designing Seeds for Similarity Search in Genomic DNA, RECOMB, 2003. [4] M.T.Chao, J.C.Fu and M.V.Koutras. A survey of the reliability studies of consecutive-k-out-of-n: F systems and its related systems, IEEE Transactions on Reliability, 44, (1995), 120-127. [5] K. P. Choi and L. X. Zhang. Sensitivity Analysis and Efficient Method for Identifying Optimal Sapced Seeds, to be published in Journal of Computer and System Sciences. [6] K. P. Choi and L. X. Zhang, unpublished paper. 58 Bibliography 59 [7] W. J. Ewens and G. R. Grant. Statistical methods in Bioinformatics: An Introduction, Springer Verlag, 2001. [8] William Feller. An Introduction to Probability Theory and its Applications Vol.I, 3rd ed. N.Y, John Wiley, 1968. [9] J.C.Fu and M.V. Koutras. Distributuion theory of Runs: A Markov Chain Approach, Journal of the American Statistical Association, 89, 1994. [10] Jean D. Gibbons. Nonparametric Statistical Inference, N.Y, McGraw-Hill 1971. [11] A. P. Godbole and S. G. Papastavridis. Runs and Patterns in Probability: Selected Papers, Kluwer Academic Publishers, 1994. [12] W. Knight. A run-like statistic for ecological transects, Biometrics, 30, (1974), 553-555. [13] B. Ma, J. Tromp and M. Li. PatternHunter-Faster and more sensitive homology search, Bioinformatics, 18, 2002. [14] F. Mosteller. Note on an Application of Runs to Quality Control Charts, Annals of Mathematical Statistics, 12, 1948. [15] S.J. Schwager. Run Probabilities in Sequences of Markov-Dependent Trials, Journal of the American Statistical Association, 78, 1983. Bibliography 60 [16] A.D. Solov’ev. A combinatorial identity and its application to the problem concerning the first occurrences of a rare event, Theory of Probability and Applications, 11, 1966. [...]... among symbols of the same type, then they may tend to cluster and this may be indicated by an unusually small number of runs, or clusters of same symbols, or by runs of unexpected length Thus the total number of runs and the lengths of the runs should reflect the existence of some sort of pattern Hence the two criterion can be used to test for randomness Too few runs, too many runs, a run of excessive... probability of the first occurrence of a sucess run of length l at position n, denoted by fn , is also obtained We will also obtain the mean waiting time for the first occurence of a success run and the mean and variance of the number of occurrences Next, we will look at success-failure runs and discuss how to obtain the mean and variance of the number of occurrences of such runs We also define and obtain... Chapter 3 Success-failure runs A success-failure run is a run containing 2 symbols In this chapter, a successfailure run is a pattern consisting of 1s and 0s In [7], the analysis of nucleotide sequences leads to the analysis of runs and patterns on the alphabet set {a, t, c, g} The authors find the mean and variance of the number of occurrences and the distance between occurrences of runs, focusing on a... occurrence On the other hand, we can say that the pattern 1111 occurs at position 4 if we consider the ending position instead as the position of occurrence Throughout, we will take the ending position as the position of occurrence Due to the many areas of applications of runs and patterns, different statistics of runs and patterns have been defined for various purposes Occurrence of runs, distance between... number of occurrences of the patterns 1010 and 1111, and the spaced pattern 1*1*1 in the sequence 1111010101111010 If overlapping occurrences of 1010 are all relevant and counted, then the pattern 1010 is counted as occuring in positions 4-7, 6-9 and 13-16, thus occurring a total of 3 times.Even though two of these patterns overlap in positions 6 and 7, both are counted If second and higher overlapping patterns. .. using the theory of runs Given an ordered sequence of two or more types of symbols, a run is defined to be a succession of one or more identical symbols which are followed and preceded by a different symbol or no symbol at all Hence in the analysis of strings and patterns, a run is simply a consecutive pattern of length at least one Clues to lack of randomness are provided by the tendency of symbols to... course of further research and development For this purpose, runs of different kinds and of different lengths have been found useful by industrial statisticians Quality control engineers have found that a convenient indication of lack of control is the occurrence if long runs of observations whose values lie above or below that of the median of the sample 1.5 Reliability A consecutive k-out -of- n failure... allowed) Let R = r1 r2 rl be a pattern of length l and weight w where ri ∈ {0, 1} for i = 1, 2, , l We want to find expressions for the mean and variance of the total number of occurrences of R in a string of length n First, we illustrate how to find the mean and variance of the total number of occurrences of R = 1001 with length 4 and weight 2 in a string of length n Then we will generalize the... length of longest success runs and other problems have been studied in various papers In [8], the author obtains the generating function of the waiting time for a run of l successes and derives the related mean waiting time The probability that a success run occurs before a failure run is also discussed In [9], probability distributions of the number of success runs of size exactly k and the number of. .. the sequence 1101001101, the pattern Chapter 1: Introduction and background 15 11 occurs twice and the distance between the two occurrences is 4 In [10], the length of the longest success run and the total number of runs are mainly of interest in randomness testing In this thesis we will derive the probability of occurrence of a sucess run of length l before or at position n, denoted by Qn , through