Scan statistics of rate function of scores on poisson point processes

Scan Statistics of Rate Function of Scores on Poisson Point Processes Yu Xiaojiang (B.Sc. USTC) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2011 i Acknowledgements I would take this opportunity to thank my supervisor, Assoc. Professor Chan Hock Peng. His useful comments, suggestions and revisions have led to significant improvements in the presentation of this thesis. I benefit a lot from his beautiful way of thinking. I would also thank my friend Jiang Binyan, Liang Xuehua, Jiang Xiaojun, Zhu Yongting and Long Yun for their help in completing this thesis. At last I would thank Ms Tay Ket Ling, Ms Su Kyi Win and all the other department staffs for their help in my graduate academic career. ii Contents Acknowledgements Summary i iv List of Tables v List of Figures vi 1 Introduction 1 1.1 Literature Review: Applications . . . . . . . . . . . . . . . . . . . . 3 1.2 Literature Review: Probabilistic Technique . . . . . . . . . . . . . . 4 1.3 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 6 2 Technical Backgrounds 7 2.1 Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . . 7 2.2 The Overshoot Constant of Compound Poisson Processes . . . . . . 9 3 Theoretical Results 15 iii 3.1 Theoretical Results and Applications in Chan and Zhang (2007) . . 17 3.2 The Theoretical Result in Chan (2009) . . . . . . . . . . . . . . . . 24 4 Examples and Numerical Studies 27 5 Proof of Theorem 3.1 33 6 Conclusions 39 Bibliography 41 Appendix 45 iv Summary Let X1 , X2 , · · · be independent and identically distributed (i.i.d.) random variables with positive mean. Each random variable Xi is associated with a random time ti and t1 , t2 , · · · are distributed according to a Poisson point process on [0, 1]. We would like to detect unusual behavior in a segment of the interval [0, 1]. For each 0 < x < y < 1, we compute a score S(x, y) which is large when there is unusual behavior in the interval [x, y]. Since x, y are unknown, we consider the maximum of these scores over 0 < x < y < 1, known in the statistical literature as scan statistics. We derive formulas for approximating the tail probabilities of these scan statistics and check them through numerical computations and simulation exercises. Keywords: Change of Measure, Large Deviation, Marked Poisson Process, Scan Statistics. v List of Tables 3.1 Estimation of p ± s.e. with F degenerate at 1. . . . . . . . . . . . . 3.2 Summary of information for the scan statistics of three viral genomes. 21 4.1 Monte Carlo simulations and analytical p-values. . . . . . . . . . . 31 4.2 Monte Carlo simulations and upper bounds by (4.2). . . . . . . . . 32 20 vi List of Figures 3.1 Weighted and unweighted scan statistics for 3 viral genomes . . . . 23 4.1 Plots of νa against a when c = 0.01 and c = 0.1. . . . . . . . . . . . 28 4.2 Plots of νa against a when c = 1 and c = 10. . . . . . . . . . . . . . 28 4.3 Plots of νa against θa when c = 0.01 and c = 0.1. . . . . . . . . . . . 29 4.4 Plots of νa against θa when c = 1 and c = 10. . . . . . . . . . . . . 29 4.5 The tail probability approximation. . . . . . . . . . . . . . . . . . . 30 1 Chapter 1 Introduction Joseph Naus published his now classical paper on scan statistics in 1965, which originated the modern work on this field. The study of scan statistics is currently a very active area and applied in many different fields such as infectious disease epidemiology, brain imaging, astronomy, geology, neurological diseases, rheumatology, parasitology, demography, forestry, toxicology, psychology and telecommunication. Suppose we have a number of points randomly located within a square region. Naus (1965) originally used a rectangular scanning window with a fixed size and shape. The window is moved over the predetermined square region to cover all possible locations. The scan statistic is the maximum number of points captured by the scanning window at any given time. The next step is to find the probability of observing at least that many points within some window, under the null hypothesis that those points are generated by a homogeneous Poisson process. The complexity of this problem lies in the multiple comparisons effect from maximizing 2 over all possible windows and the overlapping nature of those windows. Using some powerful mathematics, Naus (1965) developed analytical formulas to obtain upper and lower bounds for those probabilities. Following Naus’s pioneering work, there have been a lot of further methodological developments on scan statistics. The study region to be scanned may be of different shapes. The scanning window may be of different sizes and shapes. As mentioned above, Naus (1965) used rectangular windows of any fixed shape and size, while Loader (1991) used rectangular windows of variable sizes. Alm (1997) used circles, ellipses and triangles. Kulldorff (1997) considered circles of variable sizes. More recently, nonparametric methods have been used to taking into account windows of irregular shapes [see Duczmal and Assuncão (2004)]. Instead of defining the null hypothesis based on a homogeneous Poisson process, we can also consider the null hypothesis in the context of inhomogeneous Poisson processes [see Turnbull et al. (1990)]. For example, areas with higher population intensity are expected to have more infectious disease cases per geographical unit in urban compared to rural areas. Hence inhomogeneous Poisson processes are commonly encountered in applications. The observations may also be generated by compound Poisson processes with normal or exponential observations. Kulldorff, Huang and Konty (2008) developed scan statistics for normal and survival type data. 3 1.1 Literature Review: Applications . Scan statistics have been applied for infectious disease epidemiology. In infectious disease surveillance, scan statistics are used to detect geographical areas with clusters of the disease. The clusters can be either temporary, due to an outbreak, or long-lasting, if the population in the area is especially prone to infection. Different aspects of the infectious disease also influence the appropriate choice of scan statistic parameters. The incubation time of the disease, for example, is a very important factor in the selection of the scanning time window length. Cousens et al. (2001) investigated 84 cases of variant Creutzfeldt-Jakob disease (vCJD), a rare and fatal disease caused by the same transmissible agent as in mad cow disease. The consumption of beef products were therefore considered as an important factor. The scan statistic detected a significant cluster with five cases. A subsequent investigation revealed a local butcher shop to be the possible source of infection. Scan statistics have also been applied to important problems in brain imaging. Naiman and Priebe (2001) applied them to study positron emission tomography (PET) scan brain imagery data. Yoshida, Naya and Miyashita (2003) used them to analyze neural response data in monkeys. By injecting of retrograde tracers in specific regions (cases) and adjacent regions (controls) in the brain, maps with pixels associated with selective and non-selective neurons were generated. Significant clusters of selective neurons were found. 4 Marcos and Marcos (2008) used the two-dimensional scan statistic to study spatial clustering of ‘open star clusters’, which are physically associated groups of stars combined together by mutual gravitational attraction. The study regions were defined by galactic longitude as the first dimension and either radial velocity, proper motion or inclination as the second dimension, resulting in three different analysis. A number of statistically significant clusters were found. For specific applications, the scan statistic parameters and probabilistic models should be appropriately selected and adapted to fit the data and the scientific questions asked. 1.2 Literature Review: Probabilistic Technique Methods for approximating the tail distribution of the maximum of Gaussian random fields have been developed by Pickands (1969) and Qualls and Watanabe (1973). There are two key steps. Firstly the tail probability of crossing a high level concentrate on a small neighborhood of the subset of the indexing set where the marginal probability of crossing the level is maximal. The second step is to break the subset into small pieces which disjointedly contribute to the total probability. Then the probability approximation is derived by adding the contributions of each small piece. However, the approximation involves constants which can not be computed directly. This methodology inspired the following three papers. Hogan and Siegmund (1986) considered the large deviations for the maxima of 5 random fields which are closely related to simple one dimensional processes such as the random walk, Brownian motion and Brownian bridge. They applied the techniques by Pickands, Qualls and Watanabe and derived explicit asymptotical approximations for the tail probability of the distribution of the maximum by introducing the overshoot constants into the formulas. Chan and Zhang (2007) examined scan statistics for one dimensional marked Poisson processes. The scan statistics were defined as the maximum weighted count of event occurrences within a window of fixed width which is moved within an observed interval. They derived analytical formulas and an importance sampling method for approximating the tail probabilities of scan statistics. They also illustrated the application of their p-value approximations in computational biology. Chan (2009) further examined the tail probabilities of moving sums in a marked Poisson random field. These sums were derived by adding up the weighted occurrences of events within a scanning set of fixed size and shape. He also provided an alternative presentation of the constants of the asymptotic formulae in terms of the occupation measure of the conditional local random field at zero, which were further extended to the constants of asymptotic tail probabilities of Gaussian random field. These new formulas are useful for deriving bounds of the tail probabilities of scan statistics. 6 1.3 Organization of this Thesis In Chapter 2 we introduce the limiting distribution of the overshoot of a random walk over a constant boundary. Then we define the overshoot constant related to compound Poisson processes. Several examples are provided to illustrate the computation. The main theoretical result of this thesis is given by Theorem 3.1 in Chapter 3, which provides an asymptotic tail probability of scan statistics of rate function of scores on marked poisson point process. A corollary is also presented to provide a simple upper bound for the tail probability of scan statistics. As a comparison, we introduce the theoretical results in Chan and Zhang (2007) and Chan (2009). Then in Chapter 4 we illustrate applications of the p-value approximation and upper bound with specific examples and numerical studies. The proof of Theorem 3.1 is given in Chapter 5. At last we provide conclusions and discussions in Chapter 6 and related R code in Appendix. 7 Chapter 2 Technical Backgrounds The computation of the overshoot constant is important in queuing theory, risk insurance, engineering systems, sequential testing and change-point detection. In this section, we first introduce some theory on the limiting distribution of the overshoot of a random walk over a constant boundary. Then we define the overshoot constant related to compound Poisson processes. The results are illustrated with specific examples. 2.1 Compound Poisson Processes Let Y1 , Y2 , · · · be i.i.d. random variables with mean µ > 0. We say that Y1 is arithmetic if there exists d > 0 such that P (Y1 ∈ dZ) = 1. The largest d with such property is the span of Y1 . Otherwise, we say that Y1 is nonarithmetic. 8 Let Sn = Y1 + · · · + Yn , Sn+ = max(Sn , 0). For b > 0, define τ = τ (b) = inf{n : n ≥ 1, Sn ≥ b}, τ+ = inf{n : n ≥ 1, Sn > 0}. Then by renewal theory, Sτ − b has a limiting distribution. More specifically we have the following results, given in Siegmund (1985) page 171. Theorem 2.1. Assume µ < ∞. If Y1 is nonarithmetic, then ∞ lim P (Sτ − b > x) = (ESτ+ )−1 b→∞ P (Sτ+ > y)dy. x If Y1 is arithmetic with span d, then as b → ∞ through multiples of d, lim P (Sτ − b = jd) = d(ESτ+ )−1 P (Sτ+ > jd). b→∞ In addition, by the theory of ladder variables, we have the following theorem, given in Siegmund (1985) page 175. Theorem 2.2. If Y1 is nonarithmetic, then ∞ lim E exp[−(Sτ − b)] = µ b→∞ −1 + n−1 Ee−Sn . exp − n=1 We illustrate this theorem with the following example. Example 2.1. Let Y1 ∼ N (µ, 2µ). Then Sn ∼ N (nµ, 2nµ), and + Ee−Sn ∞ e−x P (Sn ∈ dx) = P (Sn < 0) + 0 = Φ − = 2Φ − nµ + 2 nµ . 2 ∞ 0 1 e−x √ exp 2 πnµ − (x − nµ)2 dx 4nµ (2.1) 9 Hence by (2.1), ∞ lim E exp[−(Sτ − b)] = µ b→∞ 2.2 −1 n−1 Φ − exp − 2 n=1 nµ 2 . The Overshoot Constant of Compound Poisson Processes We will next consider overshoot constants that are at the heart of the approximations in this thesis [see Chapter 3 for the approximations]. The reader can return to this subsection after looking at the approximations in Chapter 3 and the proof of Theorem 3.1 in Chapter 5. Let {N (t), t ≥ 0} be a Poisson process with rate λ > 0. Let {Xi , i = 1, 2, · · · } be i.i.d. random variables with cumulative distribution function F such that ρ := EX1 > 0, and {N (t), t ≥ 0} is independent of {Xi , i = 1, 2, · · · }. We say that N (t) R(t) = Xi (R(t) = 0, if N (t) = 0) i=1 is a compound Poisson process with rate λ and mark distribution F . Without loss of generality, we restrict t ∈ [0, 1] and randomly choose the n interarrival times u1 , · · · , un according to {N (t), t ∈ [0, 1]}. Hence Xi is associated with the arrival time ti = u1 + · · · + ui . Then u1 , · · · , un are independent of X1 , · · · , Xn . Let K(θ) = EeθX1 , ψ(θ) = K(θ) − 1. We define the rate function to be     supθ [θz − ψ(θ)] if z ≥ ρ, φ(z) =    0 if z < ρ, 10 and the segmental scores in [x, y] to be S(x, y) = λ(y − x)φ N (x, y) , where N (x, y) = Xi . λ(y − x) i:x≤t ≤y i Assume that K(θ) is finite for some θ > 0. For any z > ρ, choose θz > 0 to be the root of equation d [θz dθ − ψ(θ)] = z − ψ (θ) = 0. Then φ(z) = θz z − ψ(θz ), and φ (z) = θz if z > ρ. For given c > 0 and varying a ∈ [0, 1], choose za to be the root of the equation φ(za ) = c/a, and let θa = φ (za ). Then φ(za ) = θa za − ψ(θa ) = c/a. (2.2) ψ (θa ) = K (θa ) = za . Apply the Taylor expansion, S(x, y) = λ(y − x)φ N (x, y) λ(y − x) = λ(y − x) φ(za ) + θa N (x, y) − za + O λ(y − x) = θa N (x, y) − λ(y − x)ψ(θa ) + λ(y − x)O N (x, y) 2 − za λ(y − x) N (x, y) 2 − za . λ(y − x) We are interested in the linear approximation part [i.e. θa N (x, y) − λ(y − x)ψ(θa )], and will therefore transform the compound Poisson process. Embed F in an exponential family of distribution {Fθa } with Fθa (dy) eθa y = . F (dy) K(θa ) Define a new probability measure Qλ0 under which R(t) is a compound Poisson process with rate λ0 = K(θa ) and mark distribution Fθa . Let Yi = θa Xi − ui ψ(θa ) 11 and Sn = Y1 + · · · + Yn . Then by (2.2), µ = EQλ0 Y1 = θa K (θa ) ψ(θa ) c − = > 0. K(θa ) K(θa ) aλ0 Hence by (2.1), the existence of the overshoot constant νa = lim EQλ0 exp[−(Sτ − b)] b→∞ ∞ = µ −1 + n−1 EQλ0 e−Sn exp − (2.3) n=1 is assured. We first give following example to illustrate the computation of the overshoot constant for a degenerate compound Poisson process. Example 2.1. Let X1 ≡ 1 with probability 1. Then K(θ) = eθ , ψ(θ) = eθ − 1 and φ(z) = z log z − z + 1. Choose za > 1 satisfying φ(za ) = c/a. Then θa = φ (za ) = log za . λ0 = K(θa ) = eθa = za . Under Qλ0 we also have X1 ≡ 1 with probability 1, and µ = EQλ0 Y1 = the transformed compound Poisson process is n Sn = θa n − tn (λ0 − 1), tn = ui ∼ Γ(λ0 , n), i=1 where Γ(λ0 , n) is the gamma distribution. Hence, + EQλ0 e−Sn ∞ e−x Qλ0 {Sn ∈ dx} = Qλ0 {Sn ≤ 0} + 0 = 1 − χ22n 2λ0 θa n + λ0 − 1 ∞ e−x Qλ0 {Sn ∈ dx}, 0 c . aza Then 12 where χ22n (x) = + −Sn EQλ0 e x 1 z n−1 e−z/2 dz. 0 2n (n−1)! = 1− χ22n Let y = 2λ0 θa n + λ0 − 1 θa n λ0 −1 θa n−x , λ0 −1 e−θa n 0 t = 2y. Then λn0 y n−1 e−y dy (n − 1)! 2θa n = 1− χ22n = 1 − χ22n λ0 −1 t 1 2λ0 θa n λn0 e−θa n n + tn−1 e− 2 dt λ0 − 1 2 (n − 1)! 0 2λ0 θa n 2θ an + λn0 e−θa n χ22n . λ0 − 1 λ0 − 1 Note that λ0 = eθa = za . Hence, + EQλ0 e−Sn = 1 − χ22n 2za θa n 2θa n + χ22n . za − 1 za − 1 Then by (2.3), aza νa = exp c ∞ n−1 1 − χ22n − n=1 2za θa n 2θa n + χ22n za − 1 za − 1 . (2.4) The analytical formula of νa by (2.4) is used in Chapter 4 to derive numerical studies. We next show the computation of the overshoot constant for a compound Poisson process with an exponential mark distribution. Example 2.2. Let X1 be an exponential random variable with parameter 1. Then K(θ) = 1 , 1−θ ψ(θ) = θ 1−θ √ and φ(z) = z − 2 z + 1. Choose za satisfying φ(za ) = c/a. Then 1 θa = φ (za ) = 1 − √ . za √ λ0 = K(θa ) = za . Under Qλ0 , X1 is still an exponential random variable with parameter 1 − θa . µ = EQλ0 Y1 = c . aλ0 Then the transformed compound Poisson process is n n Xi − tn (λ0 − 1), tn = Sn = θa i=1 ui ∼ Γ(λ0 , n). i=1 13 n Xi = k, Sn = θa k − tn (λ0 − 1). This reduces the computation to Given that i=1 Example 2.1. n + −Sn EQλ0 e | Xi = k i=1 n = Qλ0 Sn ≤ 0| Let y = θa k−x , λ0 −1 −x e Qλ0 Sn ∈ dx| Xi = k + 0 i=1 Xi = k i=1 n ∞ 2λ0 θa k + λ0 − 1 = 1 − χ22n n ∞ e−x Qλ0 Sn ∈ dx| 0 Xi = k . i=1 t = 2y. Then n + −Sn EQλ0 e | Xi = k (2.5) i=1 = 1− 2λ0 θa k + λ0 − 1 χ22n θa k λ0 −1 0 e−θa k λn0 y n−1 e−y dy (n − 1)! 2θa k = 1− λ0 −1 t 2λ0 θa k 1 + λn0 e−θa k n tn−1 e− 2 dt λ0 − 1 2 (n − 1)! 0 2λ0 θa k 2θa k + λn0 e−θa k χ22n . λ0 − 1 λ0 − 1 χ22n = 1 − χ22n n Xi ∼ Γ(1, n). Hence, Note that i=1 n + −Sn EQλ0 e + −Sn = E EQλ0 [e | Xi ] i=1 ∞ 1 − χ22n = 0 2θa k 2λ0 θa k + λn0 e−θa k χ22n λ0 − 1 λ0 − 1 k n−1 e−k dk. (n − 1)! Then by (2.3), aλ0 νa = exp c ∞ ∞ 1 − χ22n − n=1 0 2λ0 θa k λ0 − 1 + λn0 e−θa k χ22n 2θa k λ0 − 1 k n−1 e−k dk . n! 14 We proceed to provide a general formula of the overshoot constant for a compound Poisson process with arbitrary mark distribution. Example 2.3. X1 has arbitrary distribution function F with a positive mean. Let n Fn be the cumulative distribution function of Xi . Choose za and θa specifically i=1 satisfying φ(za ) = c/a and θa = φ (za ). Then λ0 = K(θa ). By (2.5), n + −Sn EQλ0 e 2θa k 2λ0 θa k + λn0 e−θa k χ22n . λ0 − 1 λ0 − 1 Xi = k = 1 − χ22n | i=1 We first assume that X1 is continuous. Hence, n + −Sn EQλ0 e = E EQλ0 [e + −Sn | Xi ] i=1 ∞ 1 − χ22n = −∞ 2λ0 θa k 2θa k + λn0 e−θa k χ22n λ0 − 1 λ0 − 1 dFn (k). Then by (2.3), aλ0 exp νa = c ∞ ∞ n−1 − 1 − χ22n −∞ n=1 2λ0 θa k λ0 − 1 + λn0 e−θa k χ22n 2θa k λ0 − 1 dFn (k) . If X1 is discrete, we assume that X1 only takes integer values without loss of generality. Let fn be the probability mass function of Fn . Then, νa = aλ0 exp c ∞ ∞ n−1 − n=1 1 − χ22n k=−∞ 2λ0 θa k λ0 − 1 + λn0 e−θa k χ22n 2θa k λ0 − 1 fn (k) . 15 Chapter 3 Theoretical Results Let λ, c, N (t), Xi , ρ, ui , ti , φ(z), N (x, y) and S(x, y) be defined as in Section 2.2. We are interested in the maximum score captured by a scanning window with a varied length. Specifically for given 0 < a0 < a1 < 1 and the sliding window [x, y], define the scan statistic to be M (a0 , a1 ) = sup S(x, y). 0≤x≤y≤1 a0 ≤y−x≤a1 Let an ∼ bn , if limn→∞ (an /bn ) = 1. The main theoretical result of this thesis is the tail probability approximation below. Theorem 3.1. As λ → ∞, Pλ {M (a0 , a1 ) ≥ λc} ∼ (2π)−1/2 λ3/2 e−λc c2 a1 θa−1 K (θa )−1/2 νa2 a−5/2 (1 − a)da, a0 where θa and νa are defined in Section 2.2. Through an appropriate scaling transformation, we can look at above limiting probability as one involving fixed Poisson rate λ0 > 0 and increasing scanning sets. 16 Let λ/λ0 be the scaling constant. Then Pλ {M (a0 , a1 ) ≥ λc} = Pλ0 {M0 (a0 , a1 ) ≥ λc}, where M0 (a0 , a1 ) = S(x, y). sup 0≤x≤y≤λ/λ0 a0 λ/λ0 ≤y−x≤a1 λ/λ0 For notational simplicity, we will look at the tail probability in terms of Theorem 3.1. But in practical use, the tail probability in terms of a fixed rate and increasing scanning sets is sometimes more appropriate. One limitation of above theorem is the complicated computation of the overshoot constant. Even when F degenerate at 1, the overshoot constant νa has a complicated expression (2.4). For arbitrary F other than degenerating at 1, the computation is even more complicated. In such case, however, we can still provide a simple upper bound based on Theorem 3.1, since νa ≤ 1. Corollary 3.1. As λ → ∞, Pλ {M (a0 , a1 ) ≥ λc} ≤ [1 + o(1)](2π)−1/2 λ3/2 e−λc c2 a1 × θa−1 K (θa )−1/2 a−5/2 (1 − a)da. a0 It is interesting to compare our theoretical results with Chan and Zhang (2007) and Chan (2009), which both examine scan statistics within windows of fixed sizes. 17 3.1 Theoretical Results and Applications in Chan and Zhang (2007) Scan statistics have been recently applied in the computational analysis of DNA and protein sequences. To locate genes related to specific biological processes, Lifanov et al. (2003) scanned DNA sequences for clusters of transcription factor binding sites. They applied matrices of location weight to score words for similarity to a given transcription factor pattern, and determine locations of occurrence of the pattern by a cut-off value for the word score. Rajewsky et al. (2002) studied a similar problem except that they used the total score of all words rather than the number of words in a window exceeding the cut-off to compute the scan statistics. Chan and Zhang (2007) provided p-value approximations for scan statistics of marked Poisson processes. These approximations can be applied to general scoring schemes used in computational biology. An important feature of the formula is an overshoot correction term that is equal to 1 in the special case of 0-1 processes. Let N (t) be the Poisson process as defined in Section 2.2. Let r > 0 be the length of the interval. We restrict t ∈ (0, r]. Let Xi , ρ, ti , N (x, y), K(θ) and ψ(θ) be defined as in Section 2.2. In particular, the score in the window (t, t + δ] is N (t, t + δ), where δ ∈ (0, r) is a pre-determined width of the window. Further define the fixed window-size scan statistic to be Mr,δ = sup N (t, t + δ). 0≤t≤r−δ 18 Assume that K(θ) is finite for some θ > 0. Given c > λρ, choose θ˜c > 0 and distribution Fθ˜c to satisfy ˜ Fθ˜c (dx) eθc x ˜ K (θc ) = c/λ, = . F (dx) K(θ˜c ) (3.1) Define the large deviation rate function to be Ic = θ˜c c − λψ(θ˜c ). To derive the overshoot constant, consider Y˜1 , Y˜2 , · · · to be i.i.d. random variables satisfying P (Y1 ∈ dy) = K(θ˜c ) 1 Fθ˜c (dy) + F¯ (dy), ˜ 1 + K(θc ) 1 + K(θ˜c ) (3.2) where F¯ denotes the cumulative distribution function of −X1 . Let Sñ = Y˜1 +· · ·+Yñ and τ˜b = inf{n ≥ 1 : Sñ ≥ b}. Define the overshoot constant to be ˜ ˜ ν˜c = lim E[e−θc (Sτ˜b −b) ]. (3.3) b→∞ Chan and Zhang (2007) provided the following tail probability approximation of Mr,δ . Theorem 3.2. Let λ and c > λρ be fixed. Let δ → ∞ as r → ∞ such that r − δ → ∞. Then, P {Mr,δ ≥ δc} ∼ 1 − exp − (r − δ)˜ νc e−δIc (c − λρ) . 2πλδK (θ˜c ) When F is degenerate at 1, K(θ) = eθ , and θ˜c = log(c/λ). Further Ic = c log(c/λ) − c + λ. Note that the overshoot constant ν˜c = 1 for such degenerate case. Hence Theorem 3.2 reduces to the following corollary. 19 Corollary 3.2. Let λ and c > λρ be fixed. Let δ → ∞ as r → ∞ such that r − δ → ∞. Then, P sup [N (t + δ) − N (t)] ≥ δc ∼ 1 − exp 0≤t≤r−δ − (r − δ)eδ(c−λ) (λ/c)δc (c − λ) √ . 2πδc Chan and Zhang (2007) applied above formulas to study the palindromes in DNA sequences. Example 3.1. High concentration of palindromic patterns (PLP) is associated with origins of replication of viruses. Four letters A, T, C, G are used to denote the DNA alphabet with A-T and C-G being complementary base pairs (bp) on opposite strands of the DNA helix. Thus the complementary DNA sequence of AGATCT is TCTAGA. A DNA sequence is a PLP if its complement reads the same as itself backwards (e.g. AGATCT). Let the length of a PLP be the number of complementary pairs that it contains (i.e. the length of AGATCT is 3). Let PLP* be a PLP with length of at least 5 bp that is not nested inside another PLP. Model the occurrence of PLP* in the Human cytomegalovirus (HCMV) genome as a Poisson process [see Leung, Schactel and Yu (1994)]. A total of N (r) = 296 PLP* are observed in the genome with length r = 229, 354 bp. Thus ˆ = N (r)/r = 0.00129. Note that F the rate of Poisson process is estimated to be λ is degenerate at 1 for this example. Chan and Zhang (2007) applied Corollary 3.2 to compute the p-value approximations for the scan statistic for fixed-window size δ = 1000 bp. 20 Table 3.1: Estimation of p ± s.e. with F degenerate at 1. δc Direct Monte Carlo Chan and Zhang (2007) Naus (1982) 9 (1.5 ± 0.3) × 10−2 1.32 × 10−2 1.32 × 10−2 10 (1 ± 1) × 10−3 1.95×10−3 1.93×10−3 11 0 2.53×10−4 2.53×10−4 Naus (1982) provided a more complicated p-value approximation which works only for the degenerate case. It appears from Table 3.1, which is reproduced from Chan and Zhang (2007), that the p-value approximations by Chan and Zhang (2007) agree well with both direct Monte Carlo estimates and the corresponding results in Naus (1982). Example 3.2. Instead of giving equal score to each PLP* as in Example 3.1 (i.e. Xi = 1), assign now a score of Xi = pi − 4 to the ith PLP* with a length of pi . In this sense, we say that the scan statistics are unweighted in Example 3.1 and weighted in this example. Then define the location of the ith PLP* (i.e. ti ) to be the location of its left center. The rate of the Poisson process is estimated ˆ = N (r)/r. Consider here F to be geometric. Estimate its mean by ρ = by λ (1 − 2ˆ γA γˆT − 2ˆ γG γˆC )−1 , where (ˆ γA , γˆT , γˆG , γˆC ) are the empirical probabilities of the four bases in the genome. We shall now compute the overshoot constant ν˜c for the geometric distribution. 21 Let τ˜+ = inf{n ≥ 1 : Sñ > 0}. Then by Theorem 2.1., as b → ∞ through Z, lim P {S˜τ˜b − b = j} = (E S˜τ˜+ )−1 P {S˜τ˜+ > j}, j = 0, 1, · · · b→∞ (3.4) When F is geometric, Fθ˜c is also geometric by (3.1). Further by (3.2) and the memoryless property of the geometric distribution, S˜τ˜+ is geometric with distribution Fθ˜c . Hence by (3.3) and (3.4), Chan and Zhang (2007) showed that ˜ ν˜c = ρ[1 − (1 − ρ−1 )eθc ] Chew, Choi and Leung (2005) studied clustering of PLP* but used a score Xi = pi (or equivalently Xi = pi /5) together with a shifted geometric distribution for Xi . Chan and Zhang (2007) studied both the unweighted and weighted scan statistics and provided p-value approximations. Table 3.2: Summary of information for the scan statistics of three viral genomes. (ˆ γA , γˆT , γˆG , γˆC ) N (r) δ (0.13,0.37,0.38,0.13) 156 789 580 800 BoHV1 (0.14,0.36,0.37,0.14) 135 301 615 700 BoHV5 (0.12,0.37,0.38,0.13) 138 390 714 700 CeHV1 Unweighted r F geometric Mr,δ p-value Mr,δ p-value CeHV1 18 7.23 × 10−6 116 0 BoHV1 17 1.09 × 10−4 32 6.08 × 10−5 BoHV5 15 1.07 × 10−2 33 1.74 × 10−4 22 For Table 3.2, Chew, Choi and Leung (2005) provided the empirical probabilities of the four bases (ˆ γA , γˆT , γˆG , γˆC ), the length of the genome r and the number of observed PLP* N (r) for three viruses: Cercopithecine herpesvirus 1 (CeHV1), Bovine herpesvirus 1 (BoHV1) and Bovine herpesvirus 5 (BoHV5). The window size δ is equal to 0.5% of the genome length, rounded off to the nearest 100 bases. Chan and Zhang (2007) provided the unweighted and weighted scan statistics and p-value approximations in Table 3.2. Figure 3.1 is taken from Chan and Zhang (2007) which plots the computed scan statistics against genome location for the three viruses. Experimentally validated origins of replication for these viruses are also shown in the figure. To avoid redundant number of false positives when handling with a large number of genomes, they applied a conservative p-value cutoff of 0.001 and used Theorem 3.2 to determine the threshold levels corresponding to this p-value. Figure 3.1 shows that a length based weighting scheme improves the power for both CeHV1 and BoHV1. For BoHV5, significant clusters of palindromes are detected in the neighborhood of the replication origins. However, there are also many false positives for this genome. 23 Figure 3.1: Comparison of weighted and unweighted scan statistics for 3 viral genomes. For all plots, horizontal axis denotes location in genome. The top plots show the locations and length of palindromes longer than 4. The middle plots show the unweighted scan statistic δ −1 [N (t + δ/2) − N (t − δ/2)] against t. The bottom plots show the weighted scan statistic δ −1 [SN (t+δ/2) − SN (t−δ/2) ] against t. Triangles at the top of the plots denote experimentally validated replication origins. Thresholds for p-value of 0.001 are indicated by dashed horizontal lines. 24 In practice, however, we may not have much priori information on the length of the signal. Thus it is difficult to determine an fixed window size in advance. Moreover in application, if the length of the signal fluctuates, it is not appropriate to use a fixed-size window to detect the signal. Thus a useful extension is to allow the window size to be variable. This is the case in this thesis. Our window (i.e. [x, y]) has a variable length (i.e. a0 ≤ y − x ≤ a1 ). 3.2 The Theoretical Result in Chan (2009) Chan (2009) studied the tail probabilities of the maxima of moving sums with a wide choice of scanning sets. He first developed a theory parallel to the study of tail probabilities in Gaussian or Gaussian-like random fields in the classical framework of Pickands (1969) and Qualls and Watanabe (1973) [see also Piterbarg (1996) and Chan and Lai (2006)]. Motivated by recent developments in molecular biology [see Examples 3.1 and 3.2], he considered a more general marked Poisson random field. Consider here {tˇi , i ≤ 1} to be a homogeneous Poisson point process on Rm with intensity λ > 0 and let X1 , X2 , · · · be i.i.d. random varibles with cumulative distribution function F and independent of the Poisson point process. Let ρ and K(θ) be defined as in Section 2.2. To consider the scan statistic, define first some notations. Let σm (·) be the volume of set in Rm . For any A ⊂ Rm , vector q ∈ Rm and real number η, let q + ηA = {q + ηα : α ∈ A}. 25 ˇ For any B ⊂ Rm , define the score S(B) = tˇi ∈B Xi . Let D be a bounded subset of Rm . Define the scan statistic to be ˇ + B). MD,B = sup S(v v∈D Assume that Θ = {θ : K(θ) < ∞} is an open neighborhood of 0. For c > ρσm (B), choose θˇc > 0 and distribution Fθˇc to satisfy ˇ F ˇ (dx) eθc x = . K (θˇc ) = c/σm (B) and θc F (dx) K(θˇc ) Define the large deviation rate function to be Iθˇc = θˇc c − σm (B)[K(θˇc ) − 1]. Chan (2009) provided the following tail probability approximation of MD,B . Theorem 3.3. Let B be convex and bounded. Define xλ = θˇc (λc − d λc/d ) if F is arithmetic with span d and xλ = 0 if F is nonarithmetic. Then as λ → ∞, Pλ MD,B ≥ λc ∼ [2πσm (B)K (θˇc )]−1/2 e−λIθˇc +xλ λm−1/2 σm (D)ωB for some positive and finite constant ωB . The constant ωB above is specified in Chan (2009). When B is rectangular, ωB has an explicit expression in terms of the overshoot constant. Example 3.3. Let B = m j=1 [0, βj ] with βj > 0 for all j. Consider Yˇ1 , Yˇ2 , · · · to be i.i.d, random variables satisfying P (Yˇ1 ∈ dx) = [F¯ (dx) + K(θˇc )Fθˇc (dx)]/[1 + K(θˇc )], 26 where F¯ denotes the cumulative distribution function of −X1 . Let Sˇn = Yˇ1 +· · ·+Yˇn and τˇb = inf{n ≥ 1 : Sˇn ≥ b}. Define the overshoot constant ˇ ˇ νˇc = lim E e−θc (Sτˇb −b) . b→∞ Chan (2009) showed that ωB has the following expression: m −1 ωB = νˇc [cσm (B) − ρ] m χc m−1 βj , j=1 ˇ where χc = θˇc for F nonarithmetic and χc = d−1 (1 − e−dθc ) for F arithmetic with span d. Although the window B in Chan (2009) can take arbitrary shape, it still has fixed size. For similar reasons, a useful extension is to consider variable window sizes. 27 Chapter 4 Examples and Numerical Studies Example 4.1. Consider again X1 in Example 2.1. Then K(θ) = eθ and φ(z) = z log z − z + 1. Choose za satisfying φ(za ) = c/a. Then θa = φ (za ) = log za . By Theorem 3.1, as λ → ∞, −1/2 3/2 −λc 2 Pλ {M (a0 , a1 ) ≥ λc} ∼ (2π) λ e a1 c θa−1 e−θa /2 νa2 a−5/2 (1 − a)da, (4.1) a0 where νa is given by (2.4), aza exp νa = c ∞ n−1 1 − χ22n − n=1 2za θa n 2θa n + χ22n za − 1 za − 1 . To evaluate the above complicated formula, we first provide following straightforward plots of νa . 28 0.80 nu 0.92 0.88 0.75 0.90 nu 0.94 0.85 Figure 4.1: Plots of νa against a when c = 0.01 and c = 0.1. 0.2 0.4 0.6 0.8 0.2 0.4 a 0.6 0.8 a nu 0.40 0.60 0.35 0.55 nu 0.65 0.45 0.70 0.50 Figure 4.2: Plots of νa against a when c = 1 and c = 10. 0.2 0.4 0.6 0.8 0.2 0.4 a 0.6 0.8 a All the above figures are based on a0 = 0.1, a1 = 0.9. For fixed c, a0 and a1 , νa appears to be a increasing function of a exhibiting negative concavity. On the other hand, the scale of νa significantly decreases and the shape of plots slightly flattens as c increases. More experiments also show that the choices of a0 and a1 mainly influence the scale rather than the shape of plots. 29 Figure 4.3: Plots of νa against θa when c = 0.01 and c = 0.1. Figure 4.4: Plots of νa against θa when c = 1 and c = 10. It appears that there exists a linear relationship between νa and θa . We therefore provide corresponding linear regression equation in each plot. R2 from the regression slightly decreases (from 1 to 0.9952) as c increases (from 0.01 to 10) which indicates the strong explanatory power of linear regression equations. Since the computation of νa is complicated, we may simply estimate νa using θa . However, the regression coefficients also significantly change as c changes. Therefore 30 the linearity may be just spurious and violated in other cases. We then provide the following plot to illustrate the tail distribution approximation given by (4.1). 0.00 0.05 p 0.10 0.15 Figure 4.5: The tail probability approximation. 4 6 8 10 12 100 * c The plot is base on λ = 100. The p-values decay dramatically at the beginning space(from 4 to 6) and the shape is not smooth, which are unfavorable to numerical studies. The p-values approach slowly and smoothly to 0 after 8. To check those approximations, we provide the following comparison between Monte Carlo simulations and analytical results in (4.1). 31 Table 4.1: Monte Carlo simulations and analytical p-values. λc Direct Monte Carlo Analytical Estimate in (4.1) 8 (8.4 ± 0.8) × 10−3 8.00×10−3 9 (3.5 ± 0.3) × 10−3 3.36×10−3 10 (1.4 ± 0.1) × 10−3 1.44×10−3 11 0 5.86×10−4 The results in the second column of Table 4.1 are based on 104 paths of Monte Carlo simulations with λ = 100, a0 = 0.1, and a1 = 0.9. The related R code is provided in the Appendix. We have to run the R code during several days to get each simulation result, which is extremely time consuming compared to that we only need several minutes to derive analytical p-values. For even smaller probabilities, we need even more time to get desired results of accuracy (i.e. standard error = 10% of simulated probabilities). By the comparison, the analytical p-values by (4.1) appear to agree well with the simulations. We next examine the accuracy of upper bound given by Corollary 3.1. Example 4.2. Let X1 be an exponential random variable with parameter 1. Then K(θ) = 1 1−θ √ and φ(z) = z − 2 z + 1. Choose za satisfying φ(za ) = c/a. Then θa = φ (za ) = 1 − √1 . za By Corollary 3.1, as λ → ∞, 1 3 Pλ {M (a0 , a1 ) ≥ λc} ≤ [1 + o(1)]2−1 π − 2 λ 2 e−λc c2 a1 × a0 (4.2) 3 5 θa−1 (1 − θa ) 2 a− 2 (1 − a)da. 32 We provide following comparison between Monte Carlo simulation results and upper bounds given by (4.2). Table 4.2: Monte Carlo simulations and upper bounds by (4.2). λc Direct Monte Carlo Upper bounds by (4.2) 7 (1.2 ± 0.1) × 10−2 2.48×10−2 8 (4.9 ± 0.4) × 10−3 1.10×10−2 9 (1.9 ± 0.2) × 10−3 4.75×10−3 10 0 2.00×10−3 The simulation results in the second column in Table 4.2 are based on λ = 100, a0 = 0.1 and a1 = 0.9. It appears that upper bounds by (4.2) are not bad. Taking the first row in Table 4.2 for example, the simulated p-value and the upper bound are in the same order. On the other hand, it also appears that upper bounds do not converge to the tail probabilities as λ → ∞. This indicates again the importance of the overshoot constant. 33 Chapter 5 Proof of Theorem 3.1 We first examine the local behavior of segmental scores in a window by using a change of measure approach and then combine all these windows together to provide the tail probability of M (a0 , a1 ). Specifically we define the basic window W = (x, y) : x = x0 − v1 λ−1 , y = y0 + v2 λ−1 , 0 ≤ v1 , v2 ≤ m , where m → ∞ as λ → ∞ and m = o(λ). Choose x0 ≤ y0 to be multiples of mλ−1 in [0, 1] satisfying a0 ≤ a = y0 − x0 ≤ a1 . Consider the tail probability in W conditional on the starting point (x0 , y0 ). Pλ sup S(x, y) ≥ λc = Pλ S(x0 , y0 ) ≥ λc (5.1) (x,y)∈W +Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc . (x,y)∈W 34 To evaluate s(x, y), apply the Taylor expansion. By (2.2), S(x, y) = λ(y − x)φ N (x, y) λ(y − x) = λ(y − x) φ(za ) + θa (5.2) N (x, y) − za + O λ(y − x) N (x, y) 2 − za λ(y − x) N (x, y) 2 − za . λ(y − x) = θa N (x, y) − λ(y − x)ψ(θa ) + λ(y − x)O Define a probability measure Qλ under which X = {(ti , Xi }ni=1 is a non-uniform compound Poisson process with rate λM (θa ) and mark distribution Fθa inside [x0 , y0 ] and rate λ and mark distribution F outside [x0 , y0 ]. Then θa N (x ,y ) 0 0 e e−λaK(θa ) (λaK(θa ))N (x0 ,y0 ) K(θ N (x0 ,y0 ) dQλ a) (X ) = dPλ e−λa (λa)N (x0 ,y0 ) (5.3) = exp{θa N (x0 , y0 ) − λaψ(θa )}. By this change of measure approach, we examine the local behavior of S(x0 , y0 ) under Qλ . Specifically θa N (x0 , y0 ) − λaψ(θa ) is asymptotically normal with mean θa λaK (θa ) − λaψ(θa ) = λc, and variance θa2 λaK (θa ) under Qλ , Qλ θa N (x0 , y0 ) − λaψ(θa ) ∈ du = [1 + o(1)] 1 θa 2πλaK (θa ) (5.4) exp − (u − λc)2 , 2θa2 λaK (θa ) 35 where o(1) is uniform over bounded values of u. Combining (5.2), (5.3) and (5.4), Pλ S(x0 , y0 ) ≥ λc = Pλ θa N (x0 , y0 ) − λaψ(θa ) ≥ λc ∞ Pλ θa N (x0 , y0 ) − λaψ(θa ) ∈ λc + du = 0 ∞ exp{−λc − u}Qλ θa N (x0 , y0 ) − λaψ(θa ) ∈ λc + du = 0 ∞ exp{−λc − u}[1 + o(1)] = 0 Since limλ→∞ exp − u2 2θa2 λaK (θa ) 1 θa 2πλaK (θa ) exp u2 − 2 du. 2θa λaK (θa ) = 1, Pλ S(x0 , y0 ) ≥ λc = [1 + o(1)]e−λc θa 2πλaK (θa ) . (5.5) Similarly, Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc (5.6) (x,y)∈W ∞ = sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u Pλ S(x0 , y0 ) ∈ λc − du Pλ (x,y)∈W 0 ∞ = 0 [1 + o(1)]e−λc+u θa 2πλaK (θa ) Pλ sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u du. (x,y)∈W By (5.2), the linear approximation of S(x, y) − S(x0 , y0 ) is θa [N (x, x0 ) + N (y0 , y)] − λ(y − x − a)ψ(θa ), which is independent of S(x0 , y0 ). Hence, Pλ sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u (x,y)∈W = Pλ sup θa [N (x, x0 ) + N (y0 , y)] − λ(y − x − a)ψ(θa ) ≥ u − o(1) (x,y)∈W = Pλ sup θa [N (x0 − v1 λ−1 , x0 ) + N (y0 , y0 + v2 λ−1 )] 0≤v1 ,v2 ≤m −(v1 + v2 )ψ(θa ) ≥ u − o(1) , 36 where o(1) is uniform over bounded values of u. Let C u = Pλ θa [N (x0 − v1 λ−1 , x0 ) + N (y0 , y0 + v2 λ−1 )] sup 0≤v1 ,v2 ≤m −(v1 + v2 )ψ(θa ) ≥ u − o(1) . Through a scaling transformation, we can look at the limiting probability of Pλ as one involving fixed Poisson rate λ0 = 1 and increasingly large scanning sets, C u = Pλ 0 θa N (λx0 − v1 , λx0 ) − v1 ψ(θa ) sup 0≤v1 ,v2 ≤m +θa N (λy0 , λy0 + v2 ) − v2 )ψ(θa ) ≥ u − o(1) . Let N1 (v1 ) = N (λx0 − v1 , λx0 ), N2 (v2 ) = N (λy0 , λy0 + v2 ). Then N1 and N2 are independent and identically distributed compound Poisson processes with rate 1 and mark distribution F . Further let Y1 (v1 ) = θa N1 (v1 ) − v1 ψ(θa ) and Y2 (v2 ) = θa N2 (v2 ) − v2 ψ(θa ). Note that both Y1 and Y2 are independent of S(x0 , y0 ). Hence, C u = Pλ 0 sup [Y1 (v1 ) + Y2 (v2 )] ≥ u − o(1) . (5.7) 0≤v1 ,v2 ≤m Combining (5.6) and (5.7), Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc (5.8) (x,y)∈W ∞ = 0 [1 + o(1)]e−λc+u θa 2πλaK (θa ) Pλ 0 sup 0≤v1 ,v2 ≤m [Y1 (v1 ) + Y2 (v2 )] ≥ u − o(1) du. 37 By (5.1), (5.5) and (5.8), as λ → ∞, Pλ sup S(x, y) ≥ λc (5.9) (x,y)∈W = = = = = ∞ e−λc θa 2πλaK (θa ) e−λc θa 2πλaK (θa ) e−λc θa 2πλaK (θa ) e−λc θa 0 sup [Y1 (v1 ) + Y2 (v2 )] ≥ u du 0≤v1 ,v2 ≤m ∞ e u Pλ 0 EPλ0 exp [Y1 (v1 ) + Y2 (v2 )] ≥ u du sup 0≤v1 ,v2 ≤m −∞ 2πλaK (θa ) e−λc θa e u Pλ 0 1+ sup [Y1 (v1 ) + Y2 (v2 )] 0≤v1 ,v2 ≤m 2 EPλ0 exp sup [Y1 (v1 )] 0≤v1 ≤m ∞ 2 e u Pλ 0 2πλaK (θa ) sup [Y1 (v1 )] ≥ u du . 0≤v1 ≤m −∞ Let Tu = inf{v1 : Y1 (v1 ) ≥ u}. Define a probability measure Qλ0 under which N1 is a compound Poisson process with rate K(θa ) and mark distribution F (θa ). Then θa N (Tu ) 1 e e−K(θa )Tu (K(θa )Tu )N1 (Tu ) K(θ N (T ) dQλ0 a) 1 u (Y1 (Tu )) = dPλ0 e−Tu (Tu )N1 (Tu ) = exp{θa N1 (Tu ) − Tu ψ(θa )} = eY1 (Tu ) . Hence for (5.9), ∞ e u Pλ 0 sup [Y1 (v1 )] ≥ u du −∞ ∞ = e −∞ ∞ Pλ0 Y1 (Tu ) ∈ db, sup [Y1 (v1 )] ≥ u du −∞ ∞ = −∞ ∞ = −∞ 0≤v1 ≤m −∞ ∞ e−b Qλ0 Y1 (Tu ) ∈ db, sup [Y1 (v1 )] ≥ u du eu = (5.10) 0≤v1 ≤m ∞ u 0≤v1 ≤m −∞ eu EQλ0 e−Y1 (Tu ) I{ sup [Y1 (v1 )] ≥ u} du 0≤v1 ≤m EQλ0 e−(Y1 (Tu )−u) I{ sup [Y1 (v1 )] ≥ u} du. 0≤v1 ≤m 38 By (2.3), νa = limu→∞ EQλ0 [e−(Y1 (Tu )−u) ] is the overshoot constant. Hence it follows from (5.10) that ∞ eu Pλ0 sup [Y1 (v1 )] ≥ u du (5.11) 0≤v1 ≤m −∞ ∞ ∼ sup [Y1 (v1 )] ≥ u du νa Qλ0 0≤v1 ≤m −∞ sup [Y1 (v1 ) = νa EQλ0 0≤v1 ≤m = νa EQλ0 Y1 (m) + o(1) = νa [θa K(θa )m − mψ(θa )] + o(1) = νa mc/a + o(1). By (5.9) and (5.11), as λ → ∞, sup S(x, y) ≥ λc Pλ (5.12) (x,y)∈W ∼ e−λc θa 2πλaK (θa ) 2 νa mc/a . By Chan and Zhang (2007), the probability of joint externality in two disjoint windows is asymptotically negligible. Hence Theorem 3.1 follows by combining all the windows. Let x0 < y0 be multiples of mλ−1 in [0, 1] and a = y0 − x0 . Hence by (5.12), as λ → ∞, Pλ M (a0 , a1 ) ≥ λc ∼ sup S(x, y) ≥ λc Pλ (x,y)∈W a0 ≤a≤a1 a1 ∼ θa a0 e−λc 2πλaK (θa ) −1/2 3/2 −λc 2 = (2π) λ e νa2 λ2 c2 a−2 (1 − a)da a1 c a0 θa−1 K (θa )−1/2 νa2 a−5/2 (1 − a)da. 39 Chapter 6 Conclusions The definition of score S(x, y) and therefore Theorem 3.1 are based on the rate function φ which further depends on the mark distribution F . In practice, however, it is more reasonable to replace φ with a general function g which is independent of F . This can be done in two steps. First consider g(x) = x, then S(x, y) = N (x, y), which is also used in Chan and Zhang (2007). Under the same definition of M (a0 , a1 ), the methodology applied to φ should be adapted in such case. For example, we have to choose new θa without φ. A simple guess is to choose θa satisfying K (θa ) = c/a. The second step is to extend approximations derived when g(x) = x to general cases. We may need to place some regularity assumptions on g to derive corresponding results. For example, g is continuously differentiable in the neighborhood of origin. Another future direction is to consider multi-dimensional time indexes. Chan (2009) provides a good example of studying scan statistics in a Poisson random field. 40 For simplicity, we can first examine the scan statistics in a cube instead of arbitrary indexing sets. The following question is then to describe the existence of overshoot constants, which is still at the heart of approximations in multi-dimensional cases. The technique applied to marked Poisson processes should be adapted to marked Poisson random fields. The complicated computation of overshoot constants may be an issue of concern in applications. Therefore it is necessary to study the linearity exhibited in Figure 4.3. If the linearity between νa and θa holds, we can simply use θa to derive approximations of νa . We can also apply some transformations (e.g. νa = eνa ) and reexamine the relation between νa and a. If it shows valid linearity after the transformation, we then derive another simple approximation of νa . 41 Bibliography [1] Alm, S.E. (1997). On the distributions of scan statistics of a two dimensional Poisson process. Adv. in Appl. Prob. 29, 1–18. [2] Chan, H.P. and Lai, T.L. (2006). Maxima of asymptotically Gaussian random fields and moderate deviation approximations to boundary-crossing probabilities of sums of random variables with multidimensional indices. Ann. Probab. 34, 80–121. [3] Chan, H.P. and Zhang, N.R. (2007). Scan statistics with weighted observations. J. Amer. Statist. Assoc. 102, 595–602. [4] Chan, H.P. (2009). Maxima of moving sums in a Poisson random field. Adv. Appl. Prob. 41, 647–663. [5] Chew, D., Choi, K. and Leung, M. (2005), Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses. Nucleic Acids Research. 33, e134. 42 [6] Cousens, S., Smith, P.G., Ward, H., Everington, D., Knight, R.S.G., Zeidler, M., Stewart, G., Smith-Bathgate, E.A.B., Macleod, M.A., Mackenzie, J. and Will, R.G. (2001). Geographical distribution of variant Creutzfeldt-Jakob disease in Great Britain. The Lancet 357, 1002–1007. [7] Duczmal, L. and Assuncão, R. (2004). A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis 45, 269–286. [8] Glaz, J., Naus, J. and Wallenstein, S. (2001). Scan Statistics. Springer, New York. [9] Glaz, J., Pozdnyakov, V. and Wallenstein, S. (2009).Scan Statistics: Methods and Applications. Birkhäuser Boston. [10] Hogan, M. and Siegmund, D.O. (1986). Large deviations for the maxima of some random fields. Adv. Appl. Math. 7, 2–22. [11] Kabluchko, Z. and Spodarev, E. (2009). Scan statistics of Lévy noises and marked empirical processes. Adv. Appl. Prob. 41, 13–37. [12] Kulldorff, M. (1997). A spatial scan statistic. Commun. Statist.—Theory Meth. 26, 1481–1496. [13] Kulldorff. M., Huang, L. and Konty, K. (2008). A spatial scan statistic for normally distributed data. Manuscript. 43 [14] Lifanov, A., Makeev, V., Nazina, A. and Papatsenko, D. (2003). Homotypic regulatory clusters in Drosophila. Genome Research. 13, 579–588. [15] Loader, C. (1991). Large-deviation approximations to the distributions of scan statistics. Adv. Appl. Prob. 23, 751–771. [16] Marcos, R.D.L.F. and Marcos, C.D.L.F. (2008). From star complexes to the field: open cluster families. Astrophysical J. 672, 342–351. [17] Naiman, D.Q. and Priebe, C.E. (2001). Computing scan statistic p-values using importance sampling, with applications to genetics and medical image analysis. J. of Computational & Graphical Statistics 10, 296–328. [18] Naus, J. I. (1965). Clustering of random points in two dimensions. Biometrika. 52, 263–267. [19] Naus, J. I. (1982). Approximations for Distributions of Scan Statistics. Journal of the American Statistical Association. 77, 177–183. [20] Pickands,J. (1969). Upcrossing probabilities for stationary Gaussian processes Trans. Amer. Math. Soc. 145, 51–73. [21] Qualls, C. and Watanabe, H. (1973). Asymptotic properties of Gaussian random fields. Trans. Amer. Math. Soc.177, 155–171. 44 [22] Rajewsky, N., Vergassola, M., Gaul, U. and Siggia, E. (2002). Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics. 3, e30. [23] Siegmund, D.O. (1985). Sequential Analysis. Springer, New York. [24] Turnbull, B., Iwano, E.J., Burnett, W.S., Howe, H.L. and Clark, L.C. (1990). Monitoring for clusters of disease: application to leukemia incidence in upstate New York. Amer. J. Epidemiology 132, 136–143. [25] Yoshida, M., Naya, Y. and Miyashita, Y. (2003). Anatomical organization of forward fiber projections from area TE to perirhinal neurons representing visual long-term memory in monkeys. Proceedings of the National Academy of Sciences of the United States of America 100, 4257–4262. 45 Appendix Related R code ##main Monte Carlo simulation sim1[...]... constant related to compound Poisson processes Several examples are provided to illustrate the computation The main theoretical result of this thesis is given by Theorem 3.1 in Chapter 3, which provides an asymptotic tail probability of scan statistics of rate function of scores on marked poisson point process A corollary is also presented to provide a simple upper bound for the tail probability of. .. general marked Poisson random field Consider here {tˇi , i ≤ 1} to be a homogeneous Poisson point process on Rm with intensity λ > 0 and let X1 , X2 , · · · be i.i.d random varibles with cumulative distribution function F and independent of the Poisson point process Let ρ and K(θ) be defined as in Section 2.2 To consider the scan statistic, define first some notations Let σm (·) be the volume of set in... probabilities of scan statistics They also illustrated the application of their p-value approximations in computational biology Chan (2009) further examined the tail probabilities of moving sums in a marked Poisson random field These sums were derived by adding up the weighted occurrences of events within a scanning set of fixed size and shape He also provided an alternative presentation of the constants of the... of the occupation measure of the conditional local random field at zero, which were further extended to the constants of asymptotic tail probabilities of Gaussian random field These new formulas are useful for deriving bounds of the tail probabilities of scan statistics 6 1.3 Organization of this Thesis In Chapter 2 we introduce the limiting distribution of the overshoot of a random walk over a constant... n=1 nµ 2 The Overshoot Constant of Compound Poisson Processes We will next consider overshoot constants that are at the heart of the approximations in this thesis [see Chapter 3 for the approximations] The reader can return to this subsection after looking at the approximations in Chapter 3 and the proof of Theorem 3.1 in Chapter 5 Let {N (t), t ≥ 0} be a Poisson process with rate λ > 0 Let {Xi , i... and derived explicit asymptotical approximations for the tail probability of the distribution of the maximum by introducing the overshoot constants into the formulas Chan and Zhang (2007) examined scan statistics for one dimensional marked Poisson processes The scan statistics were defined as the maximum weighted count of event occurrences within a window of fixed width which is moved within an observed... used the total score of all words rather than the number of words in a window exceeding the cut-off to compute the scan statistics Chan and Zhang (2007) provided p-value approximations for scan statistics of marked Poisson processes These approximations can be applied to general scoring schemes used in computational biology An important feature of the formula is an overshoot correction term that is equal... The computation of the overshoot constant is important in queuing theory, risk insurance, engineering systems, sequential testing and change -point detection In this section, we first introduce some theory on the limiting distribution of the overshoot of a random walk over a constant boundary Then we define the overshoot constant related to compound Poisson processes The results are illustrated with specific... two-dimensional scan statistic to study spatial clustering of ‘open star clusters’, which are physically associated groups of stars combined together by mutual gravitational attraction The study regions were defined by galactic longitude as the first dimension and either radial velocity, proper motion or inclination as the second dimension, resulting in three different analysis A number of statistically... probability of crossing a high level concentrate on a small neighborhood of the subset of the indexing set where the marginal probability of crossing the level is maximal The second step is to break the subset into small pieces which disjointedly contribute to the total probability Then the probability approximation is derived by adding the contributions of each small piece However, the approximation involves ... Poisson processes are commonly encountered in applications The observations may also be generated by compound Poisson processes with normal or exponential observations Kulldorff, Huang and Konty... computation The main theoretical result of this thesis is given by Theorem 3.1 in Chapter 3, which provides an asymptotic tail probability of scan statistics of rate function of scores on marked poisson. .. computations and simulation exercises Keywords: Change of Measure, Large Deviation, Marked Poisson Process, Scan Statistics v List of Tables 3.1 Estimation of p ± s.e with F degenerate at

Định dạng
Số trang	55
Dung lượng	762,07 KB