Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 55 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
55
Dung lượng
762,07 KB
Nội dung
Scan Statistics of Rate Function of Scores
on Poisson Point Processes
Yu Xiaojiang
(B.Sc. USTC)
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY
NATIONAL UNIVERSITY OF SINGAPORE
2011
i
Acknowledgements
I would take this opportunity to thank my supervisor, Assoc. Professor Chan
Hock Peng. His useful comments, suggestions and revisions have led to significant
improvements in the presentation of this thesis. I benefit a lot from his beautiful
way of thinking.
I would also thank my friend Jiang Binyan, Liang Xuehua, Jiang Xiaojun, Zhu
Yongting and Long Yun for their help in completing this thesis.
At last I would thank Ms Tay Ket Ling, Ms Su Kyi Win and all the other
department staffs for their help in my graduate academic career.
ii
Contents
Acknowledgements
Summary
i
iv
List of Tables
v
List of Figures
vi
1 Introduction
1
1.1
Literature Review: Applications . . . . . . . . . . . . . . . . . . . .
3
1.2
Literature Review: Probabilistic Technique . . . . . . . . . . . . . .
4
1.3
Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . .
6
2 Technical Backgrounds
7
2.1
Compound Poisson Processes . . . . . . . . . . . . . . . . . . . . .
7
2.2
The Overshoot Constant of Compound Poisson Processes . . . . . .
9
3 Theoretical Results
15
iii
3.1
Theoretical Results and Applications in Chan and Zhang (2007) . .
17
3.2
The Theoretical Result in Chan (2009) . . . . . . . . . . . . . . . .
24
4 Examples and Numerical Studies
27
5 Proof of Theorem 3.1
33
6 Conclusions
39
Bibliography
41
Appendix
45
iv
Summary
Let X1 , X2 , · · · be independent and identically distributed (i.i.d.) random variables with positive mean. Each random variable Xi is associated with a random
time ti and t1 , t2 , · · · are distributed according to a Poisson point process on [0, 1].
We would like to detect unusual behavior in a segment of the interval [0, 1]. For
each 0 < x < y < 1, we compute a score S(x, y) which is large when there is
unusual behavior in the interval [x, y]. Since x, y are unknown, we consider the
maximum of these scores over 0 < x < y < 1, known in the statistical literature as
scan statistics. We derive formulas for approximating the tail probabilities of these
scan statistics and check them through numerical computations and simulation
exercises.
Keywords: Change of Measure, Large Deviation, Marked Poisson Process, Scan
Statistics.
v
List of Tables
3.1
Estimation of p ± s.e. with F degenerate at 1. . . . . . . . . . . . .
3.2
Summary of information for the scan statistics of three viral genomes. 21
4.1
Monte Carlo simulations and analytical p-values.
. . . . . . . . . .
31
4.2
Monte Carlo simulations and upper bounds by (4.2). . . . . . . . .
32
20
vi
List of Figures
3.1
Weighted and unweighted scan statistics for 3 viral genomes . . . .
23
4.1
Plots of νa against a when c = 0.01 and c = 0.1. . . . . . . . . . . .
28
4.2
Plots of νa against a when c = 1 and c = 10. . . . . . . . . . . . . .
28
4.3
Plots of νa against θa when c = 0.01 and c = 0.1. . . . . . . . . . . .
29
4.4
Plots of νa against θa when c = 1 and c = 10.
. . . . . . . . . . . .
29
4.5
The tail probability approximation. . . . . . . . . . . . . . . . . . .
30
1
Chapter 1
Introduction
Joseph Naus published his now classical paper on scan statistics in 1965, which
originated the modern work on this field. The study of scan statistics is currently a
very active area and applied in many different fields such as infectious disease epidemiology, brain imaging, astronomy, geology, neurological diseases, rheumatology,
parasitology, demography, forestry, toxicology, psychology and telecommunication.
Suppose we have a number of points randomly located within a square region.
Naus (1965) originally used a rectangular scanning window with a fixed size and
shape. The window is moved over the predetermined square region to cover all
possible locations. The scan statistic is the maximum number of points captured
by the scanning window at any given time. The next step is to find the probability
of observing at least that many points within some window, under the null hypothesis that those points are generated by a homogeneous Poisson process. The
complexity of this problem lies in the multiple comparisons effect from maximizing
2
over all possible windows and the overlapping nature of those windows. Using some
powerful mathematics, Naus (1965) developed analytical formulas to obtain upper
and lower bounds for those probabilities.
Following Naus’s pioneering work, there have been a lot of further methodological developments on scan statistics. The study region to be scanned may be of
different shapes. The scanning window may be of different sizes and shapes. As
mentioned above, Naus (1965) used rectangular windows of any fixed shape and
size, while Loader (1991) used rectangular windows of variable sizes. Alm (1997)
used circles, ellipses and triangles. Kulldorff (1997) considered circles of variable
sizes. More recently, nonparametric methods have been used to taking into account
windows of irregular shapes [see Duczmal and Assunc˜ao (2004)].
Instead of defining the null hypothesis based on a homogeneous Poisson process,
we can also consider the null hypothesis in the context of inhomogeneous Poisson
processes [see Turnbull et al. (1990)]. For example, areas with higher population
intensity are expected to have more infectious disease cases per geographical unit
in urban compared to rural areas. Hence inhomogeneous Poisson processes are
commonly encountered in applications. The observations may also be generated by
compound Poisson processes with normal or exponential observations. Kulldorff,
Huang and Konty (2008) developed scan statistics for normal and survival type
data.
3
1.1
Literature Review: Applications
.
Scan statistics have been applied for infectious disease epidemiology. In infectious disease surveillance, scan statistics are used to detect geographical areas with
clusters of the disease. The clusters can be either temporary, due to an outbreak,
or long-lasting, if the population in the area is especially prone to infection. Different aspects of the infectious disease also influence the appropriate choice of scan
statistic parameters. The incubation time of the disease, for example, is a very
important factor in the selection of the scanning time window length. Cousens et
al. (2001) investigated 84 cases of variant Creutzfeldt-Jakob disease (vCJD), a rare
and fatal disease caused by the same transmissible agent as in mad cow disease.
The consumption of beef products were therefore considered as an important factor. The scan statistic detected a significant cluster with five cases. A subsequent
investigation revealed a local butcher shop to be the possible source of infection.
Scan statistics have also been applied to important problems in brain imaging.
Naiman and Priebe (2001) applied them to study positron emission tomography
(PET) scan brain imagery data. Yoshida, Naya and Miyashita (2003) used them
to analyze neural response data in monkeys. By injecting of retrograde tracers in
specific regions (cases) and adjacent regions (controls) in the brain, maps with pixels associated with selective and non-selective neurons were generated. Significant
clusters of selective neurons were found.
4
Marcos and Marcos (2008) used the two-dimensional scan statistic to study
spatial clustering of ‘open star clusters’, which are physically associated groups
of stars combined together by mutual gravitational attraction. The study regions
were defined by galactic longitude as the first dimension and either radial velocity,
proper motion or inclination as the second dimension, resulting in three different
analysis. A number of statistically significant clusters were found.
For specific applications, the scan statistic parameters and probabilistic models
should be appropriately selected and adapted to fit the data and the scientific
questions asked.
1.2
Literature Review: Probabilistic Technique
Methods for approximating the tail distribution of the maximum of Gaussian
random fields have been developed by Pickands (1969) and Qualls and Watanabe
(1973). There are two key steps. Firstly the tail probability of crossing a high level
concentrate on a small neighborhood of the subset of the indexing set where the
marginal probability of crossing the level is maximal. The second step is to break
the subset into small pieces which disjointedly contribute to the total probability.
Then the probability approximation is derived by adding the contributions of each
small piece. However, the approximation involves constants which can not be
computed directly. This methodology inspired the following three papers.
Hogan and Siegmund (1986) considered the large deviations for the maxima of
5
random fields which are closely related to simple one dimensional processes such
as the random walk, Brownian motion and Brownian bridge. They applied the
techniques by Pickands, Qualls and Watanabe and derived explicit asymptotical
approximations for the tail probability of the distribution of the maximum by
introducing the overshoot constants into the formulas.
Chan and Zhang (2007) examined scan statistics for one dimensional marked
Poisson processes. The scan statistics were defined as the maximum weighted
count of event occurrences within a window of fixed width which is moved within
an observed interval. They derived analytical formulas and an importance sampling method for approximating the tail probabilities of scan statistics. They also
illustrated the application of their p-value approximations in computational biology.
Chan (2009) further examined the tail probabilities of moving sums in a marked
Poisson random field. These sums were derived by adding up the weighted occurrences of events within a scanning set of fixed size and shape. He also provided an
alternative presentation of the constants of the asymptotic formulae in terms of the
occupation measure of the conditional local random field at zero, which were further extended to the constants of asymptotic tail probabilities of Gaussian random
field. These new formulas are useful for deriving bounds of the tail probabilities of
scan statistics.
6
1.3
Organization of this Thesis
In Chapter 2 we introduce the limiting distribution of the overshoot of a random
walk over a constant boundary. Then we define the overshoot constant related to
compound Poisson processes. Several examples are provided to illustrate the computation. The main theoretical result of this thesis is given by Theorem 3.1 in
Chapter 3, which provides an asymptotic tail probability of scan statistics of rate
function of scores on marked poisson point process. A corollary is also presented to
provide a simple upper bound for the tail probability of scan statistics. As a comparison, we introduce the theoretical results in Chan and Zhang (2007) and Chan
(2009). Then in Chapter 4 we illustrate applications of the p-value approximation
and upper bound with specific examples and numerical studies. The proof of Theorem 3.1 is given in Chapter 5. At last we provide conclusions and discussions in
Chapter 6 and related R code in Appendix.
7
Chapter 2
Technical Backgrounds
The computation of the overshoot constant is important in queuing theory,
risk insurance, engineering systems, sequential testing and change-point detection.
In this section, we first introduce some theory on the limiting distribution of the
overshoot of a random walk over a constant boundary. Then we define the overshoot
constant related to compound Poisson processes. The results are illustrated with
specific examples.
2.1
Compound Poisson Processes
Let Y1 , Y2 , · · · be i.i.d. random variables with mean µ > 0. We say that Y1 is
arithmetic if there exists d > 0 such that P (Y1 ∈ dZ) = 1. The largest d with such
property is the span of Y1 . Otherwise, we say that Y1 is nonarithmetic.
8
Let Sn = Y1 + · · · + Yn , Sn+ = max(Sn , 0). For b > 0, define
τ = τ (b) = inf{n : n ≥ 1, Sn ≥ b},
τ+ = inf{n : n ≥ 1, Sn > 0}.
Then by renewal theory, Sτ − b has a limiting distribution. More specifically we
have the following results, given in Siegmund (1985) page 171.
Theorem 2.1. Assume µ < ∞. If Y1 is nonarithmetic, then
∞
lim P (Sτ − b > x) = (ESτ+ )−1
b→∞
P (Sτ+ > y)dy.
x
If Y1 is arithmetic with span d, then as b → ∞ through multiples of d,
lim P (Sτ − b = jd) = d(ESτ+ )−1 P (Sτ+ > jd).
b→∞
In addition, by the theory of ladder variables, we have the following theorem,
given in Siegmund (1985) page 175.
Theorem 2.2. If Y1 is nonarithmetic, then
∞
lim E exp[−(Sτ − b)] = µ
b→∞
−1
+
n−1 Ee−Sn .
exp −
n=1
We illustrate this theorem with the following example.
Example 2.1. Let Y1 ∼ N (µ, 2µ). Then Sn ∼ N (nµ, 2nµ), and
+
Ee−Sn
∞
e−x P (Sn ∈ dx)
= P (Sn < 0) +
0
= Φ −
= 2Φ −
nµ
+
2
nµ
.
2
∞
0
1
e−x √
exp
2 πnµ
−
(x − nµ)2
dx
4nµ
(2.1)
9
Hence by (2.1),
∞
lim E exp[−(Sτ − b)] = µ
b→∞
2.2
−1
n−1 Φ −
exp − 2
n=1
nµ
2
.
The Overshoot Constant of Compound Poisson Processes
We will next consider overshoot constants that are at the heart of the approximations in this thesis [see Chapter 3 for the approximations]. The reader can
return to this subsection after looking at the approximations in Chapter 3 and the
proof of Theorem 3.1 in Chapter 5.
Let {N (t), t ≥ 0} be a Poisson process with rate λ > 0. Let {Xi , i = 1, 2, · · · }
be i.i.d. random variables with cumulative distribution function F such that ρ :=
EX1 > 0, and {N (t), t ≥ 0} is independent of {Xi , i = 1, 2, · · · }. We say that
N (t)
R(t) =
Xi (R(t) = 0, if N (t) = 0)
i=1
is a compound Poisson process with rate λ and mark distribution F .
Without loss of generality, we restrict t ∈ [0, 1] and randomly choose the n interarrival times u1 , · · · , un according to {N (t), t ∈ [0, 1]}. Hence Xi is associated with
the arrival time ti = u1 + · · · + ui . Then u1 , · · · , un are independent of X1 , · · · , Xn .
Let K(θ) = EeθX1 , ψ(θ) = K(θ) − 1. We define the rate function to be
supθ [θz − ψ(θ)] if z ≥ ρ,
φ(z) =
0
if z < ρ,
10
and the segmental scores in [x, y] to be
S(x, y) = λ(y − x)φ
N (x, y)
, where N (x, y) =
Xi .
λ(y − x)
i:x≤t ≤y
i
Assume that K(θ) is finite for some θ > 0. For any z > ρ, choose θz > 0 to be
the root of equation
d
[θz
dθ
− ψ(θ)] = z − ψ (θ) = 0. Then φ(z) = θz z − ψ(θz ), and
φ (z) = θz if z > ρ.
For given c > 0 and varying a ∈ [0, 1], choose za to be the root of the equation
φ(za ) = c/a, and let θa = φ (za ). Then
φ(za ) = θa za − ψ(θa ) = c/a.
(2.2)
ψ (θa ) = K (θa ) = za .
Apply the Taylor expansion,
S(x, y) = λ(y − x)φ
N (x, y)
λ(y − x)
= λ(y − x) φ(za ) + θa
N (x, y)
− za + O
λ(y − x)
= θa N (x, y) − λ(y − x)ψ(θa ) + λ(y − x)O
N (x, y)
2
− za
λ(y − x)
N (x, y)
2
− za .
λ(y − x)
We are interested in the linear approximation part [i.e. θa N (x, y) − λ(y − x)ψ(θa )],
and will therefore transform the compound Poisson process.
Embed F in an exponential family of distribution {Fθa } with
Fθa (dy)
eθa y
=
.
F (dy)
K(θa )
Define a new probability measure Qλ0 under which R(t) is a compound Poisson
process with rate λ0 = K(θa ) and mark distribution Fθa . Let Yi = θa Xi − ui ψ(θa )
11
and Sn = Y1 + · · · + Yn . Then by (2.2),
µ = EQλ0 Y1 =
θa K (θa )
ψ(θa )
c
−
=
> 0.
K(θa )
K(θa )
aλ0
Hence by (2.1), the existence of the overshoot constant
νa =
lim EQλ0 exp[−(Sτ − b)]
b→∞
∞
= µ
−1
+
n−1 EQλ0 e−Sn
exp −
(2.3)
n=1
is assured.
We first give following example to illustrate the computation of the overshoot
constant for a degenerate compound Poisson process.
Example 2.1. Let X1 ≡ 1 with probability 1. Then K(θ) = eθ , ψ(θ) = eθ − 1
and φ(z) = z log z − z + 1. Choose za > 1 satisfying φ(za ) = c/a. Then
θa = φ (za ) = log za .
λ0 = K(θa ) = eθa = za .
Under Qλ0 we also have X1 ≡ 1 with probability 1, and µ = EQλ0 Y1 =
the transformed compound Poisson process is
n
Sn = θa n − tn (λ0 − 1), tn =
ui ∼ Γ(λ0 , n),
i=1
where Γ(λ0 , n) is the gamma distribution. Hence,
+
EQλ0 e−Sn
∞
e−x Qλ0 {Sn ∈ dx}
= Qλ0 {Sn ≤ 0} +
0
= 1 − χ22n
2λ0 θa n
+
λ0 − 1
∞
e−x Qλ0 {Sn ∈ dx},
0
c
.
aza
Then
12
where χ22n (x) =
+
−Sn
EQλ0 e
x
1
z n−1 e−z/2 dz.
0 2n (n−1)!
= 1−
χ22n
Let y =
2λ0 θa n
+
λ0 − 1
θa n
λ0 −1
θa n−x
,
λ0 −1
e−θa n
0
t = 2y. Then
λn0
y n−1 e−y dy
(n − 1)!
2θa n
= 1−
χ22n
= 1 − χ22n
λ0 −1
t
1
2λ0 θa n
λn0 e−θa n n
+
tn−1 e− 2 dt
λ0 − 1
2 (n − 1)!
0
2λ0 θa n
2θ
an
+ λn0 e−θa n χ22n
.
λ0 − 1
λ0 − 1
Note that λ0 = eθa = za . Hence,
+
EQλ0 e−Sn = 1 − χ22n
2za θa n
2θa n
+ χ22n
.
za − 1
za − 1
Then by (2.3),
aza
νa =
exp
c
∞
n−1 1 − χ22n
−
n=1
2za θa n
2θa n
+ χ22n
za − 1
za − 1
.
(2.4)
The analytical formula of νa by (2.4) is used in Chapter 4 to derive numerical
studies. We next show the computation of the overshoot constant for a compound
Poisson process with an exponential mark distribution.
Example 2.2. Let X1 be an exponential random variable with parameter 1. Then
K(θ) =
1
,
1−θ
ψ(θ) =
θ
1−θ
√
and φ(z) = z − 2 z + 1. Choose za satisfying φ(za ) = c/a.
Then
1
θa = φ (za ) = 1 − √ .
za
√
λ0 = K(θa ) = za .
Under Qλ0 , X1 is still an exponential random variable with parameter 1 − θa .
µ = EQλ0 Y1 =
c
.
aλ0
Then the transformed compound Poisson process is
n
n
Xi − tn (λ0 − 1), tn =
Sn = θa
i=1
ui ∼ Γ(λ0 , n).
i=1
13
n
Xi = k, Sn = θa k − tn (λ0 − 1). This reduces the computation to
Given that
i=1
Example 2.1.
n
+
−Sn
EQλ0 e
|
Xi = k
i=1
n
= Qλ0 Sn ≤ 0|
Let y =
θa k−x
,
λ0 −1
−x
e Qλ0 Sn ∈ dx|
Xi = k +
0
i=1
Xi = k
i=1
n
∞
2λ0 θa k
+
λ0 − 1
= 1 − χ22n
n
∞
e−x Qλ0 Sn ∈ dx|
0
Xi = k .
i=1
t = 2y. Then
n
+
−Sn
EQλ0 e
|
Xi = k
(2.5)
i=1
= 1−
2λ0 θa k
+
λ0 − 1
χ22n
θa k
λ0 −1
0
e−θa k
λn0
y n−1 e−y dy
(n − 1)!
2θa k
= 1−
λ0 −1
t
2λ0 θa k
1
+
λn0 e−θa k n
tn−1 e− 2 dt
λ0 − 1
2 (n − 1)!
0
2λ0 θa k
2θa k
+ λn0 e−θa k χ22n
.
λ0 − 1
λ0 − 1
χ22n
= 1 − χ22n
n
Xi ∼ Γ(1, n). Hence,
Note that
i=1
n
+
−Sn
EQλ0 e
+
−Sn
= E EQλ0 [e
|
Xi ]
i=1
∞
1 − χ22n
=
0
2θa k
2λ0 θa k
+ λn0 e−θa k χ22n
λ0 − 1
λ0 − 1
k n−1 e−k
dk.
(n − 1)!
Then by (2.3),
aλ0
νa =
exp
c
∞
∞
1 − χ22n
−
n=1
0
2λ0 θa k
λ0 − 1
+ λn0 e−θa k χ22n
2θa k
λ0 − 1
k n−1 e−k
dk .
n!
14
We proceed to provide a general formula of the overshoot constant for a compound Poisson process with arbitrary mark distribution.
Example 2.3. X1 has arbitrary distribution function F with a positive mean. Let
n
Fn be the cumulative distribution function of
Xi . Choose za and θa specifically
i=1
satisfying φ(za ) = c/a and θa = φ (za ). Then λ0 = K(θa ). By (2.5),
n
+
−Sn
EQλ0 e
2θa k
2λ0 θa k
+ λn0 e−θa k χ22n
.
λ0 − 1
λ0 − 1
Xi = k = 1 − χ22n
|
i=1
We first assume that X1 is continuous. Hence,
n
+
−Sn
EQλ0 e
= E EQλ0 [e
+
−Sn
|
Xi ]
i=1
∞
1 − χ22n
=
−∞
2λ0 θa k
2θa k
+ λn0 e−θa k χ22n
λ0 − 1
λ0 − 1
dFn (k).
Then by (2.3),
aλ0
exp
νa =
c
∞
∞
n−1
−
1 − χ22n
−∞
n=1
2λ0 θa k
λ0 − 1
+ λn0 e−θa k χ22n
2θa k
λ0 − 1
dFn (k) .
If X1 is discrete, we assume that X1 only takes integer values without loss of
generality. Let fn be the probability mass function of Fn . Then,
νa =
aλ0
exp
c
∞
∞
n−1
−
n=1
1 − χ22n
k=−∞
2λ0 θa k
λ0 − 1
+ λn0 e−θa k χ22n
2θa k
λ0 − 1
fn (k) .
15
Chapter 3
Theoretical Results
Let λ, c, N (t), Xi , ρ, ui , ti , φ(z), N (x, y) and S(x, y) be defined as in Section
2.2. We are interested in the maximum score captured by a scanning window with
a varied length. Specifically for given 0 < a0 < a1 < 1 and the sliding window
[x, y], define the scan statistic to be
M (a0 , a1 ) =
sup
S(x, y).
0≤x≤y≤1
a0 ≤y−x≤a1
Let an ∼ bn , if limn→∞ (an /bn ) = 1. The main theoretical result of this thesis is the
tail probability approximation below.
Theorem 3.1. As λ → ∞,
Pλ {M (a0 , a1 ) ≥ λc} ∼ (2π)−1/2 λ3/2 e−λc c2
a1
θa−1 K (θa )−1/2 νa2 a−5/2 (1 − a)da,
a0
where θa and νa are defined in Section 2.2.
Through an appropriate scaling transformation, we can look at above limiting
probability as one involving fixed Poisson rate λ0 > 0 and increasing scanning sets.
16
Let λ/λ0 be the scaling constant. Then Pλ {M (a0 , a1 ) ≥ λc} = Pλ0 {M0 (a0 , a1 ) ≥
λc}, where
M0 (a0 , a1 ) =
S(x, y).
sup
0≤x≤y≤λ/λ0
a0 λ/λ0 ≤y−x≤a1 λ/λ0
For notational simplicity, we will look at the tail probability in terms of Theorem
3.1. But in practical use, the tail probability in terms of a fixed rate and increasing
scanning sets is sometimes more appropriate.
One limitation of above theorem is the complicated computation of the overshoot constant. Even when F degenerate at 1, the overshoot constant νa has a
complicated expression (2.4). For arbitrary F other than degenerating at 1, the
computation is even more complicated. In such case, however, we can still provide
a simple upper bound based on Theorem 3.1, since νa ≤ 1.
Corollary 3.1. As λ → ∞,
Pλ {M (a0 , a1 ) ≥ λc} ≤ [1 + o(1)](2π)−1/2 λ3/2 e−λc c2
a1
×
θa−1 K (θa )−1/2 a−5/2 (1 − a)da.
a0
It is interesting to compare our theoretical results with Chan and Zhang (2007)
and Chan (2009), which both examine scan statistics within windows of fixed sizes.
17
3.1
Theoretical Results and Applications in Chan
and Zhang (2007)
Scan statistics have been recently applied in the computational analysis of DNA
and protein sequences. To locate genes related to specific biological processes,
Lifanov et al. (2003) scanned DNA sequences for clusters of transcription factor
binding sites. They applied matrices of location weight to score words for similarity
to a given transcription factor pattern, and determine locations of occurrence of
the pattern by a cut-off value for the word score. Rajewsky et al. (2002) studied a
similar problem except that they used the total score of all words rather than the
number of words in a window exceeding the cut-off to compute the scan statistics.
Chan and Zhang (2007) provided p-value approximations for scan statistics of
marked Poisson processes. These approximations can be applied to general scoring
schemes used in computational biology. An important feature of the formula is an
overshoot correction term that is equal to 1 in the special case of 0-1 processes.
Let N (t) be the Poisson process as defined in Section 2.2. Let r > 0 be the
length of the interval. We restrict t ∈ (0, r]. Let Xi , ρ, ti , N (x, y), K(θ) and
ψ(θ) be defined as in Section 2.2. In particular, the score in the window (t, t + δ]
is N (t, t + δ), where δ ∈ (0, r) is a pre-determined width of the window. Further
define the fixed window-size scan statistic to be
Mr,δ =
sup N (t, t + δ).
0≤t≤r−δ
18
Assume that K(θ) is finite for some θ > 0. Given c > λρ, choose θ˜c > 0 and
distribution Fθ˜c to satisfy
˜
Fθ˜c (dx)
eθc x
˜
K (θc ) = c/λ,
=
.
F (dx)
K(θ˜c )
(3.1)
Define the large deviation rate function to be
Ic = θ˜c c − λψ(θ˜c ).
To derive the overshoot constant, consider Y˜1 , Y˜2 , · · · to be i.i.d. random variables satisfying
P (Y1 ∈ dy) =
K(θ˜c )
1
Fθ˜c (dy) +
F¯ (dy),
˜
1 + K(θc )
1 + K(θ˜c )
(3.2)
where F¯ denotes the cumulative distribution function of −X1 . Let S˜n = Y˜1 +· · ·+Y˜n
and τ˜b = inf{n ≥ 1 : S˜n ≥ b}. Define the overshoot constant to be
˜ ˜
ν˜c = lim E[e−θc (Sτ˜b −b) ].
(3.3)
b→∞
Chan and Zhang (2007) provided the following tail probability approximation
of Mr,δ .
Theorem 3.2. Let λ and c > λρ be fixed. Let δ → ∞ as r → ∞ such that
r − δ → ∞. Then,
P {Mr,δ ≥ δc} ∼ 1 − exp
−
(r − δ)˜
νc e−δIc (c − λρ)
.
2πλδK (θ˜c )
When F is degenerate at 1, K(θ) = eθ , and θ˜c = log(c/λ). Further Ic =
c log(c/λ) − c + λ. Note that the overshoot constant ν˜c = 1 for such degenerate
case. Hence Theorem 3.2 reduces to the following corollary.
19
Corollary 3.2. Let λ and c > λρ be fixed. Let δ → ∞ as r → ∞ such that
r − δ → ∞. Then,
P
sup [N (t + δ) − N (t)] ≥ δc ∼ 1 − exp
0≤t≤r−δ
−
(r − δ)eδ(c−λ) (λ/c)δc (c − λ)
√
.
2πδc
Chan and Zhang (2007) applied above formulas to study the palindromes in
DNA sequences.
Example 3.1. High concentration of palindromic patterns (PLP) is associated
with origins of replication of viruses. Four letters A, T, C, G are used to denote
the DNA alphabet with A-T and C-G being complementary base pairs (bp) on
opposite strands of the DNA helix. Thus the complementary DNA sequence of
AGATCT is TCTAGA. A DNA sequence is a PLP if its complement reads the
same as itself backwards (e.g. AGATCT). Let the length of a PLP be the number
of complementary pairs that it contains (i.e. the length of AGATCT is 3).
Let PLP* be a PLP with length of at least 5 bp that is not nested inside another PLP. Model the occurrence of PLP* in the Human cytomegalovirus (HCMV)
genome as a Poisson process [see Leung, Schactel and Yu (1994)]. A total of
N (r) = 296 PLP* are observed in the genome with length r = 229, 354 bp. Thus
ˆ = N (r)/r = 0.00129. Note that F
the rate of Poisson process is estimated to be λ
is degenerate at 1 for this example. Chan and Zhang (2007) applied Corollary 3.2
to compute the p-value approximations for the scan statistic for fixed-window size
δ = 1000 bp.
20
Table 3.1: Estimation of p ± s.e. with F degenerate at 1.
δc
Direct Monte Carlo
Chan and Zhang (2007) Naus (1982)
9
(1.5 ± 0.3) × 10−2
1.32 × 10−2
1.32 × 10−2
10
(1 ± 1) × 10−3
1.95×10−3
1.93×10−3
11
0
2.53×10−4
2.53×10−4
Naus (1982) provided a more complicated p-value approximation which works
only for the degenerate case. It appears from Table 3.1, which is reproduced from
Chan and Zhang (2007), that the p-value approximations by Chan and Zhang
(2007) agree well with both direct Monte Carlo estimates and the corresponding
results in Naus (1982).
Example 3.2. Instead of giving equal score to each PLP* as in Example 3.1
(i.e. Xi = 1), assign now a score of Xi = pi − 4 to the ith PLP* with a length
of pi . In this sense, we say that the scan statistics are unweighted in Example 3.1
and weighted in this example. Then define the location of the ith PLP* (i.e. ti )
to be the location of its left center. The rate of the Poisson process is estimated
ˆ = N (r)/r. Consider here F to be geometric. Estimate its mean by ρ =
by λ
(1 − 2ˆ
γA γˆT − 2ˆ
γG γˆC )−1 , where (ˆ
γA , γˆT , γˆG , γˆC ) are the empirical probabilities of the
four bases in the genome.
We shall now compute the overshoot constant ν˜c for the geometric distribution.
21
Let τ˜+ = inf{n ≥ 1 : S˜n > 0}. Then by Theorem 2.1., as b → ∞ through Z,
lim P {S˜τ˜b − b = j} = (E S˜τ˜+ )−1 P {S˜τ˜+ > j}, j = 0, 1, · · ·
b→∞
(3.4)
When F is geometric, Fθ˜c is also geometric by (3.1). Further by (3.2) and the memoryless property of the geometric distribution, S˜τ˜+ is geometric with distribution
Fθ˜c . Hence by (3.3) and (3.4), Chan and Zhang (2007) showed that
˜
ν˜c = ρ[1 − (1 − ρ−1 )eθc ]
Chew, Choi and Leung (2005) studied clustering of PLP* but used a score
Xi = pi (or equivalently Xi = pi /5) together with a shifted geometric distribution
for Xi . Chan and Zhang (2007) studied both the unweighted and weighted scan
statistics and provided p-value approximations.
Table 3.2: Summary of information for the scan statistics of three viral genomes.
(ˆ
γA , γˆT , γˆG , γˆC )
N (r)
δ
(0.13,0.37,0.38,0.13) 156 789
580
800
BoHV1 (0.14,0.36,0.37,0.14) 135 301
615
700
BoHV5 (0.12,0.37,0.38,0.13) 138 390
714
700
CeHV1
Unweighted
r
F geometric
Mr,δ
p-value
Mr,δ
p-value
CeHV1
18
7.23 × 10−6
116
0
BoHV1
17
1.09 × 10−4
32
6.08 × 10−5
BoHV5
15
1.07 × 10−2
33
1.74 × 10−4
22
For Table 3.2, Chew, Choi and Leung (2005) provided the empirical probabilities
of the four bases (ˆ
γA , γˆT , γˆG , γˆC ), the length of the genome r and the number of
observed PLP* N (r) for three viruses: Cercopithecine herpesvirus 1 (CeHV1),
Bovine herpesvirus 1 (BoHV1) and Bovine herpesvirus 5 (BoHV5). The window
size δ is equal to 0.5% of the genome length, rounded off to the nearest 100 bases.
Chan and Zhang (2007) provided the unweighted and weighted scan statistics and
p-value approximations in Table 3.2.
Figure 3.1 is taken from Chan and Zhang (2007) which plots the computed scan
statistics against genome location for the three viruses. Experimentally validated
origins of replication for these viruses are also shown in the figure. To avoid redundant number of false positives when handling with a large number of genomes, they
applied a conservative p-value cutoff of 0.001 and used Theorem 3.2 to determine
the threshold levels corresponding to this p-value. Figure 3.1 shows that a length
based weighting scheme improves the power for both CeHV1 and BoHV1. For
BoHV5, significant clusters of palindromes are detected in the neighborhood of the
replication origins. However, there are also many false positives for this genome.
23
Figure 3.1: Comparison of weighted and unweighted scan statistics for 3 viral
genomes. For all plots, horizontal axis denotes location in genome. The top plots
show the locations and length of palindromes longer than 4. The middle plots
show the unweighted scan statistic δ −1 [N (t + δ/2) − N (t − δ/2)] against t. The
bottom plots show the weighted scan statistic δ −1 [SN (t+δ/2) − SN (t−δ/2) ] against t.
Triangles at the top of the plots denote experimentally validated replication origins.
Thresholds for p-value of 0.001 are indicated by dashed horizontal lines.
24
In practice, however, we may not have much priori information on the length
of the signal. Thus it is difficult to determine an fixed window size in advance.
Moreover in application, if the length of the signal fluctuates, it is not appropriate
to use a fixed-size window to detect the signal. Thus a useful extension is to allow
the window size to be variable. This is the case in this thesis. Our window (i.e.
[x, y]) has a variable length (i.e. a0 ≤ y − x ≤ a1 ).
3.2
The Theoretical Result in Chan (2009)
Chan (2009) studied the tail probabilities of the maxima of moving sums with a
wide choice of scanning sets. He first developed a theory parallel to the study of tail
probabilities in Gaussian or Gaussian-like random fields in the classical framework
of Pickands (1969) and Qualls and Watanabe (1973) [see also Piterbarg (1996) and
Chan and Lai (2006)]. Motivated by recent developments in molecular biology [see
Examples 3.1 and 3.2], he considered a more general marked Poisson random field.
Consider here {tˇi , i ≤ 1} to be a homogeneous Poisson point process on Rm
with intensity λ > 0 and let X1 , X2 , · · · be i.i.d. random varibles with cumulative
distribution function F and independent of the Poisson point process. Let ρ and
K(θ) be defined as in Section 2.2. To consider the scan statistic, define first some
notations.
Let σm (·) be the volume of set in Rm . For any A ⊂ Rm , vector q ∈ Rm and real
number η, let q + ηA = {q + ηα : α ∈ A}.
25
ˇ
For any B ⊂ Rm , define the score S(B)
=
tˇi ∈B
Xi . Let D be a bounded
subset of Rm . Define the scan statistic to be
ˇ + B).
MD,B = sup S(v
v∈D
Assume that Θ = {θ : K(θ) < ∞} is an open neighborhood of 0. For c >
ρσm (B), choose θˇc > 0 and distribution Fθˇc to satisfy
ˇ
F ˇ (dx)
eθc x
=
.
K (θˇc ) = c/σm (B) and θc
F (dx)
K(θˇc )
Define the large deviation rate function to be
Iθˇc = θˇc c − σm (B)[K(θˇc ) − 1].
Chan (2009) provided the following tail probability approximation of MD,B .
Theorem 3.3. Let B be convex and bounded. Define xλ = θˇc (λc − d λc/d ) if F
is arithmetic with span d and xλ = 0 if F is nonarithmetic. Then as λ → ∞,
Pλ MD,B ≥ λc ∼ [2πσm (B)K (θˇc )]−1/2 e−λIθˇc +xλ λm−1/2 σm (D)ωB
for some positive and finite constant ωB .
The constant ωB above is specified in Chan (2009). When B is rectangular, ωB
has an explicit expression in terms of the overshoot constant.
Example 3.3. Let B =
m
j=1 [0, βj ]
with βj > 0 for all j. Consider Yˇ1 , Yˇ2 , · · · to
be i.i.d, random variables satisfying
P (Yˇ1 ∈ dx) = [F¯ (dx) + K(θˇc )Fθˇc (dx)]/[1 + K(θˇc )],
26
where F¯ denotes the cumulative distribution function of −X1 . Let Sˇn = Yˇ1 +· · ·+Yˇn
and τˇb = inf{n ≥ 1 : Sˇn ≥ b}. Define the overshoot constant
ˇ ˇ
νˇc = lim E e−θc (Sτˇb −b) .
b→∞
Chan (2009) showed that ωB has the following expression:
m
−1
ωB = νˇc [cσm
(B) − ρ]
m
χc
m−1
βj
,
j=1
ˇ
where χc = θˇc for F nonarithmetic and χc = d−1 (1 − e−dθc ) for F arithmetic with
span d.
Although the window B in Chan (2009) can take arbitrary shape, it still has
fixed size. For similar reasons, a useful extension is to consider variable window
sizes.
27
Chapter 4
Examples and Numerical Studies
Example 4.1. Consider again X1 in Example 2.1. Then K(θ) = eθ and φ(z) =
z log z − z + 1. Choose za satisfying φ(za ) = c/a. Then θa = φ (za ) = log za . By
Theorem 3.1, as λ → ∞,
−1/2 3/2 −λc 2
Pλ {M (a0 , a1 ) ≥ λc} ∼ (2π)
λ
e
a1
c
θa−1 e−θa /2 νa2 a−5/2 (1 − a)da, (4.1)
a0
where νa is given by (2.4),
aza
exp
νa =
c
∞
n−1 1 − χ22n
−
n=1
2za θa n
2θa n
+ χ22n
za − 1
za − 1
.
To evaluate the above complicated formula, we first provide following straightforward plots of νa .
28
0.80
nu
0.92
0.88
0.75
0.90
nu
0.94
0.85
Figure 4.1: Plots of νa against a when c = 0.01 and c = 0.1.
0.2
0.4
0.6
0.8
0.2
0.4
a
0.6
0.8
a
nu
0.40
0.60
0.35
0.55
nu
0.65
0.45
0.70
0.50
Figure 4.2: Plots of νa against a when c = 1 and c = 10.
0.2
0.4
0.6
0.8
0.2
0.4
a
0.6
0.8
a
All the above figures are based on a0 = 0.1, a1 = 0.9. For fixed c, a0 and a1 ,
νa appears to be a increasing function of a exhibiting negative concavity. On the
other hand, the scale of νa significantly decreases and the shape of plots slightly
flattens as c increases. More experiments also show that the choices of a0 and a1
mainly influence the scale rather than the shape of plots.
29
Figure 4.3: Plots of νa against θa when c = 0.01 and c = 0.1.
Figure 4.4: Plots of νa against θa when c = 1 and c = 10.
It appears that there exists a linear relationship between νa and θa . We therefore provide corresponding linear regression equation in each plot. R2 from the
regression slightly decreases (from 1 to 0.9952) as c increases (from 0.01 to 10)
which indicates the strong explanatory power of linear regression equations. Since
the computation of νa is complicated, we may simply estimate νa using θa . However, the regression coefficients also significantly change as c changes. Therefore
30
the linearity may be just spurious and violated in other cases.
We then provide the following plot to illustrate the tail distribution approximation given by (4.1).
0.00
0.05
p
0.10
0.15
Figure 4.5: The tail probability approximation.
4
6
8
10
12
100 * c
The plot is base on λ = 100. The p-values decay dramatically at the beginning
space(from 4 to 6) and the shape is not smooth, which are unfavorable to numerical studies. The p-values approach slowly and smoothly to 0 after 8. To check
those approximations, we provide the following comparison between Monte Carlo
simulations and analytical results in (4.1).
31
Table 4.1: Monte Carlo simulations and analytical p-values.
λc Direct Monte Carlo
Analytical Estimate in (4.1)
8
(8.4 ± 0.8) × 10−3
8.00×10−3
9
(3.5 ± 0.3) × 10−3
3.36×10−3
10
(1.4 ± 0.1) × 10−3
1.44×10−3
11
0
5.86×10−4
The results in the second column of Table 4.1 are based on 104 paths of Monte
Carlo simulations with λ = 100, a0 = 0.1, and a1 = 0.9. The related R code is
provided in the Appendix. We have to run the R code during several days to get
each simulation result, which is extremely time consuming compared to that we
only need several minutes to derive analytical p-values. For even smaller probabilities, we need even more time to get desired results of accuracy (i.e. standard error
= 10% of simulated probabilities). By the comparison, the analytical p-values by
(4.1) appear to agree well with the simulations.
We next examine the accuracy of upper bound given by Corollary 3.1.
Example 4.2. Let X1 be an exponential random variable with parameter 1. Then
K(θ) =
1
1−θ
√
and φ(z) = z − 2 z + 1. Choose za satisfying φ(za ) = c/a. Then
θa = φ (za ) = 1 −
√1 .
za
By Corollary 3.1, as λ → ∞,
1
3
Pλ {M (a0 , a1 ) ≥ λc} ≤ [1 + o(1)]2−1 π − 2 λ 2 e−λc c2
a1
×
a0
(4.2)
3
5
θa−1 (1 − θa ) 2 a− 2 (1 − a)da.
32
We provide following comparison between Monte Carlo simulation results and
upper bounds given by (4.2).
Table 4.2: Monte Carlo simulations and upper bounds by (4.2).
λc Direct Monte Carlo
Upper bounds by (4.2)
7
(1.2 ± 0.1) × 10−2
2.48×10−2
8
(4.9 ± 0.4) × 10−3
1.10×10−2
9
(1.9 ± 0.2) × 10−3
4.75×10−3
10
0
2.00×10−3
The simulation results in the second column in Table 4.2 are based on λ = 100,
a0 = 0.1 and a1 = 0.9. It appears that upper bounds by (4.2) are not bad. Taking
the first row in Table 4.2 for example, the simulated p-value and the upper bound
are in the same order. On the other hand, it also appears that upper bounds do not
converge to the tail probabilities as λ → ∞. This indicates again the importance
of the overshoot constant.
33
Chapter 5
Proof of Theorem 3.1
We first examine the local behavior of segmental scores in a window by using
a change of measure approach and then combine all these windows together to
provide the tail probability of M (a0 , a1 ). Specifically we define the basic window
W = (x, y) : x = x0 − v1 λ−1 , y = y0 + v2 λ−1 , 0 ≤ v1 , v2 ≤ m ,
where m → ∞ as λ → ∞ and m = o(λ). Choose x0 ≤ y0 to be multiples of mλ−1
in [0, 1] satisfying a0 ≤ a = y0 − x0 ≤ a1 .
Consider the tail probability in W conditional on the starting point (x0 , y0 ).
Pλ
sup S(x, y) ≥ λc
= Pλ S(x0 , y0 ) ≥ λc
(5.1)
(x,y)∈W
+Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc .
(x,y)∈W
34
To evaluate s(x, y), apply the Taylor expansion. By (2.2),
S(x, y) = λ(y − x)φ
N (x, y)
λ(y − x)
= λ(y − x) φ(za ) + θa
(5.2)
N (x, y)
− za + O
λ(y − x)
N (x, y)
2
− za
λ(y − x)
N (x, y)
2
− za .
λ(y − x)
= θa N (x, y) − λ(y − x)ψ(θa ) + λ(y − x)O
Define a probability measure Qλ under which X = {(ti , Xi }ni=1 is a non-uniform
compound Poisson process with rate λM (θa ) and mark distribution Fθa inside
[x0 , y0 ] and rate λ and mark distribution F outside [x0 , y0 ]. Then
θa N (x ,y )
0 0
e
e−λaK(θa ) (λaK(θa ))N (x0 ,y0 ) K(θ
N (x0 ,y0 )
dQλ
a)
(X ) =
dPλ
e−λa (λa)N (x0 ,y0 )
(5.3)
= exp{θa N (x0 , y0 ) − λaψ(θa )}.
By this change of measure approach, we examine the local behavior of S(x0 , y0 )
under Qλ . Specifically θa N (x0 , y0 ) − λaψ(θa ) is asymptotically normal with mean
θa λaK (θa ) − λaψ(θa ) = λc, and variance θa2 λaK (θa ) under Qλ ,
Qλ θa N (x0 , y0 ) − λaψ(θa ) ∈ du
= [1 + o(1)]
1
θa
2πλaK (θa )
(5.4)
exp
−
(u − λc)2
,
2θa2 λaK (θa )
35
where o(1) is uniform over bounded values of u. Combining (5.2), (5.3) and (5.4),
Pλ S(x0 , y0 ) ≥ λc
= Pλ θa N (x0 , y0 ) − λaψ(θa ) ≥ λc
∞
Pλ θa N (x0 , y0 ) − λaψ(θa ) ∈ λc + du
=
0
∞
exp{−λc − u}Qλ θa N (x0 , y0 ) − λaψ(θa ) ∈ λc + du
=
0
∞
exp{−λc − u}[1 + o(1)]
=
0
Since limλ→∞ exp
−
u2
2θa2 λaK (θa )
1
θa
2πλaK (θa )
exp
u2
− 2
du.
2θa λaK (θa )
= 1,
Pλ S(x0 , y0 ) ≥ λc =
[1 + o(1)]e−λc
θa
2πλaK (θa )
.
(5.5)
Similarly,
Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc
(5.6)
(x,y)∈W
∞
=
sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u Pλ S(x0 , y0 ) ∈ λc − du
Pλ
(x,y)∈W
0
∞
=
0
[1 + o(1)]e−λc+u
θa
2πλaK (θa )
Pλ
sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u du.
(x,y)∈W
By (5.2), the linear approximation of S(x, y) − S(x0 , y0 ) is θa [N (x, x0 ) + N (y0 , y)] −
λ(y − x − a)ψ(θa ), which is independent of S(x0 , y0 ). Hence,
Pλ
sup S(x, y) ≥ λc|S(x0 , y0 ) = λc − u
(x,y)∈W
= Pλ
sup θa [N (x, x0 ) + N (y0 , y)] − λ(y − x − a)ψ(θa ) ≥ u − o(1)
(x,y)∈W
= Pλ
sup
θa [N (x0 − v1 λ−1 , x0 ) + N (y0 , y0 + v2 λ−1 )]
0≤v1 ,v2 ≤m
−(v1 + v2 )ψ(θa ) ≥ u − o(1) ,
36
where o(1) is uniform over bounded values of u. Let
C u = Pλ
θa [N (x0 − v1 λ−1 , x0 ) + N (y0 , y0 + v2 λ−1 )]
sup
0≤v1 ,v2 ≤m
−(v1 + v2 )ψ(θa ) ≥ u − o(1) .
Through a scaling transformation, we can look at the limiting probability of Pλ as
one involving fixed Poisson rate λ0 = 1 and increasingly large scanning sets,
C u = Pλ 0
θa N (λx0 − v1 , λx0 ) − v1 ψ(θa )
sup
0≤v1 ,v2 ≤m
+θa N (λy0 , λy0 + v2 ) − v2 )ψ(θa ) ≥ u − o(1) .
Let N1 (v1 ) = N (λx0 − v1 , λx0 ), N2 (v2 ) = N (λy0 , λy0 + v2 ). Then N1 and N2 are
independent and identically distributed compound Poisson processes with rate 1
and mark distribution F . Further let Y1 (v1 ) = θa N1 (v1 ) − v1 ψ(θa ) and Y2 (v2 ) =
θa N2 (v2 ) − v2 ψ(θa ). Note that both Y1 and Y2 are independent of S(x0 , y0 ). Hence,
C u = Pλ 0
sup
[Y1 (v1 ) + Y2 (v2 )] ≥ u − o(1) .
(5.7)
0≤v1 ,v2 ≤m
Combining (5.6) and (5.7),
Pλ S(x0 , y0 ) < λc, sup S(x, y) ≥ λc
(5.8)
(x,y)∈W
∞
=
0
[1 + o(1)]e−λc+u
θa
2πλaK (θa )
Pλ 0
sup
0≤v1 ,v2 ≤m
[Y1 (v1 ) + Y2 (v2 )] ≥ u − o(1) du.
37
By (5.1), (5.5) and (5.8), as λ → ∞,
Pλ
sup S(x, y) ≥ λc
(5.9)
(x,y)∈W
=
=
=
=
=
∞
e−λc
θa
2πλaK (θa )
e−λc
θa
2πλaK (θa )
e−λc
θa
2πλaK (θa )
e−λc
θa
0
sup
[Y1 (v1 ) + Y2 (v2 )] ≥ u du
0≤v1 ,v2 ≤m
∞
e u Pλ 0
EPλ0 exp
[Y1 (v1 ) + Y2 (v2 )] ≥ u du
sup
0≤v1 ,v2 ≤m
−∞
2πλaK (θa )
e−λc
θa
e u Pλ 0
1+
sup
[Y1 (v1 ) + Y2 (v2 )]
0≤v1 ,v2 ≤m
2
EPλ0 exp
sup [Y1 (v1 )]
0≤v1 ≤m
∞
2
e u Pλ 0
2πλaK (θa )
sup [Y1 (v1 )] ≥ u du .
0≤v1 ≤m
−∞
Let Tu = inf{v1 : Y1 (v1 ) ≥ u}. Define a probability measure Qλ0 under which N1 is
a compound Poisson process with rate K(θa ) and mark distribution F (θa ). Then
θa N (Tu )
1
e
e−K(θa )Tu (K(θa )Tu )N1 (Tu ) K(θ
N (T )
dQλ0
a) 1 u
(Y1 (Tu )) =
dPλ0
e−Tu (Tu )N1 (Tu )
= exp{θa N1 (Tu ) − Tu ψ(θa )}
= eY1 (Tu ) .
Hence for (5.9),
∞
e u Pλ 0
sup [Y1 (v1 )] ≥ u du
−∞
∞
=
e
−∞
∞
Pλ0 Y1 (Tu ) ∈ db, sup [Y1 (v1 )] ≥ u du
−∞
∞
=
−∞
∞
=
−∞
0≤v1 ≤m
−∞
∞
e−b Qλ0 Y1 (Tu ) ∈ db, sup [Y1 (v1 )] ≥ u du
eu
=
(5.10)
0≤v1 ≤m
∞
u
0≤v1 ≤m
−∞
eu EQλ0 e−Y1 (Tu ) I{ sup [Y1 (v1 )] ≥ u} du
0≤v1 ≤m
EQλ0 e−(Y1 (Tu )−u) I{ sup [Y1 (v1 )] ≥ u} du.
0≤v1 ≤m
38
By (2.3), νa = limu→∞ EQλ0 [e−(Y1 (Tu )−u) ] is the overshoot constant. Hence it follows
from (5.10) that
∞
eu Pλ0
sup [Y1 (v1 )] ≥ u du
(5.11)
0≤v1 ≤m
−∞
∞
∼
sup [Y1 (v1 )] ≥ u du
νa Qλ0
0≤v1 ≤m
−∞
sup [Y1 (v1 )
= νa EQλ0
0≤v1 ≤m
= νa EQλ0 Y1 (m) + o(1)
= νa [θa K(θa )m − mψ(θa )] + o(1)
= νa mc/a + o(1).
By (5.9) and (5.11), as λ → ∞,
sup S(x, y) ≥ λc
Pλ
(5.12)
(x,y)∈W
∼
e−λc
θa
2πλaK (θa )
2
νa mc/a .
By Chan and Zhang (2007), the probability of joint externality in two disjoint
windows is asymptotically negligible. Hence Theorem 3.1 follows by combining all
the windows. Let x0 < y0 be multiples of mλ−1 in [0, 1] and a = y0 − x0 . Hence by
(5.12), as λ → ∞,
Pλ M (a0 , a1 ) ≥ λc
∼
sup S(x, y) ≥ λc
Pλ
(x,y)∈W
a0 ≤a≤a1
a1
∼
θa
a0
e−λc
2πλaK (θa )
−1/2 3/2 −λc 2
= (2π)
λ
e
νa2 λ2 c2 a−2 (1 − a)da
a1
c
a0
θa−1 K (θa )−1/2 νa2 a−5/2 (1 − a)da.
39
Chapter 6
Conclusions
The definition of score S(x, y) and therefore Theorem 3.1 are based on the
rate function φ which further depends on the mark distribution F . In practice,
however, it is more reasonable to replace φ with a general function g which is
independent of F . This can be done in two steps. First consider g(x) = x, then
S(x, y) = N (x, y), which is also used in Chan and Zhang (2007). Under the same
definition of M (a0 , a1 ), the methodology applied to φ should be adapted in such
case. For example, we have to choose new θa without φ. A simple guess is to
choose θa satisfying K (θa ) = c/a. The second step is to extend approximations
derived when g(x) = x to general cases. We may need to place some regularity
assumptions on g to derive corresponding results. For example, g is continuously
differentiable in the neighborhood of origin.
Another future direction is to consider multi-dimensional time indexes. Chan
(2009) provides a good example of studying scan statistics in a Poisson random field.
40
For simplicity, we can first examine the scan statistics in a cube instead of arbitrary
indexing sets. The following question is then to describe the existence of overshoot
constants, which is still at the heart of approximations in multi-dimensional cases.
The technique applied to marked Poisson processes should be adapted to marked
Poisson random fields.
The complicated computation of overshoot constants may be an issue of concern
in applications. Therefore it is necessary to study the linearity exhibited in Figure
4.3. If the linearity between νa and θa holds, we can simply use θa to derive
approximations of νa . We can also apply some transformations (e.g. νa = eνa )
and reexamine the relation between νa and a. If it shows valid linearity after the
transformation, we then derive another simple approximation of νa .
41
Bibliography
[1] Alm, S.E. (1997). On the distributions of scan statistics of a two dimensional
Poisson process. Adv. in Appl. Prob. 29, 1–18.
[2] Chan, H.P. and Lai, T.L. (2006). Maxima of asymptotically Gaussian random
fields and moderate deviation approximations to boundary-crossing probabilities of sums of random variables with multidimensional indices. Ann. Probab.
34, 80–121.
[3] Chan, H.P. and Zhang, N.R. (2007). Scan statistics with weighted observations. J. Amer. Statist. Assoc. 102, 595–602.
[4] Chan, H.P. (2009). Maxima of moving sums in a Poisson random field. Adv.
Appl. Prob. 41, 647–663.
[5] Chew, D., Choi, K. and Leung, M. (2005), Scoring schemes of palindrome
clusters for more sensitive prediction of replication origins in herpesviruses.
Nucleic Acids Research. 33, e134.
42
[6] Cousens, S., Smith, P.G., Ward, H., Everington, D., Knight, R.S.G., Zeidler,
M., Stewart, G., Smith-Bathgate, E.A.B., Macleod, M.A., Mackenzie, J. and
Will, R.G. (2001). Geographical distribution of variant Creutzfeldt-Jakob disease in Great Britain. The Lancet 357, 1002–1007.
[7] Duczmal, L. and Assunc˜ao, R. (2004). A simulated annealing strategy for the
detection of arbitrarily shaped spatial clusters. Computational Statistics &
Data Analysis 45, 269–286.
[8] Glaz, J., Naus, J. and Wallenstein, S. (2001). Scan Statistics. Springer, New
York.
[9] Glaz, J., Pozdnyakov, V. and Wallenstein, S. (2009).Scan Statistics: Methods
and Applications. Birkh¨auser Boston.
[10] Hogan, M. and Siegmund, D.O. (1986). Large deviations for the maxima of
some random fields. Adv. Appl. Math. 7, 2–22.
[11] Kabluchko, Z. and Spodarev, E. (2009). Scan statistics of L´evy noises and
marked empirical processes. Adv. Appl. Prob. 41, 13–37.
[12] Kulldorff, M. (1997). A spatial scan statistic. Commun. Statist.—Theory Meth.
26, 1481–1496.
[13] Kulldorff. M., Huang, L. and Konty, K. (2008). A spatial scan statistic for
normally distributed data. Manuscript.
43
[14] Lifanov, A., Makeev, V., Nazina, A. and Papatsenko, D. (2003). Homotypic
regulatory clusters in Drosophila. Genome Research. 13, 579–588.
[15] Loader, C. (1991). Large-deviation approximations to the distributions of scan
statistics. Adv. Appl. Prob. 23, 751–771.
[16] Marcos, R.D.L.F. and Marcos, C.D.L.F. (2008). From star complexes to the
field: open cluster families. Astrophysical J. 672, 342–351.
[17] Naiman, D.Q. and Priebe, C.E. (2001). Computing scan statistic p-values
using importance sampling, with applications to genetics and medical image
analysis. J. of Computational & Graphical Statistics 10, 296–328.
[18] Naus, J. I. (1965). Clustering of random points in two dimensions. Biometrika.
52, 263–267.
[19] Naus, J. I. (1982). Approximations for Distributions of Scan Statistics. Journal
of the American Statistical Association. 77, 177–183.
[20] Pickands,J. (1969). Upcrossing probabilities for stationary Gaussian processes
Trans. Amer. Math. Soc. 145, 51–73.
[21] Qualls, C. and Watanabe, H. (1973). Asymptotic properties of Gaussian random fields. Trans. Amer. Math. Soc.177, 155–171.
44
[22] Rajewsky, N., Vergassola, M., Gaul, U. and Siggia, E. (2002). Computational
detection of genomic cis-regulatory modules applied to body patterning in the
early Drosophila embryo. BMC Bioinformatics. 3, e30.
[23] Siegmund, D.O. (1985). Sequential Analysis. Springer, New York.
[24] Turnbull, B., Iwano, E.J., Burnett, W.S., Howe, H.L. and Clark, L.C. (1990).
Monitoring for clusters of disease: application to leukemia incidence in upstate
New York. Amer. J. Epidemiology 132, 136–143.
[25] Yoshida, M., Naya, Y. and Miyashita, Y. (2003). Anatomical organization
of forward fiber projections from area TE to perirhinal neurons representing
visual long-term memory in monkeys. Proceedings of the National Academy of
Sciences of the United States of America 100, 4257–4262.
45
Appendix
Related R code
##main Monte Carlo simulation
sim1[...]... constant related to compound Poisson processes Several examples are provided to illustrate the computation The main theoretical result of this thesis is given by Theorem 3.1 in Chapter 3, which provides an asymptotic tail probability of scan statistics of rate function of scores on marked poisson point process A corollary is also presented to provide a simple upper bound for the tail probability of. .. general marked Poisson random field Consider here {tˇi , i ≤ 1} to be a homogeneous Poisson point process on Rm with intensity λ > 0 and let X1 , X2 , · · · be i.i.d random varibles with cumulative distribution function F and independent of the Poisson point process Let ρ and K(θ) be defined as in Section 2.2 To consider the scan statistic, define first some notations Let σm (·) be the volume of set in... probabilities of scan statistics They also illustrated the application of their p-value approximations in computational biology Chan (2009) further examined the tail probabilities of moving sums in a marked Poisson random field These sums were derived by adding up the weighted occurrences of events within a scanning set of fixed size and shape He also provided an alternative presentation of the constants of the... of the occupation measure of the conditional local random field at zero, which were further extended to the constants of asymptotic tail probabilities of Gaussian random field These new formulas are useful for deriving bounds of the tail probabilities of scan statistics 6 1.3 Organization of this Thesis In Chapter 2 we introduce the limiting distribution of the overshoot of a random walk over a constant... n=1 nµ 2 The Overshoot Constant of Compound Poisson Processes We will next consider overshoot constants that are at the heart of the approximations in this thesis [see Chapter 3 for the approximations] The reader can return to this subsection after looking at the approximations in Chapter 3 and the proof of Theorem 3.1 in Chapter 5 Let {N (t), t ≥ 0} be a Poisson process with rate λ > 0 Let {Xi , i... and derived explicit asymptotical approximations for the tail probability of the distribution of the maximum by introducing the overshoot constants into the formulas Chan and Zhang (2007) examined scan statistics for one dimensional marked Poisson processes The scan statistics were defined as the maximum weighted count of event occurrences within a window of fixed width which is moved within an observed... used the total score of all words rather than the number of words in a window exceeding the cut-off to compute the scan statistics Chan and Zhang (2007) provided p-value approximations for scan statistics of marked Poisson processes These approximations can be applied to general scoring schemes used in computational biology An important feature of the formula is an overshoot correction term that is equal... The computation of the overshoot constant is important in queuing theory, risk insurance, engineering systems, sequential testing and change -point detection In this section, we first introduce some theory on the limiting distribution of the overshoot of a random walk over a constant boundary Then we define the overshoot constant related to compound Poisson processes The results are illustrated with specific... two-dimensional scan statistic to study spatial clustering of ‘open star clusters’, which are physically associated groups of stars combined together by mutual gravitational attraction The study regions were defined by galactic longitude as the first dimension and either radial velocity, proper motion or inclination as the second dimension, resulting in three different analysis A number of statistically... probability of crossing a high level concentrate on a small neighborhood of the subset of the indexing set where the marginal probability of crossing the level is maximal The second step is to break the subset into small pieces which disjointedly contribute to the total probability Then the probability approximation is derived by adding the contributions of each small piece However, the approximation involves ... Poisson processes are commonly encountered in applications The observations may also be generated by compound Poisson processes with normal or exponential observations Kulldorff, Huang and Konty... computation The main theoretical result of this thesis is given by Theorem 3.1 in Chapter 3, which provides an asymptotic tail probability of scan statistics of rate function of scores on marked poisson. .. computations and simulation exercises Keywords: Change of Measure, Large Deviation, Marked Poisson Process, Scan Statistics v List of Tables 3.1 Estimation of p ± s.e with F degenerate at