SIMULATION AND THE MONTE CARLO METHOD Episode 6 doc

130 CONTROLLING THE VARIANCE Suppose that X can be generated via the composition method. Thus, we assume that there exists a random variable Y taking values in { 1, . . . , m}, say, with known probabilities {p,, i = 1, . . . , m}, and we assume that it is easy to sample from the conditional distribution of X given Y. The events {Y = i}, i = 1,. . . , m form disjoint subregions, or strata (singular: stratum), of the sample space 0, hence the name stratification. Using the conditioning formula (1.1 l), we can write m e = E[E[H(X) I Y]] = Cpz E[H(X) I Y = i] . (5.33) 1.= 1 This representation suggests that we can estimate l via the following stratijed sampling estimator: m N, e^. = CP1 N, CH(Xl,) 1 (5.34) where X,, is the j-th observation from the conditional distribution of X given Y = i. Here N, is the sample size assigned to the i-th stratum. The variance of the stratified sampling estimator is given by 1= 1 3=1 (5.35) where uz = Var(H(X) I Y = i). How the strata should be chosen depends very much on the problem at hand. However, for a given particular choice of the strata, the sample sizes { N,} can be obtained in an optimal manner, as given in the next theorem. Theorem 5.5.1 (Stratified Sampling) Assuming that a maximum number of N samples can be collected, that is, El”=, N, = N, the optimal value of N, is given by which gives a minimal variance of (5.36) (5.37) Proof: The theorem is straightforwardly proved using Lagrange multipliers and is left as an exercise to the reader; see Problem 5.9. 0 Theorem 5.5.1 asserts that the minimal variance of e^. is attained for sample sizes Ni that are proportional to pi ui. A difficulty is that although the probabilities pi are assumed to be known, the standard deviations {ai} are usually unknown. In practice, one would estimate the { ui} from “pilot” runs and then proceed to estimate the optimal sample sizes, Nt , from (5.36). A simple stratification procedure, which can achieve variance reduction without requiring prior knowledge of u: and H(X), is presented next. IMPORTANCE SAMPLING 131 Proposition 5.5.1 Let the sample sizes N, beproportional to p,, that is, N, = p, N, i = 1,. . . m. Then var(e^s) 6 Er(lj . Proof Substituting N, = p, N in (5.35) yields Var(e^.) = & zz, p, 0:. The result now follows from m NVar(P^) = Var(H(X)) 2 E[Var(H(X) 1 Y)] = xpla: = NVar(e^s), r=l where we have used (5.21) in the inequality. 0 Proposition 5.5.1 states that the estimator is more accurate than the CMC estimator It effects stratification by favoring those events {Y = i} whose probabilities p, are largest. Intuitively, this cannot, in general, be an optimal assignment, since information on a: and H(X) is ignored. In the special case of equal weights (p, = l/m and N, = N/m), the estimator (5.34) reduces to (5.38) and the method is known as the systematic sampling method (see, for example, Cochran [61). 5.6 IMPORTANCE SAMPLING The most fundamental variance reduction technique is importancesampling. As we shall see below, importance sampling quite often leads to a dramatic variance reduction (sometimes on the order of millions, in particular when estimating rare event probabilities), while with all of the above variance reduction techniques only a moderate reduction, typically up to 10-fold, can be achieved. Importance sampling involves choosing a sampling distribution that favors important samples. Let, as before, where H is the sample performance and f is the probability density of X. For reasons that will become clear shortly, we add a subscript f to the expectation to indicate that it is taken with respect to the density f. Let g be another probability density such that H f is dominated by g. That is, g(x) = 0 + H(x) f(x) = 0. Using the density g we can represent e as (5.40) where the subscript g means that the expectation is taken with respect to g. Such a density is called the importance sampling density, proposal density, or instrumental density (as we use g as an instrument to obtain information about l). Consequently, if XI, . . . , XN is a random sample from g, that is, XI, . . . , XN are iid random vectors with density g, then (5.41) 132 CONTROLLING THE VARIANCE is an unbiased estimator of e. This estimator is called the importance sampling estimator. The ratio of densities, (5.42) is called the likelihood ratio. For this reason the importance sampling estimator is also called the likelihood ratio estimator. In the particular case where there is no change of measure, that is, g = f, we have W = 1, and the likelihood ratio estimator in (5.41) reduces to the usual CMC estimator. 5.6.1 Weighted Samples The likelihood ratios need only be known up to a constanf, that is, W(X) = cw(X) for some known function w(.). Since IE,[W(X)] = 1, we can write f2 = IEg[H(X) W(X)] as This suggests, as an alternative to the standard likelihood ratio estimator (5.42), the following weighted sample estimator: (5.43) Here the {wk}, with 'uik = w(&), are interpreted as weights of the random sample {Xk}, and the sequence {(xk,Wk)} is called a weighted (random) sample from g(x). Similar to the regenerative ratio estimator in Chapter 4, the weighted sample estimator (5.43) introduces some bias, which tends to 0 as N increases. Loosely speaking, we may view the weighted sample { (&, Wk)} as a representation of f(x) in the sense that e = IE,[H(X)I =: e2, for any function H(.). 5.6.2 The Variance Minimization Method Since the choice of the importance sampling density g is crucially linked to the variance of the estimator Fin (5.41), we consider next the problem of minimizing the variance of with respect to g, that is, minVarg (H(x) g(x,> f (XI . 9 (5.44) It is not difficult to prove (see, for example, Rubinstein and Melamed [3 11 and Problem 5.13) that the solution of the problem (5.44) is In particular, if H(x) 0 - which we will assume from now on - then (5.45) (5.46) and Var,. (F) = varg- (H(x)w(x)) = Var,. (e) = o . The density g* as per (5.45) and (5.46) is called the optimal importance sampling density. IMPORTANCE SAMPLING 133 EXAMPLE58 Let X - Exp(u-') and H(X) = I{x27) for some y > 0. Let f denote the pdf of X. Consider the estimation of We have Thus, the optimal importance sampling distribution of X is the shfted exponential distribution. Note that H f is dominated by g' but f itself is not dominated by g*. Since g* is optimal, the likelihood ratio estimator zis constant. Namely, with N = 1, It is important to realize that, although (5.41) is an unbiased estimator for any pdf g dominating H f, not all such pdfs are appropriate. One of the main rules for choosing a good importance sampling pdf is that the estimator (5.41) should have finite variance. This is equivalent to the requirement that (5.47) This suggests that g should not have a "lighter tail" than f and that, preferably, the likelihood ratio, f /g. should be bounded. In general, implementation of the optimal importance sampling density g* as per (5.45) and (5.46) is problematic. The main difficulty lies in the fact that to derive g*(x) one needs to know e. But e is precisely the quantity we want to estimate from the simulation! In most simulation studies the situation is even worse, since the analytical expression for the sample performance H is unknown in advance. To overcome this difficulty, one can perform a pilot run with the underlying model, obtain a sample H(X1), . . . , H(XN), and then use it to estimate g*. It is important to note that sampling from such an artificially constructed density may be a very complicated and time-consuming task, especially when g is a high-dimensional density. Remark5.6.1 (Degeneracy of the Likelihood Ratio Estimator) The likelihood ratio estimator C in (5.41) suffers from a form of degeneracy in the sense that the distribution of W(X) under the importance sampling density g may become increasingly skewed as the dimensionality n of X increases. That is, W(X) may take values close to 0 with high probability, but may also take very large values with a small but significant probability. As a consequence, the variance of W(X) under g may become very large for large n. As an example of this degeneracy, assume for simplicity that the components in X are iid, under both f and g. Hence, both f (x) and g(x) are the products of their marginal pdfs. Suppose the marginal pdfs of each component Xi are fl and 91, respectively. We can then write W(X) as (5.48) 134 CONTROLLING THE VARIANCE Using the law of large numbers, the random variable c:=, In (fl(Xi)/gl(Xi)) is approximately equal to n E,, [In (fi (X)/gl (X))] for large n. Hence, (5.49) Since E,, [ln(gl(X)/fl(X))] is nonnegative (see page 31), the likelihood ratio W(X) tends to 0 as n + 00. However, by definition, the expectation of W(X) under g is always 1. This indicates that the distribution of W(X) becomes increasingly skewed when n gets large. Several methods have been introduced to prevent this degeneracy. Examples are the heuristics of Doucet et al. [8], Liu [23], and Robert and Casella [26] and the so-called screening method. The last will be presented in Sections 5.9 and 8.2.2 and can be considered as a dimension-reduction technique. When the pdf f belongs to some parametric family of distributions, it is often convenient to choose the importance sampling distribution from the same family. In particular, suppose that f(.) = f(.; u) belongs to the family 9 = {f(.;v), v E Y} . Then the problem of finding an optimal importance sampling density in this class reduces to the following parametric minimization problem: min Var, (H(X) W(X; u, v)) , (5.50) where W(X; u, v) = f(X; u)/f(X; v). We will call the vectorv the referenceparameter vector or tilting vector. Since under f(.; v) the expectation C = Ev[H(X) W(X; u, v)] is constant, the optimal solution of (5.50) coincides with that of VEY minV(v) , VEY (5.51) where V(v) = Ev[H2(X) W2(X; u, v)] = E"[H2(X) W(X; u, v)] . (5.52) We shall call either of the equivalent problems (5.50) and (5.5 1) the variance minimization (VM) problem, and we shall call the parameter vector .v that minimizes programs (5.50) - (5.5 1) the optimal VMreferenceparameter vector. We refer to u as the nominal parameter. The sample average version of (5.51) - (5.52) is where (5.53) (5.54) and the sample XI, . . . , XN is from f(x; u). Note that as soon as the sample X1,. . . , XN is available, the function v(v) becomes a deterministic one. Since in typical applications both functions V(v) and 6(v) are convex and differentiable with respect to v, and since one can typically interchange the expectation and differentiation operators (see Rubinstein and Shapiro [32]), the solutions of programs (5.51) - (5.52) and IMPORTANCE SAMPLING 135 (5.53) - (5.54) can be obtained by solving (with respect to v) the following system of equations: IE"[P(X) VW(X; u, v)] = 0 (5.55) (5.56) respectively, where f (X. u) f (X; v) VW(X; u, v) = V- = [V Inf(X; v)] W(X; u, v) , the gradient is with respect to v and the function V In f (x; v) is the score function, see (1.64). Note that the system of nonlinear equations (5.56)is typically solved using numerical methods. EXAMPLES9 Consider estimating e = IE[X], where X N Exp(u-'). Choosing f(z;v) = v-' exp(z,u-'), z 2 0 as the importance sampling pdf, the program (5.51) reduces The optimal reference parameter *v is given by *v = 221. We see that .IJ is exactly two times larger than u. Solving the sample average version (5.56) (numerically), one should find that, for large N, its optimal solution .z will be close to the true parameter *v. EXAMPLE 5.10 Example 5.8 (Continued) Consider again estimating e = PU(X 2 y) = exp(-yu-'). In this case, using the family { f (z; v), v > 0) defined by f (2; v) = vP1 exp(zv-l), z 2 0, the program (5.51) reduces to The optimal reference parameter .w is given by 1 2 *?I = - {y + 'u + &G2} = y + ; + O((u/y)2) , where O(z2) is a function of z such that lim (30 = constant 2-0 52 We see that for y >> u, .v is approximately equal to y. 136 CONTROLLING THE VARIANCE It is important to note that in this case the sample version (5.56) (or (5.53) - (5.54)) is meaningful only for small y, in particular for those y for which C is not a rare-event probability, say where C < For very small C, a tremendously large sample N is needed (because of the indicator function I{ x)y}). and thus the importance sampling estimator Fis useless. We shall discuss the estimation of rare-event probabilities in more detail in Chapter 8. Observe that the VM problem (5.5 1) can also be written as min V(V) = min E, [H’(x) W(X; u, v) W(X; u, w)] , (5.57) VEY VEY where w is an arbitrary reference parameter. Note that (5.57) is obtained from (5.52) by multiplying and dividing the integrand by f(x; w). We now replace the expected value in (5.57) by its sample (stochastic) counterpart and then take the optimal solution of the asso- ciated Monte Carlo program as an estimator of *v. Specifically, the stochastic counterpart of (5.57) is N 1 min ?(v) = min - H’(X,) W(Xk ; u,v) W(Xk ; u, w) , (5.58) where XI, . . . , XN is an iid sample from f( .; w) and w is an appropriately chosen trial parameter. Solving the stochastic program (5.58) thus yields an estimate, say 3, of *v. In some cases it may be useful to iterate this procedure, that is, use as a trial vector in (5.58), to obtain a better estimate. Once the reference parameter v = 3 is determined, C is estimated via the likelihood ratio estimator VEY “EY N ,=I (5.59) where XI, . . . , XN is a random sample from f(.; v). Typically, the sample size N in (5.59) is larger than that used for estimating the reference parameter. We call (5.59) the standard likelihood ratio (SLR) estimator. 5.6.3 The Cross-Entropy Method An alternative approach for choosing an “optimal” reference parameter vector in (5.59) is based on the Kullback-Leibler cross-entropy, or simply crass-entropy (CE), mentioned in (1 S9). For clarity we repeat that the CE distance between two pdfs g and h is given (in the continuous case) by Recall that ID(g, h) 2 0, with equality if and only if g = h. The general idea is to choose the importance sampling density, say h, such that the CE distance between the optimal importance sampling density g* in (5.45) and h is minimal. We call this the CE optirnalpdf: Thus, this pdf solves the followingfunctional optimization program: min ID (g’, h) . I1 IMPORTANCE SAMPLING 137 If we optimize over all densities h, then it is immediate from ’D(g*, h) 2 0 that the CE optimal pdf coincides with the VM optimal pdf g*. As with the VM approach in (5.50)and (5.5 I), we shall restrict ourselves to the parametric family of densities { f(.; v), v E Y} that contains the “nominal” density f(.; u). The CE method now aims to solve the parametric optimization problem min ’D (g*, f(.; v)) . V Since the first term on the right-hand side of (5.60) does not depend on v, minimizing the Kullback-Leibler distance between g’ and f(.; v) is equivalent to maximizing with respect to v, 1 H(x) f(x; u) In f(x; v) dx = EU [ff(X) In f(x; v)l, where we have assumed that H(x) is nonnegative. Arguing as in (5.5 I), we find that the CE optimal reference parameter vector v* can be obtained from the solution of the following simple program: max D(v) = max IE, [H(X) In f(X; v)] . (5.61) Since typically D(v) is convex and differentiable with respect to v (see Rubinstein and V V Shapiro [32]), the solution to (5.61) may be obtained by solving E, [H(X) V In f(X; v)] = 0 , (5.62) provided that the expectation and differentiation operators can be interchanged. The sample counterpart of (5.62) is .N (5.63) By analogy to the VM program (5.51), we call (5.61) the CE program, and we call the parameter vector v* that minimizes the program (5.64) the optimal CE referenceparameter vector. Arguing as in (5.57), it is readily seen that (5.61) is equivalent to the following program: max D(v) = max E, [H(X) W(X; u, w) In f(X; v)] , (5.64) where W(X; u, w) is again the likelihood ratio and w is an arbitrary tilting parameter. Similar to (5.58), we can estimate v* as the solution of the stochastic program V N 1 vN max ~(v) = max - C H(x~) w(x~; u, w) In f(&; v) , (5.65) where XI,. . . , XN is a random sample from I(.; w). As in the VM case, we mention the possibility of iterating this procedure, that is, using the solution of (5.65) as a trial parameter for the next iteration. Since in typical applications the function 5 in (5.65) is convex and differentiable with respect to v (see [32]), the solution of (5.65) may be obtained by solving (with respect to v) the following system of equations: k=l (5.66) 138 CONTROLLING THE VARIANCE where the gradient is with respect to v. Our extensive numerical studies show that for moderate dimensions n, say n 5 50, the optimal solutions of the CE programs (5.64)and (5.65) (or (5.66)) and their VM counterparts (5.57) and (5.58) are typically nearly the same. However, for high-dimensional problems (n > 50), we found numerically that the importance sampling estimator gin (5.59) based on VM updating of v outperforms its CE counterpart in both variance and bias. The latter is caused by the degeneracy of W, to which, we found, CE is more sensitive. The advantage of the CE program is that it can often be solved analytically. In particular, this happens when the distribution of X belongs to an exponentialfamily of distributions; see Section A.3 of the Appendix. Specifically (see (A. 16)). for a one-dimensional exponential family parameterized by the mean, the CE optimal parameter is always and the corresponding sample-based updating formula is (5.67) (5.68) respectively, where XI,. . . , XN is a random sample from the density f(.; w) and w is an arbitrary parameter. The multidimensional version of (5.68) is (5.69) for i = 1, . . . , n, where Xkt is the i-th component of vector Xk and u and w are parameter vectors. Observe that for u = w (no likelihood ratio term W), (5.69) reduces to (5.70) where Xk N f(x; u). Observe also that because of the degeneracy of W, one would always prefer the estimator (5.70) to (5.69), especially for high-dimensional problems. But as we shall see below, this is not always feasible, particularly when estimating rare-event probabilities in Chapter 8. EXAMPLE 5.11 Example 5.9 continued Consider again the estimation of l = E[X], where X N Exp(u-l) and f(z; v) = v-' exp(zv-'), z 2 0. Solving (5.62). we find that the optimal reference parameter v* is equal to Thus, v* is exactly the same as *v. For the sample average of (5.62), we should find that for large N its optimal solution 8' is close to the optimal parameter v* = 2u. IMPORTANCE SAMPLING 139 I EXAMPLE 5.12 Example 5.10 (Continued) Consider again the estimation of C = Bu(X > y) = exp(-yv '). In this case, we readily find from (5.67) that the optimal reference parameter is w* = y + u. Note that similar to the VM case, for y >> u, the optimal reference parameter is approximately 7. Note that in the above example, similar to the VM problem, the CE sample version (5.66) is meaningful only when y is chosen such that C is not a rare-eventprobability, say when l < In Chapter 8 we present a general procedure for estimating rare-event probabilities of the form C = B,(S(X) 2 y) for an arbitrary function S(x) and level y. EXAMPLE 5.13 Finite Support Discrete Distributions Let X be a discrete random variable with finite support, that is, X can only take a finite number of values, say al,. . . Let ui = B(X = ai),i = 1,. . . , m and define u = (u1, . . . , urn). The distribution of X is thus trivially parameterized by the vector u. We can write the density of X as m From the discussion at the beginning of this section we know that the optimal CE and VM parameters coincide, since we optimize over all densities on { a1 , . . . , am}. By (5.45) the VM (and CE) optimal density is given by so that for any reference parameter w, provided that Ew[H(X) W(X; u, w)] > 0. The vector V* can be estimated from the stochastic counterpart of (5.71), that is, as where XI, . . . , XN is an iid sample from the density f(.; w). A similar result holds for a random vector X = (XI, . . . , X,) where XI, . . . , X, are independent discrete random variables with finite support, characterized by [...]... w1trp2,, ~ 2 be the vector of positions and (discrete) velocities of the target object at time t = 0, 1 , 2 , , and let Yt be the measured angle The problem is to track the unknown state of the object Xt based on the measurements { yt } and the initial conditions Figure 5 3 Track the object via noisy measurements of the angle The process ( X , , Yt), t = 0 , 1 , 2 , is described by the following... presents a typical evolution of the sequence {GL}in the single-bridge model for the VM and VM-SCR methods at the second stage of Algorithm 5.9.1 Table 5 .6 Typical evolution of the sequence { G t } for the VM and VM-SCR methods VM t 1 2 3 iil 1.000 0.537 0.3 46 0.3 06 VM-SCR h v2 v3 114 115 1.000 0.545 0.349 0.314 2.000 2.174 2.071 1.990 2.000 2.107 1. 961 1.999 2.000 1 .61 5 1.914 1.882 t 1 2 3 G1 1.000... iteration and a sample N = 1000 for the resulting importance sampling estimator Define the efficiency of the importance sampling estimator t(u;v ) relative to the CMC one F(u) as & = (3~)) Var, Var,(Z(u; v)) ‘ Table 5.8 represents the original values u, and the reference values vi obtained by using (5 .65 ) The numbers 1-32 correspond to the generators and the numbers 33-70 correspond to the lines Note that the. .. = 1, while the remaining (nonbottlenecks) = values are set equal to 2 Note again that in this case both C E and VM found the true six bottlenecks Table 5.7 Performance of Algorithm 5.9.1 for the 3 x 10 model with six bottleneck elements and sample size N = N1 = 1000 CMC MeanT 16. 16 Max 22 .65 MinF 11.13 RE 0.20 CPU 0.00 T CE VM CE-SCR VM-SCR 16. 11 14.84 16. 12 15 .67 16. 59 18.72 14 .63 14.80 26. 85 7.007... while directly delivering the analytical solution of the stochastic counterpart of the program (5.104) 5.9 PREVENTING THE DEGENERACY OF IMPORTANCE SAMPLING In this section, we show how to prevent the degeneracy of importance sampling estimators The degeneracy of likelihood ratios in high-dimensional Monte Carlo simulation problems is one of the central topics in Monte Carlo simulation To prevent degeneracy,... determine the nonbottleneck parameters, since it is likely that they will fluctuate around their nominal value u and therefore 6i will become negative or very small in i one of the replications 3 The advantage of Algorithm 5.9.1 compared to its gradient counterpart is that identification of the bottleneck elements in the former is based on the relative perturbations 6i (see (5.108)) with respect to the known... sample sizes for updating v^ and calculating the estimator l were N = lo3 and N = lo5,respectively In the table R E denotes the estimated relative error 1 Table 5.1 Iterating the five-dimensionalvector 0 iteration 0 1 2 3 I I V 1 2.4450 2.3850 2.3559 1 2.3274 2.3894 2.3902 0.3 0.2 462 0.31 36 0.3472 0.2 0.2113 0.2349 0.2322 0.1 0.1030 0.1034 0.1047 F RE 0. 064 3 0. 063 1 0. 064 4 0. 064 6 0.0121 0.0082 0.0079 0.0080... step, and the second involves an application of the SLR technique to the transformed pdf To apply the first step, we simply write X as a function of another random vector, say as x = C(Z) (5.100) If we define R Z ) = H(G(Z)) 9 then estimating (5.92) is equivalent to estimating e = E[H(Z)] (5.101) Note that the expectations in (5.92) and (5.101) are taken with respect to the original density of X and the. .. estimator (5.102) takingr] = 6 via The TLR Algorithm 5.8.1 ensures that as soon as the transformation X = G(Z) is chosen, one can estimate C using the TLR estimator (5.102) instead of the SLR estimator (5.59) Although the accuracy of both estimators (5.102) and (5.59) is the same (Rubinstein and Kroese [29]), the advantage of the former is its universality and it ability to avoid the computational burden... those lines and generators for which 21% differs from ui, i = 1, , n by at least 0.001 Table 5.8 The original parameters ui and the reference parameters zl, obtained from (5 .65 ) i i 1 3 5 6 I 8 9 10 12 13 14 15 18 20 21 0.1000 0.0200 0.1000 0.1000 0.0200 0.0200 0.0400 0.0400 0.0500 0.0500 0.1500 0.0200 0.0200 0.0400 0.0400 0.1091 0.0303 0.1 061 0.1 064 0.0 369 0.0 267 0. 060 3 0.0814 0.1 462 0.1 461 0.1405 . multiplying and dividing the integrand by f(x; w). We now replace the expected value in (5.57) by its sample (stochastic) counterpart and then take the optimal solution of the asso- ciated Monte Carlo. Several methods have been introduced to prevent this degeneracy. Examples are the heuristics of Doucet et al. [8], Liu [23], and Robert and Casella [ 26] and the so-called screening method. The. and H(X) is ignored. In the special case of equal weights (p, = l/m and N, = N/m), the estimator (5.34) reduces to (5.38) and the method is known as the systematic sampling method

Định dạng
Số trang	30
Dung lượng	1,35 MB