Some statistical issues in population genetics

SOME STATISTICAL ISSUES IN POPULATION GENETICS KHANG TSUNG FEI (B.Sc.(Hons), M.Sc.), University of Malaya A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF STATISTICS & APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2009 i ACKNOWLEDGMENT I would like to thank Von Bing for guidance, encouragement and support. Throughout this thesis, I have deliberately used the first person plural pronoun to remind readers that the results here are really fruits of our joint labour. I must thank the department for financially assisting me through a part-time teaching job during the last semester, which enabled me to concentrate on the thesis work. I am grateful to Dr Siegfried Krauss (Australian Botanic Gardens and Parks Authority), Miss Sharon Sim (National University of Singapore), Dr Yu Dahui (Chinese Academy of Fishery Sciences), and Dr Lene Rostgaard Nielsen (University of Copenhagen) for kindly providing the data sets which have been crucial in demonstrating some results in the present work. Finally, a special note of thanks to my parents and my wife Wai Jin, for having tolerated my lengthy absence from home for so long. It is with much pleasure that I dedicate this work to them. Table of Contents ii TABLE OF CONTENTS Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Chapter Statistics in Population Genetics 1.1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.0 Biological Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.0 Background on Problems and Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Molecular Data Analysis in Diploids using Multilocus Dominant DNA Markers 2.1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.0 Estimators of null allele frequency: background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Estimators of null allele frequency: new results . . . . . . . . . . . . . . . . . . . . . . 20 2.3.0 Estimators of locus-specific heterozygosity: theory and new results . . . . . . . . . . 28 2.4.0 Correcting for ascertainment bias in the estimation of average heterozygosity 33 2.4.1 Maximum likelihood estimation of average heterozygosity . . . . . . . . . . . . 41 2.4.2 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.5.0 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.6.0 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.7.0 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter Estimation of Wright’s Fixation Indices: a Reevaluation 3.1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1.1 61 Wright’s fixation indices: theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents 3.1.2 Wright’s fixation indices: estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Estimating Wright’s fixation indices under equal weight assumption when true weights are known: simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.0 Estimating Wright’s fixation indices using dominant marker data . . . . . . . . . . . . 81 3.4.0 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.2.0 Chapter Categorical Analysis of Variance in Studies of Genetic Variation 4.1.0 4.2.0 Introduction to the analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.1.1 Fixed, random and mixed effects models in ANOVA . . . . . . . . . . . . . . . . . 90 The analysis of variance for categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.1 Hypothesis testing in CATANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2.2 Multivariate CATANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.0 The analysis of molecular variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.0 A truncation algorithm for removing correlated binary variables . . . . . . . . . . . . 107 4.5.0 Comparison between CATANOVA and AMOVA: theoretical results . . . . . . . . 113 4.5.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.5.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.6.0 Discussion . 129 4.7.0 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Table of Contents iv Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Summary v SUMMARY We wish to report three improvements to existing statistical methodology that uses dominant marker data to infer population genetic structure. Through a comparative study of the bias, standard error and root mean square error (RMSE) of candidate estimators of locus-specific null allele frequency and heterozygosity, we demonstrate the efficacy of our proposed zero-correction procedure in reducing RMSE. Next, we show that ascertainment bias induced by dominant marker-based methodologies can be corrected using a suitable linear transform of sample average heterozygosity, leading to a nearly unbiased Bayes estimator. Subsequently, we propose two ways of evaluating the maximum likelihood estimator of average heterozygosity: one using the truncated beta-binomial likelihood, and another using the expectation-maximisation algorithm. Simulation studies show that both have negligible bias, and their RMSE may be lower than those of the empirical Bayes’s. Finally, we argue that the categorical analysis of variance (CATANOVA) framework, instead of the commonly used analysis of molecular variance (AMOVA), is the appropriate one for analysing genetic structure in a collection of populations where interest in intrinsically centered on the latter. In the simplest nonhierarchical case, we show that the proportion of total variation attributed to population labels implicitly estimates a measure of genetic differention, which we call γ. When alleles in a locus correspond to categories in CATANOVA, we show that γ is Wright’s FST if the number of alleles is two, and Nei’s GST , if more. Using simulated data based on actual data sets, we reveal that the choice of which parameter to use: the average of locus-specific γ (¯ γ ), or the compound parameter γ M which weighs each locus equally, can potentially lead to conflicts in interpreting population genetic structure. Summary vi Further simulations show that γ¯ is more or less insensitive to differences in relative sample sizes of the populations, compared to γ M . This finding suggests that conclusions regarding the relative contribution of population labels to total genetic variation based on estimates of γ M are premature. List of Tables vii LIST OF TABLES 2.4.0.1 2.4.2.1 Correction factor used in zero-corrected estimators when estimating average heterozygosity under four common beta profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Comparison of RMSE (magnified 100 times) of estimators of ¯ h with and without (indicated by †) correction for ascertainment bias based on simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.0.1 ¯ obtained using the candidate estimators, with and without Estimates of h correction for ascertainment bias. We estimated the SE by bootstrapping ¯ = 0.233, which is obtained over individuals (500 iterations). We assumed that h by plugging in a ˆ = 0.63 and ˆb = 0.38 into (2.4.0.1). The (approximate) theoretical bias is the difference between the expectation of a candidate estimator and 0.233; ¯ returned by a the apparent bias is the difference between the estimate of h candidate estimator and 0.233. The zero bias of the ML estimator (indicated by ∗) refers to asymptotic bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.5.0.2 Estimates of ¯ h for SVA and KOR populations obtained using the candidate estimators. Because the complete data were inaccessible to us, we could not perform bootstrapping across individuals to obtain the SE. Therefore, we calculated the SE by dividing the standard deviation of ˆ h with the square root of number of loci (not possible for the ML estimator). . . . . . . . . . . . . . . . . . . . . . 51 3.2.0.1 Specific sets of p corresponding to three levels of Factor used in the simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Specific sets of w corresponding to three levels of Factor used in the simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.0.2 3.2.0.3 Simulation scenarios used in the present study. The dagger symbol indicates scenarios where estimates of Wright’s fixation indices have bias 0.1 or less under equal weight assumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.1.1 Haptoglobin genotype counts in Chinese, Malay and Indian samples from Singapore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2.1.2 Estimates of Wright’s fixation indices using true and equal weights. . . . . . . . 81 3.3.0.1 Estimates of average FST for the SVA and KOR populations, using codominant and dominant marker data under equal weight assumption. . . . . . . . . . . . . . . . . 85 4.1.0.1 Standard tabulation of one-way ANOVA results. . . . . . . . . . . . . . . . . . . . . . . . . . . 91 List of Tables viii 4.2.2.1 A × table for displaying the joint probabilities for two binary variables Bk and Bl . The two binary categories are indicated as and 1. Abbreviations: r.m. (row marginal); c.m. (column marginal). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3.0.1 General structure of an hierarchical random effects AMOVA table. . . . . . . . 104 4.3.0.2 The one-way, nonhierarchical random effects AMOVA table. . . . . . . . . . . . . . 105 4.4.0.1 Testing H0 for two sets of loci, using Pearson chi-squared and CATANOVA C-statistics. The p-values are indicated in parentheses. . . . . . . . . . . . . . . . . . . . 112 4.5.1.1 M before and after applying the truncation procedure. Comparison of γˆCM and γÂ Values for the latter are indicated in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . 119 4.5.1.2 Estimating γ ¯ and σγ using CATANOVA and AMOVA, before and after truncation of loci (in parentheses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.5.1.3 Estimates of γ¯, σγ and γ M using CATANOVA and AMOVA for the three AFLP data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.5.2.1 Effects of different combinations of estimated wj on γˆC . The entries in w are in the order: African, Caucasian and Oriental populations. . . . . . . . . . . . . . . . . . . 126 4.5.2.2 Effects of different combinations of estimated wj on γˆ M , γˆ ¯C and σγC . The entries in w are in the order: African, Caucasian and Oriental populations. . . . . . . 127 4.5.2.3 Values γ M , γ¯ and σγ assuming that the population weights are equal to the relative sample sizes. The entries in the vector of sample sizes are in the order: African, Caucasian and Oriental populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.5.2.4 Effects of balanced and unbalanced sample sizes on estimators of γ M , γ ¯ and σγ (SE attached). The entries in the vector of sample sizes are in the order: African, Caucasian and Oriental populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 List of Figures ix LIST OF FIGURES 2.1.0.1 Some distribution profiles of the beta distribution with parameters a, b (abbreviation: Be(a, b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1.1 Bias profiles of estimators of q, with n = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1.2 Standard error profiles of estimators of q, with n = 20. . . . . . . . . . . . . . . . . . . . . 25 2.2.1.3 Root mean square error profiles of estimators of q, with n = 20. . . . . . . . . . . . 26 2.2.1.4 Ratio of variances profiles of estimators of q, with n = 20. . . . . . . . . . . . . . . . . . 27 2.3.0.1 Bias profiles of estimators of h, with n = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.0.2 Standard error profiles of estimators of h, with n = 20. . . . . . . . . . . . . . . . . . . . . 32 2.3.0.3 Root mean square error profiles of estimators of h, with n = 20. . . . . . . . . . . . 33 2.4.0.1 Bias of estimators of ¯ h according to beta profiles, with n = 20, 100, 200 and m = 100. Colour legend for estimators: green (square root), red (LM), blue (jackknife), black (Bayes). Dashed lines indicate zero-corrected forms. . . . . . 39 2.4.0.2 ¯ according to beta profiles, with n = 20, 100, 200 Standard error of estimators of h and m = 100. Colour legend for estimators: green (square root), red (LM), blue (jackknife), black (Bayes). Dashed lines indicate zero-corrected forms. . . . . . 40 2.4.0.3 ¯ according to beta profiles, with Root mean square error of estimators of h n = 20, 100, 200 and m = 100. Colour legend for estimators: green (square root), red (LM), blue (jackknife), black (Bayes). Dashed lines indicate zero-corrected forms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.4.2.1 Distribution of errors under four common beta distribution profiles when ¯ using four methods: ML with no correction for ascertainment bias estimating h (A), ML with correction for ascertainment bias (B) using the likelihood (2.4.1.1), ML with correction for ascertainment bias (C) using the EM algorithm, and empirical Bayes with correction for ascertainment bias (D). . . . . . . . . . . . . . . . . 45 2.4.2.2 Convergence behaviour of iterations of the EM algorithm when estimating a and b. This data set was simulated with a = b = 0.8, n = 20 and l = 100. . . 2.5.0.1 46 Empirical distribution of the null homozygote proportion in the SUB population, with fitted beta density (ˆ a = 0.63 ; ˆb = 0.38). Chi-squared goodness-of-fit test: p-value = 0.12; 18 d.f. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 5: Concluding Remarks 133 CHAPTER Much of the statistical analysis of dominant marker data still relies crucially on the HW model. As the results in Table 2.5.0.2 show, violations of the HW model in a majority of loci considered may seriously skew resultant estimates of average heterozygosity. Realising the obstacle to standard statistical inference induced by dominance, researchers such as Holsinger et al. (2002) turned to a Bayesian framework for estimating FIS and FST (Chapter 3). For a general discussion of the efficacy of Bayesian methods in tackling nonidentifiable models, see Neath and Samaniego (1997). The Holsinger et al. Bayesian approach makes two important assumptions. First, it is assumed that the distribution of null allele frequency (instead of null homozygote proportion) in the ith locus over J populations (J > 1), follows Be(ai , bi). Second, in each population, all loci are assumed to have the same inbreeding coefficient FIS . It has been reported that estimates of FIS obtained using the Bayesian approach described in Holsinger et al. 2002 are unreliable (see Holsinger 2003 and Levsen et al. 2008). Foll et al. (2008) suggested that the problem was caused by failure to correct for ascertainment bias, and subsequently proposed the Approximate Bayesian Computation (ABC) algorithm that appears to improve the method of Holsinger et al. We believe, however, that further evidence is necessary for judging the efficacy of their Bayesian method in overcoming model nonidentifiability. For example, using data simulated under FIS = 0, can it be consistently demonstrated that the ABC algorithm always returns estimates of FIS that are close to zero for different a, b (assuming all = a and bi = b) values of the beta distribution? We think there is much scope for further research into categorical ANOVA. It appears Chapter 5: Concluding Remarks 134 that the theory of categorical ANOVA has been developed mainly along the lines of testing the null hypothesis of homogeneity of proportions among G groups (Light and Margolin 1971; Margolin and Light 1974; Singh 1993; Singh 1996). Chakraborty (1988) pointed out that the latter is equivalent to testing FST = 0. For the multivariate case, Pinheiro et al. (2000) derived test statistics for doing so when the K loci considered are stochastically independent. Onukogu (1985, 1986) extended Light and Margolin’s work to two-way CATANOVA. Although testing of the null hypothesis of homogeneity of proportions among G groups is a worthwhile pursuit from a mathematical perspective, it is unlikely to be of much value to biologists, since different groups are expected to differ in at least some of the K loci considered (especially if K is large); failure of rejection is thus a consequence of low power of the test than the absence of differences in proportions among groups. It is perhaps more useful to further understand more sophisticated ways of apportioning total genetic variation to different sources of variation under the CATANOVA framework. As the present work has demonstrated the deficiencies of AMOVA, and the advantages of CATANOVA in analysing the genetic structure of a fixed set of populations, it would be useful to develop software to help biologists attain the gradual shift from AMOVA to CATANOVA. It will not be long before almost entire genome of individuals are compared on a routine basis, as recent so-called next-generation sequencing techniques drive the cost of DNA sequencing further down (New Scientist 2009). Many more polymorphic sites probably several orders of magnitude higher than those generated by AFLP, will be found. With sound grounding in theory, the distributional approach using locus-specific γ should provide a powerful means of understanding single nucleotide polymorphisms variation in populations of a species. Chapter 5: Concluding Remarks 135 Finally, we note that the sampling aspects of problems involving two-stage sampling have not been sufficiently emphasised in practical problems. Perhaps nothing much can be done with the sampling of loci using markers - we can only hope that the latter has been sufficiently representative such that results that depend on assumption of random sampling of loci are valid. Conversely, there could be some room for improvement at the stage of sampling of individuals from populations. For studies of mobile species, the difficulties involved in trying to obtain a random sample of individuals are pronounced. No general theory seems sufficient, as the actual sampling will need to consider the behaviour of the subject species. Things are simpler in studies of plant species. Although the assumption of simple random sampling or stratified random sampling simplifies subsequent analyses, in practice random sampling of individuals is often logistically infeasible. Take, for example, a “random sampling” of individuals from a plant population in a forest. Because sampling is only possible in accessible parts of the forest, not all individuals are equally likely to be sampled (Lowe et al. 2004). On the other hand, cluster sampling (see Thompson 2002) may be more practical. By dividing the area of study into a number of equally sized blocks, a random sampling of the blocks is first obtained, and then all units of interest in a block are sampled. This requires much hard work, but the analysis outcome would enjoy stronger confidence. Under cluster sampling, estimators of parameters of interest no longer have the usual standard error associated with simple random sampling or stratified random sampling. These need to be carefully reconsidered. Bibliography 136 BIBLIOGRAPHY Adeyemo, A.A., Chen, G., Chen, Y. and Rotimi, C. (2005) Genetic structure in four West African population groups. BMC Genetics, 6, Article 38, pp. [electronic journal]. Bickel, P.J. and Doksum, K.A. (2001) Mathematical statistics: basic ideas and selected topics, vol.1. 2nd ed. Upper Saddle River, NJ: Prentice Hall. Bishop, Y.M., Fienberg, S.E. and Holland, P.W. (1975) Discrete multivariate analysis: theory and practice. Cambridge, MA: MIT Press. Bonin, A., Ehrich, D. and Manel, S. (2007) Statistical analysis of amplified fragment length polymorphism data: a toolbox for molecular ecologists and evolutionists. Mol. Ecol., 16, 3737-3758. Brown, A.H.D. (1970) The estimation of Wright’s fixation index from genotypic frequencies. Genetica, 41, 399-406. Carlin, B.P. and Louis, T.A. (2000) Bayes and empirical Bayes methods for data analysis. 2nd ed. Boca Raton: Chapman and Hall. Carter, K. and Worwood, M. (2007) Haptoglobin: a review of the major allele frequencies worldwide and their association with diseases. Int. J. Lab. Hem., 29, 92-110. Chakraborty, R. (1988) Analysis of genetic structure of a population and its associated statistical problems. Sankhya, Ser. B, 50(3), 327-349. Chakraborty, R. and Danker-Hopfe, H. (1991) Analysis of population structure: a comparative study of different estimators of Wright’s fixation indices. In: Rao, C.R. and Chakraborty, R., eds. Handbook of statistics, vol.8. Holland: Elsevier Science, pp. 203254. Chakraborty, R., Chakravarti, A. and Malhotra, K.C. (1977) Variation in allele frequencies among caste groups of the Dhangars of Maharashtra, India: an analysis with Wright’s FST statistic. Ann. Hum. Biol., 4(3), 275-280. Cochran, W.G. (1980) Fisher and the analysis of variance. In: Fienberg, S.E. and Hinkley, D.V., eds. R.A. Fisher: an appreciation. New York: Springer-Verlag, pp. 17-34. Cockerham, C.C. (1969) Variance of gene frequencies. Evolution, 23, 72-84. Cockerham, C.C. (1973) Analyses of gene frequencies. Genetics, 74, 679-700. Cox, D.R. (1972) The analysis of multivariate binary data. Appl. Statist., 21, 113-120. Bibliography 137 Crow, J.F. (1999) Hardy, Weinberg and language impediments. Genetics, 152, 821-825. Crowley, P.H. (1992) Resampling methods for computation-intensive data analysis in ecology and evolution. Ann. Rev. Ecol. Syst., 23, 405-447. Curie-Cohen, M. (1982) Estimates of inbreeding in a natural population: a comparison of sampling properties. Genetics, 100, 339-358. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B, 39, 1-38. Doss, H. and Sethuraman, J. (1989) The price of bias reduction when there is no unbiased estimate. Ann. Statist., 17, 440-442. Efron, B. and Tibshirani, R. (1993) An introduction to the bootstrap. New York: Chapman and Hall. Eisenhart, C. (1947) The assumptions underlying the analysis of variance. Biometrics, 3(1), 1-21. Excoffier, L. (2009) Arlequin’s home on the web. Available from: http://lgb.unige.ch/arlequin. [Accessed September 2009]. Excoffier, L., Smouse, P.E. and Quattro, J.M. (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131, 479-491. Excoffier, L., Laval, G. and Schneider, S. (2005) Arlequin ver 3.0: an integrated software package for population genetics data analysis. Evol. Bioinfor. Online, 1, 47-50. Everitt, B.S. (1992) Analysis of contingency tables. 2nd ed. Boca Raton: Chapman and Hall. Fisher, R.A. (1991) Statistical methods, experimental design, and scientific inference. New York: Oxford University Press. Falush, D., Stephens, M. and Pritchard, J.K. (2007) Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol. Ecol. Notes, 7(4), 574-578. Foll, M., Beaumont, M.A. and Gaggiotti, O. (2008) An approximate Bayesian computation approach to overcome biases that arise when using amplified fragment length polymorphism markers to study population structure. Genetics, 179, 927-939. Bibliography 138 Good, P. (1994) Permutation tests: a practical guide to resampling methods for testing hypotheses. New York: Springer-Verlag. Griffiths, D.A. (1973) Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number cases of a disease. Biometrics, 29, 637-648. Haber, M. (1980) Detection of inbreeding effects by the χ2 test on genotypic and phenotypic frequencies. Am. J. Hum. Genet., 32, 754-760. Harihara, S., Saitou, N., Hirai, M., Gojobori, T., Park, K.S., Misawa, S., Ellepola, S.B., Ishida, T. and Omoto, K. (1988) Mitochondrial DNA polymorphism among five Asian populations. Am. J. Hum. Genet., 3, 134-143. Hartl, D.L. and Clark, A.G. (1989) Principles of population genetics. 2nd ed. Sunderland, MA: Sinauer Associates. Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (1991) Fundamentals of exploratory analysis of variance. New York: John Wiley. Holsinger, K.E., Lewis, P.O. and Dey, D.K. (2002) A Bayesian approach to inferring population structure from dominant markers. Mol. Ecol., 11, 1157-1164. Holsinger, K.E. (2003) Hickory – software for genetic analysis. Available from: http://darwin.eeb.uconn.edu/hickory/hickory.html. [Accessed 17 August 2009]. Honnay, O., Coart, E., Butaye, J., Adriaens, D., Van Glabeke, S. and Rold´ an-Ruiz, I. (2006) Low impact of present and historical landscapes configuration on the genetics of fragmented Anthyllis vulneraria populations. Biol. Conservation, 127, 411-419. Isabel, N., Beaulieu, J., Thériault, P. and Bousquet, J. (1999) Direct evidence for biased gene diversity estimates from dominant random amplified polymorphic DNA (RAPD) fingerprints. Mol. Ecol., 8, 477-483. Johnson, R.A. and Wichern, D.W. (2002) Applied multivariate statistical analysis. 5th ed. Upper Saddle River, NJ: Pearson Education International. Johnson, N.L., Kotz, S. and Kemp, A.W. (1992) Univariate discrete distributions. 2nd ed. New York: John Wiley. Johnson, M.J., Wallace, D.C., Ferris, S.D., Rattazzi, M.C. and Cavalli-Sforza, L.L. (1983) Radiation of human mitochondria DNA types analyzed by restriction endonuclease cleavage pattern. J. Mol. Evol., 19, 255-271. Bibliography 139 Kendall, M. and Stuart, A. (1979) The advanced theory of statistics, vol.2. 4th ed. London: Charles Griffin. Kendall, M., Stuart, A. and Ord, K. (1987) The advanced theory of statistics, vol.1. 5th ed. London: Charles Griffin. Kimura, M. (1983) The neutral theory of molecular evolution. Cambridge: Cambridge University Press. Krauss, S.L. (1997) Low genetic diversity in Persoonia mollis (Proteaceae), a fire-sensitive shrub occurring in a fire-prone habitat. Heredity, 78, 41-49. Krauss, S.L. (2000a) Accurate gene diversity estimates from amplified fragment length polymorphism (AFLP) markers. Mol. Ecol., 9, 1241-1245. Krauss, S.L. (2000b) Patterns of mating in Persoonia mollis (Proteaceae) revealed by an analysis of paternity using AFLP: implications for conservation. Aust. J. Bot., 48, 349356. Kremer, A., Caron., H., Cavers, S., Colpaert, N., Gheysen, G., Gribel, R., Lemes, M., Lowe, A.J., Margis, R., Navarro, C. and Salgueiro, F. (2005) Monitoring genetic diversity in tropical trees with multilocus dominant markers. Heredity, 95, 274-280. Lee, S.M.C. (1988) Intermarriage and ethnic relations in Singapore. J. Marr. Fam., 50, 255-265. Levsen, N.D., Crawford, D.J., Archibald, J.K., Santos-Guerra, A. and Mort, M.E. (2008) Nei’s to Bayes’ : comparing computational methods and genetic markers to estimate patterns of genetic variation in Tolpis (Asteraceae). Am. J. Bot., 95, 1466-1474. Lewontin, R.C. and Cockerham, C.C. (1959) The goodness-of-fit test for detecting natural selection in random mating populations. Evolution, 13, 561-564. Li, C.C. (1955) Population genetics. Chicago: University of Chicago Press. Li, C.C. (1976) First course in population genetics. Pacific Grove, CA: Boxwood Press. Li, C.C. and Horvitz, D.G. (1953) Some methods of estimating the inbreeding coefficient. Am. J. Hum. Genet., 5(2), 107-117. Light, R.J. and Margolin, B.H. (1971) An analysis of variance for categorical data. J. Am. Statist. Assoc., 66, 534-544. Bibliography 140 Liu, Z. and Furnier, G.R. (1993) Comparison of allozyme, RFLP, and RAPD markers for revealing genetic variation within and between aspen and bigtooth aspen. Theor. Appl. Genet., 87, 97-105. Lowe, A., Harris, S. and Ashton, P. (2004) Ecological genetics: design, analysis and application. UK: Blackwell Publishing. Lynch, M. and Milligan, B.G. (1994) Analysis of population genetic structure with RAPD markers. Mol. Ecol., 3, 91-99. Maeda, N., Yang, F., Barnett, D.R., Bowman, B.H. and Smithies, O. (1984) Duplication within the haptoglobin Hp2 gene. Nature, 309, 131-135. Magurran, A.E. (1988) Ecological diversity and its measurement. New Jersey: Princeton University Press. Margolin, B.H. and Light, R.J. (1974) An analysis of variance for categorical data, II: small sample comparisons with chi square and other competitors. J. Am. Statist. Assoc., 69, 755-764. Maritz, J.S. and Lwin, T. (1989) Empirical Bayes methods. 2nd ed. London: Chapman and Hall. Mechanda, S.M., Baum, B.R., Johnson, D.A. and Arnason, J.T. (2004) Sequence assessment of comigrating AFLP bands in Echinacea - implications for comparative biological studies. Genome, 47, 15-25. Meudt, H.M. and Clarke, A.C. (2007) Almost forgotten or latest practice? AFLP applications, analyses and advances. Trends Plant Sci., 12, 106-117. Michalakis, Y. and Excoffier, L. (1996) A generic estimation of population subdivision using distances between alleles with special reference for microsatellite loci. Genetics, 142, 10611064. Neath, A.A. and Samaniego, F.J. (1997) On the efficacy of Bayesian inference for nonidentifiable models. Am. Statist., 51, 225-232. Nei, M. (1973) Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA, 70, 3321-3323. Nei, M. (1977) F -statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet., 41, 225-233. Nei, M. (1986) Definition and estimation of fixation indices. Evolution, 40(3), 643-645. Bibliography 141 Nei, M. (1987) Molecular evolutionary genetics. New York: Columbia University Press. Nei, M. and Chakravarti, A. (1977) Drift variances of FST and GST statistics obtained from a finite number of isolated populations. Theor. Popul. Biol., 11, 291-306. Nei, M. and Chesser, R.K. (1983) Estimation of fixation indices and gene diversities. Ann. Hum. Genet., 47, 253-259. Nelder, J.A. and Mead, R. (1965) A simplex method for function minimization. Computer J., 7, 308-313. New Scientist. (2009) $5000 to read DNA. New Scientist, 201(2695), 7. Nielsen, L.R. (2004) Molecular differentiation within and among island populations of the endemic plant Scalesia affinis (Asteraceae) from the Gal´ apagos Islands. Heredity, 93, 434-442. Onukogu, I.B. (1985) An analysis of variance of nominal data. Biometrical J., 4, 375-383. Onukogu, I.B. (1986) A multivariate analysis of variance of categorical data. Biometrical J., 5, 617-627. Peakall, R., Smouse, P.E. and Huff, D.R. (1995) Evolutionary implications of allozyme and RAPD variation in diploid populations of dioecious buffalograss Buchloë dactyloides. Mol. Ecol., 4, 135-147. Pearson, H. (2006) What is a gene? Nature, 441, 399-401. Pérez, T., Albornoz, J. and Dom´ınguez, A. (1998) An evaluation of RAPD fragment reproducibility and nature. Mol. Ecol., 7, 1347-1357. Pinheiro, H.P., Seillier-Moiseiwitsch, F., Sen, P.K. and Eron, J. (2000) Genomic sequences and quasi-multivariate CATANOVA. In: Sen, P.K. and Rao, C.R., eds. Handbook of statistics, vol.18. Amsterdam: Elsevier, pp. 713-746. Pritchard, J.K., Stephens, M. and Donnelly, P. (2000) Inference of population structure using multilocus genotype data. Genetics, 155, 945-959. Quenouille, M.H. (1956) Notes on bias in estimation. Biometrika, 43, 353-360. R Development Core Team. (2007) R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available from: http://www.rproject.org. Bibliography 142 Rieseberg, L.H. (1996) Homology among RAPD fragments in interspecific comparisons. Mol. Ecol., 5, 99-105. Rogerson, S. (2006) What is the relationship between haptoglobin, malaria, and anaemia? PLoS Med., 3(5), 593-594. Saha, N. and Ong, Y.W. (1984) Distribution of haptoglobins in different dialect groups of Chinese, Malays and Indians. Ann. Acad. Med. Singapore, 13, 498-501. Sahai, H. and Ojeda, M.M. (2003) Analysis of variance for random models: theory, methods, applications and data analysis. Boston: Birkhauser. Saiki, R., Scharf, S., Faloona, F., Mullis, K., Horn, G. and Erlich, H. (1985) Enzymatic amplification of beta-globin genomic sequences and restriction site analysis for diagnosis of sickle cell anaemia. Science, 230, 1350-1354. Scheffé, H. (1967) The analysis of variance. New York: John Wiley. Scheinfeldt, L.B., Friedlaender, F.R., Friedlaender, J.S., Latham, K., Koki, G., Karafet, T., Hammer, M. and Lorenz, J. (2007) Y choromosome variation in Northern Island Melanesia. In: Friedlaender, J.S., ed. Genes, language, and culture history in the Southwest Pacific. New York: Oxford University Press, pp. 81-95. Searle, S.R., Casella, G. and McCulloch, C.E. (1992) Variance components. New York: John Wiley. Singapore Department of Statistics. (2001) Products and services - publication catalogue census of population 2000. Available from: http://www.singstat.gov.sg/pubn/popn/ c2000sr1/cop2000sr1.pdf. [Accessed 17 December 2008]. Sim, S. (2007) Population genetic studies of two tropical lowland rain forest tree species in the forest fragments of Singapore using AFLP. Unpublished M.Sc. thesis. National University of Singapore. Simpson, E.H. (1949) The measurement of diversity. Nature, 163, 688. Singh, B. (1993) On the analysis of variance method for nominal data. Sankhya Ser. B, 55(1), 40-47. Singh, B. (1996) On CATANOVA method for analysis of two-way classified nominal data. Sankhya Ser. B, 58(3), 379-388. Snustad, D.P. and Simmons, M.J. (2006) Principles of genetics. 4th ed. USA: John Wiley. Sturtevant, A.H. (1965) A history of genetics. New York: Harper and Row. Bibliography 143 Szmidt, A.E., Wang, X. and Lu, M. (1996) Empirical assessment of allozyme and RAPD variation in Pinus sylvestris (L.) using haploid tissue analysis. Heredity, 76, 412-420. Tero, N., Aspi, J., Siikam¨ aki, P., J¨ ak¨ al¨ aniemi, A., and Tuomi, J. (2003) Genetic structure and gene flow in a metapopulation of an endangered plant species, Silene tatarica. Mol. Ecol., 12, 2073-2085. Thompson, S.K. (2002) Sampling. 2nd ed. New York: John Wiley. Van Dongen, S. (1995) How should we bootstrap allozyme data? Heredity, 74, 445-447. Van Dongen, S. and Backeljau, T. (1995) One-and two-sample tests for single-locus inbreeding coefficients using the bootstrap. Heredity, 74, 129-135. Vekemans, X., Beauwens, T., Lemaire, M. and Rold´ an-Ruiz, I. (2002) Data from amplified fragment length polymorphism (AFLP) markers show indication of size homoplasy and of a relationship between degree of homoplasy and fragment size. Mol. Ecol., 11, 139-151. Vos, P., Hogers, R., Bleeker, M., Reijans, M., van der Lee, T., Hornes, M., Friters, A., Pot., J, Paleman, J., Kuiper, M. and Zabeau, M. (1995) AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res., 23, 4407-4414. Ward, R.H. and Sing, C.F. (1970) A consideration of the power of the χ2 test to detect inbreeding effects in natural populations. Am. Nat., 104, 355-365. Weir, B.S. (1996) Genetic data analysis II. Sunderland, MA: Sinauer Associates. Weir, B.S. and Cockerham, C.C. (1984) Estimating F -statistics for the analysis of population structure. Evolution, 38, 1358-70. Williams, J.G.K., Kubelik, A.R., Livak, K.J., Rafalski, J.A. and Tingey, S.V. (1990) DNA polymorphisms amplified by arbitrary primers are useful as genetic markers. Nucleic Acids Res., 18, 6531-6535. Wright, S. (1943) Isolation by distance. Genetics, 28, 114-138. Wright, S. (1951) The genetical structure of populations. Ann. Eugen., 15, 323-254. Wright, S. (1969) Evolution and the genetics of populations: the theory of gene frequencies, vol.2. Chicago: University of Chicago Press. Wright, S. (1978) Evolution and the genetics of populations: variability within and among natural populations, vol.4. Chicago: University of Chicago Press. Bibliography 144 Yu, D.H. and Chu, K.H. (2006) Low genetic differentiation among widely separated populations of the pearl oyster Pinctada fucata as revealed by AFLP. J. Exp. Mar. Biol. Ecol., 333, 140-146. Zhang, B. and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Statist. Appl. Genet. Mol. Biol., 4(1), Article 17, 43 pp [electronic journal]. Zhivotovsky, L.A. (1999) Estimating population structure in diploids with multilocus dominant DNA markers. Mol. Ecol., 8, 907-913. 145 Appendix A APPENDIX A H D Fˆ Locus N n OPA9-230 (0) (1) 18 (19) 18 (20) - (-) OPA9-320 11 (7) (11) (2) 18 (20) -0.241 (-0.173) OPA9-340 (7) 12 (13) (0) 18 (20) -0.403 (-0.481) OPA9-400 (1) (15) (4) 18 (20) 0.084 (-0.535) OPA9-450 (0) 13 (10) (10) 20 (20) -0.481 (-0.333) OPA9-650 (0) (7) 18 (13) 20 (20) -0.053 (-0.212) OPA9-700 (2) 12 (10) (8) 20 (20) -0.429 (-0.099) OPA9-750 (0) (4) 17 (16) 20 (20) -0.081 (-0.111) OPA9-800 (1) (9) 17 (10) 20 (20) -0.081 (-0.129) OPA9-1300 (0) (8) 16 (12) 20 (20) -0.111 (-0.250) OPA9-1380 (3) (13) (4) 20 (20) 0.098 (-0.303) OPA10-410 (1) 13 (12) (5) 19 (18) -0.384 (-0.403) OPA10-480 (7) (11) (0) 19 (18) 0.263 (-0.440) OPA10-520 (0) (8) 12 (12) 19 (20) -0.226 (-0.250) OPA10-600 (0) (3) 15 (17) 19 (20) -0.118 (-0.081) OPA10-750 (0) (7) 15 (13) 20 (20) -0.143 (-0.212) OPA10-800 (0) (6) 19 (14) 20 (20) - (-0.176) OPA10-850 (1) 11 (14) (5) 20 (20) -0.125 (-0.250) OPA10-1200 (1) (7) 18 (12) 19 (20) - (-0.004) OPA10-1250 (7) 15 (12) (1) 20 (20) -0.535 (-0.319) OPA10-1300 (0) (13) 10 (7) 19 (20) -0.086 (-0.481) OPA10-1400 (0) (1) 18 (19) 20 (20) -0.053 (-) Inferred counts of null homozygotes (N ), heterozygotes (H) and dominant homozygotes (D) in the KOR and SVA (in parentheses) samples, based on Table in Szmidt et al. (1996). The locus-specific estimates of the inbreeding coefficient Fˆ come from Table in the same reference. Sample size is denoted by n. 146 Appendix B APPENDIX B Morph 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Haplotype 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 vector 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 The 35 morphs observed in Johnson et al.’s (1983) study. The haplotype vector gives information on the presence (1) and absence (0) of restriction enzyme cuts at 23 polymorphic loci in the mtDNA molecule. 1 1 1 1 1 1 1 1 0 1 1 1 1 0 147 Appendix B Morph 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Total African 11 14 12 0 0 0 0 0 0 0 0 1 1 1 74 Caucasian 29 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 50 Oriental 32 0 0 0 2 0 0 0 0 0 0 1 0 0 0 46 The distribution of morph counts in the African, Caucasian and Oriental samples. 148 Appendix C APPENDIX C Morph Japanese Korean Ainu Aeta Vedda 57 52 42 33 13 10 11 12 13 14 15 16 17 18 19 20 74 64 48 37 20 Total The distribution of counts of different haplotypes (morphs) in the Japanese, Korean, Ainu, Aeta and Vedda samples. Data from Harihara et al. (1988). The morphs here not correspond to those in Johnson et al. (1983). [...]... 124 Chapter 1: Statistics in Population Genetics 1 CHAPTER 1 1.1.0 INTRODUCTION The present thesis is an attempt to address several issues of interest, from a statistical point of view, in the discipline of population genetics Broadly speaking, the latter is concerned with the study of processes that determine the genetic characteristics of a biological population It is, however, not an overstatement... compartments in their cells, thus sequestering genetic Chapter 1: Statistics in Population Genetics 3 material in the nucleus away from the rest of the cellular contents The gene (A) is a subset of the string of DNA molecules organised into a particular chromosome (B), thus we say “gene A is in chromosome B” This subset contains important information that directs the synthesis of molecules that maintain life,... RAPD loci in a sample of Pinus sylvestris from the SVA and KOR populations The loci are arranged in such a way that their null allele frequencies are ascending in the SVA population 86 3.3.0.4 Locus-specific estimates of FST for SVA and KOR populations using codominant and dominant marker data The loci are arranged in such a way that FST estimates are ascending when estimated using (3.1.2.3)... model and the partitioning of a superpopulation into several subpopulations Assuming that interest is focused on the study populations alone, Nei and Chesser (1983) discussed the estimation of Wright’s fixation indices in detail We, however, challenge an important assumption that they make - that the relative population size of each Chapter 1: Statistics in Population Genetics 10 subpopulation is equal... locus-specific heterozygosity in Section 2.3.0 In Section 2.4.0, we propose a method for correcting ascertainment bias when estimating average heterozygosity Section 2.5.0 contains worked examples for illustrating results in the preceding sections Section 2.6.0 contains discussion of key assumptions used in developing the present estimation theory Finally, a summary of this chapter is given in Section 2.7.0 Chapter... decision-making To satisfy statistical rigour, proper sampling procedures must be in place, and estimators of relevant population genetic parameters should have desirable statistical properties as well In both the data collection and analysis phase, departures from assumptions that justify a particular procedure may result in misleading conclusions Some kind of compromise between achieving statistical. .. invalidate the equal weight assumption As an extension of the current study to binary data, we further investigate conditions that do not invalidate inferences based on FST when dominant, instead of codominant data, are used for estimating FST In Chapter 4, we reassess the “analysis of molecular variance” (AMOVA) methodology, which is a statistical method widely used to analyse genetic variation Initially... allele in a population purely as a matter of sampling error from a finite population Kimura’s neutral selection theory postulates that mutation followed by random drift is the dominant factor in driving the frequency of an allele up towards fixation or down towards extinction In light of current finding that most of the genome of eukaryotes are made up of sequences that do not have any apparent protein-coding... Population Genetics 5 protein-coding genes, the effects of certain variants are nevertheless well-documented In diploids, depending on whether a particular allele of a gene is present in one or two copies, the synthesized protein molecule may have altered activity, leading to observable traits (often medical conditions in humans) The existence of variants in a locus depends on the complex interplay of evolutionary... by statisticians Light and Margolin (1971), which deals with ANOVA in the context of categorical data We believe Light and Margolin’s categorical analysis of variance approach (CATANOVA) is a more reasonable method for studying natural biological populations Pursuing this line of thought, we state clearly the population parameters implicit in CATANOVA, subsequently linking them to a generalisation of . 1 1.1.0. INTRODUCTION The present thesis is an attempt to address several issues of interest, from a statistical point of view, in the discipline of population genetics. Broadly speaking, the. Statistics in Population Genetics 7 of genotype proportions in a large, sexually reproducing diploid population under random mating. The eponymous model known as the Hardy-Weinberg (HW) model was indepen- dently. framework, instead of the commonly used analysis of molecular variance (AMOVA), is the appropriate one for analysing genetic structure in a collection of populations where interest in intrinsically centered

Định dạng
Số trang	160
Dung lượng	1,29 MB