Detecting bad SNPs from illumina beadchips using jeffreys distance, phát hiện các SNP xấu từ illumina beadchips sử dụng khoảng cách jeffreys

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON Detecting bad SNPs from Illumina BeadChips using Jeffreys distance MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi1 - 2012 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE MASTER THESIS Sector: Information Technology Major: Computer Science Code : 60 48 01 Supervised by: Dr Le Sy Vinh Hanoi - 2012 ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UET or any other educational institution, except where due acknowledgement is made in the thesis Any contribution made to the research by others is explicitly acknowledged in the thesis I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed i Abstract Current microarray technologies are able to assay thousands of samples over million of SNPs simultaneously Computational approaches have been developed to analyse a huge amount of data from microarray chips to understand sophisticated human genomes The data from microarray chips might contain errors due to bad samples or bad SNPs In this thesis, a novel method is proposed to detect bad SNPs from the probe intensities data of Illumina Beadchips This approach measures the di erence among results determined by three software Illuminus, GenoSNP and Gencall to detect the unstable SNPs Experiment with SNP data in chromosome 20 of Kenyan people demonstrates the usefulness of our method This approach reduces the num-ber of SNPs that are needed to check manually Furthermore, it has the ability in detecting bad SNPs that have not been recognized by other criteria Acknowledgements Apart from the e orts of myself, the success of any project depends largely on the encouragement and guidelines of many others First and foremost, I would like to thank to my supervisor Dr Sy Vinh Le for the valuable guidance and advice This research project would not have been nished successfully without his continuously support and assistance His enthusiastic supervision helped me in all the time of research and writing of this thesis I also would like to gratefully acknowledge the tremendous encouragement and insightful comments of Dr Si Quang Le during the time of research of this thesis His brilliant ideas helped me so much to overcome numerous problems and di culties Any word is inadequate for his helpful aids The author would also like to convey thanks to the Department of Computer Science for providing the useful preferences and laboratory facilities I also wish to express love and gratitude to my beloved families; for their understanding and endless love, through the duration of my studies My thanks and appreciations also go to my colleague in developing the project and people who have willingly helped me out with their abilities i Table of Contents Overview Introduction 1.1 1.2 1.3 Biological Background SNP Genotyping Quality Control and Quality Assuran Related Work 2.1 2.2 2.3 2.4 2.5 Naive method for SNP genotyping GenCall Illuminus GenoSNP Discussion Method 3.1 3.2 Kullback-Leibler divergence Approximate relative entropy betwee 3.2.1 Approximate Student distribut 3.2.2 The matched bound approxim Estimate ict degree between thr 3.3 Experimental Results 4.1 4.2 4.3 Conclusion ii Data description Parameter estimation Evaluation List of Figures 1.1 An example of a SNP detected by an alignment of two DNA s 1.2 BeadChip work ow 3.1 Approximate t-distribution by a mixture of Gaussians 3.2 An example of Matched Bound Approximate method with tures of three components 4.1 4.2 4.3 4.4 Input le format Histograms of three metrics in term of ict score Erroneous loci ltered by applied two functions but minimu HWE, MAF and missing rate histogram of three callers, w and ltered data iii List of Tables 4.1 4.2 4.3 4.4 Result of missing rate lter Result of MAF lter Result of HWE exact test lter Result of synthesis criteria iv 28 28 30 30 Overview Recently, Genome-Wide Association Study (GWAS), also known as Whole Genome Association Study (WGAS), proved to be an successful strategy in identifying ge-netic variants associated with common diseases or complicated phenotypes This method focuses on associations between Single Nucleotide Polymorphisms (SNP) and traits Because GWAS investigates the entire genome rather than just testing one or a few genetic regions, there will be a large extreme amount of SNP genotype data needed to be called As the consequence, di erent impressive methods have been developed to deal with the problem of genotype calling automatically and ef-ciently In general, these programs try to translate probe hybridization intensities outputed from microarray chips into SNP genotypes In the ideal case, a call at a SNP loci generates three clusters of signals, so that SNP genotype for a certain sample could be determined according to which cluster it belongs to However, in fact, the ambiguous data and errors happen frequently due to many reasons Quality Control (QC), as an indispensable step, is included in every studies to remove these error data from the dataset as much as possible Unfortunately, this kind of work requires signi cant time and e ort because it is hardly automated completely It is also di cult to nd the bad data among the good ones when the applied criteria are not clear for these cases Although many statistical methods have been proposed, expert-guided evaluations are usually needed to determine the ultimate results On top of that, the statistical variables used in considering SNP genotype quality have thresholds that depending on conditions, making obstacles on the way of nding a automatic solution based on them Therefore, an novel approach beside the traditional QC process is necessary to make this important step self-regulating as much as possible In our work, we developed a new method to detect bad SNPs Our method takes advantage of three available genotype callers for Illumina BeadChip, namely GenCall, GenoSNP and Illuminus The general idea comes from the fact that data LIST OF TABLES of bad SNPs tend to confuse the callers, so the calling results of di erent methods are usually not consistent That is, by comparing visually the cluster clouds of three callers SNP by SNP, we could recognize these bad loci The distance in informatics theory (relative entropy) is applied to examine these dissimilarities so that bad SNPs could be detected automatically We applied our method with real-life data and it proved to be a good protocol with satisfactory results Despite of the fact that there still a lot of work to do, this method really has the ability to help experts in rming the ltered data, also suggests potential bad SNPs that hard to nd by traditional QC process The thesis is organized in ve chapters An introduction to the de nitions and background knowledge is given in Chapter At rst, this chapter will provide the very rst images in bioinformatics by explaining brie y some terminologies in molecular biology, such as human genome, DNA, SNP and SNP genotyping It also o er further information relating to our work, about solutions for genotype calling problem and quality control process Attending these sections are necessary for better understanding the later content of the thesis Chapter will describe in more details about the available genotype callers The general idea for a simple caller will be given at the beginning to give the most basic solution for the genotype calling problem After that, three callers used in our algorithm are explained thoroughly Information provided in this chapter will help reader get the rst step further toward our solution The proposed method is presented in the third chapter We will review the de nitions about relative entropy (also known as Kullback-Leibler divergence) and its application with Normal distribution The next section will state an available method to calculate relative entropy between Student distributions, which is required to deal with our problem Finally, an estimate technique is proposed to measure the dissimilarities between outputs of three callers The experiment process of applying our protocol with real data is o ered in Chapter At rst, input data description is given After that, there is a step of tryand-error experiment to nd the most appropriate parameters that should be used in our program Finally, the result of bad SNPs detected by our method is evaluated by compared with other traditional QC criteria The nal chapter will give you the overall conclusion and further discussion about the thesis 4.3 Evaluation Table 4.3: Result of HWE exact test XX X XX lter XX Callers XXX Gencall Illuminus GenoSNP h hh Number of SNPs Detected SNPs by synthesis criteria Intersected with our method bers in these tables increase along with the threshold because lower bounds are used In all three tables, Illuminus is the caller with highest call rate in this case since it has the least SNP ltered out among three The SNPs that violate the criteria are the ones that need to be checked carefully There are quite signi cant of them also assigned as bad SNPs in our protocol This help experts to narrow the range of SNPs that need to focus on In fact, SNP data is ltered hierarchically, combining all those metrics in one completed QC process For this reason, the most powerful criteria among three tables are used simultaneously to identify bad SNPs: 2% missing rate, 0:05 MAF threshold and HWE value of 10 Table 4.4 shows the number of bad SNPs for each caller using this synthesis criteria, together with overlapped amount of bad SNPs also detected by our method For instance, with regard to Gencall, there are 4497 SNPs that cannot pass the synthesis condition, 1070 among them also ltered by our measurement Looking back to the three Tables 4.1, 4.2, and 4.3 , the numbers of bad SNPs recognized by applying three separated conditions are 27957, 5585 and 9221 respectively The intersection of these three sets is 4497 bad SNPs in Table 4.4 stated above Similarly, the overlap of three counterparts which have 2345, 1127 and 1815 SNPs in the rst three tables forms the set of 1070 overall bad SNPs in Table 4.4 These are the common bad SNPs between the corresponding QC criteria and our criteria The second criteria (minor allele frequency) seem to be the most reasonable lter since its result (5585 SNPs) is smallest and closest to the synthesis 4.3 Evaluation outcome (4497 SNPs in this case) We could see that the agreed results between our method and MAF lter (1127 bad SNPs) are also more stable than the other two criteria since it is closest to the combined set of 1070 intersected bad SNPs This re ects the high reliability of our method results Beside these agreed SNPs, we also detect 2360 1070 = 1290 potential bad ones but pass the synthesis condition Manual check shows that these latent bad SNPs are rmed with high probability To sum up, it could be seen that our novel approach not only helps other QC process to reduce the SNPs needed to check out, but also able to recognize the latent erroneous SNPs To understand better the performance of our protocol, the plots of worse SNPs suggested by our algorithm are o ered as in gure 4.5 In fact, this is the nal safeguard step which implemented by experts manually to rm the quality of candidate bad SNPs Our ltering procedure requires about 20 minutes to work out the Kenyan dataset automatically, using platform with one processor Intel Xeon X5355 2:66 GHz and Linux Ubuntu operating system 4.3 Evaluation GTG ● ● ● intensity Y intensity Y intensity ● 0/0 0/1 1/1 −/− ● Y ● ● ● ●● ●● ●● ● ●●●● ●●●●● ●● ● ● ●●● ●●●● ●● ●●●●● ● ● ●● ●●●● ● ●● ●●●●●● ●● ●●● ● ●●●●●●●●●●●● Y intensity Y intensity X intensity Y intensity GTG ● ● ●●● ● ● ● ●●●●● ●● ● ●●●● ●●●●●●●● ●●●●●●●● ●● ●● ●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●● ● ●●●●●●●● ● ●●●●●● ●●●●● ●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●● ●● ● ● ●● ● ●● ● ● ● ●●● ●●●●●●●●●●● ●●●●● ● ● ●●● ●●●●●● ●● ● X intensity Yintensity Y intensity Y intensity Y intensity Yintensity Y intensity GTG ● ●●●●● ●● ●●●●●● ● GTG ●●●●●●●●● ●●● X intensity ● ●●● ●●●●●●●● ● ●●● ● ● ●●●●●● ●●● ●● ●● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●● ●●● ●●●●●●●● ●●● ●●●● ●●● ● ● ● ●● ●● ●● ●●● Figure 4.5: Bad SNPs suggest by our criteria In which, the three clustering results are very di erent 4.3 Evaluation Y intensity GTG intensity ● ● ● Y ● ● ●● ●● ●● ●● ● ● ●● ● intensity ● Y intensity ●●●●● ● ●●●●●●●●● ●●●● ●●●●● ●● ●●●●●●●● ●●● ●●●● ●●●●●●●●●●●● ●●● ●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●●● ●● Y ●●●●●● ●●●● ●● ● ●●● Y intensity X intensity ● ●● ● ●● ● Y intensity ●● ● ●● ● ● ●●● ●●●● ●●● ● ●●●● ●●●●● ● ● ●● ● ●● ●● ● ●● ● ●● ● ● ●● ● ● ●●● ●● ●● ●● ● ●●● ● ● ●● ● ● ●●●● ●● ● ● ● ●●● ●● ●●●● ●●●● ●●● ●●●● ● ●●●●● ●●●●●●●● ● ● ● ●●●● ● ● ● ● ● Y intensity Y intensity GTG Y intensity ● ● ● ● ● ● ● Y ● ●● ●● ● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● intensity Y intensity X intensity GTG ●● Yintensity ● ● ●● ● ● Figure 4.5: (cont.) Conclusion Ascertaining the quality of data is the primary task in the whole genotyping process In this work, we already introduced a novel method to lter out bad SNPs It does not require expert’s knowledge based on QC variables such as MAF, missing rate, HWE (Hardy-Weinberg equilibrium), etc Instead, it measures called results available from three callers Illuminus, Gencall and GenoSNP The underlying rationale of our algorithm is to estimate the di erences between the classi cation of these three methods and then nding the useless loci based on these informations By applying this approach as a data postprocessing step, a number of worse SNPs would be ltered out of the dataset This helps decreasing the times and e orts for later procedures In addition, our procedure is able to nd potential failed SNPs that could be ignored by other constraints Furthermore, thanks to its simplicity and computational e ectiveness, this procedure might be reused or combined with other methods of quality assurance independently to im-prove their performances We would like to emphasize that our procedure should be carried out in the very rst step of the whole quality control process as the primary lter that could help all three callers In many other quality control or quality assurance methods, the parameters highly depend on the characteristics of the dataset (platforms, chips, callers that are used in genotyping) In our program, the only unstable factor that should be considered is the ict threshold, which is set to 0:1 as default In fact, this evaluation is made not only via various experiments, but also based on the fact about the theoretic distances of closed distributions which are usually fall in concrete and small set of values Our recommendation for this parameter is 0:1 for raw dataset and 0:05 for higher quality ones Considering all together, this method of using Je reys distance between three available genotype callers is another reliable and helpful detector in nding the bad-quality SNPs, which usually takes many time and e ort to deal with 34 Publications Hoang Son Nguyen, Sy Vinh Le, Si Quang Le Detecting bad SNPs from th Illumina BeadChips using Je reys distance International Conference on Knowledge and Systems Engineering (KSE 2012), in Proceedings 35 Bibliography Anderson Carl A, Pettersson Fredrik H, C G M C L R M A P Z K T Collins, F S., & McKusick, V A (2001) Implications of the human genome project for medical science JAMA: The Journal of the American Medical Association, 285, 540{544 El Attar, A., Pigeau, A., & Gelgon, M (2009) Fast aggregation of student mixture models European Signal Processing Conference (Eusipco’2009) (pp 312{216) Glasgow, Royaume-Uni Miles (Pays-de-la-Loire) Giannoulatou, E., Yau, C., Colella, S., Ragoussis, J., & Holmes, C C (2008) Genosnp: a variational bayes within-sample snp genotyping algorithm that does not require a reference population Bioinformatics, 24, 2209{2214 Goldberger, J., Gordon, S., & Greenspan, H (2003) An e cient image similarity measure based on approximations of kl-divergence between two gaussian mixtures In Proc ICCV (pp 487{493) Group, G C R (2007) New models of collaboration in genome-wide association studies: the genetic association information network Nat Genet, 39, 1045{1051 Hershey, J., & Olsen, P (2007) Approximating the kullback leibler divergence between gaussian mixture models Acoustics, Speech and Signal Processing, 2007 ICASSP 2007 IEEE International Conference on (pp IV{317 {IV{320) Illumina Inc (2005) Spotlight, Illumina Gencall Data Analysis Software Laurie, C C., Doheny, K F., Mirel, D B., Pugh, E W., Bierut, L J., Bhangale, T., Boehm, F., Caporaso, N E., Cornelis, M C., Edenberg, H J., Gabriel, S B., Harris, E L., Hu, F B., Jacobs, K B., Kraft, P., Landi, M T., Lumley, T., Manolio, T A., McHugh, C., Painter, I., Paschall, J., Rice, J P., Rice, K M., Zheng, 36 References X., Weir, B S., & for the GENEVA Investigators (2010) Quality control and quality assurance in genotypic data for genome-wide association studies Genetic Epidemiology, 34, 591{602 Pei er, D A., Le, J M., Steemers, F J., Chang, W., Jenniges, T., Garcia, F., Haden, K., Li, J., Shaw, C A., Belmont, J., Cheung, S W., Shen, R M., Barker, D L., & Gunderson, K L (2006) High-resolution genomic pro ling of chromosomal aberrations using in nium whole-genome genotyping Genome Research, 16, 1136{ 1148 Ritchie, M E., Liu, R., Carvalho, B S A., & New Zealand Multiple Sclerosis Genetics Consortium (ANZgene), Irizarry, R A (2011) Comparing genotyping algorithms for Illumina’s In nium whole-genome SNP BeadChips BMC Bioinformatics, 12, 68 Steemers, F J., Chang, W., Lee, G., Barker, D L., Shen, R., & Gunderson, K L (2006) Whole-genome genotyping with the single-base extension assay Nature Methods, 3, 31{33 Teo, Y Y., Inouye, M., Small, K S., Gwilliam, R., Deloukas, P., Kwiatkowski, D P., & Clark, T G (2007) A genotype calling algorithm for the illumina beadarray platform Bioinformatics, 23, 2741{2746 Wigginton, J E., Cutler, D J., & Abecasis, G R (2005) A note on exact tests of hardy-weinberg equilibrium The American Journal of Human Genetics, 76, 887 { 893 ... chips might contain errors due to bad samples or bad SNPs In this thesis, a novel method is proposed to detect bad SNPs from the probe intensities data of Illumina Beadchips This approach measures... HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE MASTER THESIS Sector: Information Technology Major: Computer... lter and SNPs lter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be ltered out The most common and widely protocol used in evaluating SNPs quality

Định dạng
Số trang	58
Dung lượng	644,26 KB