Detecting bad SNPs from illumina beadchips using jeffreys distance = phát hiện các SNP xấu từ illumina beadchips sử dụng khoảng cách jeffreys

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON Detecting bad SNPs from Illumina BeadChips using Jeffreys distance MASTER THESIS OF INFORMATION TECHNOLOGY Hanoi - 2012 VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE MASTER THESIS Sector: Information Technology Major: Computer Science Code : 60 48 01 Supervised by: Dr Le Sy Vinh Hanoi - 2012 Table of Contents Overview 1 Introduction 1.1 Biological Background 1.2 SNP Genotyping 1.3 Quality Control and Quality Assurance Related Work 2.1 Naive method for SNP genotyping 2.2 GenCall 2.3 Illuminus 2.4 GenoSNP 2.5 Discussion 3 10 10 11 12 12 13 15 15 16 16 17 19 Experimental Results 4.1 Data description 4.2 Parameter estimation 4.3 Evaluation 23 23 24 26 Method 3.1 Kullback-Leibler divergence 3.2 Approximate relative entropy between two Student distributions 3.2.1 Approximate Student distribution 3.2.2 The matched bound approximation 3.3 Estimate conflict degree between three callers Conclusion 34 ii Chương Introduction 1.1 Biological Background Human Genome Human genome are encoded and represented as databases of very long strings of four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides respectively There are nearly three billions characters in total for a completed genome Single Nucleotide Polymorphisms : A SNP is defined as a single base change in a DNA sequence The positions where SNP occurred are called SNP site Note that there usually happen only two possible nucleotide variants for any SNP site The nucleotide variants at SNP sites are called alleles We use and to denote the two possible alleles A pair of alleles each from one chromosome copy in a deploy organism at a SNP site is called SNP genotype Since there are two different alleles at a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11 (equivalent to aa, aA, and AA) 00 and 11 are called homozygous genotypes while 01 is called heterozygous genotype Illumina BeadChips To date, there are various type of microarrays such as Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc The Illumina whole-genome BeadChips is one of the most popular microarray technologies used to study human SNPs It takes m individuals (samples) as the input and captures genotypes across n SNP sites After several chemical and pre- 1.2 Quality Control and Quality Assurance processing steps, it outputs a matrix G = {gij,i=1, ,m;j=1, ,n } where gij = (xij , yij ) are the raw intensities which indicate the SNP genotype of sample i at SNP site j The intensities xij , yij represent for allele 0, of SNP genotype gij Methods We only consider the three most popular methods for Illumina BeadChips, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP 1.2 Quality Control and Quality Assurance Large-scale stastical studies are supceptible to errors Quality Control and Quality Assurance (QC/QA) process is considered as a critical phase that has an significant affect to the accuracy of computational model QC is defined as a process of monitoring and controlling the quality of data as it is being generated, whereas QA is used to review the product quality after that The content of this thesis is related to QC procedure of genotype calling problem Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be filtered out In a typical QC process for genotyping problem, there are three variables considered They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF (minor allele frequency) • Missing rate or missing proportion (MSP) at a SNP indicates how much samples failed in this call: MSP = number of no calls total of samples The higher missing rate indicates the poorer genotype calling performance • In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele frequency is q then the probalilities of three possible genotype 0/0, 0/1 and 1/1 are (1 − q)2 , 2q(1 − q), q Significant deviation of HWE tests is typically imply gross genotyping error A probalility test value, or p-value, is used to estimate this difference 1.2 Quality Control and Quality Assurance • MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles: MAF = number of minor alleles total alleles Low MAF means that there exist a cluster among three with fewer samples As the consequence, SNPs with low MAF are more prone to error, since almost calling methods that based on clustering not work well in these cases Chương Related Work Given an matrix G = {gij,i=1, ,m;j=1, ,n } with gij = (xij , yij ), representing the genotype intensity of m samples at n SNP loci The task is to assign every pairs to its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible) This problem is call SNP genotype calling or in short, SNP genotyping 2.1 Naive method for SNP genotyping The most simple method for SNP genotyping problem use the correlation between two allelic intensities (xij , yij ) For example, with a data point (x, y): • If xij yij then this SNP genotype could be labelled as 00 (homozygous of allele 0) • If xij 1) • Or if xij yij then it should belong to genotype class 11 (homozygous of allele yij then this SNP calling should be 01 (heterozygous) However, this naive method could not work well with the ambiguous cases and its performance is poor with the tremendous amount of data 2.2 GenCall Gencall is the canonical caller developed for Illumina microarray from the very beginning This method uses neural networks and heuristic methods to divide samples 2.3 Illuminus into three clusters A GenCall score is generated as a confident degree for each call In general, this score represents the quality of genotyping and might be varied between different loci or chromosomes, however we could agree that any call with score less than 0.2 would be considered as a failure due to the lack of confidence (assigned as no call in this case) 2.3 Illuminus Illuminus is one of the best callers for 2.5M BeadChip of Illumina This method uses EM (Expectation Maximization) framework to fit a bivariate mixture model with three t-distributed components (as three genotyping clusters) and a Gaussian component for outliers (corresponding to nocall cluster) The posterior probabilities are calculated to classify the samples into appropriate genotype states Then pertubation analysis is applied to ensure the stability of calling results in Illuminus This process makes call for each SNP twice, in original data (xij , yij ) and pertubated data (xij + , yij + ) respectively, then the concordance between the two results is estimated to determine whether this SNP is valid or not 2.4 GenoSNP GenoSNP also utilizes the EM framework for mixture model of Student distributed components as in Illuminus However it uses a different method to fit intensities data to the model Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized intensities (log11 (x + 1), log11 (y + 1)) This novel approach is able to overcome the problem of other methods in that the accuracy is highly depend on the number of control samples, which sometimes in reality cannot be afforded Moreover, this method gives a perfect solution for studies where data is typed by different chips After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned This confident degree is also used in filter the poorer data of SNP or samples 2.5 Comment 2.5 Comment All of these callers in general could work quite well with arbitrary dataset with call rate and accuracy usually exceed 95% A combination of all three callers, also known as a consensus calling, will increases the performance of genotype calling problem with no doubt This very interesting idea has become motivation for our work That is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process Chương Method The three genotype classes clustered by Illuminus, GenoSNP and Gencall for bad SNPs are not stable They are different when compare to each others By estimating these quantities in the whole dataset using informative distance, we could determine the SNPs with largest conflicts and assign them as bad SNPs 3.1 Kullback-Leibler divergence Kullback-Leibler divergence or relative entropy D(P ||Q) between two distributions with probability mass functions f (x) and g(x) respectively, could be computed as follows: D(P ||Q) = p(x) log x∈X p(x) q(x) (3.1) For two γ-dimensional normal distributions f (µf , Σf ) and g(µg , Σg ), we have: D(f ||g) = |Σg | (log + Tr[Σ−1 g Σf ] |Σf | + (µf − µg )T Σ−1 g (µf − µg ) − γ) (3.2) in which γ = for our genotyping problem There is no such closed form expression exists to estimate KL divergence between these distributions of interest, hence an approximation is sufficient 3.2 Approximate relative entropy between two Student distributions 3.2 Approximate relative entropy between two Student distributions 3.2.1 Approximate Student distribution We have: +∞ S(x; µ, Σ, υ) = υ υ N (x; µ, Σ/u)G(u; , )du 2 (3.3) where N (x; µ, Σ/u) is a Normal distribution of x with mean and covariance matrices µ and Σ/u; while G(u; υ2 , υ2 ) is a Gamma distribution of u with two given parameters shape and rate respectively By simplifying the above Equation, a Student distribution could be approximated by a finite mixture of P Norms: S(x) = P P N (x; µ, i=1 Σ ) ui (3.4) where {ui }Pi=1 are randomly draw from Gamma distribution G(u; υ2 , υ2 ) The value of P is significantly affect the accuracy of the approximation and should be chosen according to the degree of freedom υ We use a method of Goldberger to approximately estimate D(f ||g) • A matching function is defined as m : {1, , P } → {1, , P } that map from the ith component of f to an j th component of g such as j = m(i) = arg D(fi ||gk ) k • The Goldberger’s approximate formula: D(f ||g) ≈ Dgoldberger (f ||g) = P P D(fi ||gm(i) ) i=1 (3.5) 3.3 Estimate conflict degree between three callers Thanks to Equation 3.2, we could easily compute each of every operands in the right hand side of Equation 3.5, because ∀i, j ∈ {1, , P }, both fi and gj here are normal distributed Thus up until now, we have the solution to tackle the problem of approximating relative entropy between two Student distributions 3.3 Estimate conflict degree between three callers Each of every callers clusterd the samples into three genotype states encoded as 00, 01, 11 (excluding no call cluster) by different statistical models Thus we have nine clusters as follow: i00 , i01 , i11 for Illuminus, g00 , g01 , g11 for GenoSNP and a00 , a01 , a11 for Gencall result After that, the difference in the clustered results for each pair of callers is calculated based on the works above, there will be three figures being generated (corresponding to three pairs of callers) In fact, we use Jeffreys distance thanks to its symmetric attribute: jd(P, Q) = (D(P ||Q) + D(Q||P )) The pseudo-code illustrates the algorithm to estimate Jeffrey distance between two distributions is given in Algorithm Assume that ig, ia, ag denote for the differences between the scores of Illuminus vs GenoSNP, Illuminus vs Gencall and Gencall vs GenoSNP respectively, given as: ig ← min{jd(i00 , g00 ), jd(i01 , g01 ), jd(i11 , g11 )} ia ← min{jd(i00 , a00 ), jd(i01 , a01 ), jd(i11 , a11 )} ag ← min{jd(a00 , g00 ), jd(a01 , g01 ), jd(a11 , g11 )} (3.6) By using appropriate metric on these three (among minimum, maximum or average of {ia, ig, ag}), we ultimately have the degree of difference between all three callers’ results at a SNP locus The option for metric function will be discussed later We have Algorithm to assign bad SNPs 3.3 Estimate conflict degree between three callers 10 Algorithm Estimate Jeffrey distance between two Student distribution f and g Require: • f (mf , Σf ) and g(mg , Σg ) • number of Normal components P , number of Monte Carlo iterations M C, common degree of freedom df • ran_gamma(shape, scale): the random generate function for Gamma distribution • D(f, g) function return KL divergence between two Gaussian distributions in that order, based on Equa 3.2 Function: jd(f, g) jd ← for count = → MC for i = → P , 2) u[i] ← ran_gamma( df df end for jd_goldberger ← for j = → P Σg Σf map_fg ← argmin D((mf , u[j] ), (mg , u[m] )) m∈(1, ,P ) Σg Σf map_gf ← argmin D((mg , u[j] ), (mf , u[m] )) m∈(1, ,P ) Σg Σg Σf Σf jd_goldberger ← jd_goldberger + D((mf , u[j] ), (mg , u[map_fg] )) + D((mg , u[j] ), (mf , u[map_gf] ) end for jd ← jd + P1 ∗ jd_goldberger end for return MC ∗ jd EndFunction 3.3 Estimate conflict degree between three callers 11 Algorithm Finding bad SNPs for all SNPs Assign samples to appropriate labeled cluster: (i00 , i01 , i11 ); (g00 , g01 , g11 ); (a00 , a01 , a11 ) for all Clusters Calculate sample mean and sample covariance matrix for each t-distributed cluster end for Calculate (ig, ia, ag) by Equation 3.6 conflict ← metric(ig, ia, ag) if conflict > thres then Assign this as bad SNP end if end for 3.4 Data description 3.4 12 Data description The SNP calling of three callers for 4473 Kenyan people in 28410 SNP loci is synthesised and used as input data in VCF format Each call will consist of confident scores of each callers: Gencall score and with each of the two remainings, a set of three probabilities represented how likely it belong to three genotype clusters The recommended cut-off for GenoSNP and Illuminus are both 0, 95, while it is 0.2 with Gencall Every call that have confidence below the cut-off are considered as no call 3.5 Parameter estimation This section work out the most suitable parameters and metric used in our program The degree of freedom parameter υ is fixed to 10 normal components for a tdistribution are sufficient for accuracy, and a Monte Carlo loop of 20 iterations is executed For the metric option, we test three candidates: minimum, maximum or average of the three values Through various experiments, the minimum is the best option for metric function With all the reasons discussed before and the histogram ??, it is concluded that a SNP with minimum conflict score among three greater than 0.1 will be filtered out There are 2360 out of 28410 (∼ 8.3%) being removed in this protocol 3.6 Evaluation Our results are compared with the baseline of several traditional QC criteria namely missing rate, minor allele frequency and Hardy-Weinberg equilibrium Bảng 3.1: Result of missing rate filter ❳❳ ❳❳❳ Callers Rate ❳Miss ❳❳ ❳❳❳ ❳ 2% 5% 10% 15% 20% Gencall 27957/2345 13721/1547 2589/753 1276/565 Illuminus 14761/1943 GenoSNP 28410/2360 20894/1545 4272/691 2165/554 1454/476 2245/590 248/137 79/56 891/474 34/25 3.6 Evaluation 13 Bảng 3.2: Result of MAF filter PP P PP MAF Callers PPPP 0.01 0.02 0.03 0.04 0.05 Gencall 1544/271 2668/556 3887/813 4798/966 5585/1127 Illuminus 481/189 1926/346 2932/547 4188/811 GenoSNP 3070/560 4217/875 4886/1051 5519/1190 6142/1308 1211/247 Bảng 3.3: Result of HWE exact test filter ❳❳❳ ❳❳❳ HWE ❳❳❳ Callers ❳❳❳ 10−8 10−7 10−6 10−5 10−4 Gencall 6868/1570 7290/1603 7792/1666 8424/1741 9221/1815 Illuminus 1658/280 2036/323 GenoSNP 4540/975 4935/1039 5386/1130 5906/1221 6567/1332 2540/372 3230/442 4255/551 Bảng 3.4: Result of synthesis criteria ❤❤❤❤ ❤❤❤❤ Callers ❤❤❤ Number of SNPs ❤❤❤❤ ❤ Gencall Illuminus GenoSNP ❤ ❤ Detected SNPs by synthesis criteria 4497 478 4777 Intersected with our method 1070 89 1037 The three tables 3.1, 3.2 and 3.3 illustrate different criteria used to help experts evaluate SNP quality Each cell of the tables include two numbers: the first one show how many SNPs that break the corresponding threshold and need to be reconsidered manually, the second one is the number of SNPs among them also filtered by our protocol For instance, in Table 3.3, with missing call rate threshold is set to 2%, there are 14761 SNP calling result being removed in Illuminus Among them, 1943 SNPs are also filtered out by our program This mean that our protocol helps reducing the overall SNPs that have to be checked in this case by 1943 Beside, we also detect other 417 potential bad SNPs that are missed in missing rate criteria alone Visual check shows that these SNPs are really problematic The similar observations are shown in the other two tables However, the numbers in these tables increase along with the threshold because lower bounds are used In all three tables, Illuminus is the caller with highest call rate in this case since it 3.6 Evaluation 14 has the least SNP filtered out among three The SNPs that violate the criteria are the ones that need to be checked carefully There are quite significant of them also assigned as bad SNPs in our protocol This help experts to narrow the range of SNPs that need to focus on In fact, SNP data is filtered hierarchically, combining all those metrics in one completed QC process For this reason, the most powerful criteria among three tables are used simultaneously to identify bad SNPs: 2% missing rate, 0.05 MAF threshold and HWE value of 10−4 Table 3.4 shows the number of bad SNPs for each caller using this synthesis criteria, together with overlapped amount of bad SNPs also detected by our method For instance, with regard to Gencall, there are 4497 SNPs that can not pass the synthesis condition, 1070 among them also filtered by our measurement Looking back to the three Tables 3.1, 3.2, and 3.3 , the numbers of bad SNPs recognized by applying three separated conditions are 27957, 5585 and 9221 respectively The intersection of these three sets is 4497 bad SNPs in Table 3.4 stated above Similarly, the overlap of three counterparts which have 2345, 1127 and 1815 SNPs in the first three tables forms the set of 1070 overall bad SNPs in Table 3.4 These are the common bad SNPs between the corresponding QC criteria and our criteria The second criteria (minor allele frequency) seem to be the most reasonable filter since its result (5585 SNPs) is smallest and closest to the synthesis outcome (4497 SNPs in this case) We could see that the agreed results between our method and MAF filter (1127 bad SNPs) are also more stable than the other two criteria since it is closest to the combined set of 1070 intersected bad SNPs This reflects the high reliability of our method results Beside these agreed SNPs, we also detect 2360 − 1070 = 1290 potential bad ones but pass the synthesis condition Manual check shows that these latent bad SNPs are confirmed with high probability Our filtering procedure requires about 20 minutes to work out the Kenyan dataset automatically, using platform with one processor Intel Xeon X5355 2.66 GHz and Linux Ubuntu operating system Conclusion Ascertaining the quality of data is the primary task in the whole genotyping process In this work, we already introduced a novel method to filter out bad SNPs It does not require expert’s knowledge based on QC variables such as MAF, missing rate, HWE (Hardy-Weinberg equilibrium), etc Instead, it measures called results available from three callers Illuminus, Gencall and GenoSNP The underlying rationale of our algorithm is to estimate the differences between the classification of these three methods and then finding the useless loci based on these informations By applying this approach as a data post-processing step, a number of worse SNPs would be filtered out of the dataset This helps decreasing the times and efforts for later procedures In addition, our procedure is able to find potential failed SNPs that could be ignored by other constraints Furthermore, thanks to its simplicity and computational effectiveness, this procedure might be reused or combined with other methods of quality assurance independently to improve their performances We would like to emphasize that our procedure should be carried out in the very first step of the whole quality control process as the primary filter that could help all three callers In many other quality control or quality assurance methods, the parameters highly depend on the characteristics of the dataset (platforms, chips, callers that are used in genotyping) In our program, the only unstable factor that should be considered is the conflict threshold, which is set to 0.1 as default In fact, this evaluation is made not only via various experiments, but also based on the fact about the theoretic distances of closed distributions which are usually fall in concrete and small set of values Our recommendation for this parameter is 0.1 for raw dataset and 0.05 for higher quality ones Considering all together, this method of using Jeffreys distance between three available genotype callers is another reliable and helpful detector in finding the bad-quality SNPs, which usually takes many time and effort to deal with 15 Publication Hoang Son Nguyen, Sy Vinh Le, Si Quang Le ’Detecting bad SNPs from Illumina BeadChips using Jeffreys distance’ 4th International Conference on Knowledge and Systems Engineering (KSE 2012), in Proceedings 16 ... UNIVERSITY OF ENGINEERING AND TECHNOLOGY  NGUYEN HOANG SON DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE MASTER THESIS Sector: Information Technology Major: Computer... deal with 15 Publication Hoang Son Nguyen, Sy Vinh Le, Si Quang Le Detecting bad SNPs from Illumina BeadChips using Jeffreys distance 4th International Conference on Knowledge and Systems Engineering... counterparts which have 2345, 1127 and 1815 SNPs in the first three tables forms the set of 1070 overall bad SNPs in Table 3.4 These are the common bad SNPs between the corresponding QC criteria

Định dạng
Số trang	19
Dung lượng	338,11 KB