A maximum likelihood method for detecting bad samples from Illumina BeadChips data : = sử dụng maximum likelihood để xác định mẫu xấu cho dữ liệu Illumina BeadChips

41 26 0
A maximum likelihood method for detecting bad samples from Illumina BeadChips data : = sử dụng maximum likelihood để xác định mẫu xấu cho dữ liệu Illumina BeadChips

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Table of Contents Overview 1 Introduction 1.1 Biological background 1.2 Some common types of mutation 1.3 SNP and SNP genotype 1.4 Microarray technology and Illumina BeadChips 1.5 Genotype callers 1.6 Quality control and quality assurance 1.6.1 Identify samples with discordant sex information 1.6.2 Identify samples that have high missing and heterozygosity rate 1.6.3 Identify duplicated or related samples 1.6.4 Identify samples that have different ancestries 3 10 11 11 12 Genotype callers 2.1 Illuminus 2.2 GenoSNP 2.3 GenCall 2.4 Comparing three callers 14 14 17 18 18 samples 20 21 22 24 Maximum likelihood method for 3.1 Create potential bad sample list 3.2 Estimate the fitness of data 3.3 Remove bad samples detecting bad Experimental result 25 4.1 Input file format 25 4.2 Experiment 27 i TABLE OF CONTENTS 4.3 ii Experiment 31 Conclusion 34 Publications 35 List of Figures 1.1 1.2 1.3 DNA structure Human genome, chromosome and genes The process of creating and genotyping of Illumina Infinium II[Inc06] 2.1 2.2 Mixture of two Gaussian distributions 15 x,y intensities vs strength and contrast [TIS+ 07] 17 3.1 The workflow of the method 21 4.1 4.2 4.3 4.4 VCF file format example SNP rs2465126 before and after removing bad samples SNP rs2488991 before and after removing bad samples SNP rs6055460 before and after removing bad samples iii 26 28 29 32 List of Tables 1.1 1.2 1.3 An example of DNA substitution An example of DNA insertions and deletions An example of SNP 2.1 Comparison between callers[GYC+ 08a] 19 4.1 4.2 4.3 4.4 4.5 4.6 highest missing rate samples and their statistics in experiment SNPs that have high positive changes after being removed bad samples Number of bad samples with different thresholds in experiment highest missing rate samples in experiment SNPs that have high positive changes after being removed bad samples Number of bad samples with different thresholds in experiment iv 27 30 30 31 33 33 Overview Genome-wide association study (GWAS) is a project that uses human genome to detect single nucleotide polymorphisms and some traits of diseases With the advancement of technology in recent years, some DNA microarrays have the abilities to capture millions of SNPs from thousands of individuals (or samples) In order to generate a microarray, we have to run through many chemical and biological processes Most of these processes are done automatically by machines that are produced by some large DNA microarray companies Creating microarray is only the first part, the second part is analyzing the data that are contained in microarray to get the genotype information of each SNP from each individual This part is also called SNP genotyping process and nowadays, statistical approaches are the most common methods for this process thank to the low cost and short running time However, these methods are not perfect, they may generate faulty genotype data of some individuals or SNPs The faulty genotype data could be the result of the errors in the creating microarray process, the inaccuracy in the transforming data from microarray to genotyping methods, or even the methods themselves It is sure that the faulty genotype data is useless for genotype analysis Therefore, several criteria have been proposed to remove bad samples and bad SNPs For instance, all samples that have proportion of undefined genotype (missing rate) higher than 3% are marked as bad samples and they will be removed using this criterion However, These criteria could lead to the massive reduction of number of samples or SNPs Moreover, there is no actual mathematical verification that proves the removals are right Hence, after the removals, some visualized graphs such as scatter plot of each SNP and some statistics are calculated to verify the removals This job is mostly done manually by experts and it is time consuming To conclude, the problem remaining in this step is finding a statistical approach to remove bad samples and bad SNPs A good solution for this problem is the one that has reliable LIST OF TABLES results and also requires as little as possible the interfere of experts In this thesis, we propose a maximum likelihood method to detect bad samples Our observation is that mixture model-based methods such as Illuminus has very high call rate But, they are not always consistent because of the existence of noisy samples Each noisy sample data could affect the correlation matrix and the location parameter of a distribution by shifting the cluster away from the ideal position This problem might result in faulty calls of SNP genotype from Illuminus Base on this observation, we introduce a new fitness function to deal with this problem Our new fitness function follows the idea of ML-based method (maximum likelihood based method) to maximize the fitness of mixture of student distributions If the appearance of any sample in the data reduces the fitness, this sample is marked as bad sample and it will be removed Moreover, To take the advantage of quality control criteria, we also use missing rate to create a list of samples that have high potential of being bad samples By checking only samples in this list, the processing time for detecting bad samples is massively reduced The rest of the thesis is organized as follows: Firstly, Some biological knowledge about DNA, human genome, SNP, genotype, and SNP genotyping will be introduced in chapter In chapter we would like give you a brief introduction about three most popular algorithms that work with Illumina BeadChip: Illuminus, GenCall, GenoSNP and a short comparison of their performance After that, chapter is our proposed method to detect the bad samples from the genotype result of Illuminus Chapter will show how our method work with the real data In this chapter, we will show the result of our method when it was applied to work with two different databases Finally, we will make some conclusions in the last part of my thesis Chapter Introduction 1.1 Biological background In biology, Cell is the smallest unit of living Organisms could be called unicellular if they have only one cell (Bacteria for example) Most of organisms are called multicellular - the number of cell in their body larger than one A single person contains approximately 10 trillion (1013 ) cells Each cell has its own role in our body Cell knows its role and function by a special instructions that reside in cell’s nucleus The instructions of a cell are come from DNA (DeoxyriboNucleic Acid) DNA is like a blueprint to our cells, it contains a set of plans for building our cells Figure 1.1 shows the structure of DNA Scientists call DNA structure is the double helix form that was build by two sugar phosphate backbones, nucleotides (bases), and hydrogen bonds between two nucleotides There are types of nucleotide: A stands for Adenine, C stands for Cytosine, G - Guanine, and T - Thymine A could only have hydrogen bonds with T, C could only connect to G and vice versa For this reason, when studying DNA, scientists only have to examine a half part of DNA In general, AT and CG are called base pairs There is not only one DNA in our cell The fact is that each cell in our body contains a lot of DNA However, in our cell, DNA is packaged into single unit that called Chromosome Each organism has its own number of chromosomes For instance: a dog has 78 chromosomes while a mosquito only has chromosomes Chromosomes always come in pair, one from father and another one from mother That is the reason why children look like both their mother and father Human 1.1 Biological background Figure 1.1: DNA structure genome consists of 23 pairs of chromosomes, one of them determines gender and the others are autosomal chromosome pairs Genes are parts of DNA, they encode the information to build all proteins in our body Those proteins are very important because they keep our body functioning It is said that human body contains approximately 25000 genes Genes normally contain thousands of nucleotides We could easily understand the relationship between genes and DNA as: DNA contains millions of characters (A,C,G,T), each group of three characters makes a word (three nucleotides are made to decode one amino acid in the process of encoding protein, they are also called DNA triplet), many words make a sentence, and each sentence is a gene See Figure 1.2 for more information about the relationship between human genome, chromosomes, DNA, and genes Human genome studies show that 99.9% of our genomes are identical to others[CM01], however, the appearance of each person is unique For example, the eye color of a 1.2 Some common types of mutation Figure 1.2: Human genome, chromosome and genes person could be blue, black, or brow The uniqueness of appearance is all thanks to the polymorphisms between our genome sequences, this is also called genetic polymorphisms The polymorphisms may be the results of mutation such as: insertions, deletions, substitutions, 1.2 Some common types of mutation Table 1.1: An example of DNA substitution sequence A C G A T G C A A sequence A C G A A C G A A Substitution: DNA substitution is the phenomenon when one or more nucleotides are transformed into another nucleotides Table 1.1 is an example of DNA substitution where bases in sequences are transformed into three other bases in sequence when we align them together (All of these bases have been highlighted in red) 1.3 SNP and SNP genotype Table 1.2: An example of DNA insertions and deletions sequence A C G A T G C A A sequence A C G A - - - A A Insertion and deletion: In the one hand, DNA deletion occurs when one or more nucleotides are removed from the DNA sequence In the other hand, when some nucleotides are inserted into DNA sequences, we will have DNA insertion When studying human DNA, these two types are often indistinguishable Therefore, they are grouped together and called indel mutations For instance, in Table 1.2 we could describe in two different ways The first one is there are three bases are inserted into sequence and the other one is there are three bases have been deleted in sequence Others: Other than these three above types, there are many other types of polymorphism For example, gene duplication (create a multiple copies of whole chrome region and increase the number of genes that located in this region), or chromosomal inversion (inverse the order of whole chromosome region), and many other mutations between chromosomes 1.3 SNP and SNP genotype Single nucleotide polymorphism (SNP) is one of the most common genetic polymorphisms between genomes of members of a species, it occurs at only one nucleotide in each genome sequence Approximately, for each 1000 base pairs, there is a SNP Therefore, a single person has aboult millions SNPs in his (her) genome Figure 1.3 is an example of SNP between short DNA sequences Almost all of bases in these sequences is the same except one position (colored in red) For each person, this base could be different or similar to anothers Combining all the SNPs using this characteristic makes us unique Alleles are the alternate forms of a gene represent in an individual Normally, alleles have two forms, one from the father and the other one from mother The set of alleles of an individual is called genotype A genotype at a SNP site, called SNP genotype, is a pair of alleles each from one chromosome copy in a deployed organism A SNP genotype is classified into three types: AA, AB and BB where A and B 3.2 Estimate the fitness of data 23 In order to compare the fitness between two SNPs which have different number of samples, the likelihood may not be an efficient comparator Therefore instead of using the original likelihood we introduced a fitness function which is defined as: f itness(SN P ) = log = n LK(θ; D) log n x∈D (3.3) αk f (x; µk , Σk , υk ) k=1 (3.4) = SampleV alx n x∈D (3.5) To illustrate why the fitness function could filter the good and the bad samples, first of all it is easy to realize that the fitness function after transformation looks similar to the average value of SampleV alx and each x has its own SampleV alx that relates to its location On the one hand, when a particular sample x is a bad samples Because the position of x is too far from the center of all three distributions, f (x; µk , Σk , υk ) where k = 1, 2, will be much smaller to the probability values of the sample that near one out of the three centroids As the consequence, SampleV alx is smaller than the average value and it will reduce the f itness(SN P ) Hence, if we remove x, f itness(SN P ) will increase On the other hand, good samples often have the SampleV al above the average value Therefore the appearance of good samples only increases the value of f itness(SN P ) and those samples should not be removed Taking the advantage of the fitness function for a SNP, we expand it from a SNP to the whole data to get the fitness function from multiple SNPs let f itness(SN Pi ) is the fitness of the estimated parameters for SNPi then the fitness of all m SNPs is: m f itness(SN Pi ) (3.6) f itness(SN P s) = m i=1 The reason why we use this formula is similar to the fitness function of a SNP A bad sample will lower the fitness of many SNPs and it will lower the average value which is the fitness of the whole data as well In the contrary, a good sample will increase the average value Therefore, using this formula we can detect which sample is bad and which sample is good in the data 3.3 Remove bad samples 3.3 24 Remove bad samples Given a candidate sample s, we estimate two fitness value, F1 of all samples, and F2 of all samples without s If F1 < F2 , then removing s from the given data increases the total fitness Otherwise, (F2 ≤ F1 ), removing of s does not bring any benefit To sum up, in the first step our method will create a potential bad sample list then the fitness of the current Illuminus model with the given data will be calculated After that, each sample in this list will be selected according to its priority When trying to remove the selected sample, a new fitness value will be calculated because the genotype data has been changed If the new fitness value larger than the old one, the selected sample is the bad sample and it will be removed completely and the old fitness value will be updated Otherwise, the selected sample is not really a bad sample, it will be remained in the genotype data Chapter Experimental result According to Anderson et al [APC+ 10], all the sample with the missing rate higher than 3-7% must be marked as bad samples However there are some samples that have the missing rate higher than 3% but they still produce many data near the centroids of the clusters It means they should not be removed in the quality control process Therefore, simply using the threshold of missing rate is not good enough Moreover, this threshold could lead to mass reduction of number of samples in the genotype data and the high workload for experts In this section, we will illustrate how our method works with the real data and then we will make a comparison between the number of sample that is removed by missing rate and the result of our method with the same threshold 4.1 Input file format Most of the output of calling methods is transferred to variance call format (VCF) This format is widely used to describe the SNP data such as intensity values, genotype clusters, Each VCF file contains three parts, namely: Meta-information lines, Header line, and Data lines Figure 4.1 is an simple example of a VCF file that stores SNPs of sample Information lines Meta-information lines are used to describe what will appeared in this vcf file All the Meta-information lines start with “##” The first line is the version of vcf format or the ’fileformat’ field, this field is mandatory Along with the fileformat field, FORMAT fields are the specified genotype data fields in each 25 4.1 Input file format 26 Figure 4.1: VCF file format example SNP of each sample the FORMAT fields are described as follow: ##FORMAT= For example, The third line of Figure 4.1 describes genotype calls of Illuminus that will be shortened as GI, the call will be showed in only one string There are also some other types such as: integer, float, character Header line The Header line starts with one character “#” first collumns of this line are required while FORMAT and other collumns are only presented when there are genotype data in this vcf file This line is tab-delimited Data lines Data lines are also tab-delimited, each data line descibes a record of a SNP We could divide this line into two types: Fixed fields and Genotype fields • Fixed fields: For each SNP record, there are fixed fields that correspond to mandatory collumns in header line CHROM is an identifier of chromosome from the reference genome POS is a number that show the reference position The first base has position 1, the nth base has position n ID is an indentifier of SNP record REF and ALT are the reference and alternate non-reference bases of a SNP record each base should be one of A,C,G,T,N character and all of them must be in uppercase alternate field could also be “.” character, this character means the alternate field is missing QUAL is a floating number that describes the phred-scaled quality score for the assertion of ALT 4.2 Experiment 27 FILTER is set to passed if the call of this position passed all filters INFO is the additional information for each record • Genotype fields: Genotype fields starts with the FORMAT field and genotype data fields describe the SNP data of a correspond sample ID in the header line Each genotype data field is specified in the same order of FORMAT field Moreover, The FORMAT fields in the meta-information lines are also applied in the genotype data field In genotype data field, 0/0, 0/1, 1/1 and / are presented to AA, AB, BB, and nocall genotypes For example, the 8th line of Figure 4.1 shows the SNP record of chromosome 20, at 48174450th base The reference base is G and the alternate base is C There is no specified data of phred-scaled quality score and filter The genotype call by Illuminus, GenoSNP, and Gencal of sample id WG0087518-DNAA02 are AB, nocall, and AB respectively The consensus call that results from the majority call by three callers is AB The xy intensities data of this SNP record are 0.27 and 0.17 4.2 Experiment The demo data in the first experiment consists of 498 SNPs and 3656 samples in total If we use 10% as the missing rate threshold to get the list of candidates, the number of potential bad samples will be 91 Table 4.1: highest missing rate samples and their statistics in experiment Sample names m rate h rate 180546 E12 MLCP1 1M1424575 0.376 0.299 163335 E10 WT BS659131 0.313 0.295 163338 C10 WT BS659709 0.309 0.253 180526 A03 MLCP1 1M1300140 0.297 0.226 180546 A05 MLCP1 1M1301418 0.277 0.342 TABLE 4.1 shows samples which have the highest missing rate among 91 candidates along with some statistics about them While the general method marks all 91 as bad samples, our method only marks 41 samples in this list Analyzing 4.2 Experiment 28 50 samples that are excluded by our method by visualization, most of the intensity values that they produce are close to the center of genotype clusters for many SNPs Therefore, we could conclude that these 50 samples are not bad samples Figure 4.2: SNP rs2465126 before and after removing bad samples Figure 4.2 shows an example of SNP rs2465126 The red, green, and blue points illustrate AA, AB, and BB clusters respectively In this example, two values f itness before and after are -2.583 and -2.568 respectively As the result, the f itness of SNP rs2887286 was increased by 0.015 after being removed 41 samples The first and the second graph of this figure are the original output of Illuminus and the SNP after 4.2 Experiment 29 being removed samples by our method respectively As can easily be seen from the last graph in this figure, many data points among 91 candidates appear in the centers of AA cluster, AB cluster, and particularly BB cluster These samples could be good samples Compared to the third graph which contains 41 bad samples that we detected, most of these points are still kept in the final result Figure 4.3: SNP rs2488991 before and after removing bad samples Figure 4.3 is another example of our method results in the first experiment The pattern is similar to the example of rs2465126, if we remove all samples that have missing rate higher than 10%, many data points in the center of BB cluster will 4.2 Experiment 30 be removed Our method only exclude a small number of these points, most of them are kept untouched Moreover, Most of the outliers and some data points that appear too far from the centroid of three clusters are removed by our method These removals will shift the centroid of each cluster to the densest position Consequently, the fitness of this SNP will increase Table 4.2: SNPs that have high positive changes after being removed bad samples SNP name m rateb m ratea h rateb h ratea fit diff SNP1-939752 0.00191 0.00082 0.02329 0.02326 0.01619 rs2465126 0.00903 0.00629 0.49627 0.49610 0.01501 SNP1-746515 0.00766 0.00739 0.01406 0.01394 0.01372 SNP1-1105242 0.00410 0.00082 0.00439 0.00415 0.01028 SNP1-908480 0.00547 0.00356 0.35176 0.35064 0.01004 TABLE 4.2 shows the heterozygosity rate and missing rate of the five SNPs that have high increases in their fitnesses In this table, m rateb and m ratea are the missing rate before and after removal process respectively; the pattern of heterozygosity rate h rate is similar to the missing rate; fit diff is the difference in fitness between two states fit diff = f itness(SN P )after − f itness(SN P )before (4.1) As can be seen from this table, the two rates of all SNPs have some notable reductions Table 4.3: Number of bad samples with different thresholds in experiment ❵❵❵ ❵❵❵ ❵❵❵Threshold ❵❵❵ Statistics ❵❵ ❵ 10% 9% 8% 7% 6% 5% 4% 3% 2% General QC method 91 111 144 187 263 375 516 722 1097 Our method 41 46 186 264 55 73 103 145 430 TABLE 4.3 shows the statistics of number of bad samples that we detected using our method and general QC method which is obtained by using missing rate with different thresholds It is clear from this table that missingness should not be the 4.3 Experiment 31 only measure to remove bad samples For example, there is about 1000 samples with 2% level missing but about half of them should be considered as bad samples Furthermore, if we lower the threshold, the frequency of finding good samples in the candidate list will increase This observation means the probability of being useful for the data of a sample that have low missing rate is higher than the other sample with higher missing rate 4.3 Experiment In the second experiment, we use 1000 SNPs with 4473 samples The data actually is a part of SNP data in chromosome 20 of Kenya’s people For each SNP in this data, we excluded every sample data that have the confidence of clustering result by Illuminus less than 95% from the three final clusters and marked them as the outliers Table 4.4: highest missing rate samples in experiment Sample names m rate h rate WG0093168-DNA E03 ML650K652250 0.532 0.269 WG0093166-DNA C01 ML650K651827 0.472 0.273 WG0093167-DNA E08 ML650K651583 0.465 0.372 WG0093164-DNA G05 ML650K652118 0.448 0.130 WG0087550-DNAC07 0.424 0.170 TABLE 4.4 shows samples that have highest priority from the dataset of experiment 2, all of them have the missing rates higher than 40% and their heterozygosity rates are also quite high All of them are marked as bad sample and they will be removed by our method With the threshold of 10%, after finish processing, our method could detect 432 bad samples in 434 candidates Figure 4.4 is an example of the second experiment, in this example the f itness of the SNP have increased by 0.0378 As can be seen from that Figure, most of the removed points appear in the heterozygous cluster (the green cluster) and many of them are also the outliers which means remove them could reduce the h rate and m rate 4.3 Experiment 32 Figure 4.4: SNP rs6055460 before and after removing bad samples TABLE 4.5 contains five SNPs that their fitnesses have some dramatic improvements in this experiment As many outliers have been removed, the missing rates of all these SNP decreased to very low values Moreover, the heterozygosity rate of each SNP also has a slight reduction The number of bad samples with different thresholds in experiment is shown in TABLE 4.6 The pattern of finding useful samples among the candidate list when we lower the threshold in this experiment is quite similar to that of the first experiment This time, our method still manages to filter out bad samples in the 4.3 Experiment 33 Table 4.5: SNPs that have high positive changes after being removed bad samples SNP name m rateb m ratea h rateb h ratea fit diff rs498363 0.20344 0.18712 0.33764 0.31929 0.18751 rs2298109 0.02124 0.00313 0.49627 0.49610 0.15752 rs6048226 0.01364 0.00335 0.34927 0.34724 0.14952 cnvi0018901 0.00469 0.00156 0.00741 0.00122 0.14251 rs3827153 0.01833 0.00492 0.21453 0.19582 0.13742 Table 4.6: Number of bad samples with different thresholds in experiment ❵❵❵ ❵❵❵ ❵❵❵Threshold ❵❵❵ Statistics ❵❵ ❵ 10% 9% 8% 7% General QC method 434 490 542 603 671 726 804 919 1118 Our method 432 485 535 589 652 700 767 858 1031 list that will be eliminated by an actual threshold 6% 5% 4% 3% 2% Conclusion Studying SNPs is now one of the hottest research trends Using the results of SNP and SNP genotype could lead to a new methods of disease prediction, treatment, so on There are many algorithms that have been developed to solve the SNP genotyping problem since then Among them, Illuminus, GenCall, and GenoSNP are the most three popular methods These methods have very optimistic result with the Hapmap database However, with the real data that contains many more noisy samples and SNPs than Hapmap database, the number of outliers is proportional with the density of noises Therefore, several criteria has been proposed to remove the noisy samples and SNP Missing rate is an very useful criterion to control the quality samples from SNP calling result However, using only missing rate is not a good choice As can be seen from both experiments, although missing rate could help removing many outliers, some of the removed samples are also including in the genotype clusters Thus, if we only use missing rate, the number of samples that require experts to re-check is too large and the probability of removing useful samples is too high Moreover, the results of these experiments also prove that our method work quite well with the real genotype data as a quality control method to detect and remove bad samples from raw data Furthermore, The statistic results with multiple thresholds for each dataset show that our method is able to exclude some samples that have good contributions to the genotype clusters In short, current quality control methods for samples are manually processed using simple thresholds to cut off bad samples Thus, they are not proved mathematically Our method uses maximum likelihood as a base to automatically check whether a sample from raw data is bad or not Hence, our method could be an useful post processing method to detect noisy samples with the absence of experts 34 Publications Ha Anh Tuan Nguyen, Sy Vinh Le, Si Quang Le A maximum likelihood method for detecting bad samples from Illumina BeadChips data Knowledge and Systems Engineering 2012 (Accepted) 35 Bibliography [APC+ 10] C.A Anderson, F.H Pettersson, G.M Clarke, L.R Cardon, A.P Morris, and K.T Zondervan Data quality control in genetic case-control association studies Nat Protoc, 5(9):1564–73, 2010 [CBSI07] Benilton Carvalho, Henrik Bengtsson, Terence P Speed, and Rafael A Irizarry Exploration, normalization, and genotype calls of high-density oligonucleotide snp array data Biostatistics, 8(2):485–499, 2007 [CM01] Francis S Collins and Victor A McKusick Implications of the human genome project for medical science JAMA: The Journal of the American Medical Association, 285(5):540–544, 2001 [GYC+ 08a] Eleni Giannoulatou, Christopher Yau, Stefano Colella, Jiannis Ragoussis, and Christopher C Holmes Genosnp: a variational bayes withinsample snp genotyping algorithm that does not require a reference population Bioinformatics, 24(19):2209–2214, 2008 [GYC+ 08b] Eleni Giannoulatou, Christopher Yau, Stefano Colella, Jiannis Ragoussis, and Christopher C Holmes A genotype calling algorithm for the illumina beadarray platform Bioinformatics, 24(19):2209–2214, 2008 [Inc05] Illumina Inc Illumina gencall data analysis software http: //www.illumina.com/documents/products/technotes/technote_ gencall_data_analysis_software.pdf, 2005 [Inc06] Illumina Inc Infinium ii assay workflow http://www.illumina.com/ documents/products/workflows/workflow_infinium_ii.pdf, 2006 [KF01] Larry J Kricka and Paolo Fortina Microarray technology and appli- 36 Bibliography 37 cations: An all-language literature survey including books and patents Clinical Chemistry, 47(8):1479–1482, 2001 [MB88] G.J McLachlan and K.E Basford Mixture Models: Inference and Applications to Clustering Marcel Dekker, New York, 1988 [MK97] G McLachlan and T Krishnan The EM algorithm and extensions Wiley, New York, 1997 [PPP+ 06] A L Price, N J Patterson, R M Plenge, M E Weinblatt, N A Shadick, and D Reich Principal components analysis corrects for stratification in genome-wide association studies Nat Genet, 38(8):904–909, August 2006 [RCH+ 09] Matthew E Ritchie, Benilton S Carvalho, Kurt N Hetrick, Simon Tavar, and Rafael A Irizarry R/bioconductor software for illumina’s infinium whole-genome genotyping beadchips Bioinformatics, 25(19):26212623, 2009 [Syv05] Ann-Christine C Syvăanen Toward genome-wide SNP genotyping Nature genetics, 37 Suppl, June 2005 [TIS+ 07] Yik Y Teo, Michael Inouye, Kerrin S Small, Rhian Gwilliam, Panagiotis Deloukas, Dominic P Kwiatkowski, and Taane G Clark A genotype calling algorithm for the illumina beadarray platform Bioinformatics, 23(20):2741–2746, 2007

Ngày đăng: 23/09/2020, 22:38

Mục lục

    1.2 Some common types of mutation

    1.3 SNP and SNP genotype

    1.4 Microarray technology and Illumina BeadChips

    1.6 Quality control and quality assurance

    1.6.1 Identify samples with discordant sex information

    1.6.3 Identify duplicated or related samples

    1.6.4 Identify samples that have di erent ancestries

    3.1 Create potential bad sample list

    3.2 Estimate the tness of data

Tài liệu cùng người dùng

  • Đang cập nhật ...