Association study of ABCA1 polymorphisms in singapore populations 5

Chapter ABCA1 SNP Survey ABCA1 SNP Survey 5.1 Introduction Evidence that ABCA1 gene mutations are responsible for the familial high density lipoprotein (HDL) deficiency disorders of Tangier disease and familial hypoalphalipoproteinemia (Bodzoich et al., 1999; Brooks-Wilson et al., 1999; Marcil et al., 1999; Remaley et al., 1999; Rust et al., 1999), together with the long established finding from epidemiological studies that HDL levels are inversely related with coronary artery disease (CAD; Wang and Briggs, 2004), suggests common ABCA1 genetic variations may explain phenotypic variation in HDL levels and CAD susceptibility in the general population. Association studies are an approach to investigate this notion, but to facilitate such studies, single-nucleotide polymorphisms (SNPs) in ABCA1 must first be identified. When this research was initiated in early 2000, there was a paucity of ABCA1 SNPs reported in literature. Pullinger et al. (2000) discovered -278G>C and -14C>T in the proximal promoter as well as 237indelG and 296C>G in the 5’ untranslated region (UTR) while mapping the transcriptional initiation site of the ABCA1 gene. In another report, Wang et al. (2000) found four missense (R216K, V825I, M883I and R1587K), and five silent coding SNPs (cSNPs; P312, G316, I680, L960 and T1427) during their cDNAbased resequencing efforts. Here, we surveyed the sequence variation in the ABCA1 gene among Singapore Chinese, Malays and Indians using experimental and in silico strategies. A segment of the ABCA1 proximal promoter was amplified and resequenced. Two strategies were used to examine variation in the exonic regions. Individual exons were amplified and subjected to heteroduplex detection using denaturing high performance liquid chromatography (DHPLC) under partially denaturing conditions. In addition, expressed sequence tags (ESTs) and full-length mRNAs from the ABCA1 UniGene cluster were multiply aligned 63 Chapter ABCA1 SNP Survey and candidate SNPs identified from regions of sequence overlaps. The results of the SNP discovery efforts are presented in a chronological order. 5.2 Results 5.2.1 First SNPFINDER Analysis of ABCA1 UniGene ESTs (Early 2000) In silico SNP discovery essentially mines SNPs from pre-existing DNA sequences. Public domain sequences such as ESTs provide a rich resource for SNP discovery (Buetow et al., 1999; Garg et al., 1999; Marth et al., 1999; Picoult-Newberg et al., 1999; Irizarry et al., 2000; Cox et al., 2001). The SNPFINDER (Buetow et al., 1999) was chosen to perform in silico SNP mining in the ABCA1 gene because of its integration with the UniGene database which consists of gene-specific ESTs and full-length mRNAs, and public DNA electrophoretogram archives. The key steps in the SNPFINDER pipeline include basecalling of DNA traces using PHRED which assigns a quality (Q) value to each base, assembly and alignment of multiple sequences using PHRAP, and finally, identification of candidate variants from regions of sequence overlaps using a statistical analysis that takes into account sequence quality. By examining sequencing traces, bona fide allelic variants can be discerned from sequencing errors with greater confidence as opposed to mere comparison of text-based sequences, for instance, using BLAST (Cox et al., 2001). Candidate variants flagged by SNPFINDER are assigned a score which directly reflects the probability that a position within a given assembly has heterogeneity in nucleotide composition (Buetow et al., 1999). When this survey was first conducted in early 2000, the ABCA1 UniGene cluster comprised of two mRNAs and 60 non-redundant ESTs (Table 5.1). Most of the ESTs originated from normal tissues. DNA traces were available for approximately half (31/60) of the ESTs, and these together with the mRNAs, were analyzed by SNPFINDER. Six candidate sequence variants with scores ranging from 0.05 to 0.96 were predicted 64 Chapter ABCA1 SNP Survey (Figures 5.1 and 5.2). All candidate variants are located in the 3’ portion of the ABCA1 gene, which probably reflects the fact that cDNA synthesis is generally primed by oligo dT primers as well as the fairly long 3’ portion of the ABCA1 mRNA (Santamarina-Fojo, et al., 2000). Since SNPFINDER was primarily developed for large-scale identification of SNPs, the original authors used a high arbitrary cutoff score of 0.99 in order to have a higher confirmation rate at the validation stage. Applying such a stringent cutoff in our case would result in no hit. Nevertheless, closer inspection of the multiple alignment revealed a highly probable A>G polymorphism (SNPFINDER score 0.96) in the position corresponding to nucleotide 8995 in the 3’UTR of the ABCA1 mRNA (numbering in mRNA with respect to reference sequence NM_005502). As illustrated in Figure 5.1, two out of nine ESTs originating from distinct tissue sources harbour the minor G variant. Moreover, the candidate variant is flanked by high quality bases (denoted by uppercase letters in the multiple alignment in Figure 5.1), further indicating the high likelihood of the SNP’s existence. In contrast, the candidate variants at positions 9091, 9410, 10027, 10029 and 10032 were less unlikely to be true SNPs because they occurred as singletons, possessed low scores or were flanked by low quality bases (Figure 5.2). Conversely, some of them might represent true but rare variants which would necessitate more members in the contig for detection. The variant at nucleotide 9410 has since been confirmed in a recent Japanese population survey (Iida et al., 2001). To experimentally verify the existence of the 8995A>G candidate SNP, a short 109 bp segment containing the candidate SNP was amplified and subjected to singlestrand conformation polymorphism (SSCP) analysis and sequencing. Figure 5.3 shows the three distinct band migration patterns on SSCP gels which match the three expected genotypes identified by sequencing analysis. The 8995A>G SNP is located ~1900 bp 65 Chapter ABCA1 SNP Survey downstream of the stop codon, and ~1400 bp upstream of the first polyadenylation motif, AAUAAA (Santamarina-Fojo et al., 2000). To assess the potential biological significance of 8995A>G, we searched for potential conserved sequences in the ABCA1 3’UTR. There is an absence of sequence conservation in the 3’UTRs of the ABCA1 mRNAs from human, mouse, rat and chicken. This is not an unexpected finding since the 3’ ends of genes are generally more heterogeneous among species compared to protein coding sequences (Makalowski et al., 1996). The less conserved nature of 3’UTRs possibly confers flexibility in spatial and temporal aspects of gene regulation in a manner specific to the organism (Conne et al., 2000). Furthermore, a search against the 3’UTR database (Pesole et al., 2002; http://bighost.area.ba.cnr.it/BIG/Blast/BlastUTR.html) also revealed no other genes carrying sequence motifs similar to the ABCA1 3’UTR. Allelic variants can create different structural folds in mRNA, leading to different phenotypic consequences (Shen et al., 1999). The MFOLD program (Zuker, 2003; http://www.bioinfo.rpi.edu/applications/mfold/old/rna/) was utilized to investigate whether the A and G allelic variants at 8995 would impact the folding of the ABCA1 3’UTR. RNA secondary structures encoded by both variants appear similar to one another (Figure 5.4). 66 Chapter ABCA1 SNP Survey Table 5.1 List of mRNAs and ESTs in the human ABCA1 UniGene cluster Hs.211562 from a release in early 2000. “*” indicates ESTs with DNA traces (31 in total) that are available from public FTP archives (e.g. Washington University Genome Sequencing Centre) and which could be automatically retrieved by SNPFINDER for SNP detection. mRNA NM 005502 AJ012376 EST AI802228* AA627178* AI807534 AI356194* AI628099* R01050* R01051* AI359714* AA902925* AA493786* AI399824* AA731742* AI819656 R31961* AW051752 AA814091* AI695068* AA625082* AI344681* AA292158* AA669024* AW130712 N63586* AW190098 AW019981 AA704305* D79969 AA618309 AI241822* AW019972 cDNA source Pancreas Thyroid Unknown Brain Kidney Unknown Unknown Brain Smooth muscle Thyroid Unknown Tonsil lung Placenta Kidney Tonsil Lung Unknown Kidney Ovary Lung Stomach Central nervous system Pancreas Ear Unknown Aorta Thyroid Brain Ear EST AL048638 AA434152* AA302670 AL038231 AL048434 AA618276* AA883989* AA521292* AA328447 N94914 AW364342 AA748860* AW044702 AW364344 AW364424 N36906* AI707785* AW364428 C01846 AA367573 AW006879 N46182* AW362709 cDNA source Unknown Ovary Adipose Unknown Unknown Thyroid Unknown Tonsil Whole embryo CNS Denis drash Tonsil Pooled denis drash denis drash Foreskin Aorta Denis drash Unknown Placenta Kidney Foreskin Colon AW380897 AA442439* AA357618 AL048433 AW364331 AA826281* AA302777 Head neck Whole embryo Prostate Uterus Denis drash Tonsil Adipose 67 Chapter ABCA1 SNP Survey A B Figure 5.1 A high confidence candidate 8995A>G SNP identified from the ABCA1 UniGene cluster Hs.211562 (early 2000 release). Only ESTs with DNA traces were analyzed with the SNPFINDER program (indicated by an asterick in Table 5.1). These sequences were basecalled by PHRED, assembled by PHRAP and candidate variants identified from regions of overlaps by DEMIGLACE (Buetow et al., 1999). (A) Position of the candidate SNP in the context of the multiple alignment. Two out of nine ESTs harbour the G variant. Upper and lower case letters in the alignment denote bases of high or low sequence quality respectively. (B) Representative DNA traces with the corresponding PHRED Q values at the variant position. Q value is a measure of the quality of each basecall and is related to the error probability p by: Q= -10 log10p (Ewing and Green, 1998). 68 Chapter ABCA1 SNP Survey A B C Figure 5.2 SNPFINDER multiple alignment showing the low confidence candidate variants identified from members of the ABCA1 UniGene cluster Hs.211562. Candidate variants are highlighted in blue or green columns. (A) 9091G>A. (B) from left to right, 10027T>G, 10029G>A and 10032A>T. (C) 9410A>G. Upper and lower case letters in alignment denote high and low base quality calls respectively. SNPFINDER scores ranged between 0.05 and 0.66. This analysis was conducted in early 2000. 69 Chapter ABCA1 SNP Survey A B Figure 5.3 Experimental confirmation of 8995A>G, a novel SNP in the 3’UTR of the ABCA1 gene. (A) SSCP analysis of a short 109 bp amplicon flanking the SNP reveals three reproducible and distinctive band migration patterns. (B) DNA traces showing the three representative genotypes. 70 Chapter ABCA1 SNP Survey A Figure 5.4A Lowest energy RNA structure predicted by MFOLD (Zuker, 2003) for the 8995A allele. Due to size limits imposed by the program, only the 3’UTR (nucleotides 144716-148034 in Genbank entry AF275948) was folded. The position of the 8995A allele is indicated by an arrow. 71 Chapter ABCA1 SNP Survey B Figure 5.4B Lowest energy RNA structure predicted by MFOLD (Zuker, 2003) for the 8995G variant. 72 IVS1+92T>C o 60 C IVS6-161G>T o 59 C IVS8-14indelA o 60 C Absorbance (mV) Absorbance (mV) Absorbance (mV) 1 Time (Minutes) Time (Minutes) Time (Minutes) IVS14-59C>T o 61 C IVS21-84C>T o 61 C Absorbance (mV) IVS23-67T>C IVS23-70T>C o 60 C Absorbance (mV) Absorbance (mV) 1 Time (Minutes) Time (Minutes) IVS23+102A>G 62oC Time (Minutes) 10 IVS25+23G>A 59oC IVS43+148G>A 56oC Absorbance (mV) Absorbance (mV) Absorbance (mV) 1 0 -1 Time (Minutes) 13 12 11 Time (Minutes) Time (Minutes) 10 IVS44+18C>T o 55 C IVS45-24T>C o 59 C IVS46-48G>T o 55 C 10 Absorbance (mV) Absorbance (mV) Absorbance (mV) 1 0 -1 -1 -1 Time (Minutes) Time (Minutes) Time (Minutes) Figure 5.13 DHPLC elution profiles of PCR fragments harbouring intronic ABCA1 SNPs. The top and bottom traces in each plot represent a heterozygous and homozygous wild type samples respectively. DHPLC analysis temperatures are indicated. 97 IVS48+13A>G IVS48+117A>G 56oC 17 IVS49+55G>C o 59 C 16 15 14 13 12 11 Absorbance (mV) Absorbance (mV) 10 -1 Time (Minutes) Time (Minutes) Figure 5.13 Continued from previous page. DHPLC elution profiles of PCR fragments harbouring intronic ABCA1 SNPs. 98 A Frequency 13 80.5 More Distance from 5' splice junction B Frequency 14 87.5 More Distance from 3' splice junction C Frequency 13 50 87 124 More Distance from 5' or 3' splice junctions Figure 5.14 Distribution of intronic SNPs with respect to splice junctions. (A) SNPs that are located downstream of 5’ splice donor junctions. (B) SNPs that are located upstream of the 3’ splice acceptor junctions. (C) Overall distribution of intronic SNPs. 99 Table 5.5 List of ABCA1 SNPs from dbSNP that may potentially lie in the fragments analyzed by DHPLC. DHPLC Fragment exon exon exon exon exon exon exon exon exon exon exon exon exon exon exon dbSNP rs# 13306079 2243312 13306082 2482424 1800978 1799777 7341705 2515626 2515625 2472376 2275545 2515628 12003906 2243313 2230805 SNP TAGTCnCGGCAAAAACCCCG[C/T]AATTGCGAGCGAGAGTGAGT CTCGGTTTCGGGGACTTTGA[C/T]CCGGAGCCCCACATCCCCAC GGGGGGTTCTCTCATTTTTT[C/G]TTTGTGGTTTTGAGTTGGGG CCATATCCnGACCACACAGT[C/T]CTGTGTCCATAAGAGCCAGA CTCCCGAGCCACACGCTGGG[C/G]GTGCTGGCTGAGGGAACATG CAGTTAATGACCAGCCACGG[/G]GTCCCTGCTGTGAGCTCTGG TGCCTGGATTTTCCCTGCCA[C/T]TGGTGAGTCTCAGGCAACAT GAnATCCnTATnATTTTTAA[A/T]TTTACAGTGTTCTATCTTAT GTGGCATTGGAnATCCnTAT[C/T]ATTTTTAAnTTTACAGTGTT CCCAGTGGCATTGGAnATCC[A/G]TATnATTTTTAAnTTTACAG AAnTTAAAAATnATAnGGAT[A/G]TCCAATGCCACTGGGATCTG GTTTTTCTACTCCTCGATTT[C/G]AACAGGCCATTTTCCAAATA AAGGCGAAAGGCTTTAGCTA[G/T]GCCAACTGCCTCTGCTGCTT GTGCTTTGATACATTCTGAG[G/T]TTCAGTAAAGAGACCTGATG GAAACCTTCTCTGGGTTCCT[A/G]TATCACAACCTCTCTCTCCC Type 5'UTR 2C>T Intronic IVS1+92T>C Intronic IVS3+44C>G Intronic IVS2+123C>T 5'UTR 296C>G 5'UTR 237indelG Intronic IVS3+35C>T Intronic IVS2-33A>T Intronic IVS2-42C>T Intronic IVS2-51A>G Intronic IVS2-55A>G Intronic IVS3-7C>G Intronic IVS4-39G>T Intronic IVS6-161G>T Coding L158 a a a No of independent dbSNP Allele Chinese Malays Indians submissions (ss#) Frequency 3 2 3 1 1 1 Caucasian 0.88 Japanese 0.49 Multi-national 0.78 10 African American 0.31 1 Caucasian 0.19 Pacific Rim 0.48 Hispanic 0.74 Multi-national 0.75 exon 2230806 TGAGCTTTGTGGCCTACCAA[A/G]GGAGAAACTGGCTGCAGCAG exon exon exon exon 11 exon 13 exon 14 exon 15 2274873 2246841 2067484 4392957 13306068 4743763 7031748 ATCTTCAGnCCCCCTCCCTC[A/G]GGATGCCCGCAGACAATACG TTGAGAGACTTGATCTTCAG[C/T]CCCCCTCCCTCnGGATGCCC taccctcctttctgtccccn[/A]ccctGACGCCAGCTGTTCAG GGGGCTGAGTTCCTCCCACA[G/T]GCCTTCCAGATCATGGAACA CTGGAATTACTCCAGGCAGC[A/G]TTGAGCTGCCCCATCATGTC CACAGTATAAACTGGTTAAA[A/T]ACAGTGGCTTGCAGGTAACT ATGAACCAGCTAAACCAGAG[G/T]ATGCTGTTGTCCAGGCCCAT Coding P312 Coding G316 Intronic IVS8-14indelG Coding L415M Coding I546V Intronic IVS14+24A>T Coding I680 1 Caucasian 0.92 exon 15 exon 16 exon 17 exon 17 exon 17 exon 18 2066717 2066718 13306070 4149312 2472458 4149313 CAGGCTCAGAGGCCTTGGCC[C/T]ATCACCCTGGCTCACGTGTG GTGTGGCATGGCAGGACTAC[A/G]TGGGCTTCACACTCAAGATC CTTATGATTGATCCATTTTG[C/T]AAATTCAAATTTCTCCAGGT GCTTCAATCTCACCACTTCG[A/G]TCTCCATGATGCTGTTTnAC CGnTCTCCATGATGCTGTTT[A/G]ACACCTTCCTCTATGGGGTG GGTTCCAACCAGAAGAGAAT[A/G]TCAGAAAGTAAGTGCTGTTG Intronic IVS14-59C>T Coding V771M Intronic IVS16-96C>T Coding V825I Coding N831D Coding M883I 4 Japanese 0.11 Caucasian exon 19 exon 19 12344156 TGTAAGGACACAGnGCACTG[C/T]GCTTGAACTTTCAGAAGTGC 3818689 AGTGGAGTGTAAGGACACAG[C/G]GCACTGnGCTTGAACTTTCA Intronic IVS18-74C>T Intronic IVS18-68C>G exon 21 exon 22 exon 22 exon 23 exon 23 exon 24 exon 24 exon 24 exon 24 exon 25 exon 25 exon 31 2482437 13306072 2066719 13306073 2020926 13292447 2076731 2076730 2297401 2234885 2230807 2066716 Coding L1005G Coding I1054V Intronic IVS21-84C>T Coding I1096V Intronic IVS23+102A>G Intronic IVS23-66T>C Intronic IVS23-67T>C Intronic IVS23-70T>C Intronic IVS24+141T>G Intronic IVS25+23G>A Coding R1228 Coding T1427 1 3 exon 34 2297404 CTGTGCACCCCACTGTCTGG[C/G]TTTTAATGTCAGGCTGTTCT Intronic IVS33-26C>G exon 34 exon 35 1997618 TCAAGAAGTTAATGATGCCA[C/T]CAAACAAATGAAGAAACACC 2230808 TATGACAGGACTGGACACCA[A/G]AAATAATGTCAAGGTAAACC Coding T1555I Coding R1587K 12 exon 36 exon 37 exon 43 exon 44 exon 45 exon 46 exon 46 exon 47 exon 48 exon 48 exon 48 exon 49 1883024 13306075 2066720 2020927 4149336 12346609 2020928 2020929 2740485 2066882 2066881 1331924 Coding L1648P Coding Q1678 Intronic IVS43+148G>A Intronic IVS44+18C>T Intronic IVS45+68C>T Coding G2061 Intronic IVS45-24T>C Intronic IVS46-48G>T Intronic IVS48+85A>C Intronic IVS48+13A>G Intronic IVS48+117A>G Intronic IVS49+55G>C 2 4 a CTCCGCCTTCACGTGCTTCT[C/T]AGAGAGCCCTTTCAAGCGGG TTGTCGGGGGATCTAAGGTT[A/G]TCATTCTGGATGAACCCACA CGGTTnGTAACAGAAACTTG[C/T]CCCTGGCTGTGCCCCTAGGT ACCACATGGATGAAGCGGAC[A/G]TCCTGGGGGACAGGATTGCC ACCCCTTTTGCCATGTTGAA[A/G]CCACCATCTCCCTGCTCTGT GGCAGAGCCACCAGCACCTT[C/T]nCCnGGGAGGGCTTCCAAGA TTCTTGGAAGCCCTCCCnGG[C/T]nAAGGTGCTGGTGGCTCTGC AGCTTCTTGGAAGCCCTCCC[C/T]GGnnAAGGTGCTGGTGGCTC CGCCACCAAGCTGCTCCCAC[A/C]CCCGCAGCCACCCAGCCCCA AAGTTAAGTGGCTGACTGTC[A/G]GAATATATAGCAAGGCCAAA TTTCATGAGATTGATGACCG[A/G]CTCTCAGACCTGGGCATTTC TCTTCCCTTTGCAGAGACAC[A/G]CCCTGCCAGGCAGGGGAGGA GCAGCTCTCAGAGGTGGCTC[C/T]GTAAGTGTGGCTGTGTCTGT TTTGTCGTATTCCTGATCCA[A/G]GAGCGGGTCAGCAAAGCAAA GCAATTAGTCATCGAGAAGA[A/G]AGGGACCCTGTATGTCAAGA TAGGTGAGAAAAGAAGTGGC[C/T]TGTATTTTGCTGCAAAGACT AATATATACCTTATGGCTTT[C/T]CCACACGCATTGACTTCAGG AGAAACACCACAGGAGGCCC[A/G]CCGATCAAAGCCATGGCTGT ATTCCCAAGCCAGACCAAAG[C/T]CAAGGTGCTTTTTATCACTGf TTAAAAACAAATTTATCTTT[G/T]ATTTTTTTTCCCCCCAGCAA TATATAAATGGTGCCTCTAA[A/C]ATAAAGGGAAATAAAACTGA ATAAAACTGAGCAAAacagt[A/G]tagtggaaagaatgagggct AAAAATAGGTAATAAAGATA[A/G]TTTCTTTGGGATAGTGCCTA GGACCTATGGGCGGGAGTGT[C/G]GGGGCAGGAATCCACACCCT Coding R219K 1 2 1 1 African American 0.71 Caucasian 0.73 Japanese 0.63 Pacific Rim 0.58 Hispanic 0.74 Multi-national 0.80 Caucasian 0.92 Caucasian 0.92 Caucasian 0 Caucasian Caucasian 0.07 African American 0.36 Asian 0.37 Caucasian 0.18 1 2 1 African American 0.44 Caucasian 0.10 Pacific Rim 0.58 Multi-national 0.17 African American 0.44 Caucasian 0.10 Pacific Rim 0.60 Hispanic 0.20 Caucasian 0.11 Japanese 0.34 Japanese 0.33 Japanese 0.13 Caucasian Multi-national 0.60 Caucasian 0.87 Japanese 0.73 African American 0.98 Caucasian 0.79 Pacific Rim 0.81 Number of samples showing heteteroduplex profiles. Sample size screened per ethnic group is 16 individuals. 100 5.3 Discussion The primary purpose of the work described in this chapter was to assess the genetic variation of the ABCA1 gene in the three diverse ethnic populations of Singapore as a prelude to an association study in the same population base. Using public domain ESTs, a high confidence candidate 8995A>G transition in the 3’UTR was identified and verified experimentally. Initial screening of individually amplified ABCA1 exons using DHPLC uncovered six coding variants (five missense cSNPs, R219K, V771M, V825I, M883I and R1585K; one silent cSNP, T1427), two 5’UTR variants (237indelG and 296C>G), as well as 16 intronic variants. Finally, resequencing of a 600 bp segment of the proximal ABCA1 promoter unveiled a total of seven SNPs (-14C>T, -99G>C, -278C>G, -302C>T, 407C>G, -463C>T and -564T>C). 5.3.1 SNPFINDER Analysis As a first approach to discover SNPs in the ABCA1 gene, we mined publicly available ESTs for SNP discovery. ESTs are a promising resource for SNP discovery because the majority of cDNA libraries have been generated from several individuals, and even within the same cDNA library, SNPs can be identified as a result of the differences between maternal and paternal sequences (Picoult-Newberg et al., 1999). In addition, the ESTbased SNP discovery strategy can potentially enrich for coding variants, and also taps into the large repository of public domain sequences which potentially implies savings in time and effort for SNP development. Despite these advantages, factors such as the source of the material for cDNA library construction, the cDNA construction methodology and the number of sequences impose practical limits on the efficiency of the EST-based SNP discovery strategy. For instance, the validated 9410A>G variant (rs4149341; Iida et al., 2001) was initially identified as a low confidence SNP because the G variant was encountered only once as 101 a low quality basecall out of seven sequences (Figure 5.1C), but with an increased number of sequence reads in the second analysis, it became detectable without ambiguity (Figure 5.5F). Similarly, the validated 8539C>G (rs4149339; Iida et al., 2001) was absent from the first SNPFINDER analysis but appeared as a strong candidate SNP in the second analysis (Figure 5.5C). In silico SNP mining is biased towards SNPs of high heterozygosity (Buetow et al., 1999; Cox et al., 2002) so given a contig size of same depth, a more frequent SNP will be detected with greater sensitivity compared to a rarer SNP. Failing to filter ESTs derived from tumourigenic sources can also inflate the SNP false discovery rate. In the second SNPFINDER analysis in which ESTs from tumourigenic sources were automatically included, there were as many as four potential false positives among the seven high confidence SNPs. Therefore, the false positive rate can be lowered by excluding such ESTs from the analysis. On the other hand, it cannot be excluded that these high confidence variants are rare, tumour-causing mutations although the concurrent presence of several neighbouring variants suggests they are more likely to be experimental artifacts. The high error rate (2%) of ESTs may also contribute to an inflated false SNP discovery rate due in part to the error-prone nature of the reverse transcriptase used in cDNA synthesis, as well as the low quality, single-pass nature of the sequencing (Pontius et al., 2003). The lack of sensitivity and specificity in identifying ABCA1 SNPs from EST data is compatible with previous reports (Buetow et al., 1999; Picoult-Newberg et al., 1999; Cox et al., 2001). The finding that the three validated SNPs 8539C>G, 8995A>G and 9410A>G from the EST-based analyses mapped within the 3’UTR of the ABCA1 gene is not unexpected because cDNA libraries are often constructed by oligo dT-priming in order to enrich for mRNA targets and not the more abundant rRNA. Furthermore, cDNA synthesis from long transcripts poses a technical challenge (Strausberg et al., 1999). Thus, for 102 genes with long 3’UTRs like ABCA1, the EST-based strategy may not be the best approach to enrich for putative coding variants The 3’ bias of candidate SNPs and the lack of nonsynonymous SNPs identified from EST alignments have been noted by Cox et al. (2001). The principles of identifying SNPs from sequence overlaps were initially applied to EST data (Buetow et al. 1999; Garg et al., 1999; Marth et al., 1999; PicoultNewberg et al. 1999; Clifford et al.; 2000; Irizarry et al., 2000). But in recent years, with the abundance of data from the human genome sequencing project, there has been a shift in emphasis towards the use of genomic sequence data for SNP discovery (Altshuler et al., 2000; Sachidanandam et al., 2001). Genomic sequence data have provided a better way to mine SNPs because of the high quality of sequencing demands, more even coverage, redundancy and the large overlaps of neighbouring genomic clones (Gu et al., 1998). 5.3.2 Promoter Resequencing A total of seven promoter variants (-14C>T, -99G>C, -278C>G, -302C>T, -407C>G, 463C>T and -564T>C) were detected by resequencing. As with the case with the DHPLC screen, a limitation of the promoter survey was that the small sample size of 16 individuals per ethnicity prevented us from an unbiased assessment of the allele and genotype distributions of the promoter variants. Nevertheless, review of recent extensive SNP surveys in the ABCA1 promoter indicated all except one promoter variants identified in the present study were also found in Caucasians and Japanese (Zwarts et al., 2002; Probst et al., 2004; Tregouet et al., 2004; Shioji et al., 2004; Yamakawa-Kobayashi et al., 2004). The exception was -463C>T which is likely to be a rare population-specific SNP because it was detected as in a heterozygous state in a single Indian individual. Therefore, we conclude that the promoter resequencing was quite effective in detecting high frequency, common sequence variants in the ABCA1 gene promoter. 103 5.3.3 DHPLC Analysis For reasons stated previously, in silico mining from ESTs is clearly inadequate to gather SNPs for an association study, especially for a large gene like ABCA1. Therefore, an alternative approach for SNP identification was sought in this study. To enrich for coding variants, we screened the ABCA1 exons using DHPLC. Due to the sheer size of the ABCA1 gene which encodes 50 exons, DHPLC is particularly attractive as a rapid screening tool for cSNP discovery compared to a resequencing approach. We identified five missense ABCA1 cSNPs (R219K, V771M, V825I, M883I and R1587K), one silent cSNP (T1427), two 5’UTR SNPs (237indelG and 296C>G), as well as 16 intronic variants from the DHPLC survey. All the cSNPs and 5’UTR SNPs have been independently confirmed by multiple groups (Pullinger et al., 2000; Wang et al., 2000; Clee et al., 2001, Iida et al., 2001; Zwarts et al., 2002). Database checks showed that 15 out of 16 intronic variants were also independently reported by multiple groups. The sensitivity, specificity and accuracy of DHPLC have been reported to exceed 95% (Xiao and Oefner, 2001). In general, the efficiency of a DHPLC screen can be assessed empirically by a parallel analysis of the fragments using sequencing, or analysis of fragments harbouring known variants in a blinded manner. Here, in order to address whether the DHPLC SNP survey had adequately captured the common genetic variation in the ABCA1 gene, we attempted to approximate the SNP detection rate by making advantage of the plethora of information in dbSNP. Of the 24 variants identified from the DHPLC screen, 23 have multiple dbSNP submissions (excluding those submissions from the current study), suggesting a confirmation rate of at least ~95% from the DHPLC screen. The additional variant identified from our experimental survey, IVS4648G>T, is likely to be a rare variant since it was found in the heterozygous state in one Chinese individual out of 48 DNA samples screened. On the other hand, the sensitivity of 104 the DHPLC screen is harder to evaluate. We attempted to estimate the false negative rate from our DHPLC screen from dbSNP information, and predicted that the number of potential false negatives might range from as few as five if only those SNPs with multinational or Asian frequencies and with multiple dbSNP submissions were considered, to as high as 34 if all dbSNP entries were ascertained including those with single submissions. These estimates are broad because of possible differences in population samples used in SNP discovery and inadequate allele frequency information in dbSNP. Nevertheless, we still expect the percentage of missed variants from the DHPLC survey should be low because even SNPs reported by multiple groups not necessarily have a 100% confirmation rate (Carlson et al., 2003). We also adopted steps to limit the number of false negatives in the DHPLC screen by verifying the suggested oven temperatures semi-empirically, although there still remains a slim possibility of missed variants given that control fragments with known variants were not available. In practice, most variants can be detected over a fairly broad temperature window (Jones et al., 1999; Spiegelman et al., 2000) so false negatives due to the use of suboptimal oven temperatures should be limited. One potential limitation of our DHPLC protocol is that we failed to accommodate for the possibility that some variants occur only in the homozygous state. The DHPLC melting profiles of such homozygous samples are difficult to distinguish from the common homozygotes. These rare homozygous variants can be readily identified by mixing with a wild-type homozygous reference DNA. In practice, however, most rare variants are more likely to exist in the heterozygous state. The true false negative rate of the DHPLC survey can be obtained by genotyping all putative 43 false negatives identified from the dbSNP analysis. Alternatively, both true false positive and negative rates can be empirically assessed by parallel analysis of all fragments using both DHPLC and sequencing. Two false positives were identified from our DHPLC screen. Both cases could be explained either by the presence of tracts of identical bases resulting slippage 105 (Spiegelman et al., 2000), or by PCR-induced error. The latter can be readily recognized by simply repeating the DHPLC analysis on a freshly amplified fragment because PCR errors would be random and would not produce identical DHPLC elution profiles. Another solution to reduce PCR artifacts is to adopt a more stringent PCR protocol by way of a proof-reading DNA polymerase. For examining sequence variation in fragments harbouring polymer tracts, bi-directional sequencing would be a more viable option. Alternatively, another pair of primers that excludes the polymer tract could be used. At the time of writing, we became aware of a publication by Fasano et al. (2005) who evaluated the reliability of DHPLC for rapid detection of ABCA1 mutations. The authors reported a sensitivity rate of 97% with 29 out of 30 variants which were previously identified from sequencing familial hypoalphalipoproteinemia patients and controls found. The single missed variant was found in the exon 40 fragment which was similarly missed in our screen because it contained a string of T bases which interfered with the interpretation of the elution profile. By designing a pair of primers that excluded the polymer tract, Fasano and co-workers (2005) successfully identified 30 variants, thus reaffirming the reported high sensitivity of the DHPLC method. Contrary to our experience, Fasano et al. (2005) reported that the majority of the fragments required two or more temperatures for DHPLC analysis. Analysis at multiple DHPLC temperatures is not only time-consuming but also adds to the costs of reagents. Their somewhat unexpected observation also contradicts findings that sequence variants remain detectable as heteroduplexes over a fairly broad temperature range (Jones et al., 1999; Spiegelman et al., 2000), as well as a high correlation between the temperatures predicted by the WAVEMAKER software and the empirical approach (Rudolph et al., 2000). The discrepancy is probably accounted by the fact that the Fasano study used primers originally designed for sequencing for DHPLC analysis. When we designed the DHPLC primers, we observed that the complexity of the fragment (i.e. harbouring multiple 106 melting domains) as predicted by the melting algorithms could be reduced by selecting an alternative pair of primer. Therefore, a simple in silico examination of the melting profiles of the PCR fragments before primer synthesis can often overcome the need to subject the same PCR fragments to multiple DHPLC injections. 5.3.4 Predicting Functional Effects of ABCA1 SNPs A goal of large-scale SNP screening efforts is to identify genetic variants that affect disease susceptibility and drug response, either indirectly through linkage disequilibrium (LD) with the functional variant, or directly. A special challenge is to determine which variants affect function and thereby contribute to altered phenotype. The biological significance of the SNPs discovered in this survey was largely inferred by determining whether they reside in known or putative cis regulatory elements or evolutionarily conserved regions. The assumption is that conserved sequences are likely to play a critical role in gene regulation or function and thus substitutions here may be less tolerated. None of the seven promoter SNPs affect known transcription factor binding sites in the ABCA1 proximal promoter such as the GnT/Znf202 (Porsch-Ozcurumez et al., 2001), Ebox (Yang et al., 2002) and DR4 elements (Costet et al., 2000; Schwartz et al., 2000). One promoter SNP -407C>G lies in an evolutionarily conserved segment containing a putative HNF4α or c-REL recognition site (Figures 5.7 and 5.8). HNF4α is an abundant, constitutively acting positive transcription factor that is highly expressed in liver, kidney, small intestine and colon, and is known to regulate genes involved glucose and lipoprotein metabolism, including the apoplipoprotein genes, ApoAI, ApoAIII, ApoAIV, ApoCII and ApoCIII (Sladek and Seidel, 2001). Coding mutations in HNF4α cause in maturity-onset diabetes of the young (MODY1; Yamagata et al., 1996) and, more rarely, familial monogenic late-oneset Type diabetes (Hani et al., 1998). HNF4α itself is also a 107 downstream target of TGF-β/SMAD pathway. Interestingly, TGFβ1 increases ABCA1 mRNA levels and cholesterol efflux (Argmann et al., 2001) but it is yet to be seen whether this stimulatory effect is mediated via HNF4α. c-REL is a member of the NF-kB family of nuclear transcription factors with a central role in mediating pro-inflammatory signals including oxLDL (Robbesyn et al., 2003). Functional studies such as those involving reporter gene, gel shift and real-time PCR assays will address whether the -407C>T SNP exerts an allele-specific effect on ABCA1 gene transcription, as well as the involvement of HNF4α and c-REL in the transcriptional regulation of the ABCA1 gene. Another promoter SNP that may potentially have an effect is -99C>G which appears to be in an extremely conserved position (Figure 5.8), although this might be explained by the constraint that it resides in a minimal fragment critical for oxysterol and basal activation of gene transcription (Costet et al., 2000; Santamarina-Fojo et al., 2000). -99C>G does not directly impact any known or putative transcription factor binding site. Evidence suggests that the absence of a cis regulatory element or conserved segment does not immediately eliminate the possibility of a SNP effect. For instance, the -564T>C promoter variant lacks a putative or known consensus transcription factor binding site (Figure 5.7) and is also not evolutionarily conserved (Figure 5.8). Yet the 564T variant has been shown to be associated with severe atherosclerosis (Lutucuta et al., 2002) and low ApoAI levels (Tregouet et al., 2004). Although the association might be attributed to a closely linked causal allele, recent in vitro and in vivo assays in macrophages have demonstrated differences in the transcriptional activity, nuclear protein binding and mRNA levels of the two -564T>C allele variants (Kyriakou et al., 2004). -278G>C is another promoter SNP that lies in a region devoid of putative or known transcription factor binding site (Figure 5.7) or sequence conservation (Figure 5.8) but has been associated with HDL-C levels in the Japanese (Shioji et al., 2004). In this case, 108 the effect of the -278G>C SNP could not be untangled from -564T>C which it is in perfect and complete LD (Shioji et al., 2004). Similarly, comparative sequence analysis as well the traditional classification schemes using BLOSUM62 (Heinkoff and Heinkoff, 1992) and Grantham’s chemical difference (Grantham, 1974) matrices were used to rank the severity of the common ABCA1 missense cSNPs. Miller and Kumar (2001) previously demonstrated that the missense mutations causing human diseases tend to be over-represented in amino acid positions that are most conserved over long evolutionary periods, and more radical amino acid replacements are observed in patients compared to the variations found among species or non-diseased humans. Among the common human missense SNPs in the ABCA1 gene, the amino acid residues 771 and 1587 are absolutely conserved even among species as diverse as the chicken (Figure 5.11). V771M and R1587K have been significantly associated with cardiovascular traits (Clee et al., 2001; Frikke-Schmidt et al., 2004; Tregouet et al., 2004; Yamakawa-Kobayashi et al., 2004). But as is the case with the promoter SNPs, associations with cardiovascular traits have also been reported for apparently benign missense cSNPs. A well known example is R219K for which numerous independent studies have shown that the K219 allele has a protective association (Clee et al., 2001; Cenarro et al., 2003; Evans and Beil, 2003; Tregouet et al., 2004; Yamakawa-Kobayashi et al., 2004; Zhao et al., 2004). Here, it is unclear whether the effect is direct or indirect because although Clee et al. (2001) showed a corresponding enhanced cholesterol efflux activity among younger K219 carriers, they also found that the SNP is in LD with other cSNPs. The latter disagrees with the finding of Tregouet et al. (2004) who reported a lack of LD among ABCA1 cSNPs. It may be speculated the contrasting patterns of LD of ABCA1 cSNPs described in these two studies (Clee et al., 2001; Tregouet et al., 2004), despite of a similar Caucasian population base, is due to different metrics of describing pairwise LD which have often 109 been a source of confusion (Ardlie et al., 2002a). In the context of a genetic association study, r2 rather than D’ is most suitable for describing the strength of LD because it is directly related to the power to detect an association by using a surrogate marker in place of the causal locus. Also, P values should not be used to describe LD. Clee et al. (2001) did not mention their method of LD assessment. The difficulty of using sequence conservation to predict the biological effect of missense ABCA1 cSNPs is not surprising in light of a recent bioinformatic analysis by Thomas and Kejariwal (2004) showing that amino acid changes associated with complex diseases tend to be distributed outside of conserved regions and their distribution is indistinguishable from the variation sampled from presumably healthy individuals. This highlights the difficulty in deciphering the functional significance of a SNP through identification of sequence motif conservation as it may exert more subtle and milder effects through mechanisms such as transcription, alternative splicing, translation, etc, other than affecting an amino acid residue critical for protein function. With the exception of V825I, all known common missense cSNPs reside outside the transmembrane domains (Figure 5.12). This observation concurs with an earlier survey showing that the amino acid diversity in ABC transporters is significantly lower in the transmembrane domains, perhaps due to some functional constraint imposed by substrate specificity and translocation (Leabman et al., 2003). Additionally, no cSNPs are located in the conserved and functional NBDs of the ABCA1 gene, probably due to a selective constraint. We also applied a similar approach to infer the potential biological significance of the ABCA1 UTR variants by searching for regulatory elements in the primary sequence either through sequence conservation or similarity searches in UTR databases. We noticed a lack of sequence conservation in the ABCA1 UTRs, especially the 3’UTR. The human ABCA1 3’UTR is unique whereas the cross-species sequence conservation was 110 higher at the 5’UTR (Figure 5.10). This is consistent with the observation that the 5’UTRs of genes contain sequences responsible for translational initiation and tend to be subjected to greater evolutionary and functional constraints than 3’UTRs. 3’UTRs, on the other hand, govern diverse post-transcriptional processes such as mRNA stability, mRNA export and subcellular localization, and translational efficiency, and therefore could display greater heterogeneity in their sequences and lengths. In addition, regulatory control at the RNA level relies on a complex combination of primary and secondary structure elements assembled in a consensus structure recognized by RNA binding proteins. Thus, the heterogeneous nature of UTRs in general suggests that the strategy of searching for conserved sequence elements may be less successful in predicting the effects of UTR variants, and mechanistic studies are required to assess their biological impact. 5.3.5 Caveats of SNP Survey The allele frequencies and LD patterns of all 32 ABCA1 variants identified in this study were not ascertained due in part to a constraint of the genotyping methods available in the laboratory and also the nature of the experimental design. The bulk of the variants (72.7% or 24 out of 33) were identified by DHPLC. DHPLC is a screening technique that relies on the detection of heterozygous variants and it does not permit genotypes to be determined for every analyzed sample since the elution patterns of the two homozygous variants are often indistinguishable from one another (unless each sample has been mixed with a known reference homozygous sample). Although genotypes for the promoter variants could be obtained during the resequencing in principle, the small discovery sample size of 16 individual samples per ethnic group is unlikely to yield reliable estimates of allele and genotype frequencies. We were also unable to ascertain the frequencies of many variants (with the exception of those used in the association study) by genotyping on larger samples because they are not amenable to restriction 111 fragment length analysis. In addition, a variation discovery restricted to the coding region can reduce the efficacy of an association study (Crawford et al. 2004). For an unbiased and comprehensive description of the extent of genetic diversity in the ABCA1 gene, a whole gene resequencing approach would be highly recommended. The complete allele frequency spectrum and LD pattern will also be extremely informative for guiding the selection of an optimal set of SNPs for application in subsequent association studies. Although reliable allele and genotype frequencies can be readily achieved with relatively small numbers of samples (Kruglyak and Nickerson, 2001), the large size of the ABCA1 gene at approximately 149 kb still imposes considerable sequencing efforts for most ordinarily equipped laboratories. 5.3.6 Conclusions In summary, a total of 32 ABCA1 genetic variants were identified using samples from Singapore Chinese, Malays and Indians. Although allele frequencies were not determined during the course of the SNP survey, many variants have had multiple independent dbSNP submissions and thus expected to be common in the local populations. We also attempted to infer the biological significance of the ABCA1 variants, but in general, such computational analysis may be limited in facilitating the selection of putative functional genetic markers for a subsequent association study. The development of a powerful and informative set of ABCA1 SNPs for an association study will entail a more comprehensive assessment of the sequence diversity and LD patterns in the gene, preferably by a whole gene resequencing approach. 112 [...]... 2 1 1 0 0 5 3 6 4 Time (Minutes) 5 I825V 60oC 6 5 Time (Minutes) 0 5 3 5 M883I 58 oC 4 4 Time (Minutes) T1427 61oC 4 3 3 3 Absorbance (mV) Absorbance (mV) Absorbance (mV) 4 2 2 2 1 1 1 0 2 4 3 Time (Minutes) 4 0 3 4 Time (Minutes) 5 3 4 5 6 Time (Minutes) R 158 7K 57 oC Absorbance (mV) 3 2 1 0 5 6 7 Time (Minutes) Figure 5. 9 DHPLC elution profiles of exonic ABCA1 SNPs The top and bottom traces in each plot... 2 2 1 1 0 1 0 2 3 4 0 5 3 4 Time (Minutes) 4 Time (Minutes) 5 IVS23+102A>G 62oC 2 5 Time (Minutes) 10 IVS 25+ 23G>A 59 oC IVS43+148G>A 56 oC 9 8 4 7 6 Absorbance (mV) Absorbance (mV) Absorbance (mV) 3 1 2 5 4 3 2 1 1 0 0 0 -1 5 6 7 3 Time (Minutes) 13 12 11 4 5 6 3 4 Time (Minutes) Time (Minutes) 10 IVS44+18C>T o 55 C 6 IVS 45- 24T>C o 59 C 9 5 IVS46-48G>T o 55 C 5 8 10 7 9 8 4 7 6 5 Absorbance (mV) Absorbance... Hs.21 156 2 from a later release (release 156 , date accessed 28 Sep 2002) mRNA AJ012376 NM 0 055 02 AF1 652 81 AF2 851 67 EST AI802228 AA627178 AI80 753 4 AI 356 194 AI628099 R01 050 R01 051 AA527406 AI 359 714 AA9029 25 AA493786 AI399824 AA731742 AI819 656 R31961 AW 051 752 AA814091 AI6 950 68 AA883989 AA6 250 82 AI344681 AA292 158 BM768930 BM769397 BM823180 BM830709 BM978608 AL698 654 AL701341 BQ0 250 22 BQ026286 BF9281 85 BM 153 383... Melanocyte AA434 152 AA36 757 3 AA 357 618 AA328447 AA302670 AA302777 C01846 BU198400 BQ940486 D79969 BE 857 1 75 BE880894 BE87 954 5 BE8784 85 BE816862 BE7 151 04 AV661400 AV 656 040 AV647223 BF09 452 4 BE971402 Ovarian tumour Placenta Prostate BG149600 BF879888 BF892148 BF886004 BF 951 740 BF988872 AW019972 BF439764 BF433708 BF431704 AU 156 154 AW60 157 5 BF 855 659 BF671104 AA737119 BF3792 05 BF348792 BF216316 AU1 355 88 BF116114... time-consuming and costly Therefore, for efficient SNP discovery in the ABCA1 exons, we used a screening technique, DHPLC, which is based on heteroduplex detection of sequence variants under partially denaturing conditions Primers were designed to amplify most of the 50 exons individually including the entire 5 UTR and the protein coding portion of exon 50 In addition, a fragment containing the newly... ABCA1 mRNA reference sequence NM_ 055 022 76 Chapter 5 ABCA1 SNP Survey D E F Figure 5. 5 Continued from previous page (D) from left to right: 8673T>G, 8705T>G and 8720T>G with scores of 0.16, 0.99 and 0.99 respectively (E) 9097G>A, 0.79 (F) 9410A>G, 0.99 77 Chapter 5 ABCA1 SNP Survey G H Figure 5. 5 Continued from previous page (G) 9696C>T, 0.02 (H) 10029G>A, 0.38 78 Chapter 5 ABCA1 SNP Survey Table 5. 3... Ocular tissues Brain Cochlea Cochlea Uterus_tumour Multiple sclerosis lesions Squamous cell carcinoma EST BG573 350 BG56 759 5 BG567118 N46182 BG482804 BU198380 BF574391 AA826281 AI241822 AA748860 AA7043 05 AI7077 85 cDNA Source Placenta Liver Liver Melanocyte Lung Muscle (skeletal) Germinal center B Tumour, 5 pooled Germinal center B Liver and Spleen Aorta EST BE222116 BE177793 AW88 955 0 AW8 451 51 AW380897 AW372918... discovered in this survey, R219K, V771 and R 158 7K reside in extracellular domains, while V825I is located in the transmembrane region and M883I upstream of the first nucleotide binding domain (NBD1) 88 A total of 16 intronic sequence variants over 14 different amplicons were also detected (Figure 5. 13) The majority of the intronic variants are transitions ( 75% , or 12 out of 16) while 19% (3 out of 16)... 60149 250 3FL in Figure 5. 5 correspond to ESTs BE880894, BE87 954 5 and BE8784 85 respectively in Table 5. 2), which originated from an undifferentiated large cell carcinoma (library information obtained from the IMAGE consortium, http://image.llnl.gov) None of these putative variants have been confirmed in dbSNP (build 123) It is plausible that they were acquired during propagation of the tissue in culture... 75 Chapter 5 ABCA1 SNP Survey A B C Figure 5. 5 Twelve putative SNPs predicted by SNPFINDER conducted on an expanded and later release of ABCA1 UniGene cluster (build 156 , release Sep 2002) Sequence variants are located in blue or green columns (A) 8995A>G, score 0.99 (B) 8375C>T, score 0.96 (C) From left to right: 851 7G>C, 853 9C>T and 857 0T>A with scores of 0.96, 0.96 and 0 .54 , respectively Numbering . 60149 250 3FL in Figure 5. 5 correspond to ESTs BE880894, BE87 954 5 and BE8784 85 respectively in Table 5. 2), which originated from an undifferentiated large cell carcinoma (library information obtained. scores of at least 0.96 were identified (Figure 5. 5). These include 8995A>G which had been confirmed earlier. Singletons were recorded at positions 83 75 (Figure 5. 5B) and 851 7 (Figure 5. 5C). (SNPFINDER score 0.96) in the position corresponding to nucleotide 89 95 in the 3’UTR of the ABCA1 mRNA (numbering in mRNA with respect to reference sequence NM_0 055 02). As illustrated in Figure

Định dạng
Số trang	50
Dung lượng	1,38 MB