Understanding and Assembling 454 Genome & Transcriptome data

48 257 0
Understanding and Assembling  454 Genome & Transcriptome data

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Understanding and Assembling 454 Genome & Transcriptome data. Assembly Training May 2011. Stephen Bridgett Aims •  Why sequence transcriptomes? •  How does 454 sequencing work? •  What are ‘sff’ files? •  Using sff tools •  What is assembly? •  Challenges to assembly •  Newbler assembler and Output files •  Exercises with sample data Why sequence transcriptomes? •  Gives more dynamic view of the activity in a cell, (than genome sequencing would) as: •  Gives relative expression levels for different cells under different conditions. •  Could identify alternate splicing, and fusion genes (important in several cancers). •  Focuses on gene sequences, which are often the main research focus. How does 454 sequencing work ? 454 sequencer DNA Capture bead, emPCR, Pyrosequencing reaction, Signal image, Base calling An Animation of 454 sequencing •  Animation of 454 sequencing from Wellcome trust website to explain Flowgrams: •  http://genome.wellcome.ac.uk/doc_WTX056439.html •  http://genome.wellcome.ac.uk/assets/WTX056030.swf 454Animation_from_Wellcome_WTX056030.html •  To help understand output from assembler alignment. Data obtained from 454 sequencing   Roche 454 ‘titanium’ genome reads approx. 400 bases long.   Transcriptome reads tend to be a bit shorter eg. 350 bases.   Typically 700,000 reads from one sequencing plate.   Plates can be divided into 2, 4, 8 or 16 lanes.   Samples can have an MID (multiplex index) ‘barcode’ added, so several samples can be run together in the same lane. What are ‘sff’ files ? •  ‘Sff’ files are Roche’s “Standard Flowgram Format” files, containing the sequence data produced from a 454 run. •  The sff files contain: •  a Manifest header at the start describing the contents, •  flow intensity signal values for each base in each read. •  They are in binary format, so need converted to text format, such as a fasta file (using the ‘sffinfo’ program) •  The Sequence Read Archive (SRA at EBI or NCBI) request that these .sff files be uploaded, to obtain accession number for publications. What is Assembly?   Merge the short reads into long contigs (ideally a full transcript), by finding the best sequence overlaps between reads.   Eg: Roche’s Newbler assembler, MIRA assembler, TgiCl assembler, Phrap, Cap3, MOSAIK reference guided assembler, etc.   This is an ‘overlap’ assembler (there are also deBruijn graph assemblers to cope with the very large numbers of short illumina reads)   Reads overlapped to form a contig, viewed in the gsAssembler graphical interface.   Newbler is an ‘overlap assembler’. There are also de-Bruijn graph assemblers designed to cope with the vary-large numbers of short reads from illumina or SOLiD, such as Velvet, CLC cell, Cotex, SOAP-denovo, Abyss. Challenges for assembly (1) •  Contaminants in samples (eg. from Bacteria or Human). •  Ribosomal RNA (small and large sub-units). •  PCR artifacts (eg. Chimeras and Mutations) •  Sequencing errors, such as “Homopolymer” errors – when eg. 3+ run of same base. •  MID’s (multiplex indexes), primers/adapters (eg. SMART adapters used to synthesise cDNA) still in the raw reads. •  Repeats and large or polyploid genomes – repeated sequences in the transcriptome make assembly more difficult. Challenges for assembly (2) •  Extra sample preparation steps in cDNA synthesis - more risk of cloning errors or contamination, wider range of read lengths. •  Large expression level range (eg. 10 5 ) - some transcripts have low read coverage and some very high coverage. •  Alternative splicing - differing reads from same part of genome. •  Roche’s Newbler 2.3 assembler sometimes didn’t finish transcriptome assembly, seemed to get lost when “Detangling Alignments”, but the latest Newber 2.5 beta is able to. [...]... Genome Sequencer”) Designed to assemble reads from the Roche 454 sequencer Accepts:                   454 Flx Standard reads, and 454 Titanium reads single and paired-end reads Optionally can include Sanger reads Initial versions focused on assembling Genomic reads Latest versions (2.3 and now 2.5.3) improve transcriptome assembly Runs on Linux, and has 32 bit and 64 bit versions Has Command-line... (Rice) Homopolymer error A ?c TT - AAAAA ?a •  Different between signal of 1 and signal of 2 = 100% •  Different between signal of 5 and 6 is 20% so errors more likely after eg AAAAA Roche software •  Roche have developed Data- Analysis software for processing, assembling and mapping the 454 reads: •  sffinfo - extract fasta, quality and flowgrams as text from sff files •  sfffile - join, split or trim sff... for Transcriptome projects (1) In the Assembly subdirectory: •  454Isotigs.fna  fasta file of all Isotigs, and Contigs which are not in an isotig •  454Isotigs.qual  quality scores (Phred-based) for each base in '454Isotigs.fna’ file (eg: 20 = 1 in 100 probability of incorrect base call; 50 = 1 in 100,000) •  454Contigs.fna  fasta file of all contigs, which are used to create the Isotigs •  454Contigs.qual... quality scores for each base •  454NewblerMetrics.txt  statistics of the assembly, eg: number of reads and bases aligned, overlaps found, mean contig sizes, •  454ReadStatus.txt  status of each read in assembly (Assembled, PartiallyAssembled, Singleton, TooShort, Outlier), and alignment 3' and 5' positions within contig •  454TrimStatus.txt  each read's original and revised trim-points used in the... values.) Command-line interface •  The simplest command to run Newbler is: runAssembly [options] reads.sff •  Which creates an the assembly in an output directory called: P_yyyy_mm_dd_hh_min_sec_runAssembly where P_ = Project, followed by date and time •  There are a large number of optional parameters available for controlling and refining the assembly Common command-line options •  -cdna  for transcriptome. .. Type: gsAssembler & (The & just means can still use the command-console as runs assembler in’background’) •  Set project name, directory, and Genomic or cDNA option •  On Project tab, select directory containing sff files, then uncheck any unwanted sff files •  Set parameters for project, such as MINT adapters to trim, and ribosomal rRNA fasta file to screen out, other assembler and output options... (Newbler) - to assembly reads into contigs/isotigs •  gsMapper - to map reads to a transcriptome or genome reference •  gsAmplicon – to analyse Variants in Amplicons •  (These run on 32 and 64 bit Linux There is information on the wiki about obtaining and installing these.) Exercise 1A – sff files Aims: •  Using ‘sffinfo’ and ‘sfffile’ •  Summarise the read statistics •  Blast the reads for contaminants... other assembler and output options •  Click the “Start” button at the right, and watch the output at the bottom •  When finished assembly, can view using the Results, Alignment and Flowgrams tabs Experiment 208: Using the GUI Run: gsAssembler Graphical interface should appear Can use: /dataset2 (or other dataset) •  Choose options and run the assembly •  Look at the resulting assembly in the viewing tab... using quality and flow signal information at each base in the multiple alignments •  Output the contig consensus sequences, quality scores, alignment and metric files •  You will see message about these steps as assembly progesses If paired End data is available, the assembler performs these extra steps: •  Organize contigs into scaffolds, using paired-end information to order the contigs and to approximate... 454TrimStatus.txt  each read's original and revised trim-points used in the assembly Output files (2) •  454AlignmentInfo.tsv  base consensus and quality, read-depth and flow-signal, at each position in each contig •  Can easily be parsed by Perl script to obtain eg: average coverage depth for each contig and isotig •  eg: Position Consensus >contig00008 1 G 2 A 3 T 4 T 5 G etc Quality Score 64 64 64 64

Ngày đăng: 13/03/2014, 18:45

Tài liệu cùng người dùng

Tài liệu liên quan