Gene Pool transcriptome workshop nov 2010

Understanding and Assembling 454 Transcriptome sequences Transcriptome Workshop Nov 2010 Stephen Bridgett Aims • • • • • • • • Why sequence transcriptomes? How does 454 sequencing work? What are ‘sff’ files? Using sff tools What is assembly? Challenges to assembly Newbler assembler and Output files Exercises with sample data Why sequence transcriptomes? • Gives more dynamic view of the activity in a cell, (than genome sequencing would) as: • Gives relative expression levels for different cells under different conditions • Could identify alternate splicing, and fusion genes (important in several cancers) • Focuses on gene sequences, which are often the main research focus How does 454 sequencing work ? 454 sequencer DNA Capture bead, emPCR, Pyrosequencing reaction, Signal image, Base calling An Animation of 454 sequencing • Animation of 454 sequencing from Wellcome trust website to explain Flowgrams: • http://genome.wellcome.ac.uk/doc_WTX056439.html • http://genome.wellcome.ac.uk/assets/WTX056030.swf 454Animation_from_Wellcome_WTX056030.swf • To help understand output from assembler alignment Data obtained from 454 sequencing  Roche 454 ‘titanium’ genome reads approx 400 bases long  Transcriptome reads tend to be a bit shorter eg 350 bases  Typically 700,000 reads from one sequencing plate  Plates can be divided into 2, 4, or 16 lanes  Samples can have an MID (multiplex index) ‘barcode’ added, so several samples can be run together in the same lane What are ‘sff’ files ? • ‘Sff’ files are Roche’s “Standard Flowgram Format” files, containing the sequence data produced from a 454 run • The sff files contain: • a Manifest header at the start describing the contents, • flow intensity signal values for each base in each read • They are in binary format, so need converted to text format, such as a fasta file (using the ‘sffinfo’ program) • The Sequence Read Archive (SRA at EBI or NCBI) request that these sff files be uploaded, to obtain accession number for publications What is Assembly?      Merge the short reads into long contigs (ideally a full transcript), by finding the best sequence overlaps between reads Eg: Roche’s Newbler assembler, MIRA assembler, TgiCl assembler, Phrap, Cap3, MOSAIK reference guided assembler, etc This is an ‘overlap’ assembler (there are also deBruijn graph assemblers to cope with the very large numbers of short illumina reads) Reads overlapped to form a contig, viewed in the gsAssembler graphical interface Newbler is an ‘overlap assembler’ There are also de-Bruijn graph assemblers designed to cope with the vary-large numbers of short reads from illumina or SOLiD, such as Velvet, CLC cell, Cotex, SOAP-denovo, Abyss Challenges for assembly (1) • Contaminants in samples (eg from Bacteria or Human) • Ribosomal RNA (small and large sub-units) • PCR artifacts (eg Chimeras and Mutations) • Sequencing errors, such as “Homopolymer” errors – when eg 3+ run of same base • MID’s (multiplex indexes), primers/adapters (eg SMART adapters used to synthesise cDNA) still in the raw reads • Repeats and large or polyploid genomes – repeated sequences in the transcriptome make assembly more difficult Challenges for assembly (2) • Extra sample preparation steps in cDNA synthesis - more risk of cloning errors or contamination, wider range of read lengths • Large expression level range (eg 105) - some transcripts have low read coverage and some very high coverage • Alternative splicing - differing reads from same part of genome • Roche’s Newbler 2.3 assembler sometimes didn’t finish transcriptome assembly, seemed to get lost when “Detangling Alignments”, but the latest Newber 2.5 beta is able to Viewing Assemblies • In addition to the alignment viewer in gsAssembler, there are several other viewers for viewing the ace alignment files: • Hawkeye = useful, but a bit tricky to compile initially: http://sourceforge.net/apps/mediawiki/amos/index.php? title=Hawkeye • Tablet = Fast, nice display and easy to use From SCRI: http://bioinf.scri.ac.uk/tablet/ • EagleView = a limited basic viewer From MarthLab • Gambit = newer viewer from MarthLab: http://bioinformatics.bc.edu/marthlab/ • Magic Viewer = Views sam and bam files, not ace files From http://bioinformatics.zj.cn/magicviewer/ Videos about 454 sequencing • Pyrosequencing: http://www.youtube.com/watch?v=kYAGFrbGl6E • Genome Sequencer FLX System Workflow: http://www.youtube.com/watch?v=bFNjxKHP8Jc Exercise 1: Look into an sff file • ‘sffinfo’ is a command-line program that is part of this Roche Data_Analysis package • To view the binary sff file as text, run: cd ~/data/Axolotl sffinfo Axolotl_reads.sff | less (Piping to less allows you to scroll easily) Type ‘q’ to quit less Exercise 2: Extract reads from an sff file • Use the file: Axolotl_reads.sff cd ~/data/Axolotl • Extract reads from the sff file into a fasta file: sffinfo -seq Axolotl_reads.sff > Axolotl.fna head Axolotl.fna • Extract the quality information from the sff file: sffinfo -qual Axolotl_reads.sff > Axolotl.qual head Axolotl.qual • Count the number of reads (The quotes are important): grep -c ">" Axolotl.fna Exercise 3: Assembly • dataset 1: 454 titanium reads for Mb genome ~/training/data/454/dataset_1/set1_reads.sff Get metrics for the raw reads: sffinfo -seq reads.sff > reads.fasta my_process_contigs.pl –i reads.fasta –o process_reads –t read (although these are reads, rather than contigs the same script can still be used.) more process_reads/contig_stats.txt Estimate the average read depth (genome 6Mb ) Exercise 4: Assembly Assembly command: runAssembly -o assembly1 reads.sff • Where reads are: ~/assembly_workshop/data/454/dataset_1/set1_reads.sff Look in the assembly1 subdirectory, and see what you think the files contain More options • -a num  minimum contig length for 454AllContigs (default 100) • -l num  mim contig length for 454LargeContigs (default=500) • -large  for large or complex genomes, speeds up assembly, but reduces accuracy Not with -cdna option • -m  keep sequence data in memory to speed up assembly, but needs sufficient RAM • -cpu num  num CPU’s to use (manual says default=all, but wrong), to speed up the computing alignments and generating output steps • -minlen num  minimum length of reads to use in assembly (default=20, can be 15 to 45) • -rip  output each read in only one contig Even more options • -notrim  disable default quality & primer trimming of input reads • -p filename  specify input file contains paired-end reads • -ud  treats each reads separately, not grouping duplicates • • • • • • • • • -ss  set seed step parameter (default=12, can be or more) -sl  set seed length parameter (default=16, can be to 16) -sc  set seed count parameter (default=1, can be or more) -ais  set alignment identity score (default=2, can be or more) -ads  ?set alignment difference score (default=-3, can be or less) -ml  set minimum overlap length (default=40, allowed or more) -mi  set minimum overlap identify (default=90, allowed to 100) -nobig  skip output of large files (.ace, 454AlignmentInfo.tsv) -consed  creates subdirectory, ace, phd files, sff_dir for consed (see page 21 or 77 of manual Part C for more details.) Extra challenges of transcriptome assembly • Poly A’s, Poly T’s tails (added after gene transcription) …….ATGCTAAAAAAAAAAAAAAA-3’ Exercise 5: Using options • Use some of these options that you think may improve assembly runAssembly [your options] –o assembly2 reads.sff • Change into subdir assembly2 • Look through some of the output files, eg: less filename (or use a texteditor) my_process_contigs.pl -i 454AllContigs.fna -o stats -t contig • The assembler manual is available on the web links page so you can try different options for ‘runAssembly’ Common options (again) • -o output_directory  to set name of output directory (overwrites existing directory without warning!) • -vt trimmingFile.fasta  to trim primers, adapters or polyA tails from start or end of reads • -vs screeningFile.fasta  to remove reads that closely matching a cloning vector such as E.Coli • (-vs and -vt will also match for the reverse-complements of the given sequences.) Exercise 6: Transcriptome assembly Using: ~/assembly_workshop/data/454/dataset_2/ Enter the following on one line: runAssembly -o assembly3 -vt MINTadapters.fna –vs rRNA.fna (groups at front half only) -cdna • • • • -ig NUM (max contigs in an isogroup, default 500 contigs) -it NUM (max number of isotigs in an isogroup, default 100) -icc NUM (max contigs in one isotig, default 100 contigs) -icl NUM (isotig contig length threshold, default bp) reads.sff Incremental assembly There are also alternative command-line commands (instead of ‘runAssembly’) that can perform incremental assembly, adding, or removing, runs to an existing project over time: • • • • newAssembly, addRun, removeRun, runProject Transcriptome options Newbler collects into “Isogroups”, then creates “Isotigs” New options for transcriptomes: • -cdna = for transcriptome (cDNA assembly) • -ig = max contigs in an isogroup (default 500 contigs) • -it = max number of isotigs in an isogroup (default 100) • -icc = maximum number of contigs in one isotig (default 100 contigs) • -icl = isotig contig length threshold, below which traversal stops (default base pairs) • Pages 142 to 146 of Part C of the Roche Assembly manual gives a good table of all the options Output files (cont 3) • 454ContigGraph.txt = describes the branching structure between contigs • Has sections: •(1) Graph Node information (the contigs): ContigNum ContigName Length Average_depth contig00001 452 42.6 contig00002 603 253.9 etc •(2) Graph edges (C=contig edge; or S=scaffold edge for paired end reads, also S in -cdna graphs): Edge FromContigNum FromEnd ToContigNum ToEnd AlignmentReadDepth C 5’ 2639 5’ 24 C 5’ 5’ 36 etc S 1558 2560:+;2802:-;2872:-;2575:-;2783:-;2614:S 671 2560:+;1327: etc •(3) More graph information I t 24:2639-5' 1284-3' I atcgattgaaatcaatggagaaagatacTATAGAAAGTTAATAAAaGTATCTGTAGAGCCGACAGTTG etc F 2751/188/0.0;2931/8/0.0;2957/36/0.0;1242/226/6.0 F 2639/24/0.0 1284/24/0.0 ... newAssembly, addRun, removeRun, runProject Transcriptome options Newbler collects into “Isogroups”, then creates “Isotigs” New options for transcriptomes: • -cdna = for transcriptome (cDNA assembly) • -ig... Blast the reads for contaminants The exercises are on the wiki: http://taw2010wiki What is “Newbler” ?    Roche''s “GS De Novo Assembler” (where “GS” = “Genome Sequencer”) Designed to assemble... that represent the assembled reads • Resolve branching structures between contigs, to generate isotigs • Generate consensus basecalls for the contigs using quality and flow signal information

Định dạng
Số trang	48
Dung lượng	635 KB