Bài giảng chủ đề Tin Sinh Học. Những điều cơ bản về Tin Sinh Học bằng tiếng Anh chương 2: Cơ sở dữ liệu sinh học. Trong bài giảng này, chúng ta tìm hiểu về những cơ sở dữ liệu cho tin sinh học được sử dụng rộng rãi trên toàn thế giới như NCBI
Trang 2 NCBI (The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information ) 2008: Genbank 25th, 2009: 20th anniversary
EMBL Nucleotide Database
DDBJ (DNA Data Bank of Japan)
Trang 4 GenBank is the NIH (National Institutes of Health) genetic sequence database, an annotated collection
of all publicly available DNA sequences.
There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as
of April 2011.
GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
Trang 5Several ways to search and retrieve data from
GenBank.
Search GenBank for sequence identifiers and
annotations with Entrez Nucleotide ,
Search and align GenBank sequences to a
query sequence using BLAST (Basic Local
Alignment Search Tool)
Search, link, and download sequences
programatically using NCBI e-utilities
Trang 6 Provide and encourage access within the
scientific community to the most up to date and
comprehensive DNA sequence information
No restrictions on the use or distribution of the
GenBank data
BUT, Some submitters may claim patent,
copyright, or other intellectual property rights in all or a portion of the data they have submitted
Trang 7 GenBank Submissions Handbook
User services group: info@ncbi.nlm.nih.gov
The following data is not accepted by GenBank:
Trang 8 BankIt , a WWW-based submission tool with wizards
to guide the submission process
Sequin , NCBI's stand-alone submission tool with
wizards to guide the submission process is available
by FTP for use on for MAC, PC, and UNIX platforms.
tbl2asn , a command-line program, automates the
creation of sequence records for submission to
GenBank using many of the same functions as
Sequin It is used primarily for submission of
complete genomes and large batches of sequences
and is available by FTP for use on MAC, PC and Unix
platforms.
Barcode Submission Tool , a WWW-based tool for the
submission of sequences and trace read data
for Barcode of Life projects based on the COI gene.
Trang 9 A tool that retains user information and database preferences to provide customized services for many NCBI databases.
Allows you:
◦ to save searches,
◦ select display formats, filtering options,
◦ set up automatic searches that are sent by e-mail.
◦ save citations (journal articles, books, meetings, patents and presentations) in My Bibliography
◦ manage peer review article compliance with the NIH Public Access Policy.
◦ set up preferences for displaying and filtering search results, highlighting search terms and setting LinkOut, Document Delivery Service and Outside Tool preferences.
Trang 10 UniProt (Universal Protein Resource)
http://www.expasy.uniprot.org (includes
SWISS-PROT, TrEMBL, PIR)
Protein database (NCBI)
http://www.ncbi.nlm.nih.gov/entrez/quer
y.fcgi?db=Protein
Trang 11 Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Molecular Modeling DataBase (NCBI)
http://www.ncbi.nlm.nih.gov/Structure/M
MDB
Trang 12 Whole genomes (NCBI)
Trang 13CNSH K35TT
Trang 152 Start from an unknown sequence and try to find out what it might be, to what in the database is it similar? BLAST
What does BLAST do?
Search a large target set of sequences
for hits to a query sequence
and return the alignments and scores from those hits
Do it fast.
Aim: Search databases for a sequence that resembles your sequence Show those sequences that deserve a second look Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant
related sequences.
Trang 16 nucleotide blast Search a nucleotide database using
Algorithms: blastp, psi-blast, phi-blast
Blastx Search protein database using a translated
nucleotide query
Tblastn Search translated nucleotide database
using a protein query
Tblastx Search translated nucleotide database
using a translated nucleotide query
Trang 17agatggattc tgtgaaaaag gctgaaaggg gagcgtcgcc gaagcaaata aaacccca ggtattattt gctggccgtg cattgaataa atgtaaggct gtcaagaaat cattttcttg
gagggctatc tcgttgttca taatcattta tgatgattaa ttgataagca atgagagtat
tcctctcatt gcttttttta ttgtggacaa agcgctcttt ctcctcaccc gcacgaacca
FASTA
Trang 18BLAST
Trang 19BLAST
Trang 22 Raw Score
The score of an alignment, S, is calculated as the
sum of substitution and gap scores Substitution
(non-identical amino acids at a given position in an alignment) scores are given by look-up tables (see PAM, BLOSUM) Gap scores are typically calculated
as the sum of G, the gap opening penalty and L,
the gap extension penalty For a gap of length n,
the gap cost would be G+Ln The choice of gap
costs, G and L is empirical, but it is customary to
choose a high value for G (10-15) and a low value
for L (1-2)
Trang 23 Bit Score
alignment score S in which the statistical
properties of the scoring system used have
been taken into account Because bit scores have been normalized with respect to the
scoring system, they can be used to
compare alignment scores from different
searches.
upon the scoring system (substitution
matrix and gap costs) employed [4-6].
Trang 24 E value: Expectation value
With the E value, the significance of scores can be
assessed It is a method to decide, if an alignment
is biologically meaningful and gives evidence for
homology or is just the best alignment between
two entirely unrelated sequences
The number of different alignments with score
equivalent to or better than S that are expected to
occur in a database search by chance The lower
the E value, the more significant the score
E = mn * 2-S’
The parameters m and n are the lengths of query
sequence and database
Trang 25 LOCUS A short mnemonic name for the entry,
chosen to suggest the sequence's definition
Mandatory keyword, exactly one record
DEFINITION A concise description of the
sequence Mandatory keyword, one or more
data that are associated with the GenBank entry
identified by a given primary accession number
Trang 26 KEYWORDS Short phrases describing gene
products and other information about an entry
Mandatory keyword in all annotated entries, one or more records
SEGMENT Information on the order in which this
entry appears in a series of discontinuous
sequences from the same molecule Optional
keyword (only in segmented entries), exactly one
record
SOURCE Common name of the organism or the
name most frequently used in the literature
Mandatory keyword in all annotated entries, one or more records, includes one sub-keyword
Trang 27 ORGANISM Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent
lines) Mandatory sub-keyword in all annotated entries, two
or more records.
REFERENCE Citations for all articles containing data reported
in this entry Includes four sub-keywords and may repeat
Mandatory keyword, one or more records.
AUTHORS Lists the authors of the citation Mandatory
sub-keyword, one or more records.
TITLE Full title of citation Optional sub-keyword
(present in all but unpublished citations), one or more
records.
JOURNAL Lists the journal name, volume, year, and page
numbers of the citation Mandatory sub-keyword, one or
more records.