1. Trang chủ
  2. » Giáo án - Bài giảng

Fundamentals of Bioinformatics Chapter 2 Biological Database

27 4 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Biological Database
Tác giả Do Tan Khang
Trường học National Center for Biotechnology Information
Chuyên ngành Bioinformatics
Thể loại essay
Năm xuất bản 2008
Thành phố Bethesda
Định dạng
Số trang 27
Dung lượng 3,78 MB

Nội dung

Bài giảng chủ đề Tin Sinh Học. Những điều cơ bản về Tin Sinh Học bằng tiếng Anh chương 2: Cơ sở dữ liệu sinh học. Trong bài giảng này, chúng ta tìm hiểu về những cơ sở dữ liệu cho tin sinh học được sử dụng rộng rãi trên toàn thế giới như NCBI

Trang 2

 NCBI (The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information ) 2008: Genbank 25th, 2009: 20th anniversary

 EMBL Nucleotide Database

 DDBJ (DNA Data Bank of Japan)

Trang 4

 GenBank is the NIH (National Institutes of Health) genetic sequence database, an annotated collection

of all publicly available DNA sequences.

 There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as

of April 2011.

 GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.

Trang 5

Several ways to search and retrieve data from

GenBank.

 Search GenBank for sequence identifiers and

annotations with Entrez Nucleotide ,

 Search and align GenBank sequences to a

query sequence using BLAST (Basic Local

Alignment Search Tool)

 Search, link, and download sequences

programatically using NCBI e-utilities

Trang 6

 Provide and encourage access within the

scientific community to the most up to date and

comprehensive DNA sequence information

 No restrictions on the use or distribution of the

GenBank data

 BUT, Some submitters may claim patent,

copyright, or other intellectual property rights in all or a portion of the data they have submitted

Trang 7

 GenBank Submissions Handbook

 User services group: info@ncbi.nlm.nih.gov

 The following data is not accepted by GenBank:

Trang 8

 BankIt , a WWW-based submission tool with wizards

to guide the submission process

 Sequin , NCBI's stand-alone submission tool with

wizards to guide the submission process is available

by FTP for use on for MAC, PC, and UNIX platforms.

 tbl2asn , a command-line program, automates the

creation of sequence records for submission to

GenBank using many of the same functions as

Sequin It is used primarily for submission of

complete genomes and large batches of sequences

and is available by FTP for use on MAC, PC and Unix

platforms.

 Barcode Submission Tool , a WWW-based tool for the

submission of sequences and trace read data

for Barcode of Life projects based on the COI gene.

Trang 9

 A tool that retains user information and database preferences to provide customized services for many NCBI databases.

 Allows you:

◦ to save searches,

◦ select display formats, filtering options,

◦ set up automatic searches that are sent by e-mail.

◦ save citations (journal articles, books, meetings, patents and presentations) in My Bibliography

◦ manage peer review article compliance with the NIH Public Access Policy.

◦ set up preferences for displaying and filtering search results, highlighting search terms and setting LinkOut, Document Delivery Service and Outside Tool preferences.

Trang 10

 UniProt (Universal Protein Resource)

http://www.expasy.uniprot.org (includes

SWISS-PROT, TrEMBL, PIR)

 Protein database (NCBI)

http://www.ncbi.nlm.nih.gov/entrez/quer

y.fcgi?db=Protein

Trang 11

 Protein Data Bank (PDB)

http://www.rcsb.org/pdb/

 Molecular Modeling DataBase (NCBI)

 http://www.ncbi.nlm.nih.gov/Structure/M

MDB

Trang 12

 Whole genomes (NCBI)

Trang 13

CNSH K35TT

Trang 15

2 Start from an unknown sequence and try to find out what it might be, to what in the database is it similar?  BLAST

What does BLAST do?

 Search a large target set of sequences

for hits to a query sequence

and return the alignments and scores from those hits

 Do it fast.

Aim: Search databases for a sequence that resembles your sequence Show those sequences that deserve a second look Blast programs were designed for fast database searching, with minimal sacrifice of sensitivity to distant

related sequences.

Trang 16

 nucleotide blast Search a nucleotide database using

Algorithms: blastp, psi-blast, phi-blast

 Blastx Search protein database using a translated

nucleotide query

 Tblastn Search translated nucleotide database

using a protein query

 Tblastx Search translated nucleotide database

using a translated nucleotide query

Trang 17

agatggattc tgtgaaaaag gctgaaaggg gagcgtcgcc gaagcaaata aaacccca ggtattattt gctggccgtg cattgaataa atgtaaggct gtcaagaaat cattttcttg

gagggctatc tcgttgttca taatcattta tgatgattaa ttgataagca atgagagtat

tcctctcatt gcttttttta ttgtggacaa agcgctcttt ctcctcaccc gcacgaacca

FASTA

Trang 18

BLAST

Trang 19

BLAST

Trang 22

 Raw Score

 The score of an alignment, S, is calculated as the

sum of substitution and gap scores Substitution

(non-identical amino acids at a given position in an alignment) scores are given by look-up tables (see PAM, BLOSUM) Gap scores are typically calculated

as the sum of G, the gap opening penalty and L,

the gap extension penalty For a gap of length n,

the gap cost would be G+Ln The choice of gap

costs, G and L is empirical, but it is customary to

choose a high value for G (10-15) and a low value

for L (1-2)

Trang 23

 Bit Score

alignment score S in which the statistical

properties of the scoring system used have

been taken into account Because bit scores have been normalized with respect to the

scoring system, they can be used to

compare alignment scores from different

searches.

upon the scoring system (substitution

matrix and gap costs) employed [4-6].

Trang 24

 E value: Expectation value

 With the E value, the significance of scores can be

assessed It is a method to decide, if an alignment

is biologically meaningful and gives evidence for

homology or is just the best alignment between

two entirely unrelated sequences

 The number of different alignments with score

equivalent to or better than S that are expected to

occur in a database search by chance The lower

the E value, the more significant the score

 E = mn * 2-S’

 The parameters m and n are the lengths of query

sequence and database

Trang 25

 LOCUS A short mnemonic name for the entry,

chosen to suggest the sequence's definition

Mandatory keyword, exactly one record

 DEFINITION A concise description of the

sequence Mandatory keyword, one or more

data that are associated with the GenBank entry

identified by a given primary accession number

Trang 26

 KEYWORDS Short phrases describing gene

products and other information about an entry

Mandatory keyword in all annotated entries, one or more records

 SEGMENT Information on the order in which this

entry appears in a series of discontinuous

sequences from the same molecule Optional

keyword (only in segmented entries), exactly one

record

 SOURCE Common name of the organism or the

name most frequently used in the literature

Mandatory keyword in all annotated entries, one or more records, includes one sub-keyword

Trang 27

 ORGANISM Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent

lines) Mandatory sub-keyword in all annotated entries, two

or more records.

 REFERENCE Citations for all articles containing data reported

in this entry Includes four sub-keywords and may repeat

Mandatory keyword, one or more records.

 AUTHORS Lists the authors of the citation Mandatory

sub-keyword, one or more records.

 TITLE Full title of citation Optional sub-keyword

(present in all but unpublished citations), one or more

records.

 JOURNAL Lists the journal name, volume, year, and page

numbers of the citation Mandatory sub-keyword, one or

more records.

Ngày đăng: 26/01/2024, 12:49