Bài giảng Tin sinh học

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề	Bài giảng Tin sinh học
Người hướng dẫn	TS. Nguyễn Sỹ Lê Thanh
Trường học	Viện Công nghệ sinh học Viện Hàn lâm Khoa học và Công nghệ VN
Chuyên ngành	Tin sinh học
Thể loại	bài giảng

Định dạng
Số trang	314
Dung lượng	35,05 MB

Nội dung

Lý thuyếtPhần 3: CÁC CƠNG CỤ PHÂN TÍCH KHAI THÁC VÀ XỬ LÝ DỮ LIỆU TRÌNH TỰ SINH HỌCChương 6: Genome BrowserChương 7: Công cụ phân tích dữ liệu sinh họcChương 8: Làm quen với phân tích tr

BIO417 -TIN SINH HỌC (Bioinformatics) Thông tin học phần Mã học phần: BIO417 Số tín chỉ: 02 (2LT + 1TH) Học phần tiên quyết: + Nhập môn CNSH + Sinh học phân tử Thông tin giảng viên TS Nguyễn Sỹ Lê Thanh Viện Công nghệ sinh học Viện Hàn lâm Khoa học Công nghệ VN Email: nslthanh@ibt.ac.vn Mục tiêu môn học v Giới thiệu khái qt cách tìm kiếm nguồn thơng tin Internet, phục vụ cho việc học tập, nghiên cứu, viết luận văn v Trang bị kiến thức số công cụ thông dụng tin sinh học để: (1) Khai thác xử lý thông tin sinh học; (2) Ứng dụng lĩnh vực nghiên cứu, phòng thí nghiệm thực tiễn Nội dung mơn học v Nội dung: Phần Lý thuyết Phần Giới thiệu số cơng cụ phân tích CSDL sinh học Phần Thực hành máy tính v Nhiệm vụ sinh viên: Dự lớp, thảo luận Thực hành, làm tập v Hình thức thi: Thi viết thực hành máy tính v Thang điểm đánh giá: Thang điểm 10 Chuyên cần (40% Tham dự lớp, 30% Bài test 1, 30% Bài test 2) 10% Kiểm tra lớp 10% Thực hành 20% Thi cuối kỳ 60% Tóm tắt nội dung mơn học Phần A Lý thuyết Phần 1: GIỚI THIỆU CHUNG VỀ TIN SINH HỌC Chương 1: Giới thiệu Tin sinh học Chương 2: Nền tảng sinh học Bioinformatics Chương 3: Tìm kiếm quản lý tài liệu nghiên cứu Phần 2: CƠ SỞ DỮ LIỆU SINH HỌC Chương 4: Cơ sở liệu sinh học (CSDL) Chương Xác định trình tự đăng ký trình tự vào ngân hàng gene Tóm tắt nội dung mơn học Phần A Lý thuyết Phần 3: CÁC CƠNG CỤ PHÂN TÍCH KHAI THÁC VÀ XỬ LÝ DỮ LIỆU TRÌNH TỰ SINH HỌC Chương 6: Genome Browser Chương 7: Cơng cụ phân tích liệu sinh học Chương 8: Làm quen với phân tích trình tự liệu sinh học Chương 9: Căn trình tự nguyên lý trình tự Chương 10: Phân tích mối quan hệ tiến hóa Tóm tắt nội dung môn học Phần B Thực hành Bài thực hành số 01: Làm quen với trình tự DNA cơng cụ phân tích trình tự Bài thực hành số 02: Làm quen sở liệu Bài thực hành số 03: Đăng ký trình tự làm quen phân tích trình tự Bài thực hành số 04: Căn trình tự Bài thực hành số 05: Công cụ BLAST Bài thực hành số 06: Xây dựng tiến hóa Một số lưu ý SV chủ động đọc tài liệu trước sau học Phần thực hành: 100% thao tác máy tính, SV chủ động chuẩn bị máy tính để bàn (PC) laptop, có kết nối Internet đủ mạnh để q trình học diễn thuận lợi Lưu ý: Các phần mềm sử dụng tảng Window, số phần mềm (cơng cụ) khơng dùng miễn phí tảng iOS SV nộp kiểm tra, thực hành, thi cần ghi tên file theo định dạng: stt danh sách lớp_Mã sv_Họ tên_Nội dung nộp Ví dụ: 14_630306_Nguyen Van A_Bai thuc hanh Nộp theo lịch thông báo SV cần chủ động thông báo lý với lớp trưởng vắng mặt buổi học Lớp trưởng tổng hợp gửi lại cho GV phụ trách lớp Tài liệu học tập J Xiong (2006) Essential bioinformatics, Cambridge University Press Edited by Santosh Kumar (2014) The role of bioinformatics in agriculture Toronto : Apple academic press Lesk, Arthur M (2014) Introduction to bioinformatics Boca Raton, FL : Chapman & Hall/CRC Pevsner, Jonathan (2015) Bioinformatics and functional genomics UK : John Wiley & Sons Inc Bleidorn, Christoph (2017) Phylogenomics : An Introduction Cham : Springer International Publishing : Imprint: Springer Haubold, Bernhard.; Borsch-Haubold, Angelika Bioinformatics for Evolutionary Biologists Cham : International Publishing : Imprint: Springer (2017) Springer Chủ đề tiểu luận Chủ đề: Giới thiệu tổng quan Bioinformatics, vai trò, ứng dụng Phân loại sở liệu sinh học ứng dụng sở liệu Tìm hiểu ngân hàng genome đối tượng trồng (lúa, ngơ, đậu tương, khoai tây…) Các nhóm cơng cụ phân tích trình tự, ứng dụng Genome browser ứng dụng khai thác genome Công cụ BLAST, biến thể ứng dụng Các phương pháp xác định trình tự đăng ký trình tự Cơ sở liệu SNP, ứng dụng thực tiến sở liệu Cơ sở liệu mối tương quan gene, đột biến gene bệnh tật 10 Kết hợp công cụ bioinformatics để xây dựng phương pháp hỗ trợ lai tạo chọn giống thị phân tử MAS (Marker assisted selection) cho tính trạng nơng sinh học: suất, chất lượng kháng bệnh đối tượng trồng 11 Chủ đề mở: Sinh viên tự lựa chọn chủ đề Yêu cầu: Mỗi nhóm 05sv Mỗi nhóm chọn 01 chủ đề Tiểu luận đánh máy, khổ giấy A4, cách dịng 1.5, phơng chữ Time New Roman, cỡ chữ 12 Dài – 30 trang A4 Trích dẫn tài liệu khoa học đầy đủ Thời gian nộp: Tại buổi học lý thuyết cuối Lớp trưởng tập hợp nhóm thành fol9der nộp lại cho GV giảng dạy TIN SINH HỌC 10 Base Pair Maximization – Dynamic Programming Algorithm l Alignment Method l l l l Align RNA strand to itself Score increases for feasible base pairs S(i,+j –1,1) S(i j) Each score independent of overall structure Bifurcation adds extra dimension Initialize first two diagonal Fill in squares sweeping Bases cannot pair, similar Bases can pair, similar Dynamic Programming – arrays to diagonally to alignment to unmatched matched alignment possible paths S(i + 1, j – 1) +1 Images – Sean Eddy Base Pair Maximization – Dynamic Programming Algorithm l Alignment Method l l l l Align RNA strand to itself Score increases for feasible base pairs Each score independent of overall structure Bifurcation adds extra dimension Initialize first two diagonal Fill in squares sweeping Bases cannot pair, similar Bases can pair, similar Dynamic Bifurcation arrays to Programming – add values– diagonally to matched alignment possible for all k paths k = 0 : Bifurcation Reminder: max in this case For all k S(i,k) + S(k + 1, j) Images – Sean Eddy Base Pair Maximization - Drawbacks Base pair maximization will not necessarily lead to the most stable structure May create structure with many interior loops or hairpins which are energetically unfavorable Comparable to aligning sequences with scattered matches – not biologically reasonable Energy Minimization Thermodynamic Stability Estimated using experimental techniques Theory : Most Stable is the Most likely No Pseudknots due to algorithm limitations Uses Dynamic Programming alignment technique Attempts to maximize the score taking into account thermodynamics MFOLD and ViennaRNA Energy Minimization Results Images – David Mount Linear on itself to create l All RNA loopsstrand mustfolded haveback at least bases insecondary them structure Circularized representation uses this requirement l Equivalent to having base pairs between all arcs Arcs represent base pairing Exception: Location where the beginning and end of RNA come together in circularized representation Trouble with Pseudoknots Images – David Mount Pseudoknots cause a breakdown in the Dynamic Programming Algorithm In order to form a pseudoknot, checks must be made to ensure base is not already paired – this breaks down the recurrence relations Energy Minimization Drawbacks Compute only one optimal structure Usual drawbacks of purely mathematical approaches Similar difficulties in other algorithms Protein structure Exon finding Alternative Algorithms - Covariaton Incorporates Similarity-based method Evolution maintains sequences that are important Change in sequence coincides to maintain structure through base pairs (Covariance) Base pairing creates Mutation in one base Covariation ensures Expect areas of base Cross-species structure conservation example – tRNA Manual and automated approaches have been used to identify covarying base same stable tRNA yields pairing ability to base pair is pairing in tRNA to be pairs structure in organisms impossible and breaks maintained and RNA covarying between Models for structure based on results Ordered Tree Model down structure structure is conserved various species Stochastic Context Free Grammar Binary Tree Representation of RNA Secondary Structure Representation of RNA structure using Binary tree Nodes represent Base pair if two bases are shown Loop if base and “gap” (dash) are shown Pseudoknots still not represented Tree does not permit varying sequences Mismatches Insertions & Deletions Images – Eddy et al Covariance Model HMM which permits flexible alignment to an RNA structure – emission and transition probabilities Model trees based on finite number of states Match states – sequence conforms to the model: MATP – State in which bases are paired in the model and sequence MATL & MATR – State in which either right or left bulges in the sequence and the model Deletion – State in which there is deletion in the sequence when compared to the model Insertion – State in which there is an insertion relative to model Transitions have probabilities Varying probability – Enter insertion, remain in current state, etc Bifurcation – no probability, describes path Covariance Model (CM) Training Algorithm S(i,j) = Score at indices i and j in RNA when aligned to the Covariance Model Frequency of seeing the symbols Independent frequency of seeing the (A, C, G, T) together in locations i and j symbols (A, C, G, T) in locations i or j depending on symbol depending on symbol l Frequencies obtained by aligning model to “training data” – consists of sample sequences l Reflect values which optimize alignment of sequences to model Alignment to CM Algorithm Calculate the probability score of aligning RNA to CM Three dimensional matrix – O(n³) Align sequence to given subtrees in CM For each subsequence calculate all possible states Subtrees evolve from Bifurcations For simplicity Left singlet is default Images – Eddy et al Alignment to CM Algorithm Images – Eddy et al •For each calculation take into account the •Transition (T) to next state •Emission probability (P) in the state as determined by training data Deletion – does not have an emission Bifurcation – does not have a probability associated with the state probability (P) associated with it Covariance Model Drawbacks Needs to be well trained Not suitable for searches of large RNA Structural complexity of large RNA cannot be modeled Runtime Memory requirements References How Do RNA Folding Algorithms Work? S.R Eddy Nature Biotechnology, 22:1457-1458, 2004

Ngày đăng: 28/12/2023, 08:16