Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 127 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
127
Dung lượng
495,29 KB
Nội dung
A SEQUENTIAL ITERATIVE REFINEMENT OPTIMIZATION METHOD TO MULTIPLE SEQUENCE ALIGNMENT LI YIHUI NATIONAL UNIVERSITY OF SINGAPORE 2004 A SEQUENTIAL ITERATIVE REFINEMENT OPTIMIZATION METHOD TO MULTIPLE SEQUENCE ALIGNMENT LI YIHUI (B.Sc. Nankai University) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE 2004 i Acknowledgements For the completion of this thesis, I would like very much to express my heartfelt gratitude to my supervisor Assoc. Prof. Chen ZeHua for all his invaluable advice and guidance, endless patience and encouragement during the mentor period. I truly appreciate all the time and effort he has spent in helping me to solve the problems encountered even when he is in the midst of his work. I would like to contribute the completion of this thesis to my dearest family who have always been supporting me with their encouragement and understanding in all my years. Special thanks to all my friends who helped me in one way or another for their friendship and encouragement throughout the two years. ii Contents Introduction 1.1 Basic Concept of Sequence Alignment . . . . . . . . . . . . . . . . . 1.2 Pairwise Sequence Alignment Method . . . . . . . . . . . . . . . . . 1.2.1 Dynamic Programming Method . . . . . . . . . . . . . . . . 1.2.2 Global Alignment and N-W Algorithm . . . . . . . . . . . . 1.2.3 Local Alignment and S-W Algorithm . . . . . . . . . . . . . Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . 1.3 1.4 1.3.1 Carrillo-Lipman Algorithm and MSA Program . . . . . . . . 10 1.3.2 Other Heuristic Methods . . . . . . . . . . . . . . . . . . . . 12 Importance and Application of Sequence Alignment in Biology . . . 14 g CONTENTS iii Review of Current Methods 2.1 2.2 2.3 2.4 16 CLUSTALW Program . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 The Basic Algorithm of the CLUSTALW . . . . . . . . . . . 17 2.1.2 Additional Heuristics of CLUSTALW . . . . . . . . . . . . . 20 2.1.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . 22 PRRP Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 DNR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . 26 SAGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Objective Function(OF) . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Genetic Algorithm Used by SAGA . . . . . . . . . . . . . . 29 2.3.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . 31 Multiple Alignment By Profile HMM Training . . . . . . . . . . . . 33 2.4.1 Basic Algorithm Of HMMer . . . . . . . . . . . . . . . . . . 33 2.4.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . 36 CONTENTS iv SIROA Method 38 3.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Details of SIROA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Step 1: Initial Alignment . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Step 2: Overlapped Iterative Alignment . . . . . . . . . . . 45 Some Special Features Of SIROA . . . . . . . . . . . . . . . . . . . 48 3.3.1 Block Size And Overlap Size . . . . . . . . . . . . . . . . . . 48 3.3.2 Iterative Method . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Advantages And Disadvantages . . . . . . . . . . . . . . . . 49 A Example Of Multiple Alignment Using SIROA Method . . . . . . 51 3.3 3.4 Numerical Results Reference To BAliBASE 56 4.1 BAliBASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Alignment Scoring Schemes . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 SP (Sum-Of-Pairs) Score . . . . . . . . . . . . . . . . . . . . 57 4.2.2 Baliscore(BS) . . . . . . . . . . . . . . . . . . . . . . . . . . 58 CONTENTS 4.3 v Performance Of SIROA Method In Term Of SP And BS Score . . . 59 4.3.1 SP Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.2 BS Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Conclusion and Discussion 80 Appendix 82 Bibliography 106 vi List of Figures 1.1 A sample of multiple sequence alignment. . . . . . . . . . . . . . . . . 1.2 A sample of global sequence alignment. . . . . . . . . . . . . . . . . . 1.3 A sample of a local alignment of the same sequences as above. . . . . . 1.4 The optimal path of an alignment of sequence and the actual optimal alignment(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Alignment of three sequences by dynamic programming. . . . . . . . . . 11 1.6 Schematic showing the relation between the different alignment programs and algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 13 LIST OF FIGURES 2.1 vii The scoring scheme for comparing two positions from two alignment. Two sections of alignment with and sequences respectively are shown. The score of the position with residues T,L, K,K versus the position with residues V and I is given with and without sequence weights. M(X,Y) is the substitution matrix entry for residue X versus residue Y. Wn is the weight for sequence n. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 23 The basic progressive alignment procedure, illustrated using a set of globin of known tertiary structure. In the distance matrix, the mean number of differences per residue is given. The un-rooted tree shows all branch lengths drawn to scale. In the rooted tree, all branch lengths are given as well as weights for each sequences. In the multiple alignment, the approximate positions of the α-helices common to all proteins are shown(bold residues). This alignment was derived using CLUSTALW with default parameters and the PAM3 series of weight matrices. . . . . 2.3 24 Schematic diagram of the procedure of the doubly nested randomized iterative (DNR) method for multiple sequence alignment. . . . . . . . . 27 LIST OF FIGURES 2.4 viii The layout of the SAGA algorithm.G0 is the initial population.Gn is one generation cycle. The method continues until the terminal conditions are n n+1 indicate parents in generation, boxes C1n+1 to Cm met. Boxes P1n to Pm indicate the children of these parents. Parents and children are alignments. Bold boxes indicate alignments selected to survived unchanged from one generation to the next. OP is a randomly chosen operator. . . 2.5 32 As an example of model construction from an alignment, a small DNA multiple alignment is given (a), with three columns marked above with x s. These three columns are assigned to position 1-3 in the model architecture (b). The assignment of columns to model positions determines the symbol emission and state transition counts (c) from which probability parameters would be estimated. . . . . . . . . . . . . . . . . . . . . . 3.1 37 The sequences blocks after partition. N is the number of sequences, M is the number of blocks, Si is the ith sequence, Bj is the jth block and Sij represents the ith sequence in the jth block. . . . . . . . . . . . . . . . 3.2 For each individual blocks, Bi will be aligned using the SIROA method.we can obtain the multiple alignment for each blocks. . . . . . . . . . . . . 3.3 42 44 After we combine all the alignment of sequence blocks together, we will obtain the initial alignment A0 . . . . . . . . . . . . . . . . . . . . . . 45 APPENDIX 93 typedef struct coordinate COORDINATE; typedef struct coordinate_values COORDINATE_VALUES; typedef struct vertex VERTEX; typedef struct edge EDGE; typedef struct heap HEAP; /* array for accessing vertices by lattice coordinate */ struct coordinate { int lo, hi; /* lower and upper limits on array indices */ COORDINATE_VALUES *coord_vals; /* next coordinate array indices */ COORDINATE_VALUES *prev_coord_val;/* previous coordinate array index */ COORDINATE *next_on_free_list; /*maintains available records*/ int refer; /* how many valid coordinate values I have */ }; /* index in array */ struct coordinate_values { COORDINATE *next_coord; /* next coordinate array */ COORDINATE *curr_coord; /* current coordinate array */ /* short value; value of this coordinate in absolute terms*/ }; /* lattice vertex */ struct vertex { EDGE *out; /* outgoing edge adjacency list */ COORDINATE_VALUES *prev_coord_val; /* father in array */ EDGE *nonextracted_inedges;/* incoming edges still not extracted from heap */ }; /* lattice edge */ struct edge { VERTEX *tail, *head; /* edge tail and head vertices */ int dist; /* distance to head from source along edge */ int refer; /* how many backtrack edges point to me */ EDGE *next, *prev; /* edge adjacency list links */ EDGE *heap_succ, *heap_pred; /* heap bucket links */ EDGE *nonextracted_next, *nonextracted_prev; /* nonextracted_inedges links */ EDGE *backtrack; /* edge to previous edge in path */ }; /* discrete heap of edges ranked by distance */ struct heap { EDGE **min, **max; /* minimum and maximum buckets of heap */ EDGE *bucket[1]; /* buckets of edges */ }; char *S [NUMBER+1]; /* sequences */ int K; /* number of sequences */ int N [NUMBER+1]; /* lengths of sequences */ int delta; /* upper and lower bound difference */ int epsi[NUMBER+1][NUMBER+1]; /* projected & pairwise cost diff. */ int scale[NUMBER+1][NUMBER+1]; /* pairwise cost weight scale */ int Con[NUMBER+1][NUMBER+1][LENGTH+1]; /* APPENDIX 94 consistency check */ int proj [NUMBER+1] [NUMBER+1]; /* projected costs */ char cname[20]="pam250"; /* name of cost file */ int Upper, Lower; /* upper and lower bounds on alignment distance */ int bflag,gflag,fflag,oflag; int *dd [LENGTH+1], /* forward diagonal distance */ *hh [LENGTH+1], /* forward horizontal distance */ *vv [LENGTH+1]; /* forward vertical distance */ VERTEX *presource; /* Vertex before source; tail of first edge */ void main() -g -b penalizes terminal gaps sets all pairwise weights to (does not compute tree) sequences to be aligned #define USAGE struct timeval starttime, endtime; double deltatime; void main(argc, argv) int argc; char *argv[]; { int i,j,len,size; int eflag=1; int mflag=1; char ename[FILE_NAME_SIZE]; char fname[FILE_NAME_SIZE],Fname[FILE_NAME_SIZE]; FILE *stream, *efile, *fopen(); gettimeofday(&starttime, NULL); /* process arguments */ oflag=bflag=gflag=1; fflag=0; delta = -1; stream = NULL; if ( argc > MAX_ARGS ) fatal(USAGE); for ( ++argv; --argc; argv++ ) /* Get start time */ else if ( (*argv)[1] == ’g’ ) gflag = 0; else if ( (*argv)[1] == ’b’ ) bflag = 0; else fatal(USAGE); else sscanf(*argv,"%s",Fname); data(stream,must_open(Fname,"r")); /* allocate memory */ /* get input sequences */ APPENDIX 95 for (len=i=1;ilen) len=N[i]; ++len; size=sizeof(int)*len*len*(1+eflag); dd[0] = (int *) alloc(size); hh[0] = (int *) alloc(size); vv[0] = (int *) alloc(size); for (i=1;iNUMBER) fatal("Cannot exceed %d sequences.",NUMBER); N[K] = n; S[K] = alloc(n + 2); /* initial DASH and terminal ’\0’ */ S[K][0] = DASH; strcpy(&S[K][1],buffer); } /* initialize gap cost, gap count table, and positive symmetric integer distance table */ if ( stream ) { /* distance file contains gap cost followed by triples of the form */ fscanf(stream," %d\n",&G); while (fscanf(stream," %c %c %d\n",&a,&b,&c)==3) D[a][b] = D[b][a] = c; fclose(stream); } else { G = 8; DAG(-) = 0; DAG(W)=0; DAG(Y)=7; DAG(F)=8; DAG(V)=13; DAG(L)=11; DAG(I)=12; DAG(M)=11; DAG(K)=12; /* default is dayhoff matrix and gap cost */ APPENDIX DAG(R)=11; DAG(H)=11; DAG(Q)=13; DAG(E)=13; DAG(D)=13; DAG(N)=15; DAG(G)=12; DAG(A)=15; DAG(P)=11; DAG(T)=14; DAG(S)=15; DAG(C)=5; SUB(-,C) = SUB(-,S) = SUB(-,T) = SUB(-,P) = SUB(-,A) = SUB(-,G) = SUB(-,N) = SUB(-,D) = SUB(-,E) = SUB(-,Q) = SUB(-,H) = SUB(-,R) = SUB(-,K) = SUB(-,M) = SUB(-,I) = SUB(-,L) = SUB(-,V) = SUB(-,F) = SUB(-,Y) = SUB(-,W) = SUB(W,C) = SUB(W,S) = SUB(W,T) = SUB(W,P) = SUB(W,A) = SUB(W,G) = SUB(W,N) = SUB(W,D) = SUB(W,E) = SUB(W,Q) = SUB(W,H) = SUB(W,R) = SUB(W,K) = SUB(W,M) = SUB(W,I) = SUB(W,L) = SUB(W,V) = SUB(W,F) = SUB(W,Y) = 97 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 12; 25; 19; 22; 23; 23; 24; 21; 24; 24; 22; 20; 15; 20; 21; 22; 19; 23; 17; 17; APPENDIX SUB(Y,C) SUB(Y,S) SUB(Y,T) SUB(Y,P) SUB(Y,A) SUB(Y,G) SUB(Y,N) SUB(Y,D) SUB(Y,E) SUB(Y,Q) SUB(Y,H) SUB(Y,R) SUB(Y,K) SUB(Y,M) SUB(Y,I) SUB(Y,L) SUB(Y,V) SUB(Y,F) SUB(F,C) SUB(F,S) SUB(F,T) SUB(F,P) SUB(F,A) SUB(F,G) SUB(F,N) SUB(F,D) SUB(F,E) SUB(F,Q) SUB(F,H) SUB(F,R) SUB(F,K) SUB(F,M) SUB(F,I) SUB(F,L) SUB(F,V) SUB(V,C) SUB(V,S) SUB(V,T) SUB(V,P) SUB(V,A) SUB(V,G) SUB(V,N) SUB(V,D) SUB(V,E) SUB(V,Q) SUB(V,H) SUB(V,R) SUB(V,K) SUB(V,M) SUB(V,I) SUB(V,L) 98 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 17; 20; 20; 22; 20; 22; 19; 21; 21; 21; 17; 21; 21; 19; 18; 18; 19; 10; 21; 20; 20; 22; 21; 22; 21; 23; 22; 22; 19; 21; 22; 17; 16; 15; 18; 19; 18; 17; 18; 17; 18; 19; 19; 19; 19; 19; 19; 19; 15; 13; 15; APPENDIX SUB(L,C) SUB(L,S) SUB(L,T) SUB(L,P) SUB(L,A) SUB(L,G) SUB(L,N) SUB(L,D) SUB(L,E) SUB(L,Q) SUB(L,H) SUB(L,R) SUB(L,K) SUB(L,M) SUB(L,I) SUB(I,C) SUB(I,S) SUB(I,T) SUB(I,P) SUB(I,A) SUB(I,G) SUB(I,N) SUB(I,D) SUB(I,E) SUB(I,Q) SUB(I,H) SUB(I,R) SUB(I,K) SUB(I,M) SUB(M,C) SUB(M,S) SUB(M,T) SUB(M,P) SUB(M,A) SUB(M,G) SUB(M,N) SUB(M,D) SUB(M,E) SUB(M,Q) SUB(M,H) SUB(M,R) SUB(M,K) SUB(K,C) SUB(K,S) SUB(K,T) SUB(K,P) SUB(K,A) SUB(K,G) SUB(K,N) SUB(K,D) SUB(K,E) 99 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 23; 20; 19; 20; 19; 21; 20; 21; 20; 19; 19; 20; 20; 13; 15; 19; 18; 17; 19; 18; 20; 19; 19; 19; 19; 19; 19; 19; 15; 22; 19; 18; 19; 18; 20; 19; 20; 19; 18; 19; 17; 17; 22; 17; 17; 18; 18; 19; 16; 17; 17; APPENDIX SUB(K,Q) SUB(K,H) SUB(K,R) SUB(R,C) SUB(R,S) SUB(R,T) SUB(R,P) SUB(R,A) SUB(R,G) SUB(R,N) SUB(R,D) SUB(R,E) SUB(R,Q) SUB(R,H) SUB(H,C) SUB(H,S) SUB(H,T) SUB(H,P) SUB(H,A) SUB(H,G) SUB(H,N) SUB(H,D) SUB(H,E) SUB(H,Q) SUB(Q,C) SUB(Q,S) SUB(Q,T) SUB(Q,P) SUB(Q,A) SUB(Q,G) SUB(Q,N) SUB(Q,D) SUB(Q,E) SUB(E,C) SUB(E,S) SUB(E,T) SUB(E,P) SUB(E,A) SUB(E,G) SUB(E,N) SUB(E,D) SUB(D,C) SUB(D,S) SUB(D,T) SUB(D,P) SUB(D,A) SUB(D,G) SUB(D,N) SUB(N,C) SUB(N,S) SUB(N,T) 100 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 16; 17; 14; 21; 17; 18; 17; 19; 20; 17; 18; 18; 16; 15; 20; 18; 18; 17; 18; 19; 15; 16; 16; 14; 22; 18; 18; 17; 17; 18; 16; 15; 15; 22; 17; 17; 18; 17; 17; 16; 14; 22; 17; 17; 18; 17; 16; 15; 21; 16; 17; APPENDIX SUB(N,P) SUB(N,A) SUB(N,G) SUB(G,C) SUB(G,S) SUB(G,T) SUB(G,P) SUB(G,A) SUB(A,C) SUB(A,S) SUB(A,T) SUB(A,P) SUB(P,C) SUB(P,S) SUB(P,T) SUB(T,C) SUB(T,S) SUB(S,C) } GG= gflag ? T[0][0][0][0] T[0][0][0][1] T[0][0][1][0] T[0][0][1][1] T[0][1][0][0] T[0][1][0][1] T[0][1][1][0] T[0][1][1][1] T[1][0][0][0] T[1][0][0][1] T[1][0][1][0] T[1][0][1][1] T[1][1][0][0] T[1][1][0][1] T[1][1][1][0] T[1][1][1][1] T[2][0][2][0] T[2][0][2][1] T[2][1][2][0] T[2][1][2][1] T[0][2][0][2] T[0][2][1][2] T[1][2][0][2] T[1][2][1][2] T[2][2][2][2] 101 = = = = = = = = = = = = = = = = = = 18; 17; 17; 20; 16; 17; 18; 16; 19; 16; 16; 16; 20; 16; 17; 19; 16; 17; : = = = = = = = = = = = = = = = = = = = = = = = = = G; 0; G; G; 0; 0; 0; G; 0; 0; G; 0; 0; 0; G; G; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* /* ---x -x ---x -x xxxx xx xxxx xx : : : : : : : : : : : : : : : : --x --x xxx xxx --x --x xxx xxx */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ */ } getseq(seq,maxl,fp) char *seq; int char buf[256]; int n,j,l; maxl; FILE *fp; { APPENDIX 102 n=0; fgets(buf,256,fp); while (fgets(buf,256,fp) && buf[0]!=’>’) { l=strlen(buf); for (j=0;j= ’A’ && buf[j] maxl) fatal("Cannot exceed %d characters per sequence.",maxl-1); } } seq[n]=’\0’; if (buf[0]==’>’) fseek(fp,-strlen(buf),1); return(n); } /* heap insertion and deletion */ #define INSERT(e,h)\ { register EDGE **b = (h)->bucket + (e)->dist;\ if (*b != NULL)\ (*b)->heap_pred = (e);\ (e)->heap_succ = *b; (e)->heap_pred = NULL;\ *b = (e); } #define DELETE(e,h)\ { register EDGE **b = (h)->bucket + (e)->dist;\ if ((e)->heap_pred != NULL)\ (e)->heap_pred->heap_succ = (e)->heap_succ;\ else\ *b = (e)->heap_succ;\ if ((e)->heap_succ != NULL)\ (e)->heap_succ->heap_pred = (e)->heap_pred; } HEAP *h; /* msa : compute multiple sequence alignment */ EDGE *msa () { static int p [NUMBER+1], q [NUMBER+1], r [NUMBER+1]; register int d,inc; register int ccost=0; register VERTEX *v, *w, *t; auto VERTEX *s; register EDGE *e, *f; auto char C [NUMBER+1]; register int I, J; auto int delta0[NUMBER+1],delta1[NUMBER+1]; register int difference; auto int ends[NUMBER]; auto int endcount, endindex; APPENDIX 103 s = source(); t = sink(); h = heap(Upper); presource = create_vertex(NULL); e = create_edge(presource,s); e->dist = 0; e->refer++; /* Make sure edge does not get freed */ e->backtrack = NULL; INSERT(e,h); if (oflag) { fprintf(stderr," 0\n"); inc=1+Upper/50; } while ((e=extract()) != NULL && (v=e->head) != t) { if (oflag && e->dist>ccost) { fprintf(stderr,"*"); ccost+=inc; } if (e->dist tail,p); safe_coord(e->head,q); /*next loop is from cost function difference between p and q*/ for (I=1;Idist - e->dist; APPENDIX if (difference > 0) { /*get coordinates of next vertex into r*/ safe_coord(f->head,r); /* for (I=1;Iq[I] ? S[I][r[I]] : DASH); delta1[I] = r[I] - q[I]; } if(gflag) for (I=1;I=3;I--){ if (d >= difference) goto nextedge; for (J=1;Jrefer++; if(f->backtrack != NULL) if( -- f->backtrack->refer == 0) free_edge(f->backtrack); f->dist = d + e->dist; f->backtrack = e; INSERT(f,h); } } nextedge: continue; } } if (e->refer==0) free_edge(e); } if (oflag) fprintf(stderr,"\n"); return e; } /* display : void display (e) register EDGE *e; { static int p [NUMBER+1], q [NUMBER+1], r [NUMBER+1]; 104 APPENDIX auto register 105 int EDGE d, I, J; *f; /* shortest sink to source path in lattice within bound? */ if (e==NULL || (d=e->dist)>Upper) fatal("Multiple alignment within bound does not exist."); /* recover shortest path to source tracing backward from sink */ for (e->next=NULL;e->tail!=presource;e=f) { f=e->backtrack; coord(f->tail,p); safe_coord(f->head,q); safe_coord(e->head,r); f->next = e; project(p,q,r); } /* display alignment corresponding to path */ printf("\n Optimal Multiple Alignment\n\n"); for (e=e->next; e!=NULL; e=e->next) { coord(e->tail,p); safe_coord(e->head,q); column(p,q); } output(); BIBLIOGRAPHY 106 Bibliography [1] Altschul SF, Carroll RJ, Lipman DJ. ”Weights for Data Related by a Tree”, J. Molec. Biol. 207,1989. [2] Barton, G. J. Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol 183, 403-428.1990. [3] Cedric Notredame and Desmond G. Higgins SAGA: Sequence Alignment by Genetic Algorithm, Nucleic Acids Research, 1996, Vol. 24, No.8, 1515-1524. [4] Durbin Richard, Eddy Sean R, Anders Krogh, Graeme Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press 1998. [5] Feng, D. F and Doolittle, R. F.,Progressive Sequence Alignment As A Prerequisite To Correct Hylogenetic Trees, Jounal of Molecular Biology, 1987, Vol. 25, 351-360. BIBLIOGRAPHY 107 [6] Gapta, S. K., Kececioglu J. D. and Schaffer, Improveing the Practical Space and Time Efficiency of The Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment, 1995, Vol. 2, 459-472. [7] Hirosawa M, Totoki Y, Hoshida M, Ishikawa M., Comprehensive Study on Iterative Algorithms of Multiple Sequence Alignment, Comput Appl Biosci. 1995 Feb;11(1):13-8. [8] Humberto Carrillo, David Lipman, The Multiple Sequence Alignment Problem in Biology, SIAM Journal on Applied Mathematics, 1999, Vol. 48, 1073-1082. [9] Lipman DJ, Altschul SF, John D. Kececioglu, A Tool for Multiple Sequence Alignment, Proceedings of the National Academy of Science of the United States of America, 1989, Vol. 86, 4412-4415. [10] Osamu Gotoh, Significant Improvemtn in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments,Journal of Molecular Biology 1996, Vol. 264, 823-838. [11] Saitou, N. and Nei, M. The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees,Molecular Biology Evolution, 1987, Vol. 4. [12] S. B Needleman and C. D. Wunsch, A General Method Applicable to the Search For Similarities in the Amino Acid Sequences of Two Proteins, Journal of Molecular Biology, 1970, Vol. 48, 444-453. BIBLIOGRAPHY 108 [13] Thompson Julie D, Frederic Plewniak and Olivier Poch, A Comprehensive Comparison of Multiple Sequence Alignment Programs, Nucleic Acids Research, 1999, Vol. 27, 2682-2690. [14] Thompson Julie D, Frederic Plewniak and Olivier Poch. BAliBASE: A benchmark alignment database for the evaluation of multiple sequence alignment programs Bioinformatics 1999,vol 15, 87-88. [15] Thompson Julie D, Desmond G. Higins and Toby J. Gibson CLUSTAL W, Improving the Sensitivity of the Progressive Multiple Sequence Alignment Through Sequence Weighting, Position-specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Research, 1994, Vol. 22, No. 22, 4673-3680. [16] Wee Kim Chuan, A Sequential Optimization Method To Multiple Sequence Alignments, Thesis For the Degree of Master of Science, National University of Singapore 2001. [...]... HEAGAWGHE-E P-AW-HEAE Figure 1.2: A sample of global sequence alignment AWGHE AW-HE Figure 1.3: A sample of a local alignment of the same sequences as above Suppose there are two sequences X and Y to be aligned, where |X | = m, and |Y | = n If gaps are allowed to be placed in any position of the alignment, then the maximum potential length of the alignment is m + n It means that there are 2m+n subsequences... order to do the fast and efficient multiple sequence alignment analysis, a lot of methods or algorithms such as dynamic programming, progressive and iterative alignment method have been developed This thesis introduces a Sequential Iterative Refinement Optimization Algorithm” (SIROA) approach The basic procedure of the SIROA is a heuristic progressive approach, however, we suggest to use an iterative. .. a table only once in order to avoid recompute it in the later steps When the DP algorithm is used in sequence alignment, it assumes that each alignment up to a certain “prefixed” point in a global optimal alignment must be an optimal alignment Therefore, a dynamic programming matrix will be computed in the DP algorithm The optimal alignment score for any particular point in the matrix corresponds to. .. TABLES xvii into some blocks before the progressive alignment This based on he additive properties of the alignment scores and the independence assumption of the alignment between the remote subsequence part Numerical multiple sequence alignment results reference to BAliBASE database have been done for evaluating the SIROA method and comparing it with other approaches Key Words: Multiple Sequence Alignment. .. Sequence Alignment Method Firstly, we consider the “pairwise sequences alignment which means we only need to align two sequences together Generally, there are two kind of sequence alignment, global and local Global alignment optimizes the alignment over the full length of the sequences It is more appropriate for comparing sequences that are expected to share similarity over the entire sequence As for local... (original corner) of a cube to the other corner (end corner) Dynamic programming methods are central to the computational sequence analysis The methods I will introduce in this thesis make use of the dynamic programming algorithm CHAPTER 1 INTRODUCTION 7 Figure 1.4: The optimal path of an alignment of 3 sequence and the actual optimal alignment( right) 1.2.2 Global Alignment and N-W Algorithm The dynamic... local alignment are constructed by starting at the highest-scoring positions in the scoring matrix and trace back following the pointers we stored at the matrix filling steps The alignment will be stopped to a box that scores zero Note, same as the global alignment, the local alignment made by S-W algorithm will be various if there are equal derivations at one or more points 1.3 Multiple Sequence Alignment. .. the alignment sequences Thus, if we do not modify the DP algorithm, it can be slow and memory intensive for long sequences 1.3.1 Carrillo-Lipman Algorithm and MSA Program In 1988, Carrillo and Lipman introduced a method, called Multiple Sequence Alignment (MSA) program, to reduce the numbers of cells to be examined in the dynamic programming algorithm The MSA imposes a pairwise alignment for each pair... optimal alignment that has been computed up to that point The DP algorithm aligns two sequences from the end of them and use a scoring scheme for match, mismatch and gaps The alignment corresponding the path with highest score will be the optimal one.Dynamic programming approach guarantees to provide the optimal alignment Figure 1.4 shows the alignment of 3 sequences in term of a path traversing from a. .. additional heuristics including sequence weighting, positionspecific gap penalties and weight matrix choice 2.1.1 The Basic Algorithm of the CLUSTALW The CLUSTALW program first performs a pairwise alignment of all sequences and calculates a similarity matrix that represents the similarity of each pair of sequences The program then uses the alignment score matrix to produce a guide tree Finally, the sequences . A SEQUENTIAL ITERATIVE REFINEMENT OPTIMIZATION METHOD TO MULTIPLE SEQUENCE ALIGNMENT LI YIHUI NATIONAL UNIVERSITY OF SINGAPORE 2004 A SEQUENTIAL ITERATIVE REFINEMENT OPTIMIZATION METHOD TO MULTIPLE. sample of global sequence alignment. 5 1.3 A sample of a local alignment of the same sequences as above. 5 1.4 The optimal path of an alignment of 3 sequence and the actual optimal alignment( right). . reference to BAliBASE database have been done for evaluating the SIROA method and comparing it with other approaches. Key Words: Multiple Sequence Alignment Iterative Algorithm Score CHAPTER 1.