(LUẬN VĂN THẠC SĨ) Indexation parallèle des données génomiques

Contexte du stage

I completed my internship with the Symbiose 1 team at the Institute for Research in Computer Science and Random Systems (IRISA) This team's primary focus is bioinformatics, specifically modeling genomic data to assist molecular biologists in formulating and discovering new knowledge The team is organized around three main areas: linguistic analysis of sequences, analysis and identification of dynamic systems, and architecture and parallelism Within the last area, two projects are underway: RDisk, a reconfigurable disk cluster, and ReMiX, a reconfigurable memory cluster Both projects aim to parallelize costly genomic processes to significantly accelerate execution, targeting supercomputers, computing grids, and specialized architectures.

1 http ://www.irisa.fr/symbiose

Contexte biologique

Banques de donn´ees de s´equences d’ADN

In 1965, the first sequence of 100 bases was published, marking a significant milestone in genetic research By the late 1970s, advancements in sequencing techniques allowed for sequences to be produced at a rate 100 times faster In response to the growing interest in analyzing this data, the biological community established a comprehensive database in 1978 to collect, organize, and distribute information and annotations on these sequences This initiative led to the creation of major genomic databases, including the European Molecular Biology Laboratory (EMBL) in Europe, GenBank in the United States, and the DNA Data Bank of Japan (DDBJ) These databases are continuously updated through international collaboration, and since its inception in 1982, GenBank has experienced exponential growth, with its size doubling approximately every 15 to 16 months.

Les banques de données peuvent être organisées en plusieurs structures : texte, XML, don- nées relationnelles etc Nous prenons un exemple : le format FASTA Dans cette structure,

2 http ://www.ncbi.nlm.nih.gov/Genbank

GenBank has experienced significant growth in its database of DNA and amino acid sequences, which are classified as unstructured data These sequences are stored in large text files, which can either be personal files located in individual spaces or public files from various sequence banks A sequence in FASTA format consists of two components: a definition line and the corresponding DNA sequence.

The definition line should start with the character ">", allowing multiple sequences to be included in a single file Following the ">" character, there is a unique identifier (ID) and a brief description The description is flexible and should not contain any line-ending characters.

– la séquence d’ADN est précédée immédiatement de la ligne de définition La séquence est donnée sous forme de lignes de 80 caractères au maximum.

La recherche de similarit´es

When investigating the function of an unknown gene or protein, researchers look for similarities in the amino acid sequences or DNA bases compared to known and studied molecules Similarity serves as a measure of resemblance between two protein or nucleic acid sequences, determined by the percentage of identity between the compared sequences The process of similarity search involves identifying similar regions between a query sequence and a target sequence.

Heuristique de recherche d’alignement

alignement de s´equences Les types d’alignement sont :

– alignement local : cet alignement est utilisé pour détecter les régions locales de hautes similarités entre deux séquences ayant des longueurs différentes ;

Global alignment aims to measure the overall similarity between two sequences by aligning them entirely During this alignment process, gaps or insertions can be introduced, represented by a "-" to indicate the absence of a letter in one of the sequences Biologically, these gaps correspond to insertions or deletions, which are mutation events that either remove a portion of a gene or add a new segment.

Searching for similar sequences involves content-based retrieval within large data sets, requiring a thorough examination of all the data in the given collection.

There are several methods available for finding alignments The early algorithms, such as the Smith-Waterman algorithm introduced in 1981, utilize dynamic programming techniques and exhibit quadratic complexity In 1990, BLAST, which is based on heuristic approaches, emerged as a significant advancement in alignment solutions.

BLAST enables the efficient comparison of nucleotide sequences ranging from a few hundred to several thousand bases within a reasonable timeframe It has established itself as the standard tool for genomic database exploration However, BLAST can become cumbersome when conducting systematic searches in databases that are significantly large.

10 9 à 10 10 résidus Il y a deux problèmes : la capacité calculatoire et le temps d’accès aux données Pour accélérer les performances, il existe deux faácons :

– l’impl´ementation parall`ele de BLAST sur un cluster de stations : TurboBLAST [22], mpiBLAST [7] et parallelBLAST [23].

– l’ajout d’accélérateurs matériels : unités de traitement spécialisées (en technologie ASIC ou reconfigurable [RDisk]) réparties sur des cartes d’extension.

Recherche par indexation parall`ele

We can enhance search performance for genomic similarities by indexing the data This indexing allows us to directly reference relevant information The goal is to search within a smaller subset of databases, leading to more efficient results.

Objectifs du stage

The index contains the positions of each occurrence of words of length W, streamlining the process of searching for any word of that length By simply referencing the index table, one can efficiently locate all sequences in the original text that include the specified word However, it is important to note that the index table is significantly larger than the original text from which it is derived.

L’objectif de mon stage était d’étudier la parallélisation de la recherche par indexation. Plus précisément il s’agissait :

– de construire un mod`ele de recherche de similarit´es par indexation ;

– de proposer un algorithme d’ordonnancement (à déterminer) qui distribue les tâches de manière à optimiser le temps d’exécution ;

– d’´etudier le taux de duplication d’une partie des donn´ees ;

– de valider le modèle le modèle par implémentation sur cluster PC.

In the following sections of the report, we will explore similarity search through indexing in detail in Section 2 Section 3 will introduce a parallel indexing similarity search model The subsequent section will present the task scheduling algorithm Finally, in Section 5, we will validate the model through implementation on a 32-node cluster.

Indexation pour la recherche de similarit´ es

Indexing is a technique that accelerates data retrieval and has been successfully implemented across various sectors of computing Numerous indexing mechanisms have been developed, each tailored to specific types of data and search requirements While text indexing has advanced significantly, there are still challenges in indexing certain data types, particularly biological sequence data, where precise criteria for indexing are lacking The primary difficulty in this field lies in the fact that searches do not aim for exact matches of biological sequences or subsequences, but rather utilize approximate matching techniques.

Principe de l’indexation

Exemple de recherche via l’indexation

In a lexicographical search within a library containing fifteen unorganized books, we aim to identify all titles that include the term "platform." It is important to note that variations of the word, such as "plate-forme," and potential typographical errors may exist To conduct this search effectively, we have three options: one approach is to read each book individually, noting those that contain words starting with "plate," followed by a verification process to confirm the correct form of the word.

– soit g´en´erer toutes les variantes possibles (mais acceptables) du mot et les rechercher une par une dans les livres ;

– soit construire un tableau qui associe aux cinq premières lettres de tous les mots de la bibliothèque, le numéro du livre dans lequel on le trouve.

The first solution involves reading all books for each query, as seen in the BLAST approach The second solution may generate numerous different words, leading to extensive searches and reviews of the library for each generated term, with longer search words resulting in even more generated terms This method is known as query preprocessing The third solution focuses on indexing.

When constructing an index, each book is read only once After this initial reading, any searched word can be quickly located The index table consists of a key column (or entries) and a column that indicates the occurrences of these keys.

Indexation des banques g´enomiques

The use of indexing for similarity searches in genomic databases differs significantly from traditional text searches Unlike natural language text, genomic sequences lack spaces between words, making it challenging to identify distinct terms In this context, a "word" can be any substring of nucleotides from the database Therefore, indexing involves utilizing all substrings of a specified length W, as illustrated in the accompanying figure.

Les travaux ant´erieurs

Les structures de donn´ees

Suffix trees are a data structure that reflects the internal characteristics of sequences In a suffix tree, each internal node connecting branches represents a unit of the indexed text, while the leaves indicate the positions where the word starts, traced from the root of the tree to the parent of the leaf This structure enables the search for all subsequences efficiently To optimize space, a compression algorithm is employed.

Fig 2.1 – Les arbres des suffixes pour le text ”tctagc”

– Les tableaux de suffixes [17] : ce sont des tableaux contenant tous les indices aux suffixes des textes tri´es dans l’ordre (alphab´etique) lexicographique Chaque suffixe est

2.2 Les travaux antérieurs 9 une chaˆıne commenácant à une certaine position dans le texte et finissant à la fin du texte En fait, c’est une version sous forme de tableau des arbres des suffixes qui permet de prendre moins de place en mémoire (fig 2.2).

Fig 2.2 – Les tableaux de suffixes pour le text “tctagc”

Q-grams are an indexing scheme that organizes all subsequences of length q, known as q-grams, along with their positions in the text This method enables the efficient search for subsequences of a specified length q.

La recherche

Gonzalo Navarro a abord´e dans [11] trois m´ethodes de recherche :

Neighborhood generation involves creating all sequences derived from a query that are within an edit distance of k Subsequently, each sequence is searched for within the index to retrieve relevant results.

Exact matching partitioning involves selecting substrings, searching for each substring within the index, and subsequently comparing the surrounding text areas This method enhances search efficiency and accuracy by focusing on relevant segments of data.

– le partitionnement interm´ediaire (intermediate partitioning) : extraction des sous-chaˆınes ; puis pour chacun recherche des voisinages proches dans l’index.

The first method takes longer when the requested chain is lengthy, as it generates more extensive neighborhoods Rigoutsos and Califan employed the second method for their research, utilizing graph-based and non-matching indexing techniques in genomic databases.

Indexes, or lookup tables, are highly redundant and based on a probabilistic model For every interval of length W, the FLASH search structure stores in a hash table all possible contiguous and non-contiguous subsequences of length m that begin with the first base in the interval, where m is less than W As a result, the index size is significantly large; the authors mention an index of 2.8 GB for a database.

M´ethodes d’indexation pour la recherche de similarit´es

Format de la table d’index

The index table we have chosen to use includes, in addition to the sequence numbers containing the occurrence of the reference word in the database, the V letters preceding this word and the V letters following it We will refer to these as the left context (VG) and right context (VD).

The additional information in the index table enables the selection of the best results during the query processing phase This means that only those entries with left and right neighborhoods most similar to the query neighborhoods are chosen, eliminating the need to consult the database directly The selection process utilizes a scoring function that assesses neighborhood similarity, assigns a score, and returns only candidates that exceed a specified threshold The size of the index table is equal to 4W.

2.3 Méthodes d’indexation pour la recherche de similarités 11 clé occurence mot0

VG0.0 VD0.0 NbSeq0.0 VG0.1 VD0.1 NbSeq0.1 VG0.2 VD0.2 NbSeq0.2 mot1 VG1.0 VD1.0 NbSeq1.0

VG1.1 VD1.1 NbSeq1.1 mot2 VG2.0 VD2.0 NbSeq2.0

VGN.1 VDN.1 NbSeqN.1Tableau 1 : Structure g´en´erale de table d’index

Méthode de référence

The construction of an index, known as a bank index, is derived from a FASTA-formatted database referred to as F Given a sequence S from F, a word mi of length W starts at index i in S, with V representing the lengths of the left and right neighborhoods The words mi are selected non-contiguously, meaning one word is taken every J characters For every sequence S in F and for each word mi in S where V ≤ i ≤ |S| - (V + W), the corresponding entry ami in the index is updated accordingly.

– le num´ero de s´equence S dans F ;

– le mot VG de longueur V lu `a l’indice i-V ;

– le mot VD de longueur V lu `a l’indice i+V ;

Par exemple, la banque F contient les deux s´equences suivant La longueur du mot est W=2, et la longueur des voisinages V=3 et le saut J=2 :

2.3 M´ethodes d’indexation pour la recherche de similarit´es 12

Seq No 1 : TGCCTGCATGTATACCTGCTCA Seq No 2 : CTGAACACATGCAGTGCCTAAGAA mot cl´e occurence

Tableau 2 : Contenu de la table d’index

During the processing phase of a query, each word of length W, including overlapping words, is searched in the index table If at least one occurrence of the word is found, the neighboring letters are compared, and a score is calculated based on the number of matching bases If this score meets or exceeds a predetermined threshold, the sequence number from which the word originated is added to the results list This method indexes only one word per J, significantly reducing the index size, which is crucial given the index's size relative to the database size Additionally, the context information of occurrences (left and right neighborhoods) increases memory table size, necessitating a reduction in the number of indexed words to offset this additional information However, this advantage comes at the cost of making the search less sensitive.

Suppression des zones de faible complexit´e

The previous method does not utilize any criteria for selecting reference words To enhance sensitivity, we can choose reference words based on a complexity criterion, which helps us focus on only the most interesting words This allows us to avoid areas of low complexity, such as sequences like AAAAAAACAAAAAAAAATAAAAAA The construction of the index is based on this complexity criterion For a sequence S of F, the index construction is carried out accordingly.

2.3 M´ethodes d’indexation pour la recherche de similarit´es 13 est alors la suivant :

– calcul de la complexit´e des positions de s´equences ;

The formula Xj = (nA + 1) * (nC + 1) * (nG + 1) * (nT + 1) calculates the product of the counts of the nucleotide bases A, C, G, and T within a specific window defined by the indices |j-x| and |j+x|, where x is less than j and less than the length of the sequence minus (x) Additionally, a word mi is considered only if its complexity exceeds a predetermined threshold, with the index i constrained by the condition V ≤ i ≤ |S| - (V + W).

A l’entr´ee correspondante `a mi : on ajoute :

– le mot VG de longueur V lu `a l’indice i-V ;

– le mot VD de longueur V lu `a l’indice i+V ;

Par exemple, nous prenons la s´equence suivante avec une longueur du mot W=2, une longueur des voisinages V=3, une taille de fenˆetre |i-2, i+2| et un seuil de 11

In this example, line b displays the characters from the original sequence (line a) that have surpassed a complexity threshold of 11 We extract the words from line b to include them in the index table.

During the request processing phase, it is essential to apply the same criteria as in the indexing phase This ensures a significant match between the request sequence and the target By selecting the same seeds for both phases, we guarantee that there is at least one entry in the index table corresponding to the identified match zone.

Discussion

The latest method demonstrates a higher sensitivity compared to the reference method, as it does not randomly select indexed and searched words This targeted approach enhances the effectiveness of the search process.

Conclusion

This approach not only helps avoid low-complexity areas within the bank but also significantly accelerates the request processing phase Indeed, the reduction of indexed words during the indexing stage translates into fewer search terms within the index.

We consider genomic sequence banks of 100 Gbp, where sequences are truncated into small segments, referred to as reference words, which are grouped by identical words of size W We define a cluster as such a group, with the size of clusters varying significantly In the indexed bank, there are C = 4W clusters Ultimately, we have two types of banks: an indexed bank and a bank in FASTA format.

Indexation parall` ele pour la recherche de similarit´ es

Principe

La tˆache de filtrage

The query sequence is a DNA sequence for which we seek similar sequences We break down the query sequence into N query words, each of length W For each query word, we directly access the corresponding reference word in the indexed database Subsequently, we compare the neighborhoods of the query word with those of the occurrence words in the cluster to assess similarity This comparison of neighborhoods for a query word is treated as a filtering task For example, consider a query word CA, along with its left and right neighborhoods in the query sequence, as illustrated in the index table example.

Tˆache = comparer avec le cluster CA

To process the query word CA, we first access the CA cluster within the indexed bank Next, we compare the left (CCT) and right (CAT) neighborhoods of the query word CA with three pairs of neighborhood occurrences ({GAA, CAT}, {ACA, TGC}, {ATG, GTG}) to assess similarity Candidates are selected only if their similarity scores exceed a specified threshold.

La recherche

We have two databases: a sequence database in FASTA format and an indexed database The query sequence is divided into N query words, with each word treated as a filtering task Our approach consists of four steps to execute a query effectively.

To begin with, for a given task i, we immediately access the cluster associated with the query word i and compare its neighborhoods with those of the cluster The outcome of task i is a set of sequence numbers.

Secondly, this stage involves merging the results from N tasks to identify the complete set of sequence numbers Indeed, the results may contain duplicate, triple, or multiple sequence numbers.

– troisièmement, à partir de cet ensemble de numéros de séquences, on construit une banque FASTA en extrayant directement les séquences à partir de la banque FASTA.

Mod´elisation du temps d’ex´ecution

– finalement, c’est l’étape de recherche des similarités en utilisant BLAST avec deux entrée : la requête et la banque qui vient d’être construite.

3.2 Mod´ elisation du temps d’ex´ ecution

Le temps d’exécution d’une tâche de filtrage et le temps d’extraction d’une séquence sont calculés par les deux formules suivant :

Où : Tacces : temps d’accès aux données ;

Tf : vitesse de transfert des donn´ees (Mo/s) ;

Texec :puissance de comparaison (Mo/s) ;

To expedite the search process, similarity searches are conducted in parallel across a cluster of machines, with genomic databases indexed and stored in FASTA format distributed locally on each node Specifically, the N comparison tasks and the construction of the FASTA database occur simultaneously.

In a cluster of machines, K nodes execute N tasks in parallel On average, each node processes K N filtering tasks and extracts M K sequences The total processing time for a request is determined by these factors.

The execution time of a query is determined based on the parameters of a machine cluster Within the Symbiose team, we offer three platforms: a PC cluster, RDisk, and ReMiX.

Plateformes ´etudi´ees

Cluster de PC

The platform consists of 32 nodes, each equipped with a PC that has local memory and a hard drive Communication and synchronization among the nodes are facilitated by either an MPI or RMI communication library.

3.3 Plateformes étudiées 18 données sont réparties localement sur chacun des noeuds La séquence requête est découpée en un ensemble de mots requêtes qui sont distribués aux noeuds Les mots requêtes sont exécutés séquentiellement et indépendemment sur les 32 noeuds Les résultats des mots requêtes sont envoyés à un noeud dédié qui les fusionne pour avoir un ensemble de numéros de séquences à aligner A partir de cet ensemble, une banque au format FASTA est construite. Dans cette plateforme on n’a pas besoin de communication inter-noeuds sauf pour envoyer les mots requêtes et pour fusionner les résultats Voici les paramètres du cluster de PC :

Fig 3.2 – Configuration de cluster de PC

RDisk

RDisk is a reconfigurable disk cluster that consists of a host computer connected to 48 RDisk cards via an Ethernet port Each RDisk card features a hard drive and an FPGA (Field-Programmable Gate Array) filter, which is a programmable logic chip The role of an RDisk card is comparable to a node in a previous platform Genomic databases are distributed across the 48 RDisk cards, allowing the heuristic step of the neighborhood comparison algorithm to be replaced with a filtering step directly from the hard drives The host segments the requested sequence into a set of query words and sends them to each RDisk card The FPGA filter compares pairs of neighborhoods, and the sequence numbers are then sent back to the host, which merges the results and requests the extraction of sequences from the cards.

For this platform, the clusters read from the IDE bus are transferred to the FPGA filter Execution begins as soon as a node receives the address and size of the clusters to be read This approach eliminates the time spent on neighborhood comparison.

ReMiX

ReMiX is a cluster of reconfigurable memories consisting of four machines, each connected to two RMEM (Reconfigurable Memory) cards Each RMEM card features an FPGA filter and a large FLASH memory Unlike the RDisk platform, where banks are distributed across hard drives, ReMiX allocates them locally within the FLASH memory A server processes a requested sequence by breaking it down into a set of query words, which are then sent to independent servers that operate in parallel Partial results are collected on a dedicated server for comprehensive processing Each RMEM card has a memory size of 64 gigabytes, totaling 512 gigabytes across all eight cards.

For ReMiX, the clusters read from the FLASH memory are transferred to the FPGA filter, where they are processed in the same manner as on the RDisk platform.

Conclusion

In reality, cluster sizes are highly unequal, which creates challenges in balancing the load of nodes according to the nature of the requested sequence.

The number of tasks assigned to each node can vary significantly To address this issue, we propose duplicating a portion of the data locally on each node This duplication allows us to implement a scheduling algorithm that optimizes execution time by ensuring that nodes only access a subset of clusters, even when the size of each cluster differs greatly.

Tableau 3 : le nombre moyen, maximal et minimal d’occurrences dans un cluster

Tableau 4 : les dix nombres maximaux et minimaux d’occurrences pour W=9 (a) et W (b)

1 La taille de la banque Fasta est 9 Giga octets

The goal of scheduling is to allocate N tasks across K nodes while maintaining an equal load Specifically, tasks are assigned to the K nodes along with the retrieval of FASTA sequences, ensuring that the total execution times of the tasks remain as balanced as possible, including the time taken for extracting the FASTA sequences.

Principe d’ordonnancement

Répartir les données génomiques

Genomic data is organized based on its structure, with the indexed bank containing 4 W clusters and the FASTA format bank comprising n sequences These clusters and the sequences within them can be analyzed in their natural order For instance, when W equals 2, the arrangement follows a specific pattern.

4.1 Principe d’ordonnancement 22 a 4 2 = 16 clusters ; dans ce cas les clusters sont nommés cluster0, cluster1, , cluster15. Nous proposons deux faácons de répartir les clusters : la première, consiste à partitionner des clusters contigus en K blocs (K = le nombre de processeurs) comme indiqué sur la figure 4.1a La deuxième, consiste à les répartir en non-contigus en fonction du module de K comme indiqué sur la figure 4.1b D’après nos expériences, les tailles des clusters est très inégale. C’est la raison pour laquelle, cette dernière est la meilleure méthode parce que les clusters sont distribués presque aléatoirement.

Fig 4.1 – Répartition de 16 clusters consécutifs et non-consécutifs sur 4 processeurs

Fig 4.2 – Répartition et duplication 16 clusters consécutifs à 4 processeurs

Dupliquer une partie des donn´ees localement

After distributing the data, tasks can be assigned to processors The execution time of a task is influenced by the size of the cluster The goal of scheduling is to allocate the required tasks to K processors in such a way that the total execution time remains balanced This objective can be achieved by duplicating a portion of the data locally Indeed, data duplication provides multiple options for assigning a task to a processor, whereas without duplication, a task can only be assigned to a single processor.

Fig 4.3 – Duplication d’une partie des données en local niveau 2 (répartition consécutives et non-consécutives)

Algorithme d’ordonnancement

The tasks in our problem are unique and independent, with their priority not being random Graham introduced the LPT (Largest Processing Time) scheduling method, which organizes tasks based on the significance of their execution duration We consider tasks Tj {j=1, , N} characterized by an execution time (represented by the function ex(Tj)), and processors Pi {i=0, 1, , K-1}, along with R representing the level of data duplication We have developed two scheduling algorithms to optimize task management.

The task scheduling process involves two main steps First, a corresponding cluster for the task is sought within the initial segment of the clusters in a node, specifically in part L0 For instance, with R=2, each node is divided into two segments: L0 and L1, and the search is limited to L0 Second, upon identifying a matching cluster in node (i), the task is assigned to one of the R nodes {i, i+1, , i+R-1} that has the least load For each task Tj in the request set {j=1, 2, , N}, the search begins at i=0 If a corresponding cluster for task Tj is not found and i is less than K, the algorithm continues to look for a match in the first part of the data in node i If a cluster is found, the task Tj is assigned to the least loaded node pro_assign among the R nodes {i, i-1, , i-R+1}; otherwise, the search proceeds to the next node (i+1).

4.1 Principe d’ordonnancement 24 fin quand si pas encore trouvé un cluster correspondant à la tâche Tj erreur d’affectation tâche Tj fin si

La complexité de la première approche est donnée par :

O(N) =Nlog 2 N +N(R+log 2 C){C : nombre de clusters ; N : nombre de tˆaches}

5 x 10 4 (a) nombre des taches temps d’ordonnancement des taches

2.5 x 10 4 (b) N00 niveau de concurrence des donnees temps d’ordonnancement des taches

Fig 4.4 – Temps d’ordonnancement des tˆaches de filtrage pour 48 processeurs, W=9 (calcul´e en micro seconds)

The second approach involves task scheduling based on the distribution trace of clusters These clusters are allocated non-contiguously across K nodes according to a module This distribution trace allows for the quick identification of a cluster corresponding to a specific task, enabling efficient task assignment to one of the available nodes.

R noeuds {i, i-1, , i-R+1} qui est le moins charg´e.

For each task Tj in the request {j=1, 2, , N}, calculate i as modulo(Tj, K) to identify the appropriate processor within the cluster for Tj If (i≥0) and (i < K), select pro_assign from one of the R processors {i, i-1, , i-R+1} that is the least loaded to assign the task Tj to the node process_assign If the conditions are not met, return an error for assigning task Tj.

La complexité de la deuxième approche est donnée par :

Application `a l’ordonnancement des tˆaches de filtrage

(a) nombres des taches temps d’ordonnancement des taches

(b) N00 niveau de concurrence des donnees temps d’ordonnancement des taches

Fig 4.5 – Temps d’ordonnancement des tˆaches de filtrage pour 48 processeurs, W=9 (calcul´e en micro seconds)

The second approach is faster than the first When the number of tasks is fixed and the level of data competition is increased, the execution time remains relatively unchanged.

4.2 Application ` a l’ordonnancement des tˆ aches de filtrage

Assuming there are N tasks in a query, these tasks involve N comparison operations with neighboring letters and a score calculation based on the number of identical bases The N tasks are executed using clusters within the indexed database.

If data is not duplicated locally, each task is assigned to a single processor However, by duplicating a portion of the data locally R times, a task can be assigned to one of the R processors The execution time for N filtering tasks is determined by the most loaded node The formula for calculating the time of the filtering step is based on the number of tasks assigned to each node (nk).

Duplicating a portion of data helps balance processor load, significantly reducing overall request processing time As illustrated in Figure 4.6, the time saved through local cluster duplication is substantial, particularly at level 2 However, this duplication also increases the demand for additional storage space.

Application `a l’ordonnancement des s´equences

Cluster niveau de duplication des donnees temps d’execution N*taches(sec)

RDisk niveau de duplication des donnees temps d’execution N*taches(sec)

ReMX niveau de duplication des donnees temps d’execution N*taches(sec)

Figure 4.6 illustrates the execution times of filtering tasks across various platforms, highlighting the impact of different levels of cluster duplication The query length is set at 2077, and the FASTA database size is 9.

Go, le nombres de noeuds pour le cluster de PC est de 32 ; et RDisk de 48 ; et ReMiX de 8)

After the filtering and merging stages, we need to construct a database in FASTA format (M sequences) Currently, data access times remain costly, with the IDE protocol averaging 7 ms and the SCSI protocol at 4.7 ms To address this, the construction of the FASTA database is performed in parallel, distributing sequences across nodes efficiently By reducing the construction time, we also create local duplicates of some sequences The total time to build a database of M sequences is determined by the most loaded node The time for this construction step can be calculated using the formula where mk represents the number of sequences extracted at node k.

La figure 4.7 montre que le temps de construction de la banque au format FASTA ne baisse pas beaucoup si l’on augmente le niveau de duplication.

Cluster niveau de concurrence des donnees temps de construction fbank(sec) 1 2 3 4 0

RDisk niveau de concurrence des donnees temps de construction fbank(sec) 1 2 3 4 0

ReMiX niveau de concurrence des donnees temps de construction fbank(sec)

Fig 4.7 – Temps de construction de la banque sur chaque plateforme correspondant niveau de duplication des s´equences (M = 10.000)

Simulation

We calculate the processing time for requests using the formula: Request = N * TF filtering + M * T Extraction The parameters include an EST (Expressed Sequence Tag) database containing approximately 9 billion base pairs distributed across 16 million sequences The size of the request sequences ranges from 300 to 2100 bases, and the duplication level of the clusters varies significantly.

1 à 6 La figure 4.8 montre que le temps gagné grâce à la duplication des données à niveau

2 est le plus important Le meilleur niveau de duplication pour la plateforme ReMiX est de 2.

Cluster PC niveau de concurrence des donnees temps d’execution de requete (sec)

RDisk niveau de concurrence des donnees temps d’execution de requete (sec)

ReMiX niveau de concurrence des donnees temps d’execution de requete (sec) q300 q600 q900 q1200 q1500 q1800 q2100

Fig 4.8 –Temps d’exécution des requêtes avec différents niveaux de duplication des clusters

Conclusion

The reduction in extraction time from the bank in FASTA format is not significant when data is duplicated; however, duplication increases storage requirements Therefore, we opt to duplicate data only during the filtering stage, with the level of duplication ranging from 1 to 3 Figures 4.5 and 4.8 demonstrate that the scheduling time for filtering tasks is negligible compared to the total execution time across all three platforms.

Plateforme et biblioth`eque

– plateforme : nous faisons l’exp´erimentation sur un cluster de machines ayant les carac- t`eristiques suivantes :

Noeuds : parasol (parasol01, parasol02, , parasol32) Type : SUN Fire V20z

CPU : 2.2 GHz Mémoire : 2 Go Réseau : Gigabit Ethernet Disque dur : protocole SCSI, taille de 73 Go par noeud Système d’exploitation : Linux

MPI (Message-Passing Interface) is a library of functions available for C, Fortran, and C++ that enables the utilization of multiprocessor machines through message passing The MPI programming model is based on communicating processes that collaborate to execute tasks, with each process maintaining its own data These processes can communicate and synchronize with each other effectively.

Impl´ementation

In this model, there is a server and K stations The information about the clusters, including their size and quantity, is stored on the server The two banks, "index" and

“FASTA” sont r´eparties sur les stations Pour la banque index´ee, nous utilisons une table

Param`etres d’indexation et r´esultats

Les param`etres

The reference database for all tests consists of an EST bank containing approximately 9 gigabytes of base pairs distributed across 16 million sequences For the tests, we utilized words of size W and a neighborhood size V To assess the alignments during the query processing phase, we implemented a scoring function calculated using dynamic programming, with a threshold of 23 for the sum of the scores of two neighborhoods To evaluate the actual results on the PC cluster platform against theoretical values, we need to specify the parameters.

5.3 Param`etres d’indexation et r´esultats 30 cette plateforme :

Additionally, we assess the performance of our implementation by varying the number of nodes in the cluster (8, 16, 24, and 32 nodes) We measure the acceleration of the filtering stage (N tasks).

– l’accélération de l’étape de construction la banque au format FASTA (fbank) ;

– l’accélération d’exécution d’une requête ;

Les r´esultats

Figure 5.3 illustrates the performance of the filtering stage, with input sequence sizes ranging from 300 to 3100 bases This experiment does not involve duplication The results indicate that the acceleration of the index filtering step does not increase linearly due to the highly uneven cluster sizes Generally, the actual results slightly exceed the theoretical values Additionally, the acceleration of this stage is also influenced by data access time, which averages at 4.7 milliseconds.

Increasing the number of nodes leads to a reduction in data size Additionally, the processing time for this step is influenced by the length of the requested sequences.

7 a (theorie) nombre de noeuds temps de l’etape de filtrage (seconds)

7 b (implementation) nombre de noeuds temps de l’etape de filtrage (seconds) query−300 query−600 query−900 query−1200 query−1500 query−1800 query−2100 query−2400 query−2700 query−3100

Fig 5.3 – Temps d’exécution de l’étape de filtrage par rapport à un nombre de noeuds différents (pas de duplication des données)

Pour r´eduire le temps de traitement de l’´etape de filtrage nous faisons la duplication d’une

5.3 Paramètres d’indexation et résultats 31 partie des clusters Cette duplication nous permet d’affecter les tâches pour que le charge de chaque noeud soit équivalent Dans cette expérimentation, le nombre de noeuds a été fixé à

32 La taille des séquences requêtes varie de 600 à 3100 bases et le niveau de concurrence des clusters varie de 1 à 3 Le résultat ressemble à celui calculé en théorie, c’est à dire que le temps gagné entre le premier et le deuxième niveau est plus que celui entre le deuxième et le troisième niveau (voir figure 5.4).

2.5 a (theorie) niveau de duplication des donnees temps de traitement de filtrage (seconds)

2.5 b (implemantation) niveau de duplication des donnees temps de traitement de filtrage (seconds) query−600 query−900 query−1200 query−1500 query−1800 query−2100 query−2700 query−3100

Fig 5.4 – Temps d’ex´ecution de l’´etape de filtrage par rapport au niveaux de concurrence des clusters (K2)

The extraction performance of the FASTA format database is linear, with an average sequence length of approximately 600 base pairs This consistency in sequence size means that the extraction time is primarily influenced by data access speed By distributing sequences randomly across nodes, we can effectively balance the extraction workload Additionally, the time required to build the database remains unaffected by the sequence length.

12 a (theorie) nombre de noeuds temps d’extraction de la banque (seconds)

12 b (implementation) nombre de noeuds temps d’extraction de la banque (seconds) query−300 query−600 query−900 query−1200 query−1500 query−1800 query−2100 query−2400 query−2700 query−3100

Fig 5.5 – Temps d’extraction de la banque par rapport `a un nombre de noeuds diff´erents

5.3 Param`etres d’indexation et r´esultats 32

For request processing performance, Figure 5.6 illustrates that long sequences outperform short sequences This is because short request sequences involve fewer tasks However, adding nodes to a machine cluster increases communication time between nodes The figure also indicates that theoretical performance surpasses actual performance.

20 a (theorie) nombre de noeuds temps de traitement des requetes (seconds)

20 b (implementation) nombre de noeuds temps de traitement des requetes (seconds) query−300 query−600 query−900 query−1200 query−1500 query−1800 query−2100 query−2400 query−2700 query−3100

Fig 5.6 – Temps de traitement des requêtes par rapport à un nombre de noeuds différents

The execution time for each step of a query is theoretically calculated based on the machine cluster parameters Tables 3 and 4 illustrate the theoretical (T) and actual (R) execution times for each step These tables reveal that the actual time for the filtering step and the construction of the bank in FASTA format is greater than the theoretical calculation for most sequences The real processing time of the queries also includes communication time between nodes, which accounts for the discrepancy between the actual and theoretical times Overall, the difference between theory and reality is not significantly large.

Taille Temps Temps Temps Temps Temps Temps requête filtrage(T) filtrage(R) extraction(T) extraction(R) requête(T) requête(R)

Tableau 5 : comparaison entre les résultats théoriques et réels des requêtes pour 24 noeuds (duplication=1)

Taille Temps Temps Temps Temps Temps Temps requête filtrage(T) filtrage(R) extraction(T) extraction(R) requête(T) requête(R)

Tableau 6 : comparaison entre les résultats théoriques et réels des requêtes pour 24 noeuds (duplication=2)

The extraction time of sequences, as indicated in the sixth column of Tables 3 and 4, is quite costly In the system, each node has a memory capacity of 2 gigabytes, and the size of the local Fasta database is smaller than the available memory We optimized the extraction time by copying the local Fasta database into memory, which significantly reduces the processing time Table 5 illustrates the measured extraction times on a 24-processor cluster.

Taille Nombre Mettre la banque Non mettre la banque requête séquences ` a la mémoire à la mémoire

Tableau 7 :Temps d’extraction de M Séquences Théoriquement, le temps de traitement d’une requête est :

The query time is determined by the formula Trequete = N ∗ TF iltrage + M ∗ TExtraction, where communication time between nodes is often overlooked Despite tasks being executed independently, communication is essential for sharing scheduling data and combining results Figure 5.7 illustrates the actual processing time for queries, highlighting that the communication time required to build the database and disseminate information remains significant.

Fig 5.7 – Comment le temps de traitement des requˆetes se passe pour 16 noeuds (calcul´e en seconds)

We compared the performance of our approach, IndexBLAST, with the parallel version of BLAST (1.4.0), known as mpiBLAST Table 6 summarizes the execution times measured on a 24-processor cluster with an index duplication level of 2.

Conlusion

Taille Temps Nombre de Temps Temps Temps Temps Temps requˆete filtrage s´equences extraction fusion alignement IndexBLAST mpiBLAST

Tableau 8 :comparaison entre indexBLAST et mpiBLAST pour 24 noeuds

The alignment time (column 6) refers to the duration required to sequentially execute the BLAST software on the Fasta database of M sequences The last two columns of the table provide the total execution times The total time for IndexBLAST includes the time spent on filtering, extraction, merging the database, communication, and alignment Generally, the times recorded in the last two columns are nearly identical.

By grouping identical words of size W, indexing allows us to directly target regions with similar potential to the requested sequence, thereby accelerating the search process The implementation occurs in parallel, with the query executed in two main stages: index filtering and database extraction These stages are separate yet interdependent, meaning that the second can only commence once the first is completed This leads to some processors remaining idle during query execution Additionally, the data size related to task scheduling and sequences is relatively small, whereas the data required to construct the FASTA database is substantial, resulting in a point-to-point communication time of 0.4 seconds for its construction.

Après avoir étudié le comportement de la recherche de similarités par l’indexation parallèle, on déduit que :

5.4 Conlusion 36 – une duplication des clusters au plus de 2 n’apporte pas beaucoup de gain ;

– il est inutile de dupliquer la banque FASTA ;

This report discusses a similarity search method through indexing in the field of bioinformatics and the performance of parallel indexing across three platforms: PC clusters, reconfigurable disk clusters (RDisk), and reconfigurable memory clusters (ReMiX) The parallel indexing described can enhance query processing performance We examined the local data duplication rate and proposed a task scheduling algorithm where the scheduling time is negligible compared to the total execution time We analyzed the query processing time on the constructed model and compared theoretical and actual processing times for the PC cluster platform The results indicate significant potential gains in DNA databases compared to the reference software in the field, mpiBLAST However, there are two main directions to improve our approach.

The parallel implementation conducted to validate our approach can be enhanced The proposed overall schema (Fig 3.1) includes sequential components that limit the overall performance, particularly during the final alignment step By executing this step in parallel for all nodes using a dataset derived from sequence extraction, we can significantly reduce the time required for this process.

The analysis of the results indicates that the majority of time in our approach is spent on filtering Currently, indexing relies on a heuristic method that identifies anchor points formed by consecutive W-character words However, recent studies have explored more nuanced pattern searches, known as spaced seeds, which enhance sensitivity and improve indexing accuracy.

38 moins gourmande en terme d’espaces m´emoire.

[1] Stephen F Altschul, Warren Gish, W Miller, Eugene Myers, and D J Lipman Basic Local Alignment Search Tool Journal of Molecular Biology, 215 :403 410, 1990.

In their 1997 paper published in Nucleic Acids Research, Stephen F Altschul and colleagues introduced Grapped BLAST and PSI-BLAST, innovative programs designed to enhance protein database searches These tools represent a significant advancement in bioinformatics, allowing for more efficient and accurate analysis of protein sequences, thus facilitating research in molecular biology and genomics.

[3] S Guyétant, D Lavenier, Filtrage de bases de données sur le prototype RDISK, SympAAA’2003 9ème Symposium en Architectures de Machines et Adéquation Algorithme Architecture, La Colle sur Loup, 2003.

[4] Stephane Guyetant, Mathieu Giraud, Ludovic L’Hours, Steven Derrien, Stephane Rubini, Dominique Lavenier, Frederic Raimbault Cluster of re-congurable nodes for scanning large genomic banks Parallel Computing 31, 2005, 73-96.

Jeremy Buhler presented innovative indexing strategies aimed at enhancing biosequence similarity searches His work was featured in the proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB 2002), edited by notable figures such as Gene Myers, Sridhar Hannenhalli, Sorin Istrail, Pavel Pevznera, and Michael Waterman, held in Washington.

DC, USA, April 1821, 2002), pages 90 99, New York, 2002 ACM Press.

[6] D Lavenier, D Guy´etant, S Derrien, S Rubini, A reconfigurable parallel disk system for filtering genomic banks, ERSA’03, Engineering of Reconfigurable Systems and Algorithms, Las Vegas, Nevada, USA, 2003.

[7] Aaron E Darling, Lucas Carey, Wu-chun Feng The Design, Implementation, and Evaluation of mpi- BLAST 2003.

[8] Hugh E Williams and Justin Zoble Indexing and Retrieval for Genomic Databases IEEE Transaction on knowledge and data engineering, Vol 14, No 1, 2002.

[9] Hugh E Williams Genomic information Retrieval Conference in Research and Practice in Information Technology Vol 17, 2003.

[10] Christian Burks, Michael J Cinkosky, and Paul Gilna Decades of nonlinearity ; the growth of DNA sequence data Los Alamos Science, (20), 1992.

[11] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen and Jorma Tarhio Indexing Method for Approxi- mate String Matching Bulletin of the IEEE Computer Society Technical Committee on Data Enginerring,2001.

In their 2002 paper presented at the 13th International Conference on Database and Expert Systems Applications in London, Twee-Hee Ong, Kian-Lee Tan, and Hao Wang discuss the development of indexing techniques for genomic databases aimed at enhancing the speed of homology searching Their research, published by Springer-Verlag, addresses the need for efficient data retrieval in the field of bioinformatics, highlighting the significance of optimized indexing methods for genomic data analysis.

[13] Ela Hunt Indexed searching on proteins using arrays suffix sequoia Bulletin of the IEEE computer society technical committee on data engineering, 2004.

[14] Ela Hunt, Malcolm P Atkinson, RobertW Irving Database indexing for large DNA and protein sequence collections The VLDB Journal 11 : 256 271, Digital Object Identifier, 2002.

[15] Ozgur D Sahin, S Alireza Aghili, Divyakant Agrawal, and Amr El Abbadi Efficient filtration of sequence homology search through singular value decomposition Technical Report ucsb cs :TR-2003-

19, University of California, Santa Barbara, Computer Science, jul 2003.

[16] Bin Ma, John Tromp, and Ming Li PatternHunter : faster and more sensitive homology search Bioin- formatics, 18(3) :440-445, March 2002.

[17] Stefan Burkhardt, Andreas Cramer, Paolo Ferragina, Hans-Peter Lenhof, Eric Rivals Martin, Vingron. q-gram Based Database Searching Using a Suffix Array (QUASAR) RECOMB 99 Lyon France Copyright, ACM 1999.

[18] Stefan Burkhardt and Juha K ark ainen Better filtering with gapped q-grams In Combinatorial Pattern Matching, pages 73-85, 2001.

[19] G Cooper, Michael L Raymer, Travis E Doom, Don Krane, and Natsuhiko Futamura Indexing genomic databases In Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering, pages 587-591, 2004.

[20] Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and Jams A York Reducing storage requirements for biological sequence comparison Bioinformatics, 20(18), jul 2004.

[21] Xia Cao, Shuai Cheng Li, and Anthony K H Tung Indexing DNA Sequences Using q-grams, 2004.

[22] R.D Bjornson, A.H Sherman, S.B Weston, N Willard, J Wing TurboBLAST : A Parallel Implemen- tation of BLAST Built on the TurboHub TurboGenomics, 2002.

[23] David R Mathog Parallel BLAST on split databases In Bioinformatics Applications Note, Vol 19, no.

[24] W James Kent BLAT The BLAST-Like Alignment Tool In Genome Research, 2002.

[25] Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and Jams A York Reducing storage requirements for biological sequence comparison Bioinformatics, 20(18), jul 2004.

[26] Cheol-Hoon Lee and Kang G Shin, Fellow Optimal Task Assignment in Homogeneous Networks In IEEE Transaction on parallel and distributed systems, Vol 8, No 2, Feb 1997.

[27] Virginia Mary Lo Heuristic Algorithms for Task Assignment in Distributed systems In IEEE Transac- tion on Computers, Vol 37, No 11, Nov 1988.

[28] J Smith, TF and Waterman Identification of Common Molecular Subsequences J Mol Biol, 147,195-197, 1981.

et ReMiX de 8)

Tiêu đề	Indexation Parallèle Des Données Génomiques
Tác giả	Nguyen Van Hoa
Người hướng dẫn	Dominique Lavenier
Trường học	IRISA
Chuyên ngành	Genomics
Thể loại	Thesis
Năm xuất bản	2005
Thành phố	Rennes

Định dạng
Số trang	47
Dung lượng	526,02 KB