(LUẬN văn THẠC sĩ) intégration d’un moteur de workflow sur un environnement de grille

La bioinformatique sur grille

Bioinformatics is a multidisciplinary research field that brings together biologists, computer scientists, mathematicians, and physicists to address scientific problems posed by biology The term bioinformatics also encompasses all computational applications arising from this research This discipline is often referred to as "in-silico biology," drawing an analogy to "in-vitro" (in glass) and "in-vivo" (within the living).

Bioinformatics aims to generate new insights into the functioning of living organisms' cells, their evolution, and their health status Initially focused on "genomics," which examines the structure, function, and evolution of genomes, it became clear that genomics provides a static view of cells, failing to capture their temporal dynamics This realization led to the emergence of "post-genomics," which investigates when and under what conditions genes trigger protein production, as well as how these proteins contribute to cellular function.

The ongoing consultation and enrichment of databases are essential to researchers' work, leading to an overwhelming increase in both the quantity and variety of information available While this abundance of data is a positive development, it is crucial to ensure that the management of this information does not hinder the effective utilization of this vast and partially untapped reservoir of knowledge.

La bioinformatique hérite donc deux tâches lourdes : [2]

• Élever la vitesse de traitement en utilisant des infrastructures de calcul puissantes ou en concevant de nouveaux algorithmes

• Faire en sorte que les nouvelles données soient structurées et facilement accessibles

Both needs can be met through the use of grid technology, which includes "supercomputers" (computational grids) and "massive distributed databases" (data grids).

A grid is a software system that provides users with nearly unlimited computing power and data storage capabilities It enables easy and transparent access to a vast array of distributed computing resources on a large scale, simply requiring a connection to a high-speed network like the Internet.

Bioinformatics applications are well-suited for grid environments due to their need to handle geographically distributed genomic databases The primary challenge lies in delivering transparent, secure, and scalable grid bioinformatics services Recently, significant efforts have been made to develop computing platforms specifically tailored for bioinformatics using grid technology.

Plateforme Bioinformatique de l'équipe PCSV

La couche des Services Bioinformatiques

This layer includes web services that correspond to bioinformatics applications or simulations on the platform, such as Autodock, BLAST, and ClustalW These services are deployed according to the technical recommendations of the Embrace project, which stands for A European Model for Bioinformatics Research and Community Education Each web service features a set of main operations.

• Vérification de l'état d'une tâche

• Récupération du résultat d'une tâche

Figure 1-1 La plateforme d'équipe PCSV

A web service may include additional operations when necessary For instance, in the context of drug discovery workflows, the autodock web service features an extra operation called "thresholder," which assesses whether the energy of a molecular docking exceeds a specified threshold.

Thanks to these layers, users can work remotely with the platform, while other components remain transparent to them They simply utilize service operations to submit and manage their simulation tasks without needing to understand the underlying processes The use of services is also highly flexible, allowing users to interconnect different services to create bioinformatics workflows that incorporate multiple simulations.

Les composants serveurs de ces services sont écrits en langage Java avec la bibliothèque Axis Ils sont déployés sur le serveur d'équipe.

La couche de gestion

Cette couche peut être divisée en deux modules :

Ces deux modules sont autonomes Ils ne s’échangent pas d’information directement

Les rôles du Task Manager sont :

The conservation of information regarding tasks involves several key elements: the service name of the bioinformatics tool that submitted the task, the username of the individual who submitted it, a unique task identifier (TaskId), and a string of characters (Arguments) that enables the user to include additional information.

Il y a des web services avec lesquels les utilisateurs et les applications peuvent travailler avec le Task Manager :

Web services were developed using the C programming language along with the gSOAP library The server component is hosted on the team server, while clients are connected to binary applications These clients can be initiated from various locations, including the team server, a remote machine, or a grid worker node (WN).

Based on the EGEE (Enabling Grids for E-sciencE) data grid, this system is built using the AMGA metadata catalog and the MySQL database manager AMGA serves as a metadata service for the grid, containing descriptions and physical locations of files It enables jobs to query and update data stored within the grid.

In the team platform, the purpose and use of AMGA differ slightly In coordination with MySQL, AMGA establishes a unique database to store all schemas related to simulations and tasks The primary role of the system is to retain all results from simulations and task executions on the platform.

Client APIs are available for querying, adding, and updating content within the Information System These APIs are developed in three programming languages: C++, Java, and Python Similar to the Task Manager web services, they can be executed at any time.

AMGA's client APIs are utilized across all layers of the platform In the Bioinformatics Services layer, Java APIs are integrated into web services, while in the WISDOM Production Environment, Python APIs are employed by agents.

L'Environnement de Production WISDOM

L'Environnement de Production WISDOM (Wide In Silico Docking On Malaria)

Developed over the years, the Wisdom Production Environment initially focused on discovering malaria drugs Researchers can now submit and manage thousands of jobs on the grid, conducting millions of molecular docking simulations to identify ligands that bind to the plasmepsin protein of the malaria parasite, inhibiting its action The current objective has expanded to include various types of simulations and bioinformatics analysis tools.

Les capacités générales de l'environnement sont la création et la gestion les

“agents” qui récupèrent et traitent les tâches En fait, les agents sont des job de grille

Créer les agents Soumettre les jobs sur la grille de calcul Gérer les agents Gérer les états des jobs

Comportements d'agent Charges des jobs

Les opérations de la plateforme

Il y a deux opérations pricipales sur la plateforme : soumission des tâches et soumission des agents

• L'utilisateur spécifie les arguments d'entrée d'une tâche et la soumet par le service bioinformatique correspondant

• Le service bioinformatique invoque le client de service web createTask pour créer une nouvelle tâche dans le Task Manager

• Le service web bioinformatique invoque les API de client d'AMGA pour créer une nouvelle entrée dans la table “simulations” de la base de données de l'Information System

Depending on the bioinformatics service, the output may include a task identifier, simulation identifier, or project identifier With this identifier, users can check the status and retrieve the corresponding results for tasks, simulations, or projects.

• L'utilisateur spécifie les arguments nécessaires (nombre d’agents, nom d'organisation virtuelle, etc) dans le fichier de configuration d'Environnement de Production WISDOM

• L'utilisateur lance l’Environnement de Production WISDOM

• L'Environnement de Production WISDOM soumet les agents (les grid-job) et gére leurs états

• Chaque agent exécute une boucle infinie avec ces étapes (sur le noeud de la grille de calcul):

• Il utilise le client de service getTask pour récupérer la nouvelle tache auprès du Task Manager Dans cette tâche, il y a aussi l'identificateur de simulation correspondant

• Il utilise l'identificateur de la simulation pour récupérer les informations concernant la simulation Avec ces informations, l'agent peut télécharger les données et les applications nécessaires pour effectuer la simulation

• Il traite les résultats de simulation : o Il stocke les fichiers de sortie sur la grille de données; o Il met à jour l'Information System

• Il effacer la tâche dans le Task Manager si la simulation est réussie

• Il remet la tâche dans la file d'attentte si la simulation est échouée

Puisque chaque agent exécute une boucle infinie, son temps de vie est basé sur la configuration de la grille EGEE Cette durée est généralement de 24 heures

An agent can perform multiple tasks, utilizing a hybrid approach that combines PUSH and PULL modes In the WISDOM Production Environment, agents are pushed to the grid while they pull tasks to execute them efficiently.

Interconnexion des services

Currently, users must invoke each bioinformatics service operation independently, indicating a lack of an application on the platform that facilitates the creation and management of interconnections between operations, whether they belong to the same service or different services.

For instance, if a biologist aims to create a workflow consisting of two stages, the first stage utilizes simulations related to Service A, while the second stage employs simulations concerning Service B The output from the first stage serves as the input for the second stage.

• Soumettre la simulation du service A;

• Attendre la fin de cette simulation;

• Récupérer la sortie de cette simulation;

• Soumettre la simulation du service B

Il faut implémenter sur la plateforme une gestion de workflow qui permet de créer l'interconnexion entre les services et l’exộcuter de faỗon automatique

Chapitre-2 Gestion de workflow sur la plateforme

Objectifs

Actuellement, il manque une gestion de workflow sur la plateforme Pour executer des workflows qui comprenent plusieurs étapes successives à réaliser, il y a deux manières de procéder :

• Utiliser les services bioinformatiques : Il faut lancer les opérations des services bioinformatiques les unes après les autres (cf la partie

“interconnexion des services ằ dans le chapitre prộcộdent)

Working directly with the management layer allows users to write a script that encapsulates all workflow steps and submit it directly to the Task Manager as a single task However, this approach makes it challenging to monitor the progress of each individual step, resulting in a lack of flexibility.

To address this issue through the platform, it is essential to utilize a workflow manager to oversee the interconnections between service operations This solution offers two key advantages.

• Souplesse : Les utilisateurs peuvent créer n'importe quel workflow celui-ci est interprété par le gestionnaire et executé au fur et à mesure

• Simplicité : Le worflow exécuté sur la grille est très proche du workflow utilisateur

Figure 2-1: La souplesse de créer les workflow en interconnectant des services

Another reason for implementing a workflow manager on the platform is the involvement of the PCSV team in the ANR GWENDIA project, which stands for Grid Workflow Efficient Enactment for Data Intensive Applications GWENDIA aims to provide efficient workflow management systems to handle and process large volumes of scientific data on large-scale infrastructures like computing grids The team's role in this project is to develop a workflow that facilitates in-vitro drug discovery.

Through participation in the GWENDIA project, the team has selected the MOTEUR workflow engine for integration into the platform Developed by the RAINBOW team, which is also involved in the project, MOTEUR is optimized for efficiently handling applications that manage large volumes of data on grid infrastructures.

Figure 2-2 : L'utilisation de moteur de workflow avec la flatforme

Sculf - un langage de spécification de workflow

Scufl (Simple Conceptual Unified Flow Language) is a data-flow oriented language that primarily outlines the pipeline of an application The key components of Scufl workflows are the processors, many of which are predefined and specified.

• String constants : processeurs qui sont activés seulement une fois dont la sortie est une chaợne des caractốres constante

• web service : des processeurs qui peuvent invoquer une opération d'un service web

• Beanshells : des processeurs qui peuvent exécuter une pièce de code de Java

• Source : des processeurs qui représentent des entrées de workflows

• Sink : des processeurs qui représentent des sorties de workflows

Each component can hold data segments that the workflow must process interactively Their contents are not specified within the Scufl document; they are independent of the workflow description and are only known during execution.

Scufl processors feature input and output ports capable of handling multiple data elements, interconnected through data links that serve as conduits between the output port of one processor and the input port of another An output port can connect to several input ports, allowing data to be broadcast to all connected inputs Conversely, multiple output ports can link to a single input port, where data is buffered based on the order of arrival The workflow is entirely driven by the presence or absence of data in a processor's input ports, with a processor activating only when all its ports contain the appropriate data Notably, defining variables within Scufl is not possible Composition operators are utilized to establish iteration strategies among a processor's input ports, which control how multiple data elements are combined within those inputs.

There are two composition operators: "dot" and "cross." The dot operator is a "one-to-one" operator that processes each data element from the first series with its corresponding element from the second series in the order they are defined In contrast, the cross operator is a "many-to-many" operator that combines every data element from the first series with all data elements from the second series.

Figure 2-3 L'opérateur "dot" et l'opérateur "cross"

Le moteur de workflow MOTEUR

Introduction

MOTEUR (home-Made OpTimisEd scUfl enactoR) is a workflow engine developed by the RAINBOW team at the I3S laboratory, Polytech Nice-Sophia Its primary functions include interpreting and executing workflows written in the Sculf language MOTEUR leverages multiple levels of parallelism and groups tasks to minimize application execution time Additionally, it employs a generic web service for encapsulation, enhancing the reusability of developed code without considering the specificities of computing grids.

Fichier de workflow

Pour mettre en oeuvre un workflow, on doit spécifier :

• Les processeurs avec leurs ports

• Les compositions entres les ports d'un processeur

• Les liaisons entre les ports des processeurs

All information is described using XML tags and stored in a text file The tags utilized by the ENGINE are straightforward yet effective, allowing users to easily define workflows.

Currently, MOTEUR lacks a dedicated tool for workflow specification Users can utilize any text editor, such as Vim, Nano, or Screems, to create workflows in accordance with MOTEUR's defined XML standard, utilizing the specified tags and format.

Par exemple, ci-dessous c'est un workflow simple qui fait la somme des deux chiffres :

Il y a 4 processeurs dans ce workflow : deux sources, un Beanshell et un sink

(Integer.parseInt(int0)+Integer.parseInt(int1)).toString();

int0

int1

result

Fichier de données

The data file is formatted as a text file in accordance with the XML standard It includes three defined tags to specify the names of the sources and the values of the data elements.

Par exemple, le fichier des données du workflow au-dessus est comme suit:

La capacité de parallélisme des services et de parallélisme des données 25

• Parallélisme des services : MOTEUR peut lancer les processeurs indépendants simultanément

Data parallelism allows MOTEUR to simultaneously process data elements from a processor's ports When this capability is activated, MOTEUR creates multiple processor instances to enable concurrent data processing.

MOTEUR et la grille

To facilitate work with the calculation grid, the RAINBOW team is developing a web service called GASW (Generic Application Service Wrapper) The GASW execution function allows MOTEUR to submit and manage job statuses on the grid By utilizing the Java Native Interface (JNI), both the client and server sides of GASW are seamlessly integrated with MOTEUR.

MOTEUR peut travailler avec 2 grille : EGEE et Grid5000

http://egee1.unice.fr/gasw_service.wsdl

With the service ports, users can specify job arguments and retrieve their outputs For instance, when working with the EGEE grid job, it is necessary to specify the execution file and the inputs for the sandboxes All these arguments are included in the description file, making it an essential service port.

Ci-dessous, c’est un exemple d’un fichier de description concernant un job qui exécute le script CreateTask.sh :

Le workflow de Découverte de Médicament d'équipe

The drug discovery workflow of the PCSV team primarily relies on molecular docking simulations Docking is a predictive method used to determine the optimal orientation of one molecule when it binds to another, forming a stable complex.

Figure 2-6 : L'idée principale de docking moleculaire

Docking is commonly employed to predict the binding orientation between a small molecule and a larger target, typically a protein In drug discovery, these small molecules are referred to as ligands, while the target molecule is usually a protein.

Figure 2-7: Le ligand lie avec le protéine

Les arguments d'entrée d'une simulation de docking sont [13]

• Le fichier description de ligand

• Le fichier description de target

• La meilleure conformation de ligand

The average energy consumed to form a bond is crucial, as it is indicative of binding stability; the lower the average energy value, the more stable the bond In drug discovery, docking techniques enable the screening of numerous ligands to identify those that can effectively bind to and inhibit key proteins of viruses or parasites For instance, in the search for an anti-malaria drug, docking can be utilized to discover ligands that inhibit the plasmepsin protein, preventing the malaria parasite from attacking red blood cells.

Il y a plusieurs logiciels de docking: Dock, Autodock, Flexx L'équipe utilise la version 3.1 d'Autodock

Généralement, le workflow de découverte de médicament de l'équipe se base sur 2 phases de simulation de docking :

In Step 1 of the docking phase, the process involves docking a ligand to a target protein The key inputs for this phase are the average energy and the optimal conformation of the ligand By determining the best conformation, we can generate an enhanced ligand description file, referred to as the "new ligand."

• Étape 2 (Thresholder) comparer l'energie moyenne de phase 1 avec un seuil Si l'énergie est inférieur au seuil, la liaison est stable et on passe à phase 2 Sinon, on s'arrête

• Étape 3 (Phase 2 de docking) : Faire le docking entre le nouveau ligand et la même protéine

Chapitre-3 Implémentation d’un gestionnaire de workflow sur la plateforme

Objectif

La plateforme de l'équipe permet l'utilisation d'environnement de grille par deux types d'utilisateurs:

Biologists can utilize bioinformatics services to submit simulation tasks for execution on the grid without needing a certification They only require the permission to access the platform's services.

• Les administrateurs de la plateforme qui lance l'Environnement de Production WISDOM pour soumettre des agents qui executent des tâches sur la grille

La soumission des tâches et la soumission des agents sont indépendantes, c'est mieux de créer 2 workflows séparés :

The TaskSubmission workflow is designed to automate the interconnection of services, facilitating task submission and additional functions such as monitoring task completion and retrieving results In the context of this internship, the TaskSubmission workflow is specifically tailored for the development of a drug discovery workflow for the GWENDIA project.

Figure 3-1:L’opération du workflow Découverte de Médicament

The AgentSubmission workflow is designed to establish a production environment, focusing on two primary functions: the submission of agents and their management.

Figure 3-2 : Les étapes du workflow de soumission des agents

Les deux workflows sont lancés et gérés par MOTEUR

The TaskSubmission workflow can be utilized with either the WISDOM Production Environment or the AgentSubmission workflow In this internship, AgentSubmission is employed alongside the Bioinformatics Services layer and the Management layer to develop a new platform for the GWENDIA project.

Figure 3-3 Le workflow de soumission des agents fait le rôle d’environnement de production de flatforme

Problèmes

Lecture du fichier de description des services bioinformatiques

All WSDL description files for the bioinformatics services of the team are written in the "Document/Literal Wrapped" type, which is the most advanced and widely used format for WSDL files The key features of the "Document/Literal Wrapped" type include its ability to provide a clear structure and enhance interoperability among web services.

• Chaque message d'entrée a une seule partie

• La partie est un élément

• L'élément a le même nom que l'opération

• L’élément de type complexe n’a pas d'attributs non Wrapped Wrapped

Table 2 : Comparaison de la description d'un message entre le type emballé et le type non-emballé

So far, MOTEUR has been tested with web services that have descriptions written in a "non-wrapped" format In the "wrapped" format, all input and output messages consist of a single part, which presents two challenges for MOTEUR.

When utilizing a processor with a single input port and a single output port, it's essential to specify the data file clearly Additionally, establishing a composition between the processor's input ports is crucial for efficient data handling and processing.

• Si on utilise le format traditionnel du fichier de workflow, la version actuelle de MOTEUR ne peut pas lire exactement les paramètres d'entrée

Il faut améliorer la version actuelle de MOTEUR pour qu’il puisse invoquer les services web décrits par tout type de WSDL.

Le problème avec la plateforme actuelle

The current platform features a pre-existing service called "docking" for Autodock simulation; however, this service is not suitable for directly implementing a two-phase drug discovery workflow using Autodock It is designed to submit multiple Autodock simulation tasks but lacks the functionality to compare average energy against a specified threshold.

It is not feasible to verify or enhance the existing service due to potential conflicts with its current use on the platform Therefore, it is necessary to develop a new bioinformatics service called “docking_wf,” specifically designed for creating workflows in GWENDIA.

Amélioration de MOTEUR pour traiter les web services “doc/lit wrapped”

L'Amélioration de MOTEUR

The current version of MOTEUR identifies the number, names, and types of ports for each processor by utilizing the method org.apache.axis.wsdl.symbolTable.Parameter::getQName().getLocalPart().

Avec le type "non wrapped", ỗa marche Neanmoins, avec ceux en type

The term "wrapped" in the context of messaging does not correspond to the names of the elements used, such as the ports of each processor in the ENGINE To address this issue, it is necessary to substitute the aforementioned method with org.apache.axis.wsdl.symbolTable.Parameter::getName().

Cette mộthode permet à MOTEUR de reconnaợtre exactement les nombres et les noms des sous éléments dans chaque partie de message

Par rapport à la version originale (version 080417), il faut modifier les 3 méthodes de la classe DynamicInvoker

3.3.2 Examiner la possibilité d'utiliser les formats des fichiers de workflow et des fichiers de données avec la nouvelle version de MOTEUR

3.3.2.1 Les formats des fichiers de workflow

The current workflow file format of the ENGINE is efficient, requiring no additional tags or properties Users need only specify one note regarding the input and output ports of the processor to manually read the WSDL file The number, names, and types of input ports remain consistent, with sub-elements included in the elements corresponding to the operation, as indicated by the third character of the "doc/lit wrapped" type mentioned above.

Par exemple, le processeur correspondant à l'opération submitOneDocking dispose de 3 ports d'entrée:

• request : type complexe (tns: inputType)

• path : type de base (xsd: string)

• project_id : type de base (xsd: string)

> $1/log 2>&1 echo simulationID = $simId >> $1/log 2>&1 echo finished >> $1/log 2>&1

L'opération isFinished

The main purpose of this operation is to verify the existence of the corresponding entry for the simulation in the "hits" table of the Information System database This entry is generated by the agent following the execution of each simulation.

Cette opération a 1 entrée et 1 sortie:

• L'entrée est l'identificateur de simulation

• Il y a deux cas de la sortie: o "false" si la simulation n'est pas terminée,

Il y a une classe supplémentaire nommée CheckResults Cette classe contacte l'Information System pour récupérer les informations nécessaires Il y a deux requêtes avec l'identificateur de simulation :

• Vérifier l'existence de la simulation avec cet identificateur

• Si oui, vérifier l'existence de l'entrée dans la table “hits” qui a la valeur de l'attribut simulationId égale à l'identificateur d'entrée.

L'opération thresholder

The purpose of this operation is to evaluate whether the average simulation energy of AutoDock is below a specified threshold It is important to note that if the average energy from Phase 1 of AutoDock falls below this threshold, the simulation will proceed to Phase 2.

Il y a 2 entrées de cette opération:

1 simulationId L'identificateur de simulation dont l'energie moyenne doit être testée

2 eval La valeur de seuil

Table 4 : Les entrées de l'opérations thresholder

Il n'y a qu'une seule sortie de cette opération nommée résultat:

• Si l'énergie moyenne passe le seuil, la valeur du résultat sera l'identificateur du nouveau ligand créé au cours de la phase-1

• Si le énergie moyenne ne passe pas le seuil, la valeur du résultat sera MOTEUR_VOID

Il y a une classe supplémentaire nommé CheckThresholder Cette classe contacte avec l'Information System pour récupérer les informations nécessaires

• Faire la requête à la table “hit” pour obtenir la value d'attribut mean_energy (l'énergie moyenne) correspondant de simulation

• Comparer cette valeur avec la valeur d'entrée “eval” (c'est la valeur du seuil)

If the mean energy is less than the evaluation value, search the "ligands" table for the ligand with a source attribute matching the simulation identifier, and return to the skeleton Conversely, if the mean energy is greater than or equal to the evaluation value, return "MOTEUR_VOID" to the skeleton.

Mise en oeuvre du workflow TaskSubmision

Le processeur submitDocking

This processor invokes the submitDocking operation and features four input ports: ligand_id, target_id, project_id, and user It has a single output port that provides the identifier of the presented simulation.

Le processeur isFinished

Ce processeur invoque l'opération ifFinished Il a seulement 1 port d'entrée: simulationId et 1 port de sortie qui est le rộsultat (ôfalseằ ou simulationId).

Le processeur thresholder

Ce processeur invoque l’opération submitDocking Il dispose de 2 ports d'entrée:

“simulationId” et “eval” Il y a seulement 1 port de sortie qui est le résultat ("MOTEUR_VOID" ou la ligand_id de celui qui l'a créé au cours de la phase 1)

Figure 3-5 : Le workflow de soumission des tâches spécifié pour le Découverte de Médicament

Le problème d'interconnexion des processeurs

The operation isFinished must be invoked after each time interval until the corresponding task is completed Currently, the MOTEUR system lacks a specific type of processor to facilitate this process, with the only somewhat similar option being the GASW processor However, GASW processors are specialized for job state verification Due to the limited duration of the internship, these processors are only tested in isolation A complete task submission workflow can only be implemented once this type of processor is developed.

Le workflow AgentSubmission

Conception de workflow

Il n'y a qu'un seul processeur dans ce workflow nommé AgentSubmission avec 3 ports d'entrée et 1 port de sortie Les noms et les rôles de chaque port sont les suivants:

2 input1 Nom d'élément de stockage de la grille de donnée (SE)

3 input2 Nom d'organisation virtuelle de la grille (VO)

4 result le nom d'un fichier texte qui est le fichier de log d'exécution d'agent

Table 5 : Les entrées du processeur AgentSubmission de workflow

Le fichier exécutable de ce processeur nommé AgentSubmission.sh

Il y a 2 entrộes pour la boợte de sable (sand box) de ce processeur:

The jobAgent_wf.sh script manages all execution stages of an agent throughout its lifespan, which corresponds to the duration that the associated grid job is maintained on the worker node of the computing grid.

• scripts.tar.gz: cette archive se compose de tous les fichiers binaires des côtés client des services web qui travaillent avec les Task Manger (getTask, deleteTask, setTaskWating, etc)

Tous ces 2 fichiers sont conservés dans le serveur egee1.unice.fr MOTEUR pouvez les télécharger avec ces chemins:

• http://egee1.unice.fr/dd/jobAgent_wf.sh

• http://egee1.unice.fr/dd/scripts.tar.gz

Ainsi, le fichier de description XML de ce processeur est :

Le fichier de workflow

Car il y a seulement un processeur dans le workflow, le fichier de workflow est très simple

The first element is utilized to initiate the GASWexecution operation, while the second is a constant string that specifies the web service file descriptor address for GASWexecution When this workflow is launched, only the AgentSubmission process appears in the interface, as the AgentSubmission_Descriptor remains constant.

Il y a 3 processeurs de source nommé: JobNumber, SE et VO correspondant aux 3 entrées du port de la AgentSubmission processeur

There is only one sink processor named "result," which refers to the output file It's important to note that the output file is stored in the Biomed storage element, and we can copy it using the command lcg-cp.

http://egee1.unice.fr/dd/AgentSubmission_Descriptor.xml

Service definition of function ns GASWexecution

http://egee1.unice.fr/gasw_service.wsdl

Le fichier de données

La forme du fichier de données est comme suit :

cirigridse01.univ-bpclermont.fr

Dans ces 3 sources ci-dessus, la valeur de la SE et de la VO sont des constantes

Le nombre des "" de la source JobNumber indique le nombre des agents qui seront lancés simultanément dans la grille (grâce à la capacité de parallélisme de données de MOTEUR).

Les scripts utilisés par le workflow

• jobAgent_wf.sh: le script qui correspond à l'opération de chaque agent

• docking_wf.sh: le script qui correspond à la simulation (dans ce cas, c'est Autodock)

3.6.4.1 Le script correspond à l'opération de chaque agent : jobAgent_wf.sh

Le lien vers ce script afin MOTEUR peut l'obtenir: http://egee1.unice.fr/dd/jobAgent_wf.sh

This script runs an infinite loop, retrieving necessary information from the TaskManager via the getTask command It then invokes the service script, specifically docking_wf.sh in this instance Depending on the execution status of the service—whether it succeeds or fails—the corresponding task will either be removed or requeued This script operates throughout the lifecycle of each agent and the jobAgent_wf.sh utilizes three parameters.

• $ 1: pour créer les jobName avec la forme: MOTEUR_J $ 1

• $ 2: nom d'élément de storage (SE)

# $4 JOB RETRY COUNT, not used in this file

To execute the job using the specified parameters, first, extract the scripts from the compressed file and set the necessary permissions The process runs indefinitely, continuously fetching tasks with the command `./getTask`, which retrieves task details such as service, user, task ID, and arguments If the service is not "none," the script checks for the existence of the service directory If the service is deployed, it sets a flag; otherwise, it removes any existing service directory and creates a new one.

To copy the AutoDock archive from the Storage Element, use the command `lcg-cp vo $vo lfn:DDMoteur/services/${service}.tar.gz file:`pwd`/${service}.tar.gz` After downloading, extract the contents with `tar -xzf $service.tar.gz`, set the permissions to 755 using `chmod -R 755 *`, and move the service to the designated directory with `mv $service service/` Finally, navigate to the service directory with `cd service/$service` and check for successful execution with an if statement.

/usr/bin/time -p -o time.txt /$service.sh $job $se $vo $user $taskId

/deleteTask $job else echo failure cd /

/setTaskWaiting $job fi else echo failure

/setTaskWaiting $job fi else echo no task fi done

3.6.4.2 Le script corespond à la simulation

The script is stored in the compressed package docking_wf.tar.gz, located in the data grid at the logical path: lfn:DDMoteur/services/docking_wf.tar.gz.

Ce script est modifié du script ancien de l'équipe:

1 Travailler avec la grille de données au lieu d'un répertoire dans le serveur d'équipe

Utiliser la chemin lfn:DDMoteur au lieu de ftp://osguser:fAHb5lt9@amga02.lpc-rd.fr

2 Créer et stocker le nouveau ligand basé sur le résultat de simulation autodock

Ajouter les nouvelles commandes dans le script

To add a new entry to the Information System database, a new Python application named ligand_insert.py is created This application is designed to insert a new record into the "ligand" table, utilizing three key attributes: name, pdbq_file, and src.

L'attribut src est la chemin à l'entrée correspondante dans la table “simulation” qui a déjà créé ce nouveau ligand

Table 6 : Les changements par rapport à la version actulle du script correspondant à la simulation d’Autodock

En détail sur les étapes de ce script, on peut voir l'annexe-1

3.6.4.3 Le script utilisé comme le fichier d'exécuter dans le fichier de description du grid-job

The script jobAgent_wf.sh can be directly used as the "executable file" in the grid-job description file However, to align it with MOTEUR, some adjustments to the parameter list are necessary Therefore, we utilize another script called AgentSubmission.sh to invoke jobAgent_wf.sh.

AgentSubmission.sh est utilisé comme le fichier d'exécution du grid-job La commande pour lancer ce script est :

./AgenSubmission -jobNumber number -SE se -vo VO -f result

Donc, la valeur de jobNumber est $2, la valeur de SE est $4 et la valeur de VO est

/AgenSubmission -jobNumber 1 -SE cirigridse01.univ-bpclermont.fr

Le script est très simple :

/jobAgent_wf.sh $2 $4 $6 > tmp mv tmp $8

The updated version of MOTEUR can now invoke team service operations It is also utilized for testing the docking_wf service and processors within the TaskSubmission workflow Additionally, the AgentSubmission workflow is tested on the EGEE computing grid.

Ces résultats ont été présentés dans la session du poster et de la démonstration, dans la conférence d'EGEE 2008, Istanbul, Turquie.

Tester l'amélioration de MOTEUR

The older version of MOTEUR operates effectively with services that have description files formatted as "rpc/encoded." The port types can either be basic or complex.

Après l'amélioration, la version actuelle de MOTEUR est réussie de tester avec les services dont les fichiers de description sont écrits en tous les deux types

“rpc/encoded” et “document/litteral wrapped” Les types des ports sont aussi soit le type de base soit le type complexe

La nouvelle version de MOTEUR est aussi testée avec quelques workflows préexistants de l'équipe RAINBOW Elle peut les lancer bien, n’a pas besoin de modifier ces workflows.

Évaluation le service docking_wf

Le service docking_wf est déployé sur le serveur d'équipe On peut l'utiliser avec l'adresse http://amga02.lpc-rd.fr:8080/axis2/services/docking_wf?wsdl

Toutes les opérations de service (submitdocking, isFinished et thresholder) sont testées en utilisant MOTEUR Pour chaque opération, il y a un workflow de test :

Workflow de test Opération correspondante

1 submitTest_wf submitDocking Ce workflow dispose de 4 sources dont les noms sont ligand_id, target_id, project_id et user

Le sink de ce workflow est l'identificateur de la nouvelle simulation

2 isFinishedTest_wf isFinished Ce workflow dispose 1 source dont le nom est simulationId

Dépendre sur l'état de l'exécution de la simulation, la valeur du sink de ce workflow est soit “false” soit l'identificateur de la simulation

3 thresholderTest_wf thresholder Ce workflow dispose de 2 sources dont les noms sont simulationId et eval

Dépendre sur la comparaison entre l'énergie moyenne et le seuil, la valeur de sink est soit

“MOTEUR_VOID” ou l'identificateur de nouveau ligand

Table 7 : :La liste des workflow de test du workflow TaskSubmission

Tous ces workflows sont testés avec succès En regardant la valeur du sink, on peut vérifier la marche de chaque opération.

Évaluation deux workflows TaskSubmision et AgentSubmision

The TaskSubmission workflow is currently incomplete, preventing the simultaneous launch of the AgentSubmission workflow alongside its processors However, both workflows can be tested sequentially.

L’entrée Comme vérifier la marche

• Observer le Task Manager pour vérifier la création des nouvelles tâches

• Observer le Information System pour vérifier la création des nouvelles simulations

2 AgentSubmission jobNumber = 20, pour soumettre 20 agents sur la grille EGEE

• Observer l'interface graphique de MOTEUR pour vérifier les états des agents

• Observer le TaskManager pour vérifier si les tâches sont exécutées

• Observer l'Information System pour vérifier les mises à jour identificateurs des

30 simulations soumises avant l'interface graphique de MOTEUR et la liste d'entrées de la table

4 thresholderTest_wf La liste des identificateurs des

Comparer les valeurs du sink dans l'interface graphique de MOTEUR avec

• les valeurs d'attributs mean_energy dans la table

• les valeurs d'attributs src dans la table ligand

Table 8 : Les étapes des tests des deux workflows

Le workflow AgentSubmission et les workflows de test des processeurs du workflow TaskSubmission sont testés avec succès En détail, on peut voir l'annexe-2

Conclusion

My thesis focuses on integrating the MOTEUR workflow engine into the grid environment by implementing it on the PCSV team platform MOTEUR has been enhanced to invoke the team's bioinformatics services Two workflows, task submission and agent submission, have been implemented By utilizing the bioinformatics services developed for the Embrace project, the docking_wf service is deployed on the platform to create a drug discovery workflow for the GWENDIA project.

Perspectives

The current version of the MOTEUR system requires the addition of a new class of processors that invoke a service operation after each time interval These processors will be utilized to check the completion status of tasks within the task submission workflow Once this class of processors is established, the implementation of the TaskSubmission workflow can be finalized.

On peut aussi modifier l’interconnexion des services pour créer les autres workflows bioinformatiques, pas seulement le workflow de découverte de médicaments

Currently, the functions for submitting and managing agents rely on the processing capability of the GASW processor within the ENGINE This workflow is limited to managing agent states similar to grid jobs An agent is resubmitted when errors occur in the worker node of the grid However, there is a need to enhance this workflow to enable the resubmission of agents based on the number of tasks on the platform.

Après les deux améliorations au dessus sont finies, on peut entrer à la phase de production avec très grand nombre des tâches et des agents

[1] IN2P3, CNRS, “Grille de calcul : l'internet du calcul intensif” , http://www.in2p3.fr/presentation/thematiques/grille/grille.htm

[2] Interstices, Découvrir la Recherche en Informatique, “Entre biologie, informatique et mathématiques : la bioinformatique” , 03/2004, http://interstices.info/jcms/c_6607/entre-biologie-informatique-et-mathematiques- la-bioinformatique

[3] El-Ghazali Talbi, Albert Y Zomaya, “Grid Computing for Bioinformatics and Computational Biology”, 2008, John Wiley & Sons, Inc, ISBN 978-0-471- 78409-8

[4] The EMBRACE project, “Partner Details, Vincent Breton” , http://www.embracegrid.org/page.php?page=person&pid7

[5] Portal d'EGEE (Enabling Grids for E-sciencE), http://www.eu-egee.org/

[7] ARDA project, “Amga – Overview” , 2008, http://amga.web.cern.ch/amga/

[8] Danielle Venton, “WISDOM unplugged: malaria drug-leads graduate to the wet lab” , 05/2008 iSGTW – Internal Science Grid This Week, http://www.isgtw.org/?pid00993

[9] GWENDIA WikiSite, http://gwendia.polytech.unice.fr/doku.php

Tristan Glatard's doctoral thesis, titled "Description, Deployment, and Optimization of Medical Image Analysis Workflows on Production Grids," was completed in November 2007 at the University of Nice Sophia-Antipolis This research focuses on enhancing medical image analysis processes through effective workflow management on production grids, aiming to improve efficiency and accuracy in medical imaging.

[11] Site web de MOTEUR, http://www.i3s.unice.fr/~johan/

[12] Wikipedia, 09/2008, “ Docking Molecular ,” http://en.wikipedia.org/wiki/Docking_(molecular)

[13] Site web d'Autodock, http://autodock.scripps.edu/

[14] Russell Butek, “Which style of WSDL should I use?” , 10/2003, http://www.ibm.com/developerworks/webservices/library/ws-whichwsdl/

Les étapes du script docking_wf.sh 61 Annexe 2 Les résultats de tester deux workflow : TaskSubmision et

1 Assigner les valeur aux variables et l'argument d'environnement

At the beginning of the script, a command decompresses the autodock.tar.bz2 archive, which contains all the applications necessary for autodock simulation This archive is stored within the docking_wf.tar.gz file, which was previously copied by the jobAgent_wf.sh script.

VO=$3 tar -jxf autodock.tar.bz2 export AUTODOCK_UTI= export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./libstdc++-libc6.1-2.so.3 ulimit -s unlimited

2 Obtenir les informations correspondantes à : z fichier des paramètres z ligand z cible (la protéine) de l'Information System et les télécharger au répertoire actuel

# get the parameters files from the database project_id=`./getAttr_query.py

/moteur/testMoteurDD/results/autodock/simulation/entry_$i project_id` tmp=`./selectAttr_query.py

"/moteur/testMoteurDD/results/autodock/project:project_id =

/moteur/testMoteurDD/results/autodock/project:program_options` software=`echo $tmp | tr '|' ' ' | awk '{print $1}' | tr '/' ' ' | awk '{print $NF}'` echo software = $software

# download the parameters file from the lfn lcg-cp vo $VO lfn:${software} file:`pwd`/dpf3gen.awk

# get the information about the ligand and target ligand_id=`./getAttr_query.py

/moteur/testMoteurDD/results/autodock/simulation/entry_$i ligand_id` echo ligand_id = $ligand_id ligandSearchResults=`./getAttr_query.py

/moteur/testMoteurDD/results/autodock/ligands/entry_${ligand_id} name pdbq_file` ligand=`echo $ligandSearchResults | awk '{print $1}'` ligandName=`basename $ligand pdbq` target_id=`./getAttr_query.py

/moteur/testMoteurDD/results/autodock/simulation/entry_$i target_id` targetSearchResults=`./getAttr_query.py

/moteur/testMoteurDD/results/autodock/target/entry_${target_id} name pdbqs` target=`echo $targetSearchResults | awk '{print $1}'` echo target = $target echo ligand = $ligand

# download the target if [ ! -e ${target}.tar.gz ] then lcg-cp vo $VO lfn:DDMoteur/targets/${target}.tar.gz file:`pwd`/${target}.tar.gz tar -zxf ${target}.tar.gz fi

# download the ligand from the lfn if [ ! -e ${ligand} ] then lcg-cp vo $VO lfn:DDMoteur/ligands/${ligand}.pdbq file:`pwd`/${ligand}.pdbq fi

3 Faire le docking et vérifier le résultat

$AUTODOCK_UTI/mkdpf3 ${ligand}.pdbq ${target}.pdbqs

$AUTODOCK_UTI/autodock3 -p ${ligandName}.${target}.dpf -l

To check the success of an AutoDock run, examine the log file named ${ligand}_${target}_$i.dlg If the phrase "autodock3: Successful Completion" is found within the log, it indicates a successful docking of ${ligand} with ${target} Conversely, if the log file is missing or the AutoDock process did not execute properly, the result will be an unsuccessful run, and the program will exit with an error message.

4 Créer le nouveau ligand et le stocker dans l'élément de storage, mettre à jour l'Information System

# create the new ligand and store it on the lfn

To execute the best docking process, use the command `./get_best_docking.sh ${ligand}_${target}_$i.dlg` This command utilizes the `lcg-del` tool to specify the ligand file path as `lfn:DDMoteur/ligands/${ligand}_${target}_$i.pdbq` Additionally, it employs `lcg-cr` to copy the ligand file to the specified storage endpoint with the command `-d $SE`, ensuring the file is correctly located at `file:` followed by the current working directory path and the ligand file name `${ligand}_${target}_$i.pdbq`.

/ligand_insert.py ${ligand}_${target}_$i.pdbq

DDMoteur/ligands/${ligand}_${target}_$i.pdbq

/moteur/testMoteurDD/results/autodock/simulation/entry_$i

3.5 Compresser et stocker le fichier dlg dans l'élément de storage et mettre à jour l'Information System

To efficiently zip and store a DLG file, use the command `gzip` followed by the path to the file, formatted as `${ligand}_${target}_$i.dlg` The compressed file will be saved in the directory `dlg_fileoteur/dlgs/` with the extension `.dlg.gz` After zipping, set the variable `dlg_file` to the path of the DLG file Use the command `lcg-del` to delete the file from the specified virtual organization (VO) and `lcg-cr` to create a new entry for the compressed file in the storage element (SE), ensuring the file path is correctly formatted as `file:` followed by the path to the gzipped DLG file.

/moteur/testMoteurDD/results/autodock/simulation/entry_$i dlg_file

To update the "hits" and "file" tables in the Information System, decompress the dlg file using the command `gunzip \`pwd\`/${ligand}_${target}_$i.dlg.gz`, initializing hitid to 0 and setting the number of runs with numrunP Then, execute a loop for each run from 1 to numrun to process the results.

/get-run ${ligand}_${target}_$i.dlg $run > output if [ $? = 0 ] then rank=`grep Rank output | cut -d "=" -f 2` echo rank $rank >> bkp if [ $rank = 1 ] then energy_level=`grep Docked output | cut -d "=" -f 2 | awk

'{print $1}'` mean_energy=`./PickLowestofCL.py ${ligand}_${target}_$i.dlg | grep "\_1 " | awk '{print $5}'` echo meanenergy $mean_energy >> bkp cluster_count=`grep cluster output | cut -d "=" -f 2`

/get-coordinates ${ligand}_${target}_$i > coordinates

/hits_insert.py $hitid $i $rank $energy_level $run

$mean_energy $cluster_count coordinates

7 Éffacer le fichier de paramètres et les résultats actuels pour préparer la nouvelle simulation

# remove some files rm ${ligandName}* rm dpf3gen.awk if [ `grep "success" ${1}.status | wc -l` -gt 0 ] then exit 0 else exit 1 fi

Annexe 2 Les résultats de tester deux workflow :

1 Soumettre des simulations avec le workflow submitTest_wf

Figure (Annexe) 1 : Observer l’interface de MOTEUR, le TaskManager et la table Simulation de

Information System pour vérifier le workflow submitTest_wf ( 1)

Figure (Annexe) 2 : Observer l’interface de MOTEUR, le TaskManager et la table Simulation de

Information System pour vérifier le workflow submitTest_wf (2)

2 Lancer des agents avec le workflow AgentSubmission

Figure (Annexe) 3 : Observer l'interface de MOTEUR pour vérifier les états des agents

Figure (Annexe) 4 : Observer le Task Manager pour vérifier les éxecutions des tâches des agents

Figure (Annexe) 5 : Observer la table “hits” d’Information System pour vérifier l’execution des agents

3 Lancer le workflow isFinishedTest_wf

Figure (Annexe) 6 : Observer l’interface de MOTEUR pour vérifier l’opération du workflow isFinishedTest_wf

4 Lancer le workflow thresholderTest_wf

Figure (Annexe) 7 : Observer l’interface de MOTEUR pour vérifier la liste des simulations d’entrées

Figure (Annexe) 8 : Observer l’interface de MOTEUR et l’attribut ô mean_energy ằ de la table ôhitsằ pour vộrifier le workflow threholderTest_wf ( 1)

Tiêu đề	Intégration D’un Moteur De Workflow Sur Un Environnement De Grille
Tác giả	Tran Tuan Tu
Người hướng dẫn	M. Vincent Breton - Directeur de Recherche
Trường học	Institut de la Francophonie pour l’Informatique
Chuyên ngành	Master de l’IFI
Thể loại	thesis
Năm xuất bản	2008
Thành phố	Hanoi

Định dạng
Số trang	74
Dung lượng	2,83 MB