(LUẬN văn THẠC sĩ) développement dun portail web pour le criblage virtuel sur la grille de calcul luận văn ths công nghệ thông tin

Probl` ematique

The primary focus of this work lies in the field of bioinformatics, specifically in the research and discovery of new drugs for dangerous diseases such as HIV, Ebola, and dengue fever using computational techniques The challenge is twofold: the design of new medications, which is a lengthy and costly process, and the deployment of numerous docking simulations on a computing grid Existing tools lack straightforward procedures for regular users, such as biologists and chemists, to efficiently manage resources for large-scale molecular docking Consequently, these users face significant difficulties and issues when utilizing these applications, resulting in considerable time and financial losses in the pursuit of new treatments for neglected diseases.

Notre contribution

Our contribution focuses on developing a web portal for virtual screening, utilizing a computing grid to enhance the discovery and research of new drugs for serious and neglected diseases We offer a user-friendly interface designed for non-expert users, such as chemists, biologists, and medical professionals, who may lack experience in computing and grid technology To promote interoperability between the web portal and computing grid services, we propose an architecture that enables reliable analysis and processing of end-user requests.

Plan du m´ emoire

This thesis is structured into four main sections: a literature review, implementation and design, demonstration and results, and conclusions and future perspectives The first section provides an overview of virtual screening, docking, and the AutoDock tool, followed by a discussion on grid computing technology, the GVSS portal, and the WISDOM platforms used in drug discovery and DIRAC The second section focuses on the implementation of the portal, detailing the proposed architecture, design, and execution The penultimate section showcases the demonstration of the portal along with the results obtained Finally, the thesis concludes with a general summary and outlines future perspectives.

Today, scientific projects generate and analyze unprecedented amounts of information, necessitating an extraordinary level of computing power Leading this data processing challenge are the LHC experiments at CERN, which accumulate tens of petabytes of data annually However, other scientific fields are also nearing these limits Consequently, users must easily and efficiently utilize global resources Various studies highlight the development and deployment of applications on grid computing infrastructure, demonstrating effective resource utilization Since users are rarely experts in computing and grid systems, they require a means to access the necessary grid resources while concealing the underlying infrastructure's complexity This section will detail the virtual screening technique and the "Docking" approach, the principles of grid computing, and its role in discovering new treatments for neglected and dangerous diseases Additionally, we will introduce the GVSS portal and the WISDOM platform, designed to facilitate access to grid computing services and DIRAC.

– Ligand, une structure, généralement une petite molécule qui se lie à un site de liaison.

– Récepteur, une structure, généralement une protéine qui contient le site de liaison actif.

– Site de liaison, zones de prot´eines actives qui interagissent physiquement avec le ligand pour la formation d’un compos´e.

Conception de m´ edicaments in-silico

Computer-aided drug design utilizes computational chemistry for the discovery, enhancement, and analysis of drugs and biologically active molecules This technology plays a crucial role in various stages of the drug development process.

• Dans l’identification des compos´es potentiellement th´erapeutiques, en utilisant le criblage virtuel ”virtual screening”.

• Dans le processus d’optimisation de l’affinité et de la sélectivité des molècules po- tentielles vers les têtes de série ”lead” ou appelés encore prototypes.

• Dans le processus d’optimisation du lead de série par rapport aux propriétés phar- macologiques recherchées tout en maintenant une bonne affinité de cette molécule.

Toutes ces étapes d’intervention de l’outil informatique sont présentées dans le schéma récapitulatif suivant.

Figure 1 – Processus de conception de m´edicaments in-silico [11]

Criblage virtuel ”Vitual Screening”

Introduction

Virtual screening encompasses a range of computational techniques aimed at exploring compound databases to discover new molecules These methods are often likened to filters that help create sets of molecules sharing specific properties, allowing for the selection of those most likely to interact with a given target.

Virtual screening is extensively utilized today to identify new bioactive substances and predict the binding of a vast database of ligands to specific targets, aiming to pinpoint the most promising compounds This method focuses on identifying small molecules that interact with target protein sites for further analysis and treatment More specifically, virtual screening is defined as the automated evaluation of large compound libraries using computer programs It refers to in-silico techniques, which are computer-based methods or mathematical models and simulations that aid in drug discovery and identify new compounds likely to bind to a known 3D target molecule.

Figure 2 – Criblage Virtuelin-silico (http: // serimedis inserm fr)

Compte tenu de l’augmentation rapide du nombre de prot´eines, le criblage virtuel continue

High-throughput screening is an effective method for discovering new inhibitors and medications, particularly during the early phases of drug development Its primary aim is to select a reduced set of molecules from diverse chemical libraries that demonstrate superior potential activity against a targeted therapeutic protein This process focuses on identifying essential structural motifs in ligand-receptor interactions and distinguishing the most promising compounds within oriented chemical libraries that consist of molecules from the same series.

Virtual screening is a highly valuable tool that accelerates the discovery of new treatments by searching libraries of small molecules to identify structures most likely to bind to drug targets, typically protein receptors Its effectiveness relies on the amount of information available regarding a specific disease target Virtual screening techniques have become essential in medicinal chemistry, enhancing the drug discovery phase They are routinely employed in both public research laboratories and major pharmaceutical companies.

D´ ecouverte de nouveaux m´ edicaments avec le criblage virtuel

Virtual screening is the most commonly used in-silico strategy for identifying compounds ("hits") in the search for new drugs It has become an integral part of most bioactive compound research programs, whether in academic or industrial settings, as it serves as a crucial complement to high-throughput biological screening Virtual screening allows for the exploration of large chemical libraries (over 10^6 molecules) to find active compounds against a specific therapeutic target This process significantly reduces the initial chemical library to a limited list of the most promising compounds, often leading to a marked improvement in the concentration of active molecules for the target.

The use of in-silico screening significantly enhances hit rates compared to random selection from a chemical library, leading to remarkable reductions in both the time and costs associated with identifying new compounds By conducting in-silico screening prior to smaller-scale biological assays, researchers can optimize the number of in-vitro tests according to budgetary and time constraints When conditions allow, biological screening can be performed alongside virtual screening to evaluate its effectiveness and refine the parameters of the software programs used.

The relevance of the molecule used is crucial for the success of virtual screening, even more so than the algorithms employed to identify interactions within the molecule A diverse library of compounds is essential to ensure a thorough exploration of chemical space, thereby increasing the likelihood of discovering new compounds Additionally, to avoid wasting time on molecules with characteristics incompatible with pharmaceutical interests, the screening process typically includes a preliminary filtering step This task, often managed by specialized programs, is vital for enhancing the efficiency of the screening process.

` a exclure les composés toxiques Ensuite, ne sont retenus que les composés obéissant à des définitions empiriques simples du profil de molécule active.

Les diff´ erentes strat´ egies du criblage virtuel

There are two distinct approaches to virtual screening based on the nature of available experimental information The first approach, known as "structure-based virtual screening," relies on the target's structure and often involves protein-ligand docking algorithms This method estimates the structural complementarity of each screened molecule with the active site in question However, these techniques tend to be more computationally intensive and typically require a higher level of expertise.

The second method, known as "ligand-based virtual screening," relies on sufficient knowledge of one or more reference active molecules This approach is quick and relatively easy to implement; however, its main drawback is its dependence on the reference information used to build the affinity prediction model While these two approaches are often used exclusively, combining them during screening can enhance the chances of successfully identifying new "hits." In this work, we utilize the "structure-based" approach.

Criblage virtuel ` a haut d´ ebit

Molecular docking simulation is a valuable method for predicting the interaction potentials of small molecule complexes at protein binding sites, essential for Structure Based Drug Discovery (SBDD) Various docking programs, including DOCK, GOLD, Autodock, Glide, LigandFit, and FlexX, have proven effective in the in-silico drug discovery pipeline The fundamental approach of molecular docking involves generating all possible conformations of a docking molecule and evaluating their orientations to identify the most favorable binding mode using a scoring function However, exhaustively searching through all valid conformations of a compound is a time-consuming process, highlighting the need for efficient docking simulations for effective screening.

` a grande échelle à haut débit (HTS) consommera de grandes ressources informatiques.

Docking thousands of compounds to a target protein requires several teraflops per task, yet existing tools lack straightforward procedures for regular users to efficiently organize resources for large-scale molecular docking Grid technology heralds a new era in virtual screening due to its effectiveness and cost-efficiency Traditional in-vitro testing is typically very expensive when conducted on a large scale Virtual screening offers scientists an effective tool for selecting potential compounds for in-vitro testing.

En conséquence, le criblage virtuel à haut débit pourrait bien économiser énorme somme d’argent comparant aux tests in-vitro classique.

Conclusion

We introduced the concept of virtual screening strategies, which utilize computational approaches to predict the properties of molecular libraries With the significant rise in publicly available experimental data, this field has made remarkable progress in terms of throughput, quality, and diversity of predictions An overview of in-silico screening applications is provided, focusing on specific use cases and future developments Virtual screening offers a complementary solution to traditional screening methods.

High-throughput screening (HTS) utilizes innovative computational techniques to streamline the experimental testing of molecules The primary advantage of in-silico virtual screening is its ability to generate a concise list of candidate molecules for experimental evaluation, thereby reducing costs and saving time This method allows for the rapid exploration of a wide array of molecules, enabling researchers to focus on the most promising candidates in subsequent experimental phases The inherent challenges associated with high-throughput techniques and the optimization of chemical compounds have spurred the development of new strategies, including molecular docking-based virtual screening methods.

Docking

Introduction

Modeling the structure of a protein-ligand complex is crucial for understanding the binding interactions between a potential ligand compound and its therapeutic protein target This process plays a significant role in the modern structure-based drug design, facilitating the development of effective medications.

Molecular docking, also known as "docking" or "ligand docking," is a highly useful process that aims to predict the potential interaction of a molecular complex's structure from small molecules in protein binding sites, accelerating the research and discovery of new in-silico medications This process involves determining the 3D structure of protein complexes at the atomic scale, allowing for a better understanding of their biological function Docking specifically entails finding the optimal position of a ligand within a receptor's binding site to maximize interactions and evaluate ligand-protein interactions, discriminating between experimentally observed and other positions.

Historically, early docking tools followed the "lock-and-key" principle, where the ligand acts as the key, fitting geometrically into the active site of the receptor, which serves as the lock.

Ligands are small molecules designed to inhibit the activity of a protein, which serves as the receptor They enable the prediction of intermolecular structures in a 3D format, including binding modes and possible conformations of a ligand to a receptor, while also calculating binding energy The docking technique assesses binding resistance, complex energy, and the types of signals produced, estimating the binding affinity between two molecules This process plays a crucial role in decision-making to determine which candidate ligand will interact most effectively with a target protein receptor.

Protein-ligand docking is a technique used to assess the structure, position, and orientation of a protein when interacting with small molecules known as ligands Its primary aim is to predict and categorize the structures resulting from the association between a specific ligand and a target protein with a known 3D structure.

Figure 3 – Docking prot´eine-ligand

Docking ligand-protein remains the most commonly used method, as it allows for a rapid evaluation of thousands, even millions, of molecules.

A docking program should effectively generate the expected binding modes for ligands with a known position in the active site within a reasonable timeframe To achieve this, the conformational search algorithm must thoroughly and efficiently explore the conformational space The quality of the docking process is typically assessed by measuring the Root Mean Square Deviation (RMSD) of the atoms between the docking pose and the experimentally observed pose, if available.

Approches du docking

Different docking approaches vary in their application conditions and the type of information they provide The relevance of selecting a specific docking program primarily depends on the alignment between these characteristics and the studied system Additionally, the effectiveness of the chosen algorithm will be a trade-off between execution speed and result accuracy.

Depending on the desired outcome and the required level of precision, three degrees of molecular flexibility are typically considered: rigid (where molecules are treated as rigid), semi-flexible (one rigid molecule and one flexible), and flexible (both molecules are flexible) The semi-flexible model is often applied in protein-ligand interactions, where the smaller ligand is regarded as flexible while the protein is treated as rigid to simplify the system.

Le processus de docking consiste à faire interagir une petite molécule organique avec le récepteur, généralement de nature protéique La technique de docking comprend 4 étapes principales :

1 Pr´eparer les fichiers pour la prot´eine.

2 Pr´eparer les fichiers pour le ligand.

3 Pr´eparer les fichiers de param`etres pour la grille.

4 Pr´eparer les fichiers de param`etres pour le docking.

Le schéma ci-après montre clairement les étapes de docking.

Principe du docking

Molecular docking occurs in two complementary stages The first stage, known as Docking, involves searching for ligand conformations that can establish ideal interactions with the receptor using search algorithms such as genetic algorithms and Monte Carlo methods The second stage, called Scoring, employs mathematical methods and functions to differentiate correct docking poses from incorrect ones These methods are used to estimate interaction strength and binding affinity, allowing for a rapid calculation of interaction energy between ligands and receptors to identify the most favorable conformations.

La formule utilis´ee pour le scoring est la suivante :

La figure ci-dessous schématise le principe dudocking/scoring, oùR symbolise une structure du récepteur Tandis que, A, B et C représentent les petites molécules.

Figure 5 – Illustration de docking/scoring [6]

Docking can be understood qualitatively by observing the ligand entity within the protein cavity, as well as quantitatively through the analysis of data derived from scoring functions.

Outils de Docking

A l’heure actuelle, plus de 30 programmes de docking moléculaires (commerciaux ou non) sont disponibles [6] Les plus fréquemment cités sont respectivement : AutoDock

GOLD, FlexX, DOCK, and ICM are powerful tools that enable rapid screening of extensive compound libraries These programs typically utilize specialized algorithms, such as Genetic Algorithms and Simulated Annealing Their protocols consist of two essential steps: Docking and Scoring, which are critical for effective molecular interaction analysis.

To perform the docking task, molecular docking tools generate a series of different ligand binding poses and utilize a scoring function to evaluate these poses.

”scoring” pour évaluer les affinités de liaison de ligand pour les poses générées afin de déterminer le meilleur mode de liaison.

Figure 6 – Comparaison des programmes de docking [16]

Comme la figure ci-dessus montre, le programme AutoDock est le plus cit´e et le plus utilis´e parmi les autres programmes de docking.

Conclusion

The docking process is a crucial initial step in drug design, involving the interaction between a small organic molecule and a protein receptor One of the primary advantages of protein-ligand docking methods is their ability to generate structural hypotheses on how a small molecule interacts with its macromolecular target Studies indicate that certain docking algorithms are more reliable than others in replicating the experimental binding modes of ligands However, these techniques often come with increased computational time and resource demands Conversely, projects that require virtual screening of millions of compounds are better suited for simpler algorithms that prioritize speed and cost-efficiency through approximations The number of available docking programs has significantly increased over recent decades, with popular examples including LigandFit, FlexX, and AutoDock In this study, we utilized the AutoDock program.

Docking is a type of application that can be easily distributed across a grid The EGEE project (Enabling Grids for E-sciencE), funded by the European Commission, has made numerous computing and storage resources available Its goal is to build on the latest advancements in grid technologies and develop a 24/7 grid infrastructure service.

AutoDock

Docking avec AutoDock

AutoDock requires knowledge of the types, charges, and bonding lists of each atom to perform the docking procedure First, it is essential to search the Protein Data Bank (PDB) at websites such as http://www.pdb.org and http://www.rcsb.org for the PDB files corresponding to the protein and ligand.

Figure 7 – Proc´edures de docking avec AutoDock

La procédure de docking avec AutoDock se décompose en plusieurs étapes :

1 Préparer le fichier d’entrée de protéine Dans cette étape un fichierPDBQT(Protein Data Bank, Partial Charge (Q), & Atom Type (T)) sera créé, qui contient les atomes et les charges partielles.

L’utilisateur possède 2 choix pour préparer son protéine, soit il utilise l’outil”ADT”, soit via la commande suivante :

> /usr/local/MGLTools-1.5.6/bin/pythonsh /usr/local/MGLTools-1.5.6/MGLToolsPckgs/AutoDockTools/Utilities24/prepare receptor4.py-r protein.pdb

2 Préparer le fichier d’entrée de ligand Cette étape est très semblable à la préparation du protéine Nous créons un fichier dont l’extension est PDBQT du ligand.La préparation s’effectue comme suit :

> /usr/local/MGLTools-1.5.6/bin/pythonsh /usr/local/MGLTools- 1.5.6 /MGLToolsPckgs/AutoDockTools/Utilities24/ prepare ligand4.py -r ligand.pdb

3 Génération d’un fichier de paramètre de la grille Maintenant, nous devons définir l’espace en 3D, qu’AutoDock considèrera pour le docking Dans cette phase, nous allons créer les fichier d’entrées pour ”AutoGrid4”, qui permettra de créer les différents fichiers de carte ”map file” et le fichier de données de la grille ”gpf”(grid parameter file).

> input ligand.pdbqt & protein.pdbqt

> /usr/local/MGLTools-1.5.6/bin/pythonsh /usr/local/MGLTools- 1.5.6/MGLToolsPckgs/AutoDockTools/Utilities24/ prepare gpf4.py -l ligand.pdbqt -r protein.pdbqt

4 Génération des fichiers de cartes et de données de la grille Dans l’étape précédente, nous avons créé le fichier de paramètres de la grille, et maintenant nous allons utiliser

”AutoGrid4” pour générer les différents fichiers de cartes et le fichier principal de données de la grille.

> input protein.pdbqt & protein.gpf

After launching autogrid, several new files with the map extension are created, corresponding to each type of ligand atom and auxiliary files These files play a crucial role in the docking process.

5 Génération du fichier de paramètre de docking Cette étape consiste à préparer les fichiers de docking (dpf).

> input ligand.pdbqt & protein.pdbqt

> /usr/local/MGLTools-1.5.6/bin/pythonsh /usr/local/

Utilities24/prepare dpf4.py -l ligand.pdbqt -r protein.pdbqt

You can prepare the parameter files for grid and docking without using the ADT tool by utilizing a shell script (see appendix) to create these files.

Le r´esultat de ce script sont respectivement les fichiers :dpf”docking parameter file” et gpf ”grid parameter file”.

6 À ce stade, nous aurions créé tout un tas de différents fichiers Cette avant dernière ´ etape consiste à exécuter autodock avec la commande ci-après :

> output result.dlg protein ligand.gpf

> autodock4 -p protein ligand.dpf -l result.dlg

7 La dernière étape sera consacrée à l’analyse des résultats de docking Après avoir terminé avec succès la procédure de docking Le meilleur résultat pour le docking, sont les conformations qui possèdent une basse énergie AutoDock peut faire une première analyse des résultats en regroupant les solutions en classes (clusters) en fonction de leur proximité spatiale La mesure de la proximité entre deux solutions est calculée par la racine de la moyenne des carrés des écarts (Root Mean Square Deviation RMSD) de leurs coordonnées atomiques Si le RMSD entre molécules est inférieur à une distance seuil, ces deux solutions sont dans la même classe Le seuil de distance est appelé ”tolérance de classe” et sa valeur par défaut, pour AutoDock, est de 0,5 Ce paramètre est transmis à AutoDock par le fichier de paramètrage

”dpf” avant le lancement du docking.

Conclusion

As mentioned in the docking section, the docking process with AutoDock involves several steps that require prior preparation of files Understanding the docking process is crucial for advancing our knowledge of molecular interaction mechanisms and developing predictive tools in medicine In this section, we presented the docking procedure using AutoDock 4.2 with AutoDockTools and applied the docking steps to a concrete example to grasp this technique, which will aid us in the next phase of launching jobs on the computing grid for docking.

Grille de calcul

Introduction

Researchers are focused on understanding climate change, conducting oceanographic studies, monitoring and modeling environmental pollution, advancing materials science, studying combustion processes, designing medications, simulating molecules, and processing data in particle physics They face various computational challenges requiring more powerful processors, increased data storage capacities, and improved analysis and visualization tools Recent advancements in high-speed networking technology have enabled the creation of high-performance distributed systems on a global scale, often utilizing clusters of PCs or parallel computing resources However, parallel scientific applications are inherently resource-intensive It is beneficial to run these applications when local resources, such as laboratory clusters or computing centers, are insufficient Corporate computers rarely operate at full capacity, so optimizing every moment of latency can free up significant computing power and storage space, often at a lower cost than investing in new hardware Grid computing technologies facilitate the secure sharing of data and programs across multiple computers, including desktops, personal computers, and supercomputers This networking allows for the creation of a virtual system with immense computational power and storage capacity, essential for executing scientific or technical projects that demand extensive processing cycles or large data volumes.

Grille de calcul

Grid computing is an emerging technology aimed at providing the scientific community with virtually unlimited computing resources At its most ambitious level, it serves as a software infrastructure that unites numerous distributed computing resources, databases, and specialized applications worldwide Prabhu defines grid computing as "a set of computing resources distributed over a local or wide area network that appears to an end user or a large application as a virtual computing system."

La grille de calcul a pour but de réaliser le partage flexible et coordonner de ressources ainsi que la résolution coopérative de problème au sein d’organisation virtuelles (VO).

Grid computing was originally designed as a network of computers where computational and storage resources are shared based on user demand It provides protocols, applications, and development tools for dynamic and large-scale resource sharing, which is highly controlled to determine who shares what and under what conditions A grid system is inherently dynamic, as resource providers and users fluctuate over time This technology enables the creation of virtual organizations that leverage complementary skills and resources across multiple institutions, presenting a cohesive entity for individuals working towards a common goal that is too complex for a single team to tackle Grid technologies facilitate the sharing, exchange, discovery, selection, and aggregation of diverse, geographically distributed resources over the Internet, including sensors, computers, databases, visualization devices, and scientific instruments Grid computing is widely applied in various fields such as chemistry, bioinformatics, mathematics, and biomedicine.

Figure 8 – La grille de calcul

Organisation virtuelle

The computing grid supports multiple virtual organizations, which share resources among themselves A Virtual Organization (VO) is a group of researchers with similar scientific interests and requirements, collaborating with other members and sharing resources such as data, software, programs, CPU, and storage space, regardless of geographical location Each virtual organization manages its own member list based on its specific needs and objectives, and researchers must join a VO to utilize the computing resources provided by EGI, the European Grid Infrastructure.

EGI (European Grid Infrastructure) is a continuation of the EGEE project, aimed at sustaining grid infrastructure by making it accessible to all scientific disciplines while integrating innovations in distributed computing EGI provides support, services, and tools that enable Virtual Organization (VO) members to leverage their resources effectively Currently, EGI hosts over 200 VOs representing communities with diverse interests, including earth sciences, medicine, bioinformatics, computer science, mathematics, and life sciences.

Architecture g´ en´ erale d’une grille de calcul

The architecture of a computing grid is structured in layers While each project may have its unique architecture, a general framework is essential for understanding key concepts of grids, which are outlined below.

• La couche Fabrique (Fabric layer)

The lowest layer directly interacts with hardware to provide shared resources Physically, this layer supplies essential resources such as processors for computation, databases, directories, and network resources.

• La couche r´eseau (Network layer)

Elle implements essential communication and authentication protocols required for transactions on a grid network These communication protocols facilitate data exchange across manufacturing resources The authentication protocols rely on communication services to provide secure mechanisms for verifying the identities of users and resources.

• La couche ressource (Resource layer)

Cette couche utilise les services des couches connectivité et fabrique pour collecter des informations sur les caractéristiques des ressources, les surveiller et les contrôler.

The resource layer focuses solely on the essential characteristics of resources and their individual behaviors, without considering their global interactions This responsibility falls to the collective layer, which addresses the interplay between resources.

• La couche collective (Collective layer)

It manages the interactions between resources, overseeing the scheduling and co-allocation of resources when users request multiple resources simultaneously It determines which computing resource to execute a task on based on estimated costs Additionally, it handles data replication services and is responsible for monitoring services while also detecting failures.

• La couche application (Application layer)

The top layer of the model consists of software that utilizes the grid to deliver essential resources to users, whether for computation or data retrieval Applications leverage services from each layer of the architecture to function effectively.

Figure 9 – Couches de la grille de calcul

Composants de la grille

This section discusses the key components of the grid computing environment in detail Depending on the design and intended use of the grid application, some of these components may be necessary while others may not, and in certain cases, they can be combined The components of the grid computing infrastructure include:

• Le portail de la grille

A grid portal serves as an interface for service requesters, including private sectors, public entities, and commercial users, enabling them to design and access a wide array of resources, services, applications, and tools It simplifies the complexities of the underlying network architecture for end users.

The information service component offers details about available resources, including their total capacity, current availability, usage, and pricing information This data is later utilized by the grid portal and resource planner to identify suitable computing resources that meet user demand.

• Courtier de ressources ”Resource Broker”

A Resource Broker serves as an intermediary between a requesting service and available service providers within a grid Its primary role is to dynamically identify available resources and select the most suitable ones to allocate for a given job This process ensures efficient resource management and optimal job execution.

Once resources have been identified, the next step is to plan the work by allocating the available resources A resource scheduler should be utilized, as certain tasks are prioritized over others, and some jobs require extended autonomy.

Grid users are resource consumers within a computing grid, encompassing a diverse range of categories such as scientists, military personnel, educators, businesses, and medical professionals The classification of these users primarily depends on the specific problems they aim to solve using the grid infrastructure.

The grid resource manager assesses resource needs, executes jobs, monitors their status, and returns outputs upon job completion It interacts with the resource broker to allocate resources and assigns tasks to the appropriate resources Additionally, the manager must authenticate users and verify their authorization to access resources before job assignment.

Fonctionnement de la grille

The calculation grid operates on the principle of resource pooling, where a vast array of distributed computing resources are interconnected through a high-speed network These resources are provisioned from various geographic locations, enabling efficient and scalable computing solutions.

The grid system operates by associating each created job with a "jobstep" and a set of "workunits." These workunits are ready to be executed on grid resources and contain essential information, including data, required parameters, and the program to run Agents installed on each machine in the grid connect to the grid server at regular intervals to retrieve jobs using a "pull" model Before downloading data, the agent checks its cache to avoid unnecessary transfers Once the data is verified, the agent executes the scientific program, archives the results, and sends the result archive back to the grid server.

Each completed job is linked to one or more results, which the user can download collectively The key steps for the operation of the computer network and the interaction among its various components are illustrated in the figure below.

Figure 10 – Architecture de grille de calcul [10]

Comme le montre la figure ci-dessus, le fonctionnement des diff´erents composants de la grille sont :

• Les utilisateurs du r´eseau pr´esentent leurs jobs au Resource Broker de la grille.

• Le courtier de ressources ”Resource Broker” de la grille procède à la découverte des ressources et de la tarification des informations en utilisant le service de l’information.

• Le gestionnaire de ressources de la grille ”Resource Manager”, authentifie et assure le crédit nécessaire dans le compte de l’utilisateur afin de déployer les ressources de la grille.

• L’ordonnanceur de ressource(Resource Scheduler), exécute alors le job sur les résultats en matière de ressources et de rendement approprié.

• Le courtier rassemble les r´esultats et les passent `a l’utilisateur de la grille.

Avantages & D´ efis de la grille

Les avantages d’utiliser une telle architecture sont multiples et ind´eniables Nous pouvons citer les exemples suivants :

• D´eploiement des ressources inutilis´ees

Grid computing is a powerful concept designed to harness the untapped computing power of idle PCs Nowadays, many computers remain underutilized for extended periods, with their processors rarely reaching full capacity This technology allows for the utilization and sale of the downtime from hundreds or thousands of computers and servers to anyone in need of massive computing power.

• Bas´e sur une architecture de type client/serveur

The computing grid is built on a robust and secure client/server architecture, specifically tailored to meet the unique requirements of grid computing technology.

• Meilleure rentabilisation du mat´eriel

There is a clear underutilization of machines in various industries Implementing a grid system presents an ideal solution that is economically beneficial for businesses and practically advantageous for users, allowing for better resource optimization.

Les défis de la recherche rencontrés par les technologies de grilles de calcul actuelles sont répertoriés comme :

Dynamicity in grid resources is managed and controlled by multiple organizations, as these resources can enter or exit the grid at any time This constant flux can lead to increased load on the grid.

• Administration : La technologie de grille est essentiellement un groupe de ressources mises en commun qui n´ecessitent une administration de syst`eme lourde pour la bonne coordination.

• Puissance : La grille offre de nombreux services informatiques, qui consomment beaucoup d’´energie ´electrique Donc, alimentation sans interruption est primordiale.

Conclusion

Dans cette partie, nous avons vu que les besoins en puissance de calcul pour la recherche scientifique fondamentale d´epassent souvent les possibilit´es qu’offre la technologie actuelle.

The calculation grid revolutionizes how researchers access resources, emphasizing that a complex application can be broken down into smaller, isolated tasks.

Grid computing technology offers an economically appealing solution by harnessing the unused computing power and storage space from a vast array of computers connected through a network It has proven to be the most effective technology for various sectors, including commerce, enterprise, education, science, research, and development Virtualization removes geographical and economic constraints on resources, enabling large projects to be completed in a shorter timeframe This innovative technology reduces reliance on a central server or supercomputer However, it is crucial for grid technology to address security and privacy concerns.

Portail GVSS

Introduction

Since the first global avian influenza data challenge in 2005, the Academia Sinica Grid Computing Center (ASGCC) has focused on developing and refining virtual screening for neglected and emerging diseases, including avian influenza and dengue fever Molecular docking simulation is a time-consuming process that requires exhaustive research of all possible conformations of a compound However, the large-scale in-silico process benefits from high-speed grid computing technology, providing intensive computational power and efficient data management The e-infrastructure (EUAsia VO) supports in-silico drug discovery for epidemic diseases in Asia.

GAP (Grid Application Platform) and GVSS (Grid-enabled Virtual Screening Services) were developed using the AutoDock 3.0.5 docking engine GAP serves as a high-level application development environment for creating grid application services, while GVSS is a Java-based graphical user interface designed to facilitate large-scale molecular docking on the gLite grid environment End users of GVSS can specify target compounds, set docking parameters, monitor docking jobs and computational resources, visualize and refine docking results, and download final outcomes Additionally, there are ongoing efforts to enhance biomedical activities and integrate more dynamic resources to support large-scale virtual screening simulations in Asia For instance, scientists studying new target structures need to know how to model and prepare these targets using AutoDockTools, highlighting the necessity for a user-friendly interface that enables collaboration, job submission, progress tracking, docking visualization, and result analysis.

Users prepare virtual screening files in the GVSS graphical user interface and select computing grid resources to submit jobs These computational tasks are managed by GAP/DIANE to distribute computing agents across the grid The calculation results are handled by AMGA, a metadata catalog designed to store storage elements.

Figure 11 – Portail GVSS (http: // gvss2 twgrid org/)

To facilitate large-scale molecular docking in a grid environment, ASGC developed the GVSS (Grid enabled Virtual Screening Services) application, which integrates the gLite DIANE2/GANGA middleware and the AMGA metadata catalog from EGEE All computational tasks are managed by GAP/DIANE to effectively distribute grid computing workers The calculation results are stored and organized using AMGA, which serves as a metadata catalog for storage elements Additionally, GVSS employs Autodock as its docking engine, resulting from the integration of various frameworks designed for grid computing applications.

La plate-forme GAP

GAP (Grid Application Platform) is a high-level application development environment designed for creating grid application services using the MVC (Model-View-Controller) approach It divides the grid application development process into three main stages: application porting ("gridification"), complex job workflow design, and customized user interface development Corresponding to these stages, the GAP system comprises three sub-frameworks: the core framework, the application framework, and the presentation framework.

The core framework provides an abstraction layer for the underlying distributed computing environment, simplifying the complexities of user and job management It achieves this by isolating implementation details through a well-defined set of Java APIs Additionally, leveraging object-oriented design, the core framework has been enhanced to include a high-level job management interface known as DIANE.

The application framework introduces an action-based approach for developing advanced workflows and complex applications to address real scientific problems By utilizing the core framework APIs, application developers can focus on designing workflows without worrying about the underlying details or changes in the computing environment where the jobs will be executed.

Unlike basic and application frameworks, the GAP presentation framework is freely defined, allowing applications the freedom to choose their preferred interface technology based on Java, such as web portals or graphical interfaces.

Architecture GVSS

In the GVSS service, AMGA is utilized to manage indexing and distributed docking results Based on a data analysis workflow, a comprehensive set of metadata from the compound library, target proteins, and docking results is meticulously curated by participating biologists The DIANE framework has been integrated for distributed job management in the GVSS service, allowing for complete control over job presentation and management on the grid This development minimizes the effort required to interact with the grid environment Additionally, a Java graphical application has been created for end-users to access GVSS services, leveraging the benefits of GAP's foundational and application frameworks to streamline communication with the computing grid.

Figure 12 – Architecture Service de criblage virtuel GAP (GVSS) [7]

Conclusion

GVSS is designed to predict how small molecules interact with receptors, significantly reducing costs by utilizing dynamic resource demand from the computing grid The GVSS portal enhances drug discovery by providing users with simultaneous and instant access to grid resources, while effectively masking the complexity of the grid environment from end users.

Plate-formes utilis´ es

WISDOM

WISDOM (Wide In Silico Docking On Malaria) est une initiative qui a été lancé en

In 2005, the WISDOM project was initiated to leverage advanced information technologies and large-scale docking applications for the discovery of drugs targeting malaria and other neglected diseases The primary objective of WISDOM is to demonstrate the effectiveness of grid computing in researching treatments for these critical health issues Collaborating closely with EGEE, WISDOM utilizes its infrastructure to process vast amounts of data This initiative marks a significant step towards in-silico drug research on a grid infrastructure The WISDOM Production Environment (WPE), developed by LPC in Clermont-Ferrand, France, has successfully supported the project in identifying new inhibitors for malaria This platform simplifies the complexity of grid computing, enabling users to efficiently access grid resources for their calculations.

WISDOM is a middleware designed as an experience management environment that efficiently manages data, jobs, and distributes workload across integrated resources, regardless of differing technological standards It enables the creation of web services that interact seamlessly with the system WISDOM functions as a set of generic services, providing an abstraction layer for resources and facilitating generic management of data and jobs, allowing application services to utilize underlying services transparently The WISDOM initiative has three main objectives: the biological objective aims to propose new inhibitors for a protein family produced by Plasmodium; the biomedical objective focuses on deploying an in-silico docking application on a computing grid infrastructure; and the grid objective emphasizes the deployment of a computationally intensive application that generates substantial data to test the grid infrastructure and its services Users interact only with high-level services and do not need to understand the underlying grid resource mechanisms.

The WISDOM Production Environment (WPE) is an intermediary software installed on computing resources to manage data and jobs while distributing the workload across all integrated resources It allows for the creation of web services that interact with the system The four main components of WPE include essential features that enhance its functionality and efficiency.

• Le gestionnaire des tâches ”Task Manager” interagit avec le client et accueille les tâches créées par le client.

The Job Manager submits tasks to the computing elements (CE), ensuring that the tasks overseen by the task manager are executed efficiently.

• Le syst`eme d’information WIS ”WISDOM Information System” utilise AMGA

”ARDA Metadata Grid Application”, pour stocker toutes les m´etadonn´ees requises pour le gestionnaire de job.

• Le gestionnaire de donn´ees ”Data Manager”, g`ere les fichiers sur la grille de calcul.

The "Job Manager" module receives requests and submits pilot jobs to the computing grid for execution within the "Task Manager." This process requires a certificate corresponding to the virtual organization where the jobs will be submitted Tasks are recorded and managed by the Task Manager, with agents retrieving and executing tasks on the grid Additionally, the "WISDOM Information System" (WIS) monitors the status of agents and oversees pilot agent information on the grid, while the data manager handles files in batch mode on the grid.

The goal of WISDOM is to demonstrate the effectiveness of using grid computing in the search for treatments for neglected diseases, aiming to generate a large volume of data efficiently and cost-effectively by leveraging grid infrastructure and resources Despite its successful application for large-scale data, the platform faces certain limitations.

• Son service d’information est ralenti apr`es un long temps d’ex´ecution.

• Les agents de pilotes sont tu´es pour des raisons inconnues.

• La plate-forme WPE s’arrête lorsque le gestionnaire des tâches est vide pendant une longue période.

DIRAC

DIRAC (Distributed Infrastructure with Remote Agent Control) is a software framework designed for distributed computing, specifically developed for the LHCb experiment in High Energy Physics at CERN's LHC It offers a comprehensive solution for one or multiple user communities seeking access to distributed resources.

It creates a layer between users and computing grid resources, offering a unified interface to a variety of heterogeneous providers This integration ensures interoperability, transparency, and reliability in the utilization of resources.

Figure 14 – Intergiciel DIRAC (http: // diracgrid org/)

The modular architecture of DIRAC is designed for easy expansion based on specific application needs, adhering to the service-oriented architecture (SOA) paradigm DIRAC components are categorized into four groups: resources, services, agents, and interfaces.

All DIRAC services are developed in Python and implemented as XML-RPC servers The standard Python library offers a comprehensive implementation of the XML-RPC protocol for both server and client functionalities.

2 El´´ ement de calcul ”Computing Element”

The Computing Element (CE) in DIRAC is an API that abstracts common job manipulation operations by batch computing systems, also providing access to computational resource state information.

3 Syst`eme de gestion de la charge de travail

The Workload Management System (WMS) consists of three main components: the Job Management Service (JMS), distributed agents executing near DIRAC computing elements, and the job manager JMS is a collection of services that handle job reception and queuing, serve jobs to agent requests, and provide job status information Agents continuously monitor the availability of computing elements (CE), retrieve jobs from the JMS, and direct job execution to local computing resources The Job Wrapper prepares job execution on the Worker Node, retrieves the job's input (sandbox), sends job status updates to the JMS, and downloads the job output Jobs are defined using the Job Description Language (JDL).

4 Syst`eme de gestion des donn´ees

The Data Management System encompasses catalog file services that track available data sets and their replicas, along with tools for data access and replication.

A database that tracks executed jobs and metadata of available datasets also maintains information on the physical replicas of files.

The Storage Element (SE) integrates a standard server, such as GridFTP, with configuration service data for access management The SE API enables the dynamic integration of transport protocol modules, facilitating access to the SE.

• Service de transfert de fichier

File transfer is a delicate process, often susceptible to potential failures or network and storage errors in associated software services Therefore, a reliable file transfer service (RFTS) enables the automatic retry of failed operations until they are successfully completed.

Le service de configuration(CS)fournit des param`etres de configuration n´ecessaires

` a d’autres services, afin d’assurer la collaboration entre les agents et les jobs.

The DIRAC platform facilitates the deployment of pilot agents on Worker Nodes as regular jobs using grid scheduling mechanisms Initially, the user creates a task, after which the pilot agent is submitted to the grid to execute the task.

Submitting jobs to the computing grid using the DIRAC middleware requires a specific language known as Job Description Language (JDL) JDL serves as the standard method for job description within the grid computing environment Below is an example of a JDL script that facilitates the submission of a docking job to the computing grid.

InputSandbox = {"dock1.sh", "LFN:/biomed/user/l/louacheni/ file.tar.gz"};

OutputSandbox = {"std.out","std.err", "fileDock1.tar.bz2",

The DIRAC project is a versatile product widely utilized by various partners in different contexts, enabling the construction of distributed computing systems that leverage diverse computational resources such as individual computers, clusters, or computing grids DIRAC's modular architecture allows for rapid adaptation to the specific needs of different user communities, facilitating their access to grid resources and services Key advantages of DIRAC include ease of installation and configuration, efficient operation of various services, and the capability to manage large volumes of data resources Its primary objective is to enhance the efficiency and accessibility of distributed computing environments.

We aim to assist users in effectively communicating with the grid environment, allowing them to submit, control, and monitor their jobs seamlessly For this purpose, we have chosen DIRAC as a key component in the development of the portal.

The WPE and DIRAC platforms share the common goal of facilitating user communication with the computing grid environment, allowing for job submission, control, and monitoring However, their architectures differ significantly In the WPE platform, a pilot agent is submitted to the grid, after which the user creates a task in the task manager, and a WPE pilot agent executes multiple tasks In contrast, the DIRAC platform requires the user to create a task first, followed by the submission of a pilot agent to the grid for that specific task's execution, with a DIRAC pilot agent executing a single task.

D´ eveloppement du portail du web

Tiêu đề	Développement D’un Portail Web Pour Le Criblage Virtuel Sur La Grille De Calcul
Tác giả	Louacheni Farida
Người hướng dẫn	Dr. Nguyen Hong Quang, Dr. Doan Trung Tung, Dr. Bui The Quang
Trường học	Institut de la Francophonie pour l’Informatique
Chuyên ngành	Informatique
Thể loại	mémoire
Năm xuất bản	2014

Định dạng
Số trang	79
Dung lượng	5,72 MB