Tài liệu Grid Computing P40 ppt

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	16
Dung lượng	204,58 KB

Nội dung

40 The new biology and the Grid Kim Baldridge and Philip E. Bourne University of California, San Diego, California, United States 40.1 INTRODUCTION Computational biology is undergoing a revolution from a traditionally compute-intensive science conducted by individuals and small research groups to a high-throughput, data- driven science conducted by teams working in both academia and industry. It is this new biology as a data-driven science in the era of Grid Computing that is the subject of this chapter. This chapter is written from the perspective of bioinformatics specialists who seek to fully capitalize on the promise of the Grid and who are working with computer scientists and technologists developing biological applications for the Grid. To understand what has been developed and what is proposed for utilizing the Grid in the new biology era, it is useful to review the ‘first wave’ of computational biology application models. In the next section, we describe the first wave of computational models used for computational biology and computational chemistry to date. 40.1.1 The first wave: compute-driven biology applications The first computational models for biology and chemistry were developed for the classical von Neumann machine model, that is, for sequential, scalar processors. With the Grid Computing – Making the Global Infrastructure a Reality. Edited by F. Berman, A. Hey and G. Fox  2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0 908 KIM BALDRIDGE AND PHILIP E. BOURNE emergence of parallel computing, biological applications were developed that could take advantage of multiple processor architectures with distributed or shared memory and locally located disk space to execute a collection of tasks. Applications that compute molecular structure or electronic interactions of a protein fragment are examples of pro- grams developed to take advantage of emerging computational technologies. As distributed memory parallel architectures became more prevalent, computational biologists became familiar with message passing library toolkits, first with Parallel Virtual Machine (PVM) and more recently with Message Passing Interface (MPI). This enabled biologists to take advantage of distributed computational models as a target for executing applications whose structure is that of a pipelined set of stages, each dependent on the completion of a previous stage. In pipelined applications, the computation involved for each stage can be relatively independent from the others. For example, one computer may perform molecular computations and immediately stream results to another computer for visualization and analysis of the data generated. Another application scenario is that of a computer used to collect data from an instrument (say a tilt series from an electron microscope), which is then transferred to a supercomputer with a large shared memory to perform a volumetric reconstruction, which is then rendered on yet a different high- performance graphic engine. The distribution of the application pipeline is driven by the number and the type of different tasks to be performed, the available architectures that can support each task, and the I/O requirements between tasks. While the need to support these applications continues to be very important in computational biology, an emerging challenge is to support a new generation of applications that analyze and/or process immense amounts of input/output data. In such applications, the computation on each of a large number of data points can be relatively small, and the ‘results’ of an application are provided by the analysis and often visualization of the input/output data. For such applications, the challenge to infrastructure developers is to provide a software environment that promotes application performance and can leverage large numbers of computational resources for simultaneous data analysis and processing. In this chapter, we consider these new applications that are forming the next wave of computational biology. 40.1.2 The next wave: data-driven applications The next wave of computational biology is characterized by high-throughput, high technology, data-driven applications. The focus on genomics, exemplified by the human genome project, will engender new science impacting a wide spectrum of areas from crop production to personalized medicine. And this is just the beginning. The amount of raw DNA sequence being deposited in the public databases doubles every 6 to 8 months. Bioinfor- matics and Computational Biology have become a prime focus of academic and industrial research. The core of this research is the analysis and synthesis of immense amounts of data resulting in a new generation of applications that require information technology as a vehicle for the next generation of advances. Bioinformatics grew out of the human genome project in the early 1990s. The requests for proposals for the physical and genetic mapping of specific chromosomes called for developments in informatics and computer science, not just for data management but for THE NEW BIOLOGY AND THE GRID 909 innovations in algorithms and application of those algorithms to synergistically improve the rate and accuracy of the genetic mapping. A new generation of scientists was born, whose demand still significantly outweighs their supply, and who have been brought up on commodity hardware architectures and fast turnaround. This is a generation that contributed significantly to the fast adoption of the Web by biologists and who want instant gratification, a generation that makes a strong distinction between wall clock and CPU time. It makes no difference if an application runs 10 times as fast on a high-performance architecture (minimizing execution time) if you have to wait 10 times as long for a result by sitting in a long queue (maximizing turnaround time). In data-driven biology, turnaround time is important in part because of sampling: a partial result is generally useful while the full result is being generated. We will see specific examples of this subsequently, for now let us better grasp the scientific field we wish the Grid to support. The complexity of new biology applications reflects exponential growth rates at different levels of biological complexity. This is illustrated in Figure 40.1 that highlights representative activities at different levels of biological complexity. While bioinformatics is currently focusing on the molecular level, this is just the beginning. Molecules form complexes that are located in different parts of the cell. Cells differentiate into different types forming organs like the brain and liver. Increasingly complex biological systems Sequence Structure Assembly Sub-cellular Cellular Organ Higher-life Year 90 05 Computing power Sequencing technology Data 1 10 100 1000 100000 95 00 Human genome project E.coli genome C.elegans genome 1 Small genome/Mo. ESTs Yeast genome Gene chips Virus structure Ribosome Model metaboloic pathway of E.coli Biological complexity Technology Brain mapping Genetic circuits Neuronal modeling Cardiac modeling Human genome # People/Web site 10 6 10 2 1 Biological experiment Data Information Knowledge Discovery Collect Characterize Compare Model Infer Figure 40.1 From biological data comes knowledge and discovery. 910 KIM BALDRIDGE AND PHILIP E. BOURNE generate increasingly large and complex biological data sets. If we do not solve the problems of processing the data at the level of the molecule, we will not solve problems of higher order biological complexity. Technology has catalyzed the development of the new biology as shown on the right vertical axis of Figure 40.1. To date, Moore’s Law has at least allowed data processing to keep approximate pace with the rate of data produced. Moreover, the cost of disks, as well as the communication access revolution brought about by the Web, has enabled the science to flourish. Today, it costs approximately 1% of what it did 10 to 15 years ago to sequence one DNA base pair. With the current focus on genomics, data rates are anticipated to far outweigh Moore’s Law in the near future making Grid and cluster technologies more critical for the new biology to flourish. Now and in the near future, a critical class of new biology applications will involve large-scale data production, data analysis and synthesis, and access through the Web and/or advanced visualization tools to the processed data from high-performance databases ideally federated with other types of data. In the next section, we illustrate in more detail two new biology applications that fit this profile. 40.2 BIOINFORMATICS GRID APPLICATIONS TODAY The two applications in this section require large-scale data analysis and management, wide access through Web portals, and visualization. In the sections below, we describe CEPAR (Combinatorial Extension in PARallel), a computational biology application, and CHEMPORT, a computational chemistry framework. 40.2.1 Example 1: CEPAR and CEPort – 3D protein structure comparison The human genome and the less advertised but very important 800 other genomes that have been mapped, encode genes. Those genes are the blueprints for the proteins that are synthesized by reading the genes. It is the proteins that are considered the building blocks of life. Proteins control all cellular processes and define us as a species and as individuals. A step on the way to understanding protein function is protein structure – the 3D arrangement that recognizes other proteins, drugs, and so on. The growth in the number and complexity of protein structures has undergone the same revolution as shown in Figure 40.1, and can be observed in the evolution of the Protein Data Bank (PDB; http://www.pdb.org), the international repository for protein structure data. A key element to understanding the relationship between biological structure and function is to characterize all known protein structures. From such a characterization comes the ability to be able to infer the function of the protein once the structure has been determined, since similar structure implies similar function. High-throughput structure determination is now happening in what is known as structure genomics – a follow-on to the human genome project in which one objective is to determine all protein structures encoded by the genome of an organism. While a typical protein consists of 300 of one of 20 different amino acids – a total of 20 300 possibilities – more than all the atoms in the universe – nature has performed her own reduction, both in the number of sequences and THE NEW BIOLOGY AND THE GRID 911 in the number of protein structures as defined by discrete folds. The number of unique protein folds is currently estimated at between 1000 and 10 000. These folds need to be characterized and all new structures tested to see whether they conform to an existing fold or represent a new fold. In short, characterization of how all proteins fold requires that they be compared in 3D to each other in a pairwise fashion. With approximately 30 000 protein chains currently available in the PDB, and with each pair taking 30 s to compare on a typical desktop processor using any one of several algorithms, we have a (30 000 ∗ 30 000/2) ∗ 30 s size problem to compute all pairwise comparisons, that is, a total of 428 CPU years on one processor. Using a combination of data reduction (a pre-filtering step that permits one structure to represent a number of similar structures), data organization optimization, and efficient scheduling, this computation was performed on 1000 processors of the 1.7 Teraflop IBM Blue Horizon in a matter of days using our Combinatorial Extension (CE) algorithm for pairwise structure comparison [1]. The result is a database of comparisons that is used by a worldwide community of users 5 to 10 000 times per month and has led to a number of interesting discoveries cited in over 80 research papers. The resulting database is maintained by the San Diego Supercom- puter Center (SDSC) and is available at http://cl.sdsc.edu/ce.html [2]. The procedure to compute and update this database as new structures become available is equally amenable to Grid and cluster architectures, and a Web portal to permit users to submit their own structures for comparison has been established. In the next section, we describe the optimization utilized to diminish execution time and increase applicability of the CE application to distributed and Grid resources. The result is a new version of the CE algorithm we refer to as CEPAR. CEPAR distributes each 3D comparison of two protein chains to a separate processor for analysis. Since each pairwise comparison represents an independent calculation, this is an embarrassingly parallel problem. 40.2.1.1 Optimizations of CEPAR The optimization of CEPAR involves structuring CE as an efficient and scalable master/worker algorithm. While initially implemented on 1024 processors of Blue Horizon, the algorithm and optimization undertaken can execute equally well on a Grid platform. The addition of resources available on demand through the Grid is an important next step for problems requiring computational and data integration resources of this magnitude. We have employed algorithmic and optimization strategies based on numerical studies on CEPAR that have made a major impact on performance and scalability. To illustrate what can be done in distributed environments, we discuss them here. The intent is to familiarize the reader with one approach to optimizing a bioinformatics application for the Grid. Using a trial version of the algorithm without optimization (Figure 40.2), performance bottlenecks were identified. The algorithm was then redesigned and implemented with the following optimizations: 1. The assignment packets (chunks of data to be worked on) are buffered in advance. 2. The master processor algorithm prioritizes incoming messages from workers since such messages influence the course of further calculations. 912 KIM BALDRIDGE AND PHILIP E. BOURNE Performance (3422 ent.) 0 256 512 768 1024 0 256 512 768 1024 Number of processors Speedup Ideal Early stopping 100% processed Trial Figure 40.2 Scalability of CEPAR running on a sample database of 3422 data points (protein chains). The circles show the performance of the trial version of the code. The triangles show the improved performance after improvements 1, 2, and 4 were added to the trial version. The squares show the performance based on timing obtained with an early stopping criterion (improvement 3). The diamonds provide an illustration of the ideal scaling. 3. Workers processing a data stream that no longer poses any interest (based on a result from another worker) are halted. We call this early stopping. 4. Standard single-processor optimization techniques are applied to the master processor. With these optimizations, the scalability of CEPAR was significantly improved, as can be seen from Figure 40.2. The optimizations significantly improved the performance of the code. The MPI implementation on the master processor is straightforward, but it was essential to use buffered sends (or another means such as asynchronous sends) in order to avoid communication channel congestion. In summary, with 1024 processors the CEPAR algorithm out- performs CE (no parallel optimization) by 30 to 1 and scales well. It is anticipated that this scaling would continue even on a larger number of processors. One final point concerns the end-process load imbalance. That is, a large number of processors can remain idle while the final few do their job. We chose to handle this by breaking the runs involving a large number of processors down into two separate runs. The first run does most of the work and exits when the early stopping criterion is met. Then the second run completes the task for the outliers using a small number of processors, thus freeing these processors for other users. Ease of use of the software is maintained through an automatic two-step job processing utility. CEPAR has been developed to support our current research efforts on PDB structure similarity analysis on the Grid. CEPAR software uses MPI, which is a universal standard for interprocessor communication. Therefore, it is suitable for running in any parallel environment that has an implementation of MPI, including PC clusters or Grids. There is no dependence on the particular structural alignment algorithm or on the specific application. THE NEW BIOLOGY AND THE GRID 913 The CEPAR design provides a framework that can be applied to other problems facing computational biologists today where large numbers of data points need to be processed in an embarrassingly parallel way. Pairwise sequence comparison as described subsequently is an example. Researchers and programmers working on parallel software for these problems might find useful the information on the bottlenecks and optimization techniques used to overcome them, as well as the general approach of using numerical studies to aid algorithm design briefly reported here and given in more detail in Reference [1]. But what of the naive user wishing to take advantage of high-performance Grid computing? 40.2.1.2 CEPAR portals One feature of CEPAR is the ability to allow users worldwide provide their own structures for comparison and alignment against the existing database of structures. This service currently runs on a Sun Enterprise server as part of the CE Website (http://cl.sdsc.edu/ce.html) outlined above. Each computation takes on an average three hours of CPU time for a single user request. On occasion, this service must be turned off as the number of requests for structure comparisons far outweighs what can be processed on a Sun Enterprise server. To overcome this shortage of compute resources, a Grid portal has been established to handle this situation (https://gridport.npaci.edu/CE/) using SDSC’s GridPort technology [3]. The portal allows this computation to be done using additional resources when available. Initial target compute resources for the portal are the IBM Blue Horizon, a 64-node Sun Enterprise server and a Linux PC cluster of 64 nodes. The GridPort Toolkit [3] is composed of a collection of modules that are used to provide portal services running on a Web server and template Web pages needed to implement a Web portal. The function of GridPort is simply to act as a Web frontend to Globus services [4], which provide a virtualization layer for distributed resources. The only requirements for adding a new high-performance computing (HPC) resource to the portal are that the CE program is recompiled on the new architecture, and that Globus services are running on it. Together, these technologies allowed the development of a portal with the following capabilities: • Secure and encrypted access for each user to his/her high-performance computing (HPC) accounts, allowing submission, monitoring, and deletion of jobs and file management; • Separation of client application (CE) and Web portal services onto separate servers; • A single, common point of access to multiple heterogeneous compute resources; • Availability of real-time status information on each compute machine; • Easily adaptable (e.g. addition of newly available compute resources, modification of user interfaces etc.). 40.2.1.3 Work in progress While the CE portal is operational, much work remains to be done. A high priority is the implementation of a distributed file system for the databases, user input files, jobs in progress, and results. A single shared, persistent file space is a key component of the distributed abstract machine model on which GridPort was built. At present, files must 914 KIM BALDRIDGE AND PHILIP E. BOURNE be explicitly transferred from the server to the compute machine and back again; while this process is invisible to the user, from the point of view of portal development and administration, it is not the most elegant solution to the problem. Furthermore, the present system requires that the all-against-all database must be stored locally on the file system of each compute machine. This means that database updates must be carried out individually on each machine. These problems could be solved by placing all user files, along with the databases, in a shared file system that is available to the Web server and all HPC machines. Adding Storage Resource Broker (SRB) [5] capability to the portal would achieve this. Work is presently ongoing on automatically creating an SRB collection for each registered GridPort user; once this is complete, SRB will be added to the CE portal. Another feature that could be added to the portal is the automatic selection of compute machine. Once ‘real-world’ data on CPU allocation and turnaround time becomes available, it should be possible to write scripts that inspect the queue status on each compute machine and allocate each new CE search to the machine expected to produce results in the shortest time. Note that the current job status monitoring system could also be improved. Work is underway to add an event daemon to the GridPort system, such that compute machines could notify the portal directly when, for example, searches are scheduled, start and finish. This would alleviate the reliance of the portal on intermittent inspection of the queue of each HPC machine and provide near-instantaneous status updates. Such a system would also allow the portal to be regularly updated with other information, such as warnings when compute machines are about to go down for scheduled maintenance, broadcast messages from HPC system administrators and so on. 40.2.2 Example 2: Chemport – a quantum mechanical biomedical framework The successes of highly efficient, composite software for molecular structure and dynamics prediction has driven the proliferation of computational tools and the development of first-generation cheminformatics for data storage, analysis, mining, management, and pre- sentations. However, these first-generation cheminformatics tools do not meet the needs of today’s researchers. Massive volumes of data are now routinely being created that span the molecular scale, both experimentally and computationally, which are available for access for an expanding scope of research. What is required to continue progress is the integration of individual ‘pieces’ of the methodologies involved and the facilitation of the computations in the most efficient manner possible. Towards meeting these goals, applications and technology specialists have made considerable progress towards solving some of the problems associated with integrating the algorithms to span the molecular scale computationally and through the data, as well as providing infrastructure to remove the complexity of logging on to a HPC system in order to submit jobs, retrieve results, and supply ‘hooks’ into other codes. In this section, we give an example of a framework that serves as a working environment for researchers, which demonstrates new uses of the Grid for computational chemistry and biochemistry studies. THE NEW BIOLOGY AND THE GRID 915 Figure 40.3 The job submission page from the SDSC GAMESS portal. Using GridPort technologies [3] as described for CEPAR, our efforts began with the creation of a portal for carrying out chemistry computations for understanding various details of structure and property for molecular systems – the General Atomic Molecular Electronic Structure Systems (GAMESS) [6] quantum chemistry portal (http://gridport. npaci.edu/gamess). The GAMESS software has been deployed on a variety of computational platforms, including both distributed and shared memory platforms. The job submission page from the GAMESS portal is shown in Figure 40.3. The portal uses Grid technologies such as the SDSC’s GridPort toolkit [3], the SDSC SRB [5] and Globus [7] to assemble and monitor jobs, as well as store the results. One goal in the creation of a new architecture is to improve the user experience by streamlining job creation and management. Related molecular sequence, structure, and property software have been created using similar frameworks, including the AMBER [8] classical molecular dynamics portal, the EULER [9] genetic sequencing program, and the Adaptive Poisson-Boltzmann Solver (APBS) [10] program for calculating electrostatic potential surfaces around biomolecules. Each type of molecular computational software provides a level of understanding of molecular structure that can be used for a larger scale understanding of the function. What is needed next are strategies to link the molecular scale technologies through the data and/or through novel new algorithmic strategies. Both involve additional Grid technologies. Development of portal infrastructure has enabled considerable progress towards the integration across scale from molecules to cells, linking the wealth of ligand-based data present in the PDB, and detailed molecular scale quantum chemical structure and 916 KIM BALDRIDGE AND PHILIP E. BOURNE Builder_Launcher 3DStructure 3DVibration 3DMolecular_Orbitals 3DElectrostatic_Surface 3DSolvent_Surface 3DReaction_Path 3D-Biopolymer_Properties QMView Computational modeling Protein Data Bank (PDB) Experimental characterization Quantum mechanics Highly accurate Small molecule Semi Empirical Moderate accuracy Moderate size molecule Empirical Low accuracy Large complex QM compute engine (e.g. GAMESS) * Internet Quantum Mechanical Biomedical Framework (QM-BF) Quantum Mechanical Data Base (QM-DB) O O O − Na + N N R n n Figure 40.4 Topology of the QM-PDB framework. property data. As such, accurate quantum mechanical data that has been hitherto under- utilized will be made accessible to the nonexpert for integrated molecule to cell studies, including visualization and analysis, to aid in the understanding of more detailed molecular recognition and interaction studies than is currently available or sufficiently reliable. The resulting QM-PDB framework (Figure 40.4) integrates robust computational quantum chemistry software (e.g. GAMESS) with associated visualization and analysis toolkits, QMView, [11] and associated prototype Quantum Mechanical (QM) database facility, together with the PDB. Educational tools and models are also integrated into the framework. With the creation of Grid-based toolkits and associated environment spaces, researchers can begin to ask more complex questions in a variety of contexts over a broader range of scales, using seamless transparent computing access. As more realistic molecular computations are enabled, extending well into the nanosecond and even microsecond range at a faster turnaround time, and as problems that simply could not fit within the phys- [...]... emerging Grid standards (e.g the Globus Toolkit) and dynamic reconfigurability By defining XML schema to describe both resources and application codes, and interfaces, using emerging Grid standards (such as Web services, SOAP [13], Open Grid Services Architecture), and building user friendly interfaces like science portals, the next generation portal will include a ‘pluggable’ event-driven model in which Grid- enabled... users • To ensure that the science performed on the Grid constitutes the next generation of advances and not just proof-of-concept computations • To accept feedback from bioinformaticians that is used in the design and implementation of the current environment and to improve the next generation of Grid infrastructure 919 THE NEW BIOLOGY AND THE GRID 40.3.1 A future data-driven application – the encyclopedia... generation of Chemport will require the ability to build complex and dynamically reconfigurable workflows At present, GridPort only facilitates the creation of Web-based portals, limiting its potential use The next step is to develop an architecture that integrates the myriad of tools (GridPort, XML, Simple Object Access Protocol (SOAP) [12]) into a unified system Ideally, this architecture will provide... instances (DataMarts) are derived for different user groups and made accessible to users and applications The Grid is a prime facilitator of the comparative analysis needed for putative assignment The pipeline consists of a series of applications well known in bioinformatics and outlined in Figure 40.7 GRID Ported applications Sequence data from genomic sequencing projects Pipeline data MySQL DataMart(s)... Distribute the database to which every target sequence is to be compared to each node on the Grid We use the nonredundant (NR) protein sequence database and the PFAM databases which contain all unique sequences and those organized by families respectively 2 Schedule and distribute each of 107 target sequences to nodes on the Grid 3 Run PSI-BLAST on each node 4 Retain the PSI-BLAST profiles in secondary storage... is not necessary to await the final result before useful science can be done Once two genomes have been completed, comparative proteomics can begin THE NEW BIOLOGY AND THE GRID 921 Interestingly, the major challenge in using the Grid for EOL is not technical at this stage, but sociological (how should policies for sharing and accounting of data and computational resources be formulated?) and logistical... dedicated set of processors rather than accessing a much larger set of processors that must be shared with others The challenge for Grid researchers is to enable disciplinary researchers to easily use a hybrid of local dedicated resources and national and international Grids in a way that enhances the science and maximizes the potential for new disciplinary advances If this is not challenge enough,... (2000) Development of web toolkits for computational science portals: The NPACI hotpage Proc of the Ninth IEEE International Symposium on High Performance Distributed Computing, 2000 4 Foster, I and Kesselman, C (1997) Globus: A metacomputing infrastructure toolkit International Journal of Supercomputer Applications, 11, 115–128 5 Chaitanya Baru, R M., Rajasekar, A and Wan, M (1998) Proc CASCON ’98... 1347 7 Foster, I and Kesselman, C (1998) IPPS/SPDP ’98 Heterogeneous Workshop, 1998, pp 4–18 8 http://gridport.npaci.edu/amber 9 Pevzner, P., Tang, H and Waterman, M S (2001) An Eulerian approach to DNA fragment assembly Proceedings of the National Academy of Sciences (USA), 98, 9748–9753, http://gridport.npaci.edu/euler srb 10 Baker, N A., Sept, D., Holst, M J and McCammon., J A (2001) The adaptive... Web client Resource layer SOAP SRB Web service SOAP Storage Figure 40.5 The three-layered architecture and the communication through SOAP and GRAM protocols over the Internet 40.3 THE CHALLENGES OF THE GRID FROM THE PERSPECTIVE OF BIOINFORMATICS RESEARCHERS Disciplinary researchers care most about results, not the infrastructure used to achieve those results Computational scientists care about optimizing . of compute resources, a Grid portal has been established to handle this situation (https://gridport.npaci.edu/CE/) using SDSC’s GridPort technology [3]. The. on the promise of the Grid and who are working with computer scientists and technologists developing biological applications for the Grid. To understand

Ngày đăng: 21/01/2014, 19:20

Xem thêm