DISTRIBUTED AND PARALLEL SYSTEMSCLUSTER AND GRID COMPUTING 2005 phần 6 pps

PROCESS MIGRATION IN CLUSTERS AND CLUSTER GRIDS * József Kovács MTA SZTAKI Parallel and Distributed Systems Laboratory H1518 Budapest, P.O.Box 63 Hungary smith@sztaki.hu The paper describes two working modes of the parallel program checkpointing mechanism of P-GRADE and its potential application in the nationwide Hun- garian ClusterGrid (CG) project. The first generation architecture of ClusterGrid enables the migration of parallel processes among friendly Condor pools. In the second generation CG Condor flocking is disabled, so a new technique is introduced to somehow interrupt the whole parallel application and take it out of the Condor scheduler with checkpoint files. The latter mechanism enables a parallel application to be completely removed from the Condor pool after checkpointing and to be resumed under another non-friendly Condor pool after resubmission. The checkpointing mechanism can automatically (without user interaction) support generic PVM programs created by the P-GRADE Grid programming environment. message-passing parallel programs, graphical programming environment, checkpointing, migration, cluster, grid, pvm, condor Abstract Keywords: 1. Introduction Process migration in distributed systems is a special event when a process running on a resource is redeployed on another one in a way that the migration does not cause any change in the process execution. In order to provide this capability special techniques are necessary to save the whole memory image of the target process and to reconstruct it. This technique is called checkpointing. During checkpointing a tool suspends the execution of the process, collects all those internal status information necessary for resumption and terminates the * The work presented in this paper has been supported by the Hungarian Chemistrygrid OMFB-00580/2003 project, the Hungarian Supergrid OMFB-00728/2002 project, the Hungarian IHM 4671/1/2003 project and the Hungarian Research Fund No. T042459. 104 DISTRIBUTED AND PARALLEL SYSTEMS process. Later a new process is created and all the collected information is restored for the process to continue its execution without any modification. Such migration mechanism can be advantageously used in several scenarios like load balancing, utilisation of free resources (high throughput computing), fault tolerant execution or resource requirement driven migration. When using a job scheduler most of the above cases can only be supported by some external checkpointing mechanism, since automatic checkpointing of parallel jobs is rarely solved within a job scheduler. For example, the Condor [11] system can only guarantee the automatic checkpointing of sequential jobs but only provides user level support for fault-tolerant execution of Master/Worker PVM jobs. When building a huge ClusterGrid we should aim at making the Grid [4] capable of scheduling parallel applications effectively, otherwise these applications will fail due to the dynamic behaviour of the execution environment. Beyond the execution of a parallel program another important aspect of a Grid end-user - among others - is the creation of a Grid application. Unfortu- nately, there are no widely accepted graphical tools for high-level development of parallel applications. This is exactly the aim of the P-GRADE [9] (Paral- lel Grid Run-time and Application Development Environment) Grid programming environment that has been developed by MTA SZTAKI. P-GRADE currently generates [3] either PVM or MPI code from the same graphical notation according to the users’ needs. In this paper we show how an external checkpointing mechanism can be plugged into a scheduler by our tool without requiring any changes to the scheduler, and making a huge nationwide ClusterGrid be capable of execut- ing parallel application with full support of automatic checkpointing. The paper details two working modes: migration among friendly (flocked) Condor pools and migration among non-friendly (independent) condor pools. Both are related to the different layouts of the evolving Hungarian ClusterGrid project. 2. The Hungarian ClusterGrid Project The ClusterGrid project was started in the spring of 2002, when the Hun- garian Ministry of Education initiated a procurement project which aimed at equipping most of the Hungarian universities, high-schools and public libraries with high capacity computational resources. The ClusterGrid project aims to integrate the Intel processor based PCs into a single, large, countrywide interconnected set of clusters. The PCs are provided by the participating Hungarian institutes, the central infrastructure and the coordination is provided by NIIF/HUNGARNET, the operator of the Hun- garian Academic Network. Every contributor uses their PCs for their own purposes during the official work hours, such as educational or office-like pur- Process Migration In Clusters and Cluster Grids 105 poses, and offers the infrastructure for high-throughput computation whenever they do not use them for other purposes, i.e. during the night hours and the un- occupied week-ends. The combined use of “day-shift” (i.e. individual mode) and “night-shift” (i.e. grid mode) enables us to utilise CPU cycles (which would have been lost anyway) to provide firm computational infrastructure to the national research community. By the end of summer 2002, 99 PC-labs had been installed throughout the country; each lab consisting of 20 PCs, a single server and a firewall machine. The resources of PCs in each lab were accumulated by the Condor software and the pools were flocked to each other creating a huge Condor pool containing 2000 machines. A Virtual Private Network was built connecting all the nodes and a single entry point was defined to submit applications. This period is referred as 1st generation architecture of ClusterGrid. From September 2003, a new grid layout has been established referred to as 2nd generation architecture. It was changed to support decentralised submission of applications and to add an intelligent brokering layer above the condor pools that are not flocked to each other any more. Currently both sequential jobs and parallel jobs parallelized by Parallel Vir- tual Machines (PVM) library are supported. Automatic checkpointing works for statically linked sequential jobs only, thus no parallel jobs can run longer than 10 hours (the duration of a night-shift operation) or 60 hours (the duration of a week- end operation). User-level check-pointing can be applied to both sequential and parallel jobs without any execution time restriction. For more detailed information, please refer to [12] 3. The P-GRADE software development tool P-GRADE [5] provides a complete, integrated, graphical solution (including design, debugging, testing, monitoring, load balancing, checkpointig, performance analysis and visualization) for development and execution of parallel applications on clusters, Grid systems and supercomputers. The high-level graphical environment of P-GRADE reduces the need for programming com- petence thus, non-professional programmers can use the same environment on traditional supercomputers, clusters, or Grid solutions. To overcome the execution time limitation for parallel jobs we introduced a new checkpointing technique in P- GRADE where different execution modes can be distinguished. In interactive mode the application is started by P-GRADE directly, which means it logs into a cluster, prepares the execution environment, starts a PVM or MPI application and takes care of the program. In this case it is possible to use the checkpoint system with a load balancer attached to it. In job mode the execution of the application is supervised by a job scheduler like Condor or SGE after submission. When using the Condor job scheduler P- 106 DISTRIBUTED AND PARALLEL SYSTEMS GRADE is able to integrate automatic checkpointing capability into the application. In this case the parallel application can be migrated by Condor among the nodes of its pool or it is even possible to remove the job from the queue after checkpointing and transfer the checkpoint files representing the interrupted state to another pool and continue the execution after it is resubmitted to the new pool. To enable one of the execution modes mentioned above the user only needs to make some changes in the “Application Settings” dialog of P-GRADE and submit the application. No changes required in the application code. 4. Migration in the 1st generation ClusterGrid P-GRADE compiler generates [3] executables which contain the code of the client processes defined by the user and an extra process, called the grapnel server which is coordinating the run-time set-up of the application. The client processes contain the user code, the message passing primitives and the so called grapnel (GRAPhical NEt Language) library that manages logical con- nections among them. To set-up the application first the Grapnel Server (GS) (see Figure 1 comes to life and then it creates the client processes containing the user computation. Before starting the execution of the application, an instance of the Check- point Server (CS) is started in order to transfer checkpoint files to/from the dynamic checkpoint libraries dynamically linked to the application. Each process of the application automatically loads the checkpoint library at start-up that checks the existence of a previous checkpoint file of the process by connecting to the Checkpoint Server. If it finds a checkpoint file for the process, resumption of the process is automatically initiated by restoring the process image from the checkpoint file otherwise the process is started from the begin- ning. To provide an application-wide consistent checkpointing, the communication primitives are modified to perform the necessary protocol among the user processes and among the user processes and the server. In a Condor based Grid, like the 1st generation ClusterGrid, the P-GRADE checkpoint system is prepared to the dynamic behaviour of the PVM virtual machine organised by Condor. Under Condor the PVM runtime environment is slightly modified by Condor developers in order to give fault-tolerant execution support for Master-Worker (MW) type parallel applications. The basic principle of the fault-tolerant MW type execution in Condor is that the Master process spawns workers to perform the calculation and it continuously monitors whether the workers successfully finish their calculation. In case of a failure the Master process simply spawns new workers passing the unfinished work to them. The situation when a worker fails to finish its calculation usually comes from the fact that Condor removes the worker because Process Migration In Clusters and Cluster Grids 107 Figure 1. Migration phases under Condor. the executor node is no longer available. This action is called vacation of the machine containing the PVM process. In this case the master node receives a notification message indicating that a particular node has been removed from the PVM machine. As an answer the Master process tries to add new PVM host(s) to the virtual machine with the help of Condor and gets notified when it is done successfully. Afterwards it spawns new worker(s). For running a P-GRADE application, the application continuously requires the minimum amount of nodes to execute the processes. Whenever the number of the nodes decreases below the minimum, the Grapnel Server (GS) tries to extend the number of PVM machines above the critical level. It means that the GS process works exactly the same way as the Master process in the Condor MW system. Whenever a process is to be killed (e.g. because its node is being vacated), an application-wide checkpoint is performed and the exited process is resumed on another node. The application-wide checkpointing is driven by the GS, 108 DISTRIBUTED AND PARALLEL SYSTEMS but can be initiated by any user process (A, B, C) which detects that Condor tries to kill it. After the notification the GS sends a checkpoint signal and message to every user process, which results in the user processes to make a coordinated checkpoint. It is started with a message synchronisation among the processes and finishes with saving the memory image of the individual processes. Now, that the application is saved, terminating processes exit to be resumed on another node. At this point the GS waits for the decision of Condor that tries to find un- derloaded nodes either in the home Condor pool of the submit machine or in a friendly Condor pool. The resume phase is performed only when the PVM master process (GS) receives a notification from Condor about new host(s) connected to the PVM virtual machine. For every new node a process is spawned and resumed from the stored checkpoint file. When every terminated process is resumed on a new node allocated by Condor, the application can continue its execution. This working mode enables the PVM application to continuously adapt itself to the changing PVM virtual machine by migrating processes from the machines being vacated to some new ones that have just been added. Figure 1 shows the main steps of the migration between friendly Condor pools. This working mode is fully compatible with the first generation architecture of the nationwide Hungarian ClusterGrid project. 5. Migration in the 2nd generation ClusterGrid Notice that in the previous solution the Application Server (GS) and Check- point Server (CS) processes must remain in the submit machine during the whole execution even if every user process (A,B,C in Figure 1) of the application migrates to another pool through flocking. Since flocking is not used in the 2nd generation ClusterGrid, the application must be checkpointed and removed from the pool. Then a broker allocates a new pool, transfers checkpoint files and resubmits the job. Then, the application should be able to resume its execution. In order to checkpoint the whole application, the checkpoint phase is initiated by the broker (part of the ClusterGrid architecture) by simply removing the application from the pool. In this case the application server detects to be killed, it performs a checkpoint of each process of the application, shuts down all user processes, checkpoints itself and exits. This phase is similar to the case when all the processes are prepared for migration but completes with an addi- tional server self-checkpointing and termination. As a preparation the server creates a file status table in its memory to memorise the open files used by the application and also stores the status of each user process. Process Migration In Clusters and Cluster Grids 109 When the broker successfully allocates a new pool it transfers the exe- cutable, checkpoint and data or parameter files and resubmits the application. When resubmitted, the server process first comes to life and the checkpoint library linked to it automatically checks for proper checkpoint file by query- ing the checkpoint server. The address of the checkpoint server is passed by parameters (or optionally can be taken from environment variable). When it is found, the server (GS) resumes, data files are reopened based on the information stored in the file status table and finally every user process is re-spawned, the application is rebuilt. This solution enables the parallel application to be migrated among different sites and not limited to be executed under the same condor pool during its whole lifetime. Details of the checkponting mechanism can be found in [6]. 6. Performance and Related Work Regarding the performance of checkpointing overall time spent for migration are checkpoint writing, reading, allocation of new resources and some coordination overhead. The overall time a complete migration of a process takes also includes the response time of the resource scheduling system e.g. while Condor vacates a machine, the matchmaking mechanism finds a new resource, allocates it, initialises pvmd and notifies the application. Finally, the cost of message synchronisation and costs used for coordination processing are negligible, less than one percent of the overall migration time. Condor [8], MPVM [1], DPVM [2], Fail-Safe PVM [7], CoCheck [10] are other software systems supporting adaptive parallel application execution including checkpointing and migration facility. The main drawbacks of these systems are that they are modifying PVM, build complex execution system, require special support, need root privileges, require predefined topology, need operating system support, etc. Contrary to these systems our solution makes parallel applications be capable of being checkpointed, migrated or executed in a fault tolerant way on specific level and we do not require any support from execution environment or PVM. 7. Conclusion In this paper a checkpointing mechanism has been introduced which enables parallel applications to be migrated partially among friendly Condor pools in the 1st generation Hungarian ClusterGrid and to be migrated among independent (non- friendly) Condor pools in the 2nd generation ClusterGrid. As a consequence, the P-GRADE checkpoint system can guarantee the execution of any PVM job in a Condor-based Grid system like ClusterGrid. Notice that the Condor system can only guarantee the execution of sequential jobs and special Master/Worker PVM jobs. In case of generic PVM jobs Condor cannot 110 DISTRIBUTED AND PARALLEL SYSTEMS provide checkpointing. Therefore, the developed checkpointing mechanism significantly extends the robustness of any Condor-based Grid system. An essential highlight of this checkpointing system is that the checkpoint information can be transferred among condor pools, while native condor check- pointer is not able provide this capability, so non-flocked condor pools cannot exchange checkpointed applications not even with help of an external module. The migration facility presented in this paper does not even need any modification either in the message-passing layer or in the scheduling and execution system. In the current solution the checkpointing mechanism is an integrated part of P-GRADE, so the current system only supports parallel applications created by the P-GRADE environment. In the future, roll-back mechanism is going to be integrated to the current solution to support high-level fault- tolerance and MPI extension as well. References J. Casas, D. Clark, R. Konuru, S. Otto, R. Prouty, and J. Walpole, “MPVM: A Migration Transparent Version of PVM”, Technical Report CSE-95-002, 1, 1995 L. Dikken, F. van der Linden, J.J.J. Vesseur, and P.M.A. Sloot, “DynamicPVM: Dynamic Load Balancing on Parallel Systems”, In W.Gentzsch and U. Harms, editors, Lecture notes in computer sciences 797, High Performance Computing and Networking, volume Pro- ceedings Volume II, Networking and Tools, pages 273-277, Munich, Germany, April 1994. Springer Verlag D. Drótos, G. Dózsa, and P. Kacsuk, “GRAPNEL to C Translation in the GRADE Environ- ment”, Parallel Program Development for Cluster Comp.Methodology,Tools and Integrated Environments, Nova Science Publishers, Inc. pp. 249-263, 2001 I. Foster, C. Kesselman, S. Tuecke, “The Anatomy of the Grid.” Enabling Scalable Virtual Organizations, Intern. Journal of Supercomputer Applications, 15(3), 2001 P. Kacsuk, “Visual Parallel Programming on SGI Machines”, Invited paper, Proc. of the SGI Users Conference, Krakow, Poland, pp. 37-56, 2000 J. Kovács and P. Kacsuk, “Server Based Migration of Parallel Applications”, Proc. of DAP- SYS’2002, Linz, pp. 30-37, 2002 J. Leon, A. L. Fisher, and P. Steenkiste, “Fail-safe PVM: a portable package for distributed programming with transparent recovery”. CMU-CS-93-124. February, 1993 M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System”, Technical Report # 1346, Com- puter Sciences Department, University of Wisconsin, April 1997 P-GRADE Parallel Grid Run-time and Application Development Environment: http://www.lpds.sztaki.hu/pgrade G. Stellner, “Consistent Checkpoints of PVM Applications”, In Proc. 1st Euro. PVM Users Group Meeting, 1994 D. Thain, T. Tannenbaum, and M. Livny, “Condor and the Grid”, in Fran Berman, Anthony J.G. Hey, Geoffrey Fox, editors, Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003 http://www.clustergrid.iif.hu [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] IV P-GRADE This page intentionally left blank [...]... Predicates in Distributed Systems with Clocks” Distributed Computing, Vol 13, Issue 2 (2000), pp 85-98 [TK98] M Tudruj, P Kacsuk, Extending Grade Towards Explicit Process Synchronization in Parallel Programs, Computers and Artificial Intelligence, Vol 17, 1998, No 5, pp 507-5 16 [WA99] B Wilkinson, M Allen, Parallel Programming, Techniques and Applications Using Networked Workstations and Parallel Computers,... Parallel Distributed and Network-Based Processing, PDP 2004, La Coruna, Spain, Feb., 2004, IEEE Computer Society, pp 1 26- 133 [KDF97] Kacsuk, P., Dózsa, G and Fadgyas, T., GRADE: A Graphical Programming Environment for PVM Applications Proc of the 5th Euromicro Workshop on Parallel and Distributed Processing, London, 1997, pp 358- 365 [KDFL99] The GRED Graphical Editor for the GRADE Parallel Program Development... this, we start editing the code (the block and flow diagrams) using the graphical design tool 3 EXAMPLE: A TSP SOLVED BY B&B METHOD A parallel program design for a Travelling Salesman Problem solved by a branch and bound method [WA99] will illustrate the PS-GRADE system Execution of the algorithm is based on a set of worker parallel processes, 1 16 DISTRIBUTED AND PARALLEL SYSTEMS which receive search tasks... pp 3 16- 323 [JB00] J Borkowski, Towards More Powerful and Flexible Synchronization Primitives, in Proc of Inter Conf on Parallel Computing in Electrical Engineering PARELEC 2000, August 2000, Trois-Rivieres, Canada IEEE Computer Society, pp 18-22 [JB04] J Borkowski, Strongly Consistent Global State Detection for On-line Control of Distributed Applications, 12-th Euromicro Conference on Parallel Distributed. .. improve efficiency of many parallel application programs Computational data transmissions can be de-coupled from synchronization and control 120 DISTRIBUTED AND PARALLEL SYSTEMS transmissions In the final implementation of the system, computational data exchange is assumed to be done by the Gigabit Ethernet, while state and control messages will be exchanged by a FireWire network Parallel implementation... PS-GRADE system and having made the source code of the P-GRADE system available This work has been partially sponsored by the KBN Grant N 4T11C 007 22 and internal PJIIT research grants 5 BIBLIOGRAPHY [BKT03] J Borkowski, D Kopanski, M Tudruj, Implementing Control in Parallel Programs by Synchronization-Driven Activation and Cancellation, 11-th Euromicro Conference on Parallel Distributed and Network based... method that 114 DISTRIBUTED AND PARALLEL SYSTEMS opens new design possibilities in programs for distributed memory systems The essential concept is here that the control predicates enable influencing the behavior of programs by means of generalized synchronization of global states of processes located in separate parallel processors [TK98, JB00] It is different to what exists in the standard practice,... triggered by any of regional states: Dreq or Dsend, and MinDist, triggered by any of two regional states: NMin and Emin NMin corresponds to a MinDist result for local workers Emin corresponds to MinDist for all workers Figure 2 Condition window of Synch0 118 DISTRIBUTED AND PARALLEL SYSTEMS Figure 3 Control flow diagram of Synch0 condition MinDist (left) and control flow diagram of Synch1 conditions MinDist... high-level synchronization and communication methods A parallel program control environment includes the global control level and the process level The global control level is based on special globally accessible processes called synchronizers They are responsible for monitoring execution states of parallel processes in application programs, computing predicates on these states and issuing control signals... paper Comparative experiments have shown that implementation of parallel TSP programs, which applies new control features of PS-GRADE is more efficient than that using standard P-GRADE features Acknowledgements The authors wish to thank cordially the Laboratory of Parallel and Distributed Systems of SZTAKI Institute (Institute of Computers and Automation of the Hungarian Academy of Sciences) in Budapest . CLUSTERS AND CLUSTER GRIDS * József Kovács MTA SZTAKI Parallel and Distributed Systems Laboratory H1518 Budapest, P.O.Box 63 Hungary smith@sztaki.hu The paper describes two working modes of the parallel. project, the Hungarian IHM 467 1/1/2003 project and the Hungarian Research Fund No. T042459. 104 DISTRIBUTED AND PARALLEL SYSTEMS process. Later a new process is created and all the collected information. Tannenbaum, and M. Livny, “Condor and the Grid , in Fran Berman, Anthony J.G. Hey, Geoffrey Fox, editors, Grid Computing: Making The Global Infrastructure a Reality, John Wiley, 2003 http://www.clustergrid.iif.hu [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] IV P-GRADE This

Định dạng
Số trang	23
Dung lượng	871,49 KB