Just-In-Time Workload Management Scalable Resource Sharing on the Open Science Grid

Cover Page U.S Department of Energy Office of Science Scientific Discovery through Advanced Computation Solicitation JustInTime Workload Management: Scalable Resource Sharing on the Open Science Grid A Collaborative DOE SciDAC Science Application Project and Partnership For the period July 1, 2006 – June 30, 2009 Lead Principal Investigator Torre Wenaus Physics Department, Brookhaven National Laboratory Tel: 631 821 6157 Email: wenaus@bnl.gov Co-Principal Investigators Miron Livny Frank Würthwein Dept of Computer Science, University of Wisconsin Madison, Madison, WI, 53705 Ph: 608 262 0856 miron@cs.wisc.edu Physics Department, University of California San Diego San Diego, CA Ph: 858 774 7035 fkw@fnal.gov DOE/Office of Science Program Office: Office of Advanced Scientific Computing and Research DOE/Office of Science Program Office Technical Contact: Mary Anne Scott (301) 9036368 scott@er.doe.gov Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid i Table of Contents Cover Page Abstract A Project Narrative A.1 Background and Significance A.2 Preliminary Studies and Project Drivers A.2.a Condor A.2.b The Panda Workload Manager A.2.c Just-in-time workload management at CMS A.2.d Experience of the Principals A.3 Research Design and Methods A.3.a Technical Approach A.3.b Evolving Panda in a HEP/CS Partnership A.3.c Program of Work A.3.d Management and Organization A.3.e Software Development Approach A.3.f Work Assignments A.3.g Milestones and Deliverables A.4 Consortium Arrangements – OSG Collaboration B Biographical Sketches of Investigators B.1 Miron Livny B.2 Torre Wenaus B.3 Frank Würthwein C Current and Pending Support of Investigators C.1 Miron Livny C.2 Torre Wenaus C.3 Frank Würthwein D Description of Facilities and Resources D.1 Brookhaven National Laboratory D.2 UW-Madison D.3 UCSD E Budget Summary F Per Institution Statements of Work (SOWs) / Tasks+Milestones F.1 UW-Madison F.2 Brookhaven National Laboratory F.3 UCSD G Letter of Support from Open Science Grid (OSG) H Science Application Partnership Computer Science Costs I Bibliography Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid i iii 1 2 9 10 13 14 14 15 15 15 16 18 20 22 22 26 27 28 28 28 29 29 29 29 31 32 33 34 34 ii Abstract The Large Hadron Collider (LHC) experiments are collaborating with computer scientists to develop grid computing as the enabler for data intensive, globally distributed computing at the unprecedented scales required by LHC science The first LHC physics run in 2008 is expected to provide sensitivity to exciting new physics within weeks The 2008 run will deliver 10PB each to ATLAS and CMS, requiring 100MSpecInt2k at 50 centers worldwide to support physics analysis by communities of order 1000 physicists LHC's collaboration with the grid computing community is taking place within projects that also involve other applications sciences US projects like the Trillium projects, Grid3 and recently Open Science Grid (OSG), collaborating with others overseas, have delivered a grid infrastructure and applications that are now in production in many experiments, layered over a common baseline grid infrastructure that is now being hardened as the foundation for systems scalable to LHC requirements We propose a three year partnership between ATLAS and CMS collaborators and computer scientists from the Condor project to build on this foundation to develop and deploy a workload management system capable of meeting key science-driven requirements in opportunistic resource utilization, diverse usage modes from production to analysis, managing dynamic workloads, and automation which are relevant both for the LHC and the broader science community of the OSG The partners involved bring demonstrated expertise in developing and successfully deploying systems and supporting middleware following the highly scalable “just in time” workflow design the project will employ Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid iii A Project Narrative In the following sections we describe the proposed project technically and organizationally. Section A.1 provides some background on the HEP computing challenge at the Large Hadron Collider (LHC) [ 1], the approach the LHC experiments have taken to addressing this challenge through collaboration with computer science on dataintensive grid computing, the status of this program, and the motivation and significance of this proposal in this context. Section A.2 describes work done to date that motivates and informs the objectives and workplan of this proposal, and establishes the partners involved as highly qualified to carry the program to success. Section A.3 describes the specific program of work: our architectural and technical approach to workflow management and the advantages thereof; the form our application/computer science partnership will take; how we will organize ourselves and apply manpower; and milestones/deliverables. Finally, Section A.4 gives the specifics of the relationship between this project and the Open Science Grid Consortium [2] A.1 Background and Significance The Large Hadron Collider (LHC) experiments are collaborating with computer scientists to develop grid computing as the enabler for data intensive, globally distributed computing at the unprecedented scales required by the LHC science program The LHC computing challenge is immense and has no margin for error The luminosity expected in the first LHC physics run in 2008 is sufficient to provide sensitivity to new physics such as supersymmetry within a few weeks The computing systems must be ready, and at scale: the 2008 run will deliver about 10PB each to ATLAS and CMS (the volume for a nominal LHC year), requiring about 100MSpecInt2k [3] at about 50 centers worldwide to analyze in order to understand detector performance and extract the physics There will be 500-1000 physicists per experiment actively engaged worldwide in data analysis and detector performance studies; the computing infrastructure must support this individual usage equally well with managed production The priorities and workloads will change frequently and urgently and must be accommodated quickly Computing will be resource-limited, demanding a system capable of opportunistically exploiting a diverse and dynamic array of computing sites and resources Operations manpower will be very limited, so the system must be highly automated and robust against instabilities and failures LHC's collaboration with the grid computing community to meet these challenges is well-established, and is taking place within projects that involve other applications sciences as well While the LHC's requirements differ in scale from other domains, in most respects they not differ in kind, and these projects have established tools and systems used in common across domains In the US, these projects include the Trillium projects (PPDG[4], GriPhyN[5], iVDGL[6]), Grid3[7] and most recently the Open Science Grid They have delivered a grid computing infrastructure and grid-capable applications that are now in production in many experiments Both in the US and in Europe the efforts to date have established a common baseline grid computing infrastructure This is now being consolidated and hardened to serve as the foundation for systems able to scale to LHC requirements The work has also established an invaluable experience base to inform the remaining effort to deliver systems meeting all the requirements of LHC computing We propose to develop improvements in workload management that are necessary to meet the LHC requirements described above The targeted requirements opportunistic resource utilization, good support for diverse usage modes from managed production to individual analysis, flexible and fast management of dynamic workloads, extensive automation to simplify operations, and needed scalability -are relevant to all or most grid users, not just the LHC For this reason we anticipate that this work will be widely beneficial within the OSG consortium Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid Experience with existing systems (both our own and others) has demonstrated the advantages of a "late binding" or "just-in-time" workload management system In conventional workload management, the 'payload' of a job dispatched to the grid the processing task to be performed is sent as an intrinsic part of the job at time of submission to the grid In a late binding scheme, the submitted job is merely a container for a payload to be acquired later, once the job successfully launches on a worker node When the container ('pilot') job launches on a worker node, it contacts a queue manager and requests a task In this way work is pulled to worker nodes that are acquired by a resource harvesting system that is largely decoupled from the application's workload management This scheme offers a number of benefits:  It enables opportunistic acquisition of resources on any site or grid to which pilot jobs can be delivered, with the VO then able to deliver work appropriate to the capabilities of the resource (as reported by the pilot)  It allows the VO to flexibly and quickly adjust workflows to changing requests and priorities, since all tasks are held in a VOmanaged queue until the moment of their release (to a pilot) for processing  It provides robustness against site and worker node failures; sites and worker nodes that do not successfully launch a communicating pilot will be ignored by workload management  It is capable of very short job launch latencies, not bounded by the latency of submitting a grid job, because the workload management sees a steady stream of communicating pilots to which it can dispatch tasks immediately as they arrive. This is particularly valuable for supporting distributed interactive analysis  It maximizes uniformity across heterogeneous grids and scheduling systems, as seen by the VO's workload management; the heterogeneity is primarily isolated in the harvesting system, with the queue management and pilot interaction systems common across all environments  It allows sophisticated, dynamic, VOmanaged brokerage to take place in the decision process by which the workload manager selects a task to deliver to a requesting pilot. For example, data placement constraints can easily be applied; the brokerage may require that input data be pre positioned at a site to avoid data transfer latencies and failure modes. VO policies such as user priorities and quotas can also be imposed in a homogeneous way across heterogeneous resources These benefits are not expectations in the abstract for such a system; they are seen in deployed systems as will be described in the next section. The deployed systems to date are however experiment specific; this proposal seeks to provide a broadly usable system deployed to and supported on the OSG A.2 Preliminary Studies and Project Drivers The late binding approach has been used in various forms by LHCb[8], ALICE[9], and CDF[10] in the DIRAC[11], ALIEN, and GlideCAF systems respectively, with very good success This approach is also closely related to the way Condor employs matchmaking to bind resources and consumers This proposal leverages a recent entry among the late binding systems, the Panda system developed by US ATLAS [ 12], together with Condor and GlideCAF from CDF/CMS In the following subsections we describe these systems as they exist today and as they relate to the workload management objectives of this proposal A.2.a Condor Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid Dependable and effective access to large amounts of sustained computing power, often referred to as highthroughput computing, is critical to today’s scientists and engineers The Condor Project at the University of Wisconsin-Madison (UW-Madison) Department of Computer Sciences has been engaged in research, software development and software deployment in this area since 1985 [ 13] and consist of a team of about 35 staff and students The Condor High Throughput Computing System software (“Condor”) is an established distributed workload management system for large scale and compute intensive applications, with facilities for resource monitoring and management, job scheduling, priority scheme, and workflow supervision Condor provides easy access to large amounts of dependable and reliable computational power over prolonged periods of time by effectively harnessing all available resources, including both dedicated compute clusters and non-dedicated machines under the control of interactive users or autonomous batch systems Condor’s unique architecture and mechanisms enable it to perform particularly well in opportunistic environments [14] Opportunistic mechanisms such as process checkpoint/migration [15] and redirected I/O allow Condor to effectively harness non-dedicated desktop workstations as well as dedicated compute clusters, and Condor’s special attention to fault-tolerance enables it to provide large amounts of computational throughput over prolonged periods of time or in a grid environment that crosses different administrative domains Originally, the Condor job submission agent could launch jobs only upon Condor-managed resources In 2001, Condor-G [16] was developed as an enhanced submission agent that can launch and supervise jobs upon resources controlled by a growing list of workload management and grid middleware systems, permitting computing environments that cross administrative boundaries – a primary requirement for grid computing Condor-G (which stands for Condor to Grid) was originally developed to submit jobs to Globus Toolkit (GT) 2.x middleware via the GRAM protocol [17], but has since been extended to support submission to GT4, Nordugrid, PBS, and others Used as a front-end to a computational grid, Condor-G can manage thousands of jobs destined to run at distributed sites Condor-G provides job monitoring, logging, notification, policy enforcement, fault tolerance, credential management, and it can handle complex job interdependencies Of course, Condor can also launch jobs upon remote Condor pools; this is sometimes referred to as Condor-C (which stands for Condor to Condor), thereby allowing multiple Condor sites to work together It is not uncommon for both Condor-G and Condor to be utilized within a computational grid deployment For example, Condor-G can be used as the reliable submission and job management service for one or more sites, the Condor High Throughput Computing system can be used as the fabric management service (a grid “generator”) for one or more sites, and finally Globus Toolkit services can be used as the bridge between them One disadvantage of common grid job submission protocols today such as GRAM is they usually result in either over-subsription by submitting jobs to multiple queues at once or under-subscription by submitting jobs to potentially long queues This problem can be solved with a technique called Condor GlideIn [Error: Reference source not found] To take advantage of both the powerful reach of general-purpose protocols such as GRAM and the full Condor machinery, a personal Condor pool may be carved out of remote resources [18] This requires three steps In the first step, Condor-G is used to submit the standard Condor servers as jobs to remote batch systems From the remote system's perspective, the Condor servers are ordinary jobs with no special privileges In the second step, the servers begin executing and contact a personal Condor matchmaker started by the user In step three, the user may submit normal jobs to the Condor agent, which are then “just-in-time” matched to and executed on the remote resources The term “Condor” has become an umbrella term that refers to the services collectively offered by Condor, Condor-G, and Condor-C Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid A.2.b The Panda Workload Manager US ATLAS launched development of the Panda (Production ANd Distributed Analysis) system in August 2005 with an architectural approach driven by the need for major improvements in system throughput, scalability, operations manpower requirements, and efficient integrated data/processing management relative to the previous generation production system in order to meet LHC requirements Key design elements are  Support for a full range of usages from managed production to group and user level production to individual interactive distributed analysis  Just-in-time workload delivery to pilots on processing nodes, as described above  System-wide task queue holds all jobs until pilot dispatch, allowing very flexible and dynamic brokerage coupled to data distribution policy and real-time resource availability  Data-driven system design, with data management playing a central role and data pre-placement a prerequisite to workload dispatch to a site  Tight integration with the ATLAS distributed data management system Don Quixote (DQ2) [ 19], and using DQ2’s model of datasets (file collections) and subscriptions to them as the basis of data management and movement  Major attention to monitoring and automation to make operations workload low and problem diagnostics rapid  Pilot job delivery subsystem able to employ multiple job scheduler implementations transparently to the rest of the system (currently Condor-G and PBS)  Minimal requirements to include a site: pilot delivery (via grid or local queue), outbound HTTP, and data management support  Lightweight, highly scalable communication protocols based on REST-style [ 20] HTTP communication The Panda architecture is shown in Figure Within the ATLAS production system Panda functions as a regional (OSG) ‘executor’ interacting with an ATLAS production system ‘supervisor’ Eowyn to receive and report production work Panda’s support for a multiplicity of workload sources and types is reflected in a number of ‘regional usage interfaces’ in addition to the ATLAS interface (all supported by a common Panda client interface) to submit OSG regional production, user jobs, and distributed analysis jobs The Panda server receives work from these front ends into a job queue, upon which a brokerage module operates to prioritize and assign work on the basis of job type, priority, input data and their locality, and available CPU resources Allocation of job blocks to sites is followed by the dispatch of input data to those sites, handled by a data service interacting with the distributed data management system; jobs are not released for processing until the data arrives When data dispatch completes, jobs are made available to a job dispatcher An independent subsystem manages the scheduling of pilot jobs to deliver them to worker nodes via a range of scheduling systems A pilot upon launching on a worker node contacts the dispatcher and receives an available job appropriate to the site (If no appropriate job is available, the pilot immediately exits.) An important attribute of this scheme for interactive analysis, where minimal latency from job submission to launch is important, is that the pilot dispatch mechanism bypasses any latencies in the scheduling system for submitting and launching the pilot itself Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid Figure 1: Panda architecture In Figure the current implementation of Panda is shown, in which all components of the architecture are realized Implemented front ends are the ATLAS production system, a command line equivalent of the same, and two distributed analysis systems, pathena (an interface to the ATLAS offline software framework, Athena) and the DIAL distributed analysis system The Panda server containing the central components of the system is implemented in python (as are all components of Panda, and the DDM system DQ2) and runs under Apache as a web service (in the REST sense; communication is based on HTTP GET/POST with the messaging contained in the URL and optionally a message payload of various formats) MySQL databases implement the job queue and all metadata and monitoring repositories Condor-G and PBS are the implemented schedulers in the pilot scheduling (resource harvesting) subsystem A monitoring server works with the MySQL DBs, including a logging DB populated by system components recording incidents via a simple web service behind the standard python logging module, to provide web browser based monitoring and browsing of the system and its jobs and data Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid Figure 2: Current Panda implementation Figure shows Panda’s DQ2-based automated data handling All data handling is at the dataset level (file collections, with a ‘data block’ being an immutable dataset) Sites are subscribed to datasets to trigger automated dataflow, and distributed (HTTP URL) callbacks provide notification of transfer completion and are used to trigger job release on data arrival and other chained operations This automated dataflow together with enforced data pre-placement as a precondition to job dispatch has been key to minimizing operational manpower and maximizing robustness against transfer failures and SE problems Figure 3: Panda dataflow Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid Panda has evolved rapidly from a proof-of-concept prototype to a deployed system that took over US ATLAS production responsibilities in December 2005 Panda is now operating as an integral part of the ATLAS production system, managing US production at five Tier and Tier centers It has processed 11,000 jobs/day peak to date, limited by available CPU resources; no Panda scaling limits have been seen, and the Panda design target is ~100k jobs/day In ATLAS Computing System Commissioning production, Panda has processed 30% of the 113k ATLAS total through early February In January it exceeded its target efficiency (job failure rate not arising from the workload itself) for this early phase of development, 90%, and by late February was typically

Tiêu đề	Just-In-Time Workload Management: Scalable Resource Sharing on the Open Science Grid
Tác giả	Torre Wenaus, Miron Livny, Frank Würthwein
Trường học	University of Wisconsin Madison
Chuyên ngành	Computer Science
Thể loại	collaborative project
Năm xuất bản	2006-2009
Thành phố	Madison

Định dạng
Số trang	40
Dung lượng	1,17 MB