Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 33 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
33
Dung lượng
1,13 MB
Nội dung
Proposal for: an Interactive Grid Analysis Environment Service Architecture Julian Bunn (julian@cacr.caltech.edu), Rick Cavanaugh (Cavanaug@phys.ufl.edu), Iosif Legrand (iosif.legrand@cern.ch), Harvey Newman (newman@hep.caltech.edu), Suresh Singh (suresh@cacr.caltech.edu), Conrad Steenberg (conrad@hep.caltech.edu) Michael Thomas (thomas@hep.caltech.edu), Frank van Lingen(fvlingen@caltech.edu) 1 Introduction This document is a proposal that describes an Interactive Grid Analysis Service Architecture , with a focus on High Energy Physics applications, including a set of work packages and milestones to implement this architecture. The ideas outlined in this architecture document are motivated by a lack of distributed environments dedicated to interactive Grid analysis, and the RTAG activity ARDA (Architectural Road map towards Distributed Analysis) [1] within LHC. The foundation of the architecture described here will be based on web services and peer to peer systems. Web Services have been identified as being suitable for Grid applications [2]. Peer to peer technologies are well suited for environments composed of dynamic resource providers [3]. The goal of this document is to identify services that can be used within the GAE. Besides re-use of existing services (e.g. components), this document describes characteristics of an interactive Grid Analysis Environment (GAE) and identifies new”Grid services and functionality that are needed to enable distributed interactive analysis. One of the big differences between an GAE and production or batch analysis oriented Grids is that behavior of the users is much more unpredictable. This unpredictable behavior of a collection of users is too complex for humans to steer usage of resources “manually”. Instead applications are needed to act on behalf of humans in an autonomous fashion to steer usage of resources and to optimize resource usage. Furthermore this documents identifies several components that can be used for policy management within the Grid. Policy management prevents that certain users allocate more resources than they are entitled too. While this architecture is being developed to solve specific problems related to CMS data analysis [4], many of the services described in this architecture can be used in other Grid systems. In addition, while some of the services have already been developed, or are being developed, they have never been assembled into such an interactive system before. 1 A Grid as used within High Energy Physics consists of a heterogeneous collection of resources and components. A secondary goal within this proposal is to define platform and language neutral APIs in order to allow smooth operation between the heterogeneous Grid resources and components and minimize dependencies between components. The secondary goal is closely related to work done in the PPDG CS 11 work group [5] and is also stated in the HEPCAL II document [6]. Section 2 discusses several characteristics of interactive analysis. Requirements and use cases are discussed in Section 3. Section 4 describes GAE and the different components it comprised of. Based on the description of section 4 several scenarios are discussed in section 5 that show interaction between different GAE components in typical analysis scenarios. Section 6 and Section 7 map the identified components of GAE to use cases identified for interactive analysis in HEPCAL [7] and HEPCALII [6] and to existing components and applications that can be used within the development of GAE. Finally Section 8 maps the components described in the architecture section to work packages and milestones. The content of this proposal focuses on interactive analysis. However, components developed for an interactive analysis environment could also be used in a batch or production environment and vice versa (see section 2 and section 4). The research problems, service components, and requirements addressed by this proposal are not only important for interactive analysis; batch analysis will benefit as well. The work outlined in this proposal will pave the way fora high energy physics data grid in which production, batch and interactive behavior coexist. 2 Interactivity in a Grid Environment Part of the motivation for this proposal is the difference between batch and interactive analysis and the focus on batch analysis within Grid software development for High Energy Physics. This section serves to explain the characteristics of interactive analysis and how it differs from batch analysis. Additional information on interactive and batch analysis can be found in HEPCALII [6]. 2.1 Interactive versus Batch Analysis The structure of a batch analysis environment is well known within high energy physics [8] . A large number of computing jobs are split up into a number of processing steps arranged in a directed acyclic graph and are executed in parallel on a computing farm. A batch analysis session is fairly static in that the Directed Acyclic Graph (DAG) structure of the processing steps being executed is known in advance. The only interaction between the user and the batch job is the ability to see the progress of the computing job and the ability to restart processing steps that may 2 have failed due to error. A typical batch analysis session would involve the following operations: sign on to the Grid, specification of requested datasets, generation and review of an execution plan (this includes the resource allocation for executing the plan), plan submission, plan monitoring and steering, and collection of analysis results. A typical interactive analysis session is quite similar to a batch analysis session. The main difference is that the execution plan for interactive analysis is an iterative loop that builds on the results of previous commands. The structure of the job graph is not known in advance as it is with batch processing. The user will submit a command to be executed, analyze the results, then submit new commands that build upon the results of the previous commands. Both interactive and batch analysis make use of “processing instructions”. For batch analysis, these processing instructions are comprised of shared libraries, interpreted scripts, and command parameters that are all known in advance, or can be automatically or programmatically determined from the results of an earlier processing job. In an interactive analysis session, however, the processing instructions are not known in advance. The end user will analyze the results of each processing step and change the parameters, libraries, and scripts for succeeding processing steps. Human intervention, as opposed to computer automation, determines the details of each processing step. As humans tend to be more impatient, computers and processing instructions in an interactive session will have a much shorter execution time than processing instructions in batch jobs. Low latency will be required between the submission of a processing instruction and the receipt of its results. Both interactive analysis and batch analysis can be seen as two extremes of the analysis continuum: It is possible to have an analysis session where both batch and interactive type analysis is done. Furthermore it is difficult to assign a “batch” label to any particular analysis. If there is enough low latency and computing power then the response time of a batch analysis could be short enough such that is becomes part of a larger interactive analysis. The next subsections describe several issues, important for interactive analysis some of which could also be used in a batch environment. However none of the current Grid batch environment have addressed these issues such that it could be used within an interactive environment. 2.2 Execution state Execution state is the entire set of information needed to recreate the current analysis session. This includes logical dataset names, processing scripts, shared libraries, processing steps already taken, processing parameters, data provenance, and any temporary results (both logical and physical). It is assumed that the size of the execution state will be small enough to be easily 3 replicated to other locations. Once an interactive session is finished, all the execution steps and their order are known by the execution state. As such the “recorded” interactive session can be “replayed” as a batch session with minimum intervention from a user. Batch and interactive analysis both involve the allocation of execution nodes on which the actual processing occurs. In batch analysis, there is little advantage to reusing the same execution node multiple times as with batch analysis the execution state and order is known in advance. Interactive analysis, however, benefits from the reuse of an execution node since much of the state of the interactive session resides on the execution node. The state for an interactive session can not be completely stored on the execution node. Local scheduling policies and resource failures can make an execution node unavailable for the rest of an interactive session. As such: The interactive client must be able to transfer the state of the interactive session to another execution node with no required involvement from the user. Transition from one execution node to another (due to node failure) should be transparent to the user. The only reliable source of the current interactive execution state must be stored in the client application (or for limited resource clients such as hand held devices, the execution state should be stored on the grid service entry host). The state stored on an individual execution node must be treated as a cache of this canonical state that is stored in the client application. This execution node state cache can be leveraged to provide lower execution latencies by reusing the same execution node as much as possible during an interactive session. The state for an interactive session is sent with every interactive command so that execution node failures can be handled transparently to the client application (more on this below). Figure 2.1 execution state Figure 2.1 execution state shows how state is managed during an interactive session as described 4 in Scenario 1. Scenario 1: State Information: The Grid scheduler locates a suitable Job Processing and Execution Service (JPES) service instance to handle the job. The JPES then sends the command to an execution node. The execution node executes the command and returns the new state back to the client (one possibility is to store the execution state in a CVS repository which would not only track the state but also the state history) . The execution node retains a local cache of the application state for future processing. This is shown as step 1 in Figure 2.1. The client application then sends the second command to a Grid scheduler. The Grid scheduler attempts to use the previous execution service to handle the command. When the execution service receives the command, it attempts to use the previous execution node to handle the command. If the node is still available, then it is used to handle the command. This is also shown by step 1 in Figure 2.1. If the previous execution node is not available (it was rescheduled fora higher priority task, or it experienced some hardware failure) then the execution service returns a failure code back to the Scheduler, which then attempts to locate another suitable execution service to handle the command (execution service 2). A new execution node will be allocated (step 2 in the figure). The new execution node uses the state that was sent along with the command to recreate the environment of the previous execution node and begins to process new commands. This is shown by step 2.1 in Figure 2.1. If the execution service becomes unavailable, the Scheduler sends the command to the another available execution service instance (step 3 in the figure). As before, the state is sent along with the command so that the new execution node can recreate the environment of the previous execution node. By sending all interactive commands through a Grid Scheduler, the client application never needs to know about execution node failures, or about execution service failures (except for monitoring purposes). State information (job traceability) is also described in HEPCALII [6]. 2.3 Steering and Intelligent Agents In order to satisfy users' lust for responsiveness in an interactive system, constant status and monitoring feedback must be provided. In any Grid system it must be possible for the user to redirect slow running jobs to faster execution nodes, and perform other redirections based on perceived optimizations. By giving users this control certain patterns of behavior will start to appear and will give valuable information on how to tune the entire system for optimum 5 performance. As certain optimization strategies are discovered, it will be possible to write intelligent agents that can perform these optimizations on behalf of the user. These agents will be able to reschedule jobs and re-prioritize requests based on the currently available resources. Intelligent agents can also be used to monitor resource usage profiles and match them to the profiles desired by the greater Grid community. A steering service would inter operate with grid schedulers. As users decide on constraints where to run their job or what data to access, schedulers need to translate this to a concrete plan that can be executed. 2.4 Resource Policies Resources in a Grid environment are scarce (CPU time, storage space, bandwidth), while others are in abundance (services). Not all users of the Grid will be given equal priority for all resources. This is true for the entire spectrum of interactive and batch systems. In addition to policies on who can use what resource, there also needs to an accounting mechanism that keeps track of who is using what resources and a history on resource usage. One way of adjusting user priorities is by setting local policies. Managers of the scarce resources can set local policies that give preference to certain users and groups. This is analogous to UNIX process priorities, except that the users are given priorities instead of processes. A more sophisticated way to manage priorities is through the use of a Grid Economy. Users and groups in the Grid are given some number of Grid Dollars (Higgies). These dollars can be spent to procure more resources for more important Grid tasks. Similarly, users and groups would “sell” their own resources to earn more Higgies to spend on their own Grid tasks. Large groups (such as a research institutes) could distribute Higgies to individual researchers in the form of a Grid Salary, giving a higher salary to researchers who they believe will use their Higgies more wisely. Within this paper we think it is important to recognize the concept of a Grid Economy, however it will not be addressed in the first phases of the GAE. 3 Requirements Use cases and requirements for an GAE have been described in detail in HEPCAL [7], HEPCAL II [6] , and the PPDG CS 11 requirements document [9]. Some of the requirements focus on specific components that will be used within the GAE. Section 6 associates GAE components with use cases that are relevant for interactive analysis. While requirements and use cases have been finalized in HEPCAL and PPDG CS 11, during the development of GAE, requirements and use cases can be subject to change and improvements as developers of the system gain more 6 understanding of the complexity of interactive analysis within a Grid environment. One GAE requirement has been described in section 2.2: the interactive client must be able to transfer the state of the interactive session to another execution environment (node or farm) with no required involvement from the user. Requirements in [7], [6], and [9], focused on physics analysis, mainly from a single user point of view. This section lists several requirements that the overall GAE system needs to satisfy: As stated in [7] section 2.5: physics analysis is less predictable than other physics processes such as reconstruction and simulation. Furthermore, physicists from all over the world will access resources and share resources. These resources change/move/disappear in time. Additional to the requirements stated in [7], [6], and [9] for interactivity in a Grid are the following requirements: Requirement 1, Fault tolerance: Data loss and loss of time because of failures of components should be minimized. A corollary of this is that the web services must operate asynchronously so that the application can continue to run in the face of a slow or unresponsive service. Requirement 2, Adaptable: Access to data will be unpredictable, however patterns could be discovered that can be exploited to optimize the overall performance of the system. Requirement 3, Scalable: The system is able to support an ever-increasing number of resources (CPU, network bandwidth, storage space) without any significant loss in responsiveness. The system is also able to support increasing numbers of users. As more users enter the system, the system will respond predictably and will not become unusable for everyone. Requirement 4, Robust: The system should not crash if one site crashes (no chain reaction). Requirement 5, Transparent: It should be possible for Grid clients to get feedback on the estimated state of grid resources. This would allow clients to make better decisions on what resources to use within the Grid. Requirement 6, Traceability: If the performance of Grid resources drop, there should be diagnostic tools available to analyze these anomalies. Requirement 7, Secure: Security within the GAE will be certificate based: Users get a certificate/key pair from a certificate authority. This certificate allows a user to access service within the Grid using a proxy that acts on behalf of the user for authentication. Depending on the security settings and policies (e.g. VO management settings) of different Grid sites and individual services of that site, a user can access these services offered by a particular site. Clarens [10] is one of the applications that support certificate base security and VO management. 7 Requirement 8, Independence : The components described in the architecture will be deployed as “stand alone” components: There will be no specific dependency between any two components. For example, in order to use replica manager “X” I will need to deploy meta-data catalog “Y” within the architecture. Stand alone components and the use of web services allow to “plug in” different components (e.g. different meta-data catalogs) that export the same interface as used within the GAE. Requirement 9, Policies: It should not be possible for users or groups to allocate more resources on the Grid than they are allowed to. There should be a mechanism in the Grid that allows organizations to specify policies and that monitors the use of resources by users and groups based on these policies. Policies will describe a measure of “fair use of Grid resources” for all users and groups involved (provided all uses agree on this policy). It should be noted that deciding on a policy can not be solved by any piece of software but is subject to discussion between groups an users of that policy. 4 Architecture The first sub section describes the high level view of the architecture, while the second sub section identifies the different services that are needed for interactive analysis within such a high level architecture. 4.1 Peer Groups The LHC experiments computing models were initially based on a hierarchical organization of Tier 0, multiple Tier 1's and multiple Tier 2's, etc . Most of the data will “flow” from the Tier 0, to Tier 1's and fromTier 1's to Tier 2's. Furthermore, Tier 0 and Tier 1 are powerful in terms of cpu power, data storage, and bandwidth. This hierarchical model has been described in “Data Intensive Grids for high-energy physics” [11] . Data is organized such that the institutes (represented by tiers) will be (in most cases) physically close to the data they want to analyze. The hierarchical model described in [11] is a good starting point, but there will be a certain dynamics within the LHC computing/data model. Although it is possible to make predictions on how to organize data, the patterns of the end users for analysis will be unpredictable. Depending on how “hot” or “cold” data is, attention will shift to different data sets. Furthermore multiple geographically dispersed users might be interested in data that is geographically dispersed on the different Tier's. This geographically dispersed user group for particular data can lead to data replication in order to prevent “hot spots” in data access. 8 Figure 4.1. Peers and Super Peers Figure 4.1 shows a modification of the hierarchical model: hierarchical peer to peer model. In which the different tier x centers act as peers. The thickness of the arrows represent the data movement between peers while the size of the peer represents its power in terms of cpu, and storage. The green part shows the number of resources that is idle, yellow: resources being used, red: resources that are offline. Associated with every peer are data. Red colored data represents “hot” data that is accessed often. Blue colored data represents “cold” data that is accessed less frequent. The hierarchical model is the basis on which data will be distributed, however the unpredictable behavior of physics analysis as a whole will lead to data and jobs being moved around between different peers. These data movements and job movements outside the hierarchical model although relative small compared to the data movement within the hierarchical model, will be substantial. When users submit jobs to a peer, middleware will discover what resources are available on other peers to execute this job request. Although a large number of job requests will follow the hierarchical model, other job requests will not. The more powerful peers (super peers) will receive more job requests, and data requests and will host a wider variety of services. Within the figure 4.1 T0 and T1's and 1 T2 act as super peers. 9 It is not unlikely that certain Tier 2's could be more “powerful” than a Tier 1's in terms of this measure and over time the relations between “tier power” can change, due to hardware and software upgrades in the future. As such the hierarchical model will form the basis of the peer to peer model, but this model is not fixed and can change during time, due to the self organizing capabilities of the Grid. A peer to peer architecture address requirements 4 an 5 about robustness and scalability. The services developed within GAE should be robust and scalable enough to be deployed in the (hierarchical) peer to peer model. 4.2 Services Based on discussion on interactive analysis (section 2) , use cases mentioned in HEPCAL and, requirements it is possible to identify a set of core services for GAE. Within this section the word service will be used to describe the different components within the architecture, because these components will be implemented as a web service within GAE. While they have been selected due to their necessity for an interactive Grid environment, many of these services could be used in other Grid environments. In order to satisfy fault tolerance requirements all services must be able to operate asynchronously. The following services (components) have been identified for GAE: Sign-on and Virtual Organization: These services provide authentication, authorization, and access control for data and computing resources. Virtual Organization management services are also provided to assist in maintaining and modifying the membership of virtual organizations. Virtual organization service relates to requirement 7 and 9 on security and policy. Data Catalog / Virtual Data: These services provide data lookup capabilities. They allow datasets to be looked up based on Physics Expressions or other meta data. These services return lists of Logical Filenames. The various HEP experiments will have their own implementations of this service to handle the experiment-specific meta-data and dataset naming. Lookup: Lookup services are the main entry point to access the Grid. Locations of the various services are never known in advance; they must be located using a lookup service. The lookup services allow dynamic lookup of the Grid services, eliminating the need to know the service locations in advance. Decentralized peer to peer style technologies are key to the operation of the lookup services to keep their service listings accurate and up to date. Three possible architectures for lookup service implementations are described below. Each architecture builds upon the previous. This makes it is possible to implement the simplest solution 10 [...]... “DS MetaData update” [7]: Catalog /Virtual data “DS MetaData access” [7]: Catalog/Virtual data “Dataset registration” [7]: Catalog/Virtual Data “Virtual Dataset Declaration” [7]: Catalog/ Virtual Data “(Virtual) Dataset Access” [7]: Catalog/Virtual Data “Dataset Access Cost Evaluation” [7]: Estimator “Data Set Replication” [7]: Replica management, Replica Selection, Catalog 18 9 “Data Set Deletion, Browsing, Update” [7]: Catalog 10.“Job Submission” [7]: Sign on, Catalog, Steering, Scheduler, Job Execution, Monitoring... 8.3 WP2.2: Set up data sources for analysis Gather analysis data sources such that these data sources can be accessed via Clarens web services within an analysis scenario in the GAE. Furthermore, create (or copy) analysis script (e.g ROOT/JAS) for analysis that can be used on. this analysis data Milestone 2: Multiple Grid services and analysis services are available on the testbed. Preferably not every service is available everywhere, but every service is available more than once. A client... Overview and Project Structure", Proceedings of CHEP, La Jolla, California, March 2003 15: C. Cioffi, S. Eckmann, D. Malon, A. Vanaichine, "POOL File Catalog, Collection and Metadata Components", CHEP 2003, 16: Ian Foster, Jens Vockler, Michael Wilde, Yong Zhao, "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation", 14th International Conference on Scientific and Statistical Database Management, 2002, 17: JXTA, http://www.jxta.org/, ... as an intermediary between the scheduler, the execution service, and the metadata service, eliminating the need for the application to have to communicate with these services directly 4) Process Wrapper uses a lookup service to locate Scheduling and Metadata services on the 15 Grid. A Metadata catalog service is used to map any meta data in the abstract plan into logical data filenames 5) A process wrapper sends the abstract plan to the scheduler. The scheduler further resolves the... 29: S. Bagnasco, P. Cerello, R. Barbera, P. Buncic, F. Carminati, P. Saiz, "AliEnEDG Interoperability in Alice", Proceedings CHEP, La Jolla, California, March 2003 30: Maarten Ballintijn, Rene Brun, Fons Rademakers, Gunther Roland, "The Proof Distributed Parallel Analysis Framework Based on Root", Proceedings of CHEP, La Jolla, California, March 2003 31: D. Adams, "Dial: Distributed Interactive Analysis of Large Datasets", Proceedings of... efficient processing. The Replica Management service makes use of lower level Replica Location, Meta data, and Replica Optimization services to control the movement of data around the Grid Replica management will prevent bottlenecks in data access on the grid and lowers the chance of losing valuable data As such replica management adds to Grid robustness and scalability (requirement 4 and 5) Replica Selection: This service is used to locate the optimum replica to use for processing. It is... local resources. For example, individual resource providers (as well as a Virtual Organization as a whole) may assign usage quotas to intended users and job monitoring mechanisms may be used to tally accounting information for the user. As such, the architecture requires services for storing quota and accounting information so that other services (such as a scheduling service or a steering service) may mine that information for enforcement strategies. Due to the distributed nature of the... ++++ A short description of most of the VDT components follows: The Chimera Virtual Data System is a tool for specifying how a data product is/was produced Chimera consists of a Virtual Data Language for specifying derived data dependencies in terms of a DAG, a database for persistently storing the DAG representation, an abstract planner for planning a workflow which is location and data existence independent, and a concrete planner for mapping an... Application uses a Lookup Service to locate a suitable Process Wrapper service on the Grid 3) Application sends an abstract job plan to the Process Wrapper. This abstract plan can be as abstract or concrete as necessary. If the user knows in advance where data is stored or which execution nodes must be used, then that information can be included in the plan. Or the plan can leave these as abstract entries to be resolved by the scheduler. The Process Wrapper acts as an intermediary between the scheduler, the execution service, and the metadata service,... the infrastructure needed to support this view in a way that is consistent with metadata services at other levels of a persistence architecture. The Chimera Virtual Data System (VDS) [16] provides a catalog that can be used by application environments to describe a set of application programs ("transformations"), and then track all the data files produced by executing those applications . Catalog/Virtual data 4. “Dataset registration” [7]: Catalog/Virtual Data 5. “Virtual Dataset Declaration” [7]: Catalog/ Virtual Data 6. “(Virtual) Dataset Access” [7]: Catalog/Virtual Data 7. “Dataset. cases can be found at the end of the HEPCAL documents: 1. “Grid Login” [7]: Sign on/Virtual Organization 2. “DS Meta-Data update” [7]: Catalog /Virtual data 3. “DS Meta-Data access” [7]: Catalog/Virtual. of interactive analysis and how it differs from batch analysis. Additional information on interactive and batch analysis can be found in HEPCALII [6]. 2.1 Interactive versus Batch Analysis The