Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 20 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
20
Dung lượng
0,99 MB
Nội dung
128 INTEGRATED RESEARCH IN GRID COMPUTING IMPACT OF FAULTS 100 1^ 80 h 60 \- 40 20 [ 1 1 1 1 •''' '" '-w'' 1 _, _. ._ -— —*-"'"' (ill i AHHnrl 1 i 1 Added 2 Completed 1 « Completed 2 G H 1 1 0.2 0.4 0.6 0.8 Probability of fault appearance Figure 2. FCI fault-injection results corrected data and the at-most-once semantics has been followed in the execu- tion of the SOAP service. This was the synthetic application that has been used in the experiments that will be described herein. Other Web-Services and Grid Services are currently under assessment by the QUAKE tool. The testing infrastructure was composed by a cluster with 12 machines run- ning Linux and Java 1.4. The SOAP service was running on a central node (dual-processor) of the cluster and we have use a Tomcat-AXIS server running on top of Linux. As far as we know, most of the Java-based Web-Services and Grid-Services are currently using Tomcat-Axis so we were interested to evaluate the robustness of this middleware. From those 12 machines, one was running the SUT application, other was dedicated to the BMS system, and the remaining 10 machines were running instances of the clients that were in practice the workload generators. For these results herein presented we have chosen the following parameters: 1 Default configuration parameters of the JVM, Tomcat and Axis. 2 The Tomcat JVM was running with the implicit Java garbage-collector. 3 In the overall, the client machines will send 1 million of SOAP requests. 4 The request will follow the "continuous-burst" distribution. Fault-injection and Dependability Benchmarking 129 5 There are no retransmission of SOAP requests when a client gets a re- sponse error. This way there are no repeated messages and the "at-most- once" semantics is not violated; 6 No fault-load is introduced in the SUT system. We ran the SOAP services in a dedicated server and all the operating system resources were available to the application. This means we are testing a Web-Service in a normal environment with any perturbations at the system-level. The results of the first experiment are presented in Figure 4.2. The Figure presents the number of requests-per-second that is served by the SOAP service over the time axis. In this benchmark run, the client machines sent 1 million of requests to the SOAP service running in a dedicated machine with a dual- processor. We used the default configuration of Tomcat/Axis that allocates a JVM with 64Mb. This first run produced impressive results: this test took 31 minutes, and only 73.740 of the 1 million requests were processed (about 7.37% of the total). The remaining requests were not processed by the server due to "out of memory". It was observed that the reason for this failure was directly related with the occurrence of memory leaks in the Tomcat/Axis middleware. 250 200 i2 150 ^ 100 (2 50 time (min) Figure 3. Results for the first test run, with default-configuration More interesting that this result is the type of failure that happened in the SUT server: the Tomcat processes did not crashed, they were left in a completely hang status that even the shutdown command of Tomcat was not able to restart the server. It was necessary to kill explicitly all the processes and restart the Tomcat server. This would be that type of failures that would require a human intervention in a production system. These failures are very expensive to main- tain since they require human intervention. When the systems start growing 130 INTEGRATED RESEARCH IN GRID COMPUTING in complexity the management will be almost virtually impossible [16]. The vision for autonomic computing defended by IBM researchers is entirely shared by the authors of this paper that recognize the strategic importance for creating self-healing Grid Services. From the first test run was clear that the SOAP server was under-configured in terms of memory for the selected workload. So, in the second test run the memory of the Tomcat JVM was increased from 64Mb up to 1Gb. The results are presented in Figure 3. This time the SOAP server did not crash and executed all the 1 million requests. The total turnaround the execution was 737 minutes. Those peaks that show up in the graphic have to due with the execution of the garbage collector at the server side. We can conclude from the graphic that the SOAP service is maintained running but the throughput (requests-per-second) drops heavily over time, which ends in the observation: the SOAP service does not crash, but it runs slower and slower over time. 300 250 o> 200 % 150 <i> 3 cr S: 100 50 IWriTITrrT^ r- V- <N Time (min) Figure 4. Results for the second test run, with a JVM of 1Gb. Once again the reason for this performance drop-out has to due with memory leakage in the SOAP middleware. One point of concern that we get even in this configuration scenario is the sharp decrease in the QOS level of the SOAP service: at the beginning it was able to sustain about 230 requests-per-second. At the end of the test run the throughput was less than 20 requests-per-second, so 10% of the initial throughput. This observation led us to think: how can we improve the throughput level of the SOAP service and maintain it at acceptable levels? How can we provide some self-healing mechanisms to this SOAP Fault-injection and Dependability Benchmarking 131 server? How can we prevent the SOAP server to fail and be left in a hang- status? With these questions in mind we start thinking about applying some software- rejuvenation technique to increase the throughput of that SOAP service. And the decision was to implement a preventive rebooting to avoid a zombie crash (hang status) of the server but also to avoid that the server would fall down into a lower level of throughput: when the throughput level decrease down to 20% of the initial throughput the watchdog produced a restart of the Tomcat/Axis server. This restart was done in a clean way: the SOAP server closed the service to new requests and all the on-going requests were finished before applying the shutdown-restart to the Tomcat. At the end of the test run the correctness of the application was successfully verified. In Figure 4 we present the results of this test run. Those deep peaks in the through-put level correspond to a restart event. Every shutdown-restart of the Tomcat took between 14 to 16 seconds, in average. 50 k ^ f] I ^ I J IN /I . L i h A h A t ' I J J ^ \ ul jj , ^ IL V. f J \ h ^ ] [ ^ l\ ^ \ [ ^ J ^ l\ 1 M J <^ ^ K^ ^ ^ <P ^ ^ <$> <S> ^^ ^ ^ <,^ 4> ^ ^<^ ^>^ K^ ^ K.^ Time (iiiiii) Figure 5. Results for the fourth test run with a preventive shutdown of the server At first sight it seems that this technique would not produce interesting results, since it creates some seconds of downtime at the SOAP server. As can be seen in the Figure there was 15 preventive restarts and this may had resulted in 225 second of downtime in the overall. So this technique is not good from the point of view of availability metric. But the result obtained in the turnaround metric is quite interesting: the total turnaround the test run was 146 minutes. This means the SOAP service was 5 times faster when compared with the second test run. It is clear that this "wise-reboot" technique is a potential technique to increase the sustained throughput level of the SOAP server and to avoid the zombie crashes of the server that would normally require human intervention. 132 INTEGRATED RESEARCH IN GRID COMPUTING There are more results taken with the QUAKE tool, but these small set of re- sults is clear representative of the interest of using dependability benchmarking to assess the robustness of SOAP services and Grid services. 5, Conclusions and Current Status We reviewed several available tools for software fault injection and depend- ability benchmarking tools for grids. We emphasized on the FAIL-FCI fault injector developed by INRIA, and on the QUAKE dependability benchmark developed by the University of Coimbra. The FAIL-FCI tool has so far only provided preliminary results on desktop grid middleware (XtremWeb) and P2P middleware (the FreePastry Distributed Hash Table). These results permitted to identify quantitative failure points in both tested middleware, as well as qual- itative issues concerning the failure recovery of XtremWeb. With the QUAKE tool we have been conducting the following experimental studies: (a) assess the reliability of different middleware for client/server applications; (b) study the reliability of OGS A-DAI and other tools from GT4. Acknowledgments This research work is carried out in part under the FP6 Network of Excel- lence Core-GRID funded by the European Commission (Contract IST-2002- 004265). References [1] P.Koopman, H.Madeira. "Dependability Benchmarking & Prediction: A Grand Challenge Technology Problem", Proc. 1st IEEE Int. Workshop on Real-Time Mis-sion-Critical Systems: Grand Challenge Problems; Phoenix, Arizona, USA, Nov 1999 [2] S Ghosh, AP Mathur, "Issues in Testing Distributed Component-Based Systems", 1st Int. ICSE Workshop on Testing Distributed Component-Based Systems, 1999 [3] H. Madeira, M. Zenha Rela, F. Moreira, and J. G. Silva. "Rifle: A general purpose pin-level fault injector". In European Dependable Computing Conference, pages 199-216, 1994. [4] S. Dawson, F. Jahanian, and T. Mitton. "Orchestra: A fault injection environment for distributed systems". Proc. 26th International Symposium on Fault-Tolerant Comput-ing (FTCS), pages 404-414, Sendai, Japan, June 1996. [5] D.T. Stott and al. "Nftape: a framework for assessing dependability in distributed systems with lightweight fault injectors". In Proceedings of the IEEE International Computer Performance and Dependability Symposium, pages 91-100, March 2000. [6] R. Chandra, R. M. Lefever, M. Cukier, and W. H. Sanders. "Loki: A state-driven fault injector for distributed systems". In In Proc. of the Int.Conf. on Dependable Systems and Networks, June 2000. [7] http://www.lri.fr/fci/GdX Fault-injection and Dependability Benchmarking 133 [8] S. Lumetta and D. Culler. "The Mantis parallel debugger". In Proceedings of SPDT'96: SIGMETRICS Symposium on Parallel and Distributed Tools, pages 118-126, Philadel- phia, Pennsylvania, May 1996. [9] William Hoarau, and S6bastien Tixeuil. "A language-driven tool for fault injection in distributed applications". In Proceedings of the IEEE/ACM Workshop GRID 2005, page to appear, Seattle, USA, November 2005. [10] M. Vieira and H. Madeira, "A Dependability Benchmark for OLTP Application Envi- ronments", Proc. 29th Int. Conf. on Very Large Data Bases (VLDB-03), Berlin, Ger-many, 2003. [11] K. Buchacker and O. Tschaeche, "TPC Benchmark-c version 5.2 Dependability Bench- mark Extensions", http://www.faumachine.org/papers/tpcc-depend.pdf, 2003 [12] D. Wilson, B. Murphy and L. Spainhower. "Progress on Deining Standardized Classes of Computing the Dependability of Computer Systems", Proc. DSN 2002, Workshop on Dependability Benchmarking, Washington, D.C., USA, 2002. [13] A. Kalakech, K. Kanoun, Y. Crouzet and A. Arlat. "Benchmarking the Dependability of Windows NT, 2000 and XP", Proc. Int. Conf. on Dependable Systems and Net-works (DSN 2004), Florence, Italy, 2004. [14] J. Duraes, H. Madeira, "Characterization of Operating Systems Behaviour in the Presence of Faulty Drivers Through Software Fault Emulation", in Proc. 2002 Pa-cific Rim Int. Symposium Dependable Computing (PRDC-2002), pp. 201-209, Tsu-kuba, Japan, 2002. [15] A. Brown, L, Chung, and D. Patterson. "Including the Human Factor in Dependabil- ity Benchmarks", Proc. of the 2002 DSN Workshop on Dependability Benchmarking, Washington, D.C., June 2002. [16] A. Brown, L. Chung, W. Kakes, C. Ling, D, A. Patterson, "Dependability Bench-marking of Human-Assisted Recovery Processes", Dependable Computing and Communications, DSN 2004, Florence, Italy, June, 2004 [17] A Brown and D. Patterson, "Towards Availability Benchmarks: A Case Study of Software RAID Systems", Proc. of the 2000 USENIX Annual Technical Conference, San Diego, CA, June 2000 [18] J. Zhu, J. Mauro, I. Pramanick. "R3 - A Framework for Availability Benchmarking", Proc. Int. Conf. on Dependable Systems and Networks (DSN 2003), USA, 2003. [19] J Zhu, J. Mauro, and I. Pramanick, "Robustness Benchmarking for Hardware Main-tenance Events", in Proc. Int. Conf. on Dependable Systems and Networks (DSN 2003), pp. 115- 122, San Francisco, CA, USA, IEEE CS Press, 2003. [20] J. Mauro, J. Zhu, I. Pramanick. "The System Recovery Benchmark", in Proc. 2004 Pacific Rim Int. Symp. on Dependable Computing, Papeete, Polynesia, 2004. [21 ] S. Lightstone, J. Hellerstein, W. Tetzlaff, P. Janson, E. Lassettre, C. Norton, B. Ra-jaraman and L. Spainhower. "Towards Benchmarking Autonomic Computing Matur-ity", 1st IEEE Conf. on Industrial Automatics (INDIN-2003), Canada, August 2003. [22] A.Brown, J.Hellerstein, M.Hogstrom, T.Lau, S.Lightstone, P.Shum, M.PYost, "Bench- marking Autonomic Capabilities: Promises and Pitfalls", Proc. Int. Conf. on Autonomic Computing (ICAC'04), 2004 [23] A. Brown and J. Hellerstein, "An Approach to Benchmarking Configuration Com-plexity", Proc. of the 11th ACM SIGOPS European Workshop, Leuven, Belgium, September 2004 [24] A.Brown, C.Redlin. "Measuring the Effectiveness of Self-Healing Autonomic Sys-tems", Proc. 2nd Int. Conf. on Autonomic Computing (1CAC'05), 2005 134 INTEGRATED RESEARCH IN GRID COMPUTING [25] J. Duraes, M. Vieira and H. Madeira. "Dependability Benchmarking of Web-Servers", Proc. 23rd International Conference, SAFECOMP 2004, Potsdam, Germany, Sep-tember 2004. Lecture Notes in Computer Science, Volume 3219/2004 [26] William Hoarau, Sebastien Tixeuil, and Fabien Vauchelles. "Easy fault injection and stress testing with FAIL-FCI". Technical Report 1421, Laboratoire de Recherche en Informa- tique, University Paris Sud, October 2005 USER MANAGEMENT FOR VIRTUAL ORGANIZATIONS Jifi Denemark, Ludek Matyska, Miroslav Ruda Institute of Computer Science, Masaryk University, Botanickd 68a, 602 00 Brno, Czech Republic "[jirka,iudek,ruda}@ ics.muni.cz Michal Jankowski, Norbert Meyer, Pawel Wolniewicz Poznan Supercomputing and Networking Center, ul. Noskowskiego 10, 61-704 Poznan, Poland -[janl<owsk,meyer,pawelw}@ man.poznan.pl Abstract Scalable and fine-grained Grid authorization requires moving away from a grid- mapfile based access control and 1-to-l mappings to individual OS user accounts. This is recognized and addressed to by virtual organization (VO) authorization services, e.g. VOMS/LCAS and CAS. They, however,do not address user OS account management and isolation/sandboxing requirements, such as flexible pooling of accounts while maintaining auditing records. This paper describes some existing systems for user management for VOs and provides a list of re- quirements for a new user management system on which our current research is focused on. Keywords: user management, virtual organization, accounting, authorization, authentication, encapsulation, logging, LCAS, LCMAPS, VOMS, VUS, Perun 136 INTEGRATED RESEARCH IN GRID COMPUTING 1. Introduction The main aim of the user management system is controlled, secure access to grid resources. Security requires authentication of the user and authorization based on combined security policy from the resource provider and virtual or- ganization of the user. The second important thing is the possibility of logging user activities for accounting and auditing and then gathering these data both by the resource provider and virtual organization of the user. From the user's point of view, an important feature is single sign-on. The problem of user management is a non-trivial one in an environment that includes a bulk number of computing resources, data, and hundreds or even thousands of users participating in lots of virtual organizations. The complex- ity rises from the point of view of time required for administration tasks and automation of these tasks. There are many solutions that attempt to fulfill these basic requirements and solve the mentioned problem, but none of them, to the best of our knowledge, solve the problem in a complex and satisfactory way. 2. Definitions Virtual organization (VO) is a set of individuals and/or institutions that al- lows its members to share resources in a controlled manner, so that they may collaborate to achieve a shared goal [1]. We assume that virtual organizations may form hierarchies. The hierarchy of VO is useful for user management on the VO side (delegation of administrative burden to sub-organization in case of big organizations) and accounting (sub- organizations may refer to real institutions and departments who are responsible for paying the bills). The hierarchy forms a Directed Acyclic Graph (DAG) where the VOs are vertices and the edges represent relations between them (see [3], sub-organizations are called "groups"). The user may be a member of many VOs, and in particular, a member of a sub-organization is also a member of the parent organization. The privileges the organization wants to grant the user, related to the tasks he is supposed to perform, are connected with user roles. The roles are de- fined across the hierarchy of VOs and managed in an independent structure, although the authorities of VOs are responsible for defining roles. One user may have multiple roles and he is responsible for selecting the required role while accessing the resource. Any special rights to resources expressed, e. g., by ACL [2] are called capa- bilities. The capabilities may be used to express any rights to a specific user, e. g., some file is writable only by the owner. Resource provider (RP) is an abstract entity that owns and offers some re- sources (e. g. services, CPU, disks, data, etc.) to the grid users. User Management for Virtual Organizations 137 CoreGrid FHG PSNC CG.Admin CG.Developer GG.User CG.PSNC.User CoreGrid staff <D:ace> <D:principal> <D:all> </D:principal> <D:grant> <D:privilege> <D:read/> </D:privilege> </D:grant> </D:ace> y • vo ( I roles I ] capabilities Figure 1. Hierarchy of Virtual Organizations, User Roles and Capabilities By the virtual environment we understand encapsulation of user jobs in order to give it a limited set of privileges and be able to identify the user and orga- nization on behalf of which the job acts. Example implementations are virtual accounts [8], virtual machines, and sandboxes [5]. 3. Existing Solutions In this section we provide a brief description of several systems trying to cope with user management in the context of virtual organizations. 3,1 Perun Perun [9] provides a repository of complex authorization data, as well as tools to manage the data. The data are used to generate configuration of the authorization services themselves (starting from UNIX user accounts throught grid-mapfiles to the VOMS database). In turn, these services are used to enforce authorization policies. Perun makes use of central configuration repository which models an ideal world, i. e. what the resources should look like. In this central repository all the necessary (and possibly very complex) integrity constraints are relatively easy to be enforced. The repository is complemented with a change propaga- tion mechanism which detects the changes, generates consistent configuration snapshots of atomic pieces of managed systems, and tries to deliver them to their final destinations, appropriatelly dealing with resource or network failures. In this way, the real world is forced to follow the ideal one as closely as possible. [...]... self-contained so that RP does not need to contact any external entity to obtain any information (such as VO, role(s), capabilities) required to authorize the user This additional information must be stored within the request in an expandable way 140 INTEGRATED RESEARCH IN GRID COMPUTING The authorization module should be plug -in- based in order to allow flexible configuration (use a different set of plug-ins... other important design assumption is being concordant with the existing standards and trends in the area of grid computing, especially the webservice (WS) approach The WS-Stateful Resource [7] technology seems to be especially promising for our purpose, as it allows for easy modeling virtual environment and managing its life cycle 142 INTEGRATED RESEARCH IN GRID COMPUTING RequestCredential (push model)... existing solutions, allowing combination of their features 7 Acknowledgment This work has been supported by the CESNET Research Intent (MSM6383917201) and the EU CoreGRID NoE (FP6-004265) References [1] I.Foster, C.Kesselman, S.Tuecke, The Anatomy of the Grid: Enabhng Scalable Virtual Organizations, International J Supercomputer Applications, 15(3), 2001 146 INTEGRATED RESEARCH IN GRID COMPUTING [2]... J.Hahkala, K.Lorentey Managing Dynamic User Communities in a Grid of Autonomous Resources, Computing in High Energy and Nuclear Phisics, La Jolla, California, 24-28 March 2003 [4] K.Keahey, V.Welch, S.Lang, B.Liu, S.Meder Fine-Grain Authorization Policies in the GRID: Design and Implementation 1st International Workshop on Middleware for Grid Computing, 2003 [5] K.Keahey, K Doering, I.Foster, From Sandbox... User < g GetMyAccounting fQ Virtual Environment Information Servi Vo Manager I GetVOAccounnting GetVOLogging t} Authorization IVIodule (pluggable) Virtual Environment Database VOMS CAS Gridmapfile Resource Owner GetResAccounnting GetResLogging SetPricing T Figure 3 Virtual Environment Information Service User Management for Virtual Organizations 145 For billing purposes the accounting information must... Environments in the Grid, 5th International Workshop in Grid Computing (Grid 2004), Pittsburgh, PA, November 2004 [6] K.Keahey, I.Foster, T.Freeman, X.Zhang, D.Garlon Wirtual Workspaces in the Grid, Europar 2005, Pisa, Italy, August, 2005 [7] I.Foster, J.Frey, S.Graham, S.Tuecke, K.Czajkowski, D.Ferguson, F.Leymann, M.Nally, I.Sedukhin, D.Snelling, T.Storey, W.Vambenepe, S.Weerawarana Modeling Stateful... Trimintzios European Network and Information Security Agency P.O Box 1309, 71001, Heraklio, Greece panagiotis.trimintzios@enisa.eu.int Abstract This paper focuses on the integration of passive and active network monitoring techniques in Grid systems We propose a number of performance metrics for assessing the quality of the connectivity, and describe the required measurement methods for obtaining these... second important issue is fine grained authorization [4] that allows limiting user access rights to specific resources The authorization is based on the triplet VO, role, capabilities [2] and is done on the computing node The RP policy defines privileges for given pair VO-role and interprets the capabilities The RP policy may limit the privileges in any way, including denying access at all The virtual... based grids, Cracow '04 Grid Workshop Proceedings, December 2004 [9] Ales Kfenek and Zora Sebestianova Rerun - Fault-Tolerant Management of Grid Resources, Cracow '04 Grid Workshop Proceedings, December 2004 [10] Globus Toolkit Version 4: Software for Service-Oriented Systems I Foster IFIP International Conference on Network and Parallel Computing, Springer-Verlag LNCS 3779, pp 2-13,2005 ON THE INTEGRATION...138 INTEGRATED RESEARCH IN GRID COMPUTING The core of the system is completely independent of the structure and semantics of the configuration data, hence the system is easily extensible 3.2 Virtual User System The Virtual User System (VUS) [8] is an extension of the system that runs users' jobs (e.g scheduling system, Globus Gatekeeper, etc.) and allows running jobs without having a personal . human intervention in a production system. These failures are very expensive to main- tain since they require human intervention. When the systems start growing 130 INTEGRATED RESEARCH IN GRID. C.Redlin. "Measuring the Effectiveness of Self-Healing Autonomic Sys-tems", Proc. 2nd Int. Conf. on Autonomic Computing (1CAC'05), 2005 134 INTEGRATED RESEARCH IN GRID COMPUTING. information must be stored within the request in an expandable way. 140 INTEGRATED RESEARCH IN GRID COMPUTING The authorization module should be plug -in- based in order to allow flexible configuration