Tài liệu Grid Computing P16 doc

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	27
Dung lượng	202,76 KB

Nội dung

16 Virtualization services for Data Grids Reagan W. Moore and Chaitan Baru University of California, San Diego, California, United States 16.1 INTRODUCTION The management of data within Grids is a challenging problem. It requires providing easy access to distributed, heterogeneous data that may reside in different ‘administrative domains’ and may be represented by heterogeneous data formats, and/or have different semantic meaning. Since applications may access data from a variety of storage repositories, for example, file systems, database systems, Web sites, document management systems, scientific databases, and so on, there is a need to define a higher-level abstraction for data organization. This is generally referred to as a data collection. A data collection contains named entities that may in actuality be stored in any of the repositories mentioned above. Data collections can be organized hierarchically, so that a collection may contain subcollections, and so on. An important issue in building collections is naming. Typically, the naming conventions are dependent upon the type of digital entities. While there are technologies to build uniform namespaces, they are typically different for each type of digital entity. For example, a Geographic Information System may use one approach to organize and name spatial data, database systems may use another approach for structured Grid Computing – Making the Global Infrastructure a Reality. Edited by F. Berman, A. Hey and G. Fox  2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0 410 REAGAN W. MOORE AND CHAITAN BARU data, and file systems may use yet another for files. Data Grids must provide the capability to assemble (and, further, integrate) such disparate data into coherent collections. This problem is further exacerbated owing to the fact that data collections can persist longer than their supporting software systems. Thus, the organization of data collections may need to be preserved across multiple generations of supporting infrastructure. There is a need for technologies that allow a data collection to be preserved while software evolves [1]. Data Grids must provide services, or mechanisms, to address both the data naming and the data persistence issues. Data Grids provide a set of virtualization services to enable management and integration of data that are distributed across multiple sites and storage systems. These include services for organization, storage, discovery, and knowledge-based retrieval of digital entities – such as output from word processing systems, sensor data, and application output – and associated information. Some of the key services are naming, location transparency, federation, and information integration. Data Grid applications may extract information from data residing at multiple sites, and even different sets of information from the same data. Knowledge representation and management services are needed to represent the different semantic relationships among information repositories. This chapter provides a survey of data management and integration concepts used in Data Grids. Further details can be found in Reference [2, 3]. In the following sections, we define ‘digital entities’ in the Grid as combinations of data, information, and knowledge, and define the requirements for persistence (Section 16.2). We discuss issues in managing data collections at the levels of data, information, and knowledge. The state of the art in Data Grid technology is discussed, including the design of a persistent archive infrastructure, based upon the convergence of approaches across several different extant Data Grids (Section 16.3). Approaches to information integration are also described on the basis of data warehousing, database integration, and semantic-based data mediation (Section 16.4). We conclude in Section 16.5 with a statement of future research challenges. 16.2 DIGITAL ENTITIES Digital entities are bit streams that can only be interpreted and displayed through a supporting infrastructure. Examples of digital entities include sensor data, output from simulations, and even output from word processing programs. Sensor data usually are generated by instruments that convert a numerical measurement into a series of bits (that can be interpreted through a data model). The output from simulation codes must also be interpreted, typically by applying a data format; the output from word processing programs requires interpretation by the program that generated the corresponding file. Digital entities inherently are composed of data, information (metadata tags), and knowledge in the form of logical relationships between metadata tags or structural relationships defined by the data model. More generally, it is possible for these relationships to be statistical, based, for example, on the results of knowledge discovery and data mining techniques. Digital entities reside within a supporting infrastructure that includes software systems, hardware systems, and encoding standards for semantic tags, structural data models, and VIRTUALIZATION SERVICES FOR DATA GRIDS 411 presentation formats. The challenge in managing digital entities is not just the management of the data bits but also the management of the infrastructure required to interpret, manipulate, and display these entities or images of reality [4]. In managing digital entities, we can provide Grid support for the procedures that generate these digital entities. Treating the processes used to generate digital entities as first-class objects gives rise to the notion of ‘virtual’ digital entities, or virtual data.This is similar to the notion of a view in database systems. Rather than actually retrieving a digital entity, the Grid can simply invoke the process for creating that entity, when there is a request to access such an entity. By careful management, it is possible to derive multiple digital entities from a single virtual entity. Managing this process is the focus of virtual Data Grid projects [5]. The virtual data issue arises in the context of long-term data persistence as well, when a digital entity may need to be accessed possibly years after its creation. A typical example is the preservation of the engineering design drawings for each aeroplane that is in com- mercial use. The lifetime of an aeroplane is measured in decades, and the design drawings typically have to be preserved through multiple generations of software technology. Either the application that was used to create the design drawings is preserved, in a process called emulation, or the information and the knowledge content is preserved in a process called migration. We can maintain digital entities by either preserving the processes used to create their information and knowledge content or by explicitly characterizing their information and knowledge content, and then preserving the characterization. There is strong commonality between the software systems that implement virtual Data Grids and emulation environments. These approaches to data management focus on the ability to maintain the process needed to manipulate a digital entity, either by characterization of the process or by wrapping of the original application. There is also a strong commonality between the software systems that implement Data Grids and migration environments. In this discussion, we will show that similar infrastructure can be used to implement both emulation and migration environments. In emulation, a level of abstraction makes it possible to characterize the presentation application independent of the underlying operating system. Similarly in migration, for a given data model it is possible to characterize the information and knowledge content of a digital entity independent of the supporting software infrastructure. We will also look at the various extraction-transformation-load (ETL) issues involved in information integration across such multiple archives, across subdisciplines and disciplines. In the next section, we discuss the notion of long-term persistence of digital entities. 16.2.1 Long-term persistence Digital entity management would be an easier task if the underlying software and hardware infrastructure remained invariant over time. With technological innovation, new infrastructure may provide lower cost, improved functionality, or higher performance. New systems appear roughly every 18 months. In 10 years, the underlying infrastructure may have evolved through six generations. Given that the infrastructure continues to evolve, one approach to digital entity management is to try to keep the interfaces between the infrastructure components invariant. 412 REAGAN W. MOORE AND CHAITAN BARU Standards communities attempt to encapsulate infrastructure through definition of an interface, data model, or protocol specification. Everyone who adheres to the specification uses the defined standard. When new software is written that supports the standard, all applications that also follow the standard can manipulate the digital entity. Emulation specifies a mapping from the original interface (e.g. operating system calls) to the new interface. Thus, emulation is a mapping between interface standards. Migration specifies a mapping from the original encoding format of a data model to a new encoding format. Thus, migration is a mapping between encoding format standards. Preservation can thus be viewed as the establishment of a mechanism to maintain mappings from the current interface or data model standard to the oldest interface or data model standard that is of interest. We can characterize virtual Data Grids and digital libraries by the mappings they provide between either the digital entities and the generating application or the digital entity file name and the collection attributes. Each type of mapping is equivalent to a level of abstraction. A data management system can be specified by the levels of abstraction that are needed to support interoperability across all the software and hardware infrastructure components required to manipulate digital entities. Specification of the infrastructure components requires a concise definition of what is meant by data, information, and knowledge. 16.3 DATA, INFORMATION, AND KNOWLEDGE It is possible to use computer science–based specifications to describe what data, information, and knowledge represent [6]. In the simplest possible terms, they may be described as follows: • Data corresponds to the bits (zeroes and ones) that comprise a digital entity. • Information corresponds to any semantic tag associated with the bits. The tags assign semantic meaning to the bits and provide context. The semantically tagged data can be extracted as attributes that are managed in a database as metadata. • Knowledge corresponds to any relationship that is defined between information attributes or that is inherent within the data model. The types of relationships are closely tied to the data model used to define a digital entity. At a minimum, semantic/logical relationships can be defined between attribute tags and spatial/structural, temporal/procedural, and systemic/epistemological relationships can be used to characterize data models. Data and information that are gathered and represented, as described above, at the subdisciplinary and disciplinary levels must then be integrated in order to create cross-disciplinary information and knowledge. Thus, information integration at all levels becomes a key issue in distributed data management. 16.3.1 A unifying abstraction Typically, data are managed as files in a storage repository, information is managed as metadata in a database, and knowledge is managed as relationships in a knowledge repository. This is done at a subdisciplinary or disciplinary level (in science) or VIRTUALIZATION SERVICES FOR DATA GRIDS 413 at an agency or department level (in government or private sector organizations). For each type of repository, mechanisms are provided for organizing and manipulating the associated digital entity. Files are manipulated in file systems, information in digital libraries, and knowledge in inference engines. A Data Grid defines the interoperability mechanisms for interacting with multiple versions of each type of repository. These levels of interoperability can be captured in a single diagram that addresses the ingestion, management, and access of data, information, and knowledge [7]. Figure 16.1 shows an extension of the two-dimensional diagram with a third dimension to indicate the require- ment for information integration across various boundaries, for example, disciplines and/or organizations. In two dimensions, the diagram is a 3 × 3 data management matrix that characterizes the data handling systems in the lowest row, the information handling systems in the middle row, and the knowledge handling systems in the top row. The ingestion mechanisms used to import digital entities into management systems are characterized in the left column. Management systems for repositories are characterized in the middle column and access systems are characterized in the right column. Each ingestion system, management system, and access system is represented as a rectangle. The interoperability mechanisms for mapping between data, information, and knowledge systems are represented by the Grid that interconnects the rectangles. Attributes semantics Knowledge Information (Model-based mediation) (Data handling system−storage resource broker) M C A T/ H D F G R I D S X M L D T D S D L I P X T M D T D K Q L Information repository Attribute-based query Feature-based query Knowledge or topic-based query Knowledge repository for rules Relationships between concepts Fields Containers Folders Storage repository Ingestion Management Access I n f o r m a t i o n I n t e g r a t i o n I n f o r m a t i o n I n t e g r a t i o n Cross-disciplinary information integration Data Figure 16.1 Characterization of digital entity management systems. 414 REAGAN W. MOORE AND CHAITAN BARU The rectangles in the left column of the data management matrix represent mechanisms for organizing digital entities. Data may be organized as fields in a record, aggregated into containers such as tape archive (tar) files, and then sorted into folders in a logical namespace. Information may be characterized as attributes, with associated semantic meanings that are organized in a schema. Knowledge is represented as relationships between concepts that also have associated semantic meanings that may be organized in a relationship schema or ontology [8]. The third dimension repeats the 3 × 3 matrix, but for different disciplinary databases or for information systems from different organizations. A major challenge is to define techniques and technologies for integration across layers in the third dimension, as discussed in Section 16.4. The rectangles in the middle column represent instances of repositories. Data are stored as files in storage repositories such as file systems and archives and in databases as binary large objects or ‘blobs’. Information is stored as metadata attributes in databases. Knowl- edge is stored as relationships in a knowledge repository. There are many possible choices for each type of repository. For instance, file systems tightly couple the management of attributes about files (location on disk, length, last update time, owner) with the management of the files, and store the attributes as i-nodes intermixed with the data files. On the other hand, Data Grids need to use a logical namespace to organize and manage attributes about digital entities stored at multiple sites [9]. The attributes could be stored in a database, with the digital entities stored in a separate storage system, such as an archive [10], with referential integrity between the two. The ability to separate the management of the attributes (information) from the management of the data makes it possible for Data Grids to build uniform namespaces that span multiple storage systems and, thereby, provide location virtualization services. The rectangles in the right column represent the standard query mechanisms used for discovery and access to each type of repository. Typically, files in storage systems are accessed by using explicit file names. The person accessing a file system is expected to know the names of all of the relevant files in the file system. Queries against the files require some form of feature-based analysis performed by executing an application to determine if a particular characteristic is present in a file. Information in a database is queried by specifying operations on attribute–value pairs and is typically written in SQL. The person accessing the database is expected to know the names and meaning of all of the attributes and the expected ranges of attribute values. Knowledge-based access may rely upon the concepts used to describe a discipline or a business. An example is the emerging topic map ISO standard [11], which makes it possible to define terms that will be used in a business, and then map from these terms to the attribute names implemented within a local database. In this case, the knowledge relationships that are exploited are logical relationships between semantic terms. An example is the use of ‘is a’ and ‘has a’ logical relationships to define whether two terms are semantically equivalent or subordinate [12–15]. The Grid in the data management matrix that interconnects the rectangles represents the interoperability mechanisms that make up the levels of abstraction. The lower row of the Grid represents a data handling system used to manage access to multiple storage systems [16]. An example of such a system is the San Diego Supercomputer Center (SDSC) Storage Resource Broker (SRB) [17, 18]. The system provides a uniform storage VIRTUALIZATION SERVICES FOR DATA GRIDS 415 system abstraction for accessing data stored in files’ systems, archives, and binary large objects (blobs) in databases. The system uses a logical namespace to manage attributes about the data. This is similar to the ‘simple federation’ mentioned in Reference [2]. The upper row of the Grid in the data management matrix represents the mediation systems used to map from the concepts described in a knowledge space to the attributes used in a collection. An example is the SDSC model-based mediation system used to interconnect multiple data collections [19]. The terms used by a discipline can be organized in a concept space that defines their semantic or logical relationships. The concept space is typically drawn as a directed graph with the links representing the logical relationships and the nodes representing the terms. Links are established between the concepts and the attributes used in the data collections. By mapping from attribute names used in each collection to common terms defined in the concept space, it is possible to define the equivalent semantics between the attributes used to organize disparate collections. Queries against the concept space are then automatically mapped into SQL queries against the data collections, enabling the discovery of digital entities without having to know the names of the data collection attributes. The left column of the Grid in the data management matrix represents the encoding standards used for each type of digital entity. For data, the Hierarchical Data Format version 5 (HDF) [20] may be used to annotate the data models that organize bits into files. The Metadata CATalog system (MCAT) [21] provides an encoding standard for aggregating files into containers before storage in an archive. For information annotation, the XML (extensible Markup Language) syntax [22] provides a standard markup language. The information annotated by XML can be organized into an XML schema. For relationships, there are multiple choices for a markup language. The Resource Descrip- tion Framework (RDF) [23] may be used to specify a relationship between two terms. The ISO 13250 Topic Map standard [11] provides a way to specify typed associations (relationships) between topics (concepts) and occurrences (links) to attribute names in collections. Again the relationships can be organized in an XML Topic Map Document Type Definition. The right column of the Grid in the data management matrix represents the standard access mechanisms that can be used to interact with a repository. For data access, a standard set of operations used by Data Grids is the Unix file system access operations (open, close, read, write, seek, stat, sync). Grid data access mechanisms support these operations on storage systems. For information access, the Simple Digital Library Interoperability Protocol (SDLIP) [24] provides a standard way to retrieve results from search engines. For knowledge access, there are multiple existing mechanisms, including the Knowledge Query Manipulation Language (KQML) [25], for interacting with a knowledge repository. The interoperability mechanisms between information and data and between knowledge and information represent the levels of abstraction needed to implement a digital entity management system. Data, information, and knowledge management systems have been integrated into distributed data collections, persistent archives, digital libraries, and Data Grids. Each of these systems spans a slightly different portion of the 3 × 3data management matrix. 416 REAGAN W. MOORE AND CHAITAN BARU Digital libraries are focused on the central row of the data management matrix and the manipulation and presentation of information related to a collection [26]. Digital libraries are now extending their capabilities to support the implementation of logical namespaces, making it possible to create a personal digital library that points to material that is stored somewhere else on the Web, as well as to material that is stored on your local disk. Web interfaces represent one form of knowledge access systems (the upper right hand corner of the Grid), in that they organize the information extraction methods, and organize the presentation of the results for a particular discipline. Examples are portals, such as the Biology Workbench [27], that are developed to tie together interactions with multiple Web sites and applications to provide a uniform access point. Data Grids are focused on the lower row of the Grid and seek to tie together multiple storage systems and create a logical namespace [28]. The rows of the data management matrix describe different naming conventions. At the lowest level, a data model specifies a ‘namespace’ for describing the structure within a digital entity. An example is the METS standard [29] for describing composite objects, such as multimedia files. At the next higher level, storage systems use file names to identify all the digital entities on their physical media. At the next higher level, databases use attributes to characterize digital entities, with the attribute names forming a common namespace. At the next higher level, Data Grids use a logical namespace to create global, persistent identifiers that span multiple storage systems and databases. Finally, at the knowledge level, concept spaces are used to span multiple Data Grids [30]. 16.3.2 Virtualization and levels of data abstraction The management of persistence of data has been discussed through two types of mechanisms, emulation of the viewing application and migration of the digital entity to new data formats. Both approaches can be viewed as parts of a continuum, based upon the choice for data abstraction. From the perspective of emulation, software infrastructure is characterized not by the ability to annotate data, information, and knowledge but by the systems used to interconnect an application to storage and display systems. Figure 16.2 shows the levels of interoperability for emulation. Note that emulation must address not only possible changes in operating systems but also changes in storage systems and display systems. Since any component of the hardware and software infrastructure may change over time, emulation needs to be able to deal with changes not only in the operating system calls but also in the storage and the display systems. An application can be wrapped to map the original operating system calls used by the application to a new set of operating system calls. Conversely, an operating system can be wrapped, by adding support for the old system calls, either as issued by the application or as used by old storage and display systems. Finally, the storage systems can be abstracted through use of a data handling system, which maps from the protocols used by the storage systems to the protocols required by the application. A similar approach can be used to build a display system abstraction that maps from the protocols required by the display to the protocols used by the application. Thus, the choice of abstraction level can be varied from the application, to the operating system, to the storage and display systems. Migration puts the abstraction VIRTUALIZATION SERVICES FOR DATA GRIDS 417 Application Display system abstraction Display system Storage system abstraction Storage system Digital entity abstraction Digital entity Operating system Add operating system call Add operating system call Wrap system calls Figure 16.2 Levels of interoperability for emulation. level at the data model. In this context, the transformative migration of a digital entity to a new encoding format is equivalent to wrapping the digital entity to create a new digital object. Once an abstraction level is chosen, software infrastructure is written to manage the associated mappings between protocols. The abstraction level typically maps from an original protocol to a current protocol. When new protocols are developed, the abstraction level must be modified to correctly interoperate with the new protocol. Both the migration approach and the emulation approach require that the chosen level of abstraction be migrated forward in time, as the underlying infrastructure evolves. The difference in approach between emulation and migration is mainly concerned with the choice of the desired level of abstraction. We can make the choice more concrete by explicitly defining abstraction levels for data and knowledge. Similar abstraction levels can be defined for information catalogs in information repositories. A Data Grid specifies virtualization services and a set of abstractions for interoperating with multiple types of storage systems. It is possible to define an abstraction for storage that encompasses file systems, databases, archives, Web sites, and essentially all types of storage systems, as shown in Figure 16.3. Abstraction levels differentiate between the digital entity and the infrastructure that is used to store or manipulate the digital entity. In Figure 16.3, this differentiation is 418 REAGAN W. MOORE AND CHAITAN BARU Abstraction for digital entity Digital entity Abstraction for repository Repository Logical: file name Physical: data model (syntax, structure) Files Logical: namespace Physical: data handling system− SRB/MCAT File system, archive Figure 16.3 Storage system abstraction for Data Grids. explicitly defined. The abstraction for the storage repository is presented in the bottom row of the 2 × 2 matrix. The abstraction for the digital entity that is put into the storage repository is presented in the upper row of the 2 × 2 matrix. The storage repository has a physical instantiation represented as the box labeled ‘File System, Archive’. The abstraction for the physical repository has both a logical namespace used to reference items deposited into the storage as well as a set of physical operations that can be performed upon the items. Similarly, the digital entity has a physical instantiation, a logical namespace, and an associated set of physical operations. The name used to label the physical entity does not have to be the same as the name used to manage the physical entity within the storage system. The set of operations that can be performed upon the digital entity are determined by its data model and are typically supported by aggregating multiple operations within the physical storage system. This characterization of abstraction levels can be made concrete by separately considering the mechanisms used to support Data Grids that span multiple storage systems and the knowledge management systems that span multiple knowledge repositories. The storage system abstraction for a Data Grid uses a logical namespace to reference digital entities that may be located on storage systems at different sites. The logical namespace provides a global identifier and maintains the mapping to the physical file names. Each of the data, the information, and the knowledge abstractions in a Data Grid introduces a new namespace for characterizing digital entities. For example, consider the following levels of naming: • Document level: Definition of the structure of multimedia or compound documents through use of an [31] Archival Information Packet (Open Archival Information System [...]... FOR DATA GRIDS 435 32 Allcock, W E (2002) Argonne National Laboratory 33 Grid Forum Remote Data Access Working Group, http://www.sdsc.edu/GridForum/RemoteData/ 34 Moore, R (2002) Persistent Archive Concept Paper , Global Grid Forum 5, Edinburgh, Scotland, July 21–26, 2002 35 European Data Grid, http://eu-datagrid.web.cern.ch/eu-datagrid/ 36 Scientific Data Management in the Environmental Molecular Sciences... VIRTUALIZATION SERVICES FOR DATA GRIDS 423 use, managing data collections that contain millions of digital entities and aggregate terabytes in size It is noteworthy that across the many implementations, a common approach is emerging There are a number of important projects in the areas of Data Grids and Grid toolkits They include the SRB Data Grid described here, the European DataGrid replication environment... the European DataGrid and the Particle Physics Data Grid, and augmented with an additional product of the European DataGrid for storing and retrieving metadata in relational databases called Spitfire and other components) (EDG), the Scientific Data Management [36] Data Grid from Pacific Northwest Laboratory (PNL) (SDM), the Globus toolkit [37], the SAM Sequential Access using Metadata Data Grid from Fermi... varied Although all of the Grids provided some form of an error message, the number of error messages varied from less than 10 to over 1000 for the SRB The Grids statically tuned the network parameters (window size and buffer size) for transmission over wide-area networks Most of the Grids provide interfaces to the GridFTP transport protocol The most common access APIs to the Data Grids are a C++ I/O library,... interface The Grids are implemented as distributed client server architectures Most of the Grids support federation of the servers, enabling third-party transfer All of the Grids provide access to storage systems located at remote sites including at least one archival storage system The Grids also currently use a single catalog server to manage the logical namespace attributes All of the Data Grids provide... and Wan M (1999) Data intensive computing, in Foster, I and Kesselman, C (eds) The Grid: Blueprint for a New Computing Infrastructure San Francisco, CA: Morgan Kaufmann Publishers 3 Raman, V., Narang, I., Crone, C., Haas, L., Malaika, S., Mukai, T., Wolfson, D and Baru, C (2002) Data Access and Management Services on the Grid , Technical Report, submission to Global Grid Forum 5, Edinburgh, Scotland,... widely used knowledge support utilities 16.3.3 Data Grid infrastructure The Data Grid community is developing a consensus on the fundamental capabilities that should be provided by Data Grids [33] In addition, the Persistent Archive Research Group of the Global Grid Forum [34] is developing a consensus on the additional capabilities that are needed in Data Grids to support the implementation of a persistent... the Joint Center for Structural Genomics (Data Grid) , and the National Institute of Health Biomedical Informatics Research Network (BIRN) (Data Grid) The seven data-management systems listed above included not only Data Grids but also distributed data collections, digital libraries, and persistent archives However, at the core of each system was a Data Grid that supported access to distributed data... many of the Grids also support asynchronous registration of attributes Most of the Grids support synchronous replica creation and provide data access through parallel I/O The Grids check transmission status and support data transport restart at the application level Writes to the system are done synchronously, with standard error messages returned to the user The error messages provided by the Grids to... Data Grid defines and implements a standard set of operations for the storage systems that are managed by the Grid In doing so, it needs to develop ‘wrappers’ for these systems to ensure that all systems can respond to Grid requests As an example, in the SRB, the set of operations are an extension of those provided by the Unix file system A similar set of operations are being defined for the XIO Grid . important projects in the areas of Data Grids and Grid toolkits. They include the SRB Data Grid described here, the European DataGrid replication environment. 16.3.3 Data Grid infrastructure The Data Grid community is developing a consensus on the fundamental capabilities that should be provided by Data Grids [33].

Ngày đăng: 15/12/2013, 05:15

Xem thêm