222 Goel & Buyya Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permis- sion of Idea Group Inc. is prohibited. Replication can be done either on the storage-array level or host level. In array- level replication, data is copied from one disk array to another. Thus, array-level replication is mostly homogeneous. The arrays are linked by a dedicated channel. Host-level replication is independent of the disk array used. Since arrays used in different hosts can be different, host-level replication has to deal with heterogene- ity. Host-level replication uses the TCP/IP (transmission-control protocol/Internet protocol) for data transfer. The replication in SAN also can be divided in two main categories based on the mode of replication: (a) synchronous and (b) asynchronous, as discussed earlier. Survey of Distributed Data-Storage Systems and Replication Strategies Used A brief explanation of systems in Table 3 follows. Arjuna (Parrington et al., 1995) supports both active and passive replication. Passive replication is like primary- copy replication, and all updates are redirected to the primary copy. The updates can be propagated after the transaction has committed. In active replication, mutual consistency is maintained and the replicated object can be accessed at any site. Coda (Kistler & Satyanarayanan, 1992) is a network-distributed le system. A group of servers can fulll the client’s read request. Updates are generally applied to all participating servers. Thus, it uses a ROWA protocol. The motivation behind using this concept was to increase availability so that if one server fails, other servers can take over and the request can be satised without the client’s knowledge. The Deceit (Siegel et al., 1990) distributed le system is implemented on top of the Isis (Birman & Joseph, 1987) distributed system. It provides full network-le- system (NFS) capability with concurrent read and writes. It uses write tokens and stability notication to control le replicas (Siegel et al.). Deceit provides variable le semantics that offer a range of consistency guarantees (from no consistency to semantics consistency). However, the main focus of Deceit is not on consistency, but on providing variable le semantics in a replicated NFS server (Triantallou, 1997). Harp (Liskov, 1991) uses a primary-copy replica protocol. Harp is a server protocol and there is no support for client caching (Triantallou & Nelson, 1997). In Harp, le systems are divided into groups, and each group has its own primary site and secondary sites. For each group, a primary site, a set of secondary sites, and a set of sites as witnesses are designated. If the primary site is unavailable, a primary site is chosen from the secondary sites. If enough sites are not available from the primary and secondary sites, a witness is promoted to act as a secondary site. The data from such a witness are backed up in tapes so that if it is the only surviving site, then the data can be retrieved. Read and write operations follow typical ROWA protocol. Data Replication Strategies in Wide-Area Distributed Systems 223 Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Mariposa (Sidell et al., 1996) was designed at the University of California (Berkley) in 1993 and 1994. Basic design principles behind the design of Mariposa were the scalability of distributed data servers (up to 10,000) and the local autonomy of sites. Mariposa implements an asynchronous replica-control protocol, thus distributed data may be stale at certain sites. The updates are propagated to other replicas within a time limit. Therefore it could be implemented in systems where applications can af- ford stale data within a specied time window. Mariposa uses an economic approach in replica management, where a site buys a copy from another site and negotiates to pay for update streams (Sidell et al.). Oracle (Baumgartel, 2002) is a successful commercial company that provides data- management solutions. Oracle provides a wide range of replication solutions. It sup- ports basic and advanced replication. Basic replication supports read-only queries, while advanced replication supports update operations. Advanced replication sup- ports synchronous and asynchronous replication for update requests. It uses 2PC for synchronous replication. 2PC ensures that all cohorts of the distributed transaction completes successfully, or rolls back the completed part of the transaction. Pegasus (Ahmed et al., 1991) is an object-oriented DBMS designed to support multiple heterogeneous data sources. It supports Object Structured Query Language (SQL). Pegasus maps a heterogeneous object model to a common Pegasus object model. Pegasus supports global consistency in replicated environments as well as it respects integrity constraints. Thus, Pegasus supports synchronous replication. Sybase (Sybase FAQ, 2003) implements a Sybase replication server to implement replication. Sybase supports the replication of stored procedure calls. It imple- ments replication at the transaction level and not at the table level (Helal, Hedaya, & Bhargava, 1996). Only the rows affected by a transaction at the primary site are replicated to remote sites. The log-transfer manager (LTM) passes the changed re- cords to the local replication server. The local replication server then communicates the changes to the appropriate distributed replication servers. Changes can then be applied to the replicated rows. The replication server ensures that all transactions are executed in correct order to maintain the consistency of data. Sybase mainly implements asynchronous replication. To implement synchronous replication, the user should add his or her own code and a 2PC protocol (http://www.dbmsmag. com/9705d15.html). Peer-to-Peer Systems P2P networks are a type of overlay network that uses the computing power and bandwidth of the participants in the network rather than concentrating it in a rela- tively few servers (Oram, 2001). The word peer-to-peer reects the fact that all 224 Goel & Buyya Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permis- sion of Idea Group Inc. is prohibited. participants have equal capability and are treated equally, unlike in the client-server model where clients and servers have different capabilities. Some P2P networks use the client-server model for certain functions (e.g., Napster uses the client-server model for searching; Oram). Those networks that use the P2P model for all func- tions, for example, Gnutella (Oram), are referred to as pure P2P systems. A brief classication of P2P systems is shown below. Types of Peer-to-Peer Systems Today P2P systems produce a large share of Internet trafc. A P2P system relies on the computing power and bandwidth of participants rather than relying on central servers. Each host has a set of neighbours. P2P systems are classied into two categories. 1. Centralised P2P systems: Centralised P2P systems have a central directory server where the users submit requests, for example, as is the case for Napster (Oram, 2001). Centralised P2P systems store a central directory, which keeps information regarding le location at different peers. After the les are located, the peers communicate among themselves. Clearly centralised systems have the problem of a single point of failure, and they scale poorly when the number of clients ranges in the millions. 2. Decentralised P2P systems: Decentralised P2P systems do not have any central servers. Hosts form an ad hoc network among themselves on top of the exist- ing Internet infrastructure, which is known as the overlay network. Based on two factors—(a) the network topology and (b) the le location—decentralised P2P systems are classied into the following two categories. (i) Structured decentralised: In a structured architecture, the network topology is tightly controlled and the le locations are such that they are easier to nd (i.e., not at random locations). The structured architecture can also be classied into two categories: (a) loosely structured and (b) highly structured. Loosely structured systems place the le based on some hints, for example, as with Freenet (Oram, 2001). In highly structured systems, the le locations are precisely determined with the help of techniques such as hash tables. (ii) Unstructured: Unstructured systems do not have any control over the network topology or placement of the les over the network. Examples of such systems include Gnutella, KaZaA, and so forth (Oram, 2001). Since there is no structure, to locate a le, a node queries its neighbours. Data Replication Strategies in Wide-Area Distributed Systems 225 Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Flooding is the most common query method used in such an unstructured environment. Gnutella uses the ooding method to query. In unstructured systems, since the P2P network topology is unrelated to the loca- tion of data, the set of nodes receiving a particular query is unrelated to the content of the query. The most general P2P architecture is the decentralised, unstructured architecture. Main research in P2P systems have focused on architectural issues, search techniques, legal issues, and so forth. Very limited literature is available for unstructured P2P systems. Replication in unstructured P2P systems can improve the performance of the system as the desired data can be found near the requested node. Especially in ooding algorithms, reducing the search even by one hop can drastically reduce the number of messages in the system. Table 4 shows different P2P systems. A challenging problem in unstructured P2P systems is that the network topology is independent of the data location. Thus, the nodes receiving queries can be com- pletely unrelated to the content of the query. Consequently, the receiving nodes also do not have any idea of where to forward the request for quickly locating the data. To minimise the number of hops before the data are found, data can be proactively replicated at more than one site. Replication Strategies in P2P Systems Based on Size of Files (Granularity) 1. Full-le replication: Full les are replicated at multiple peers based upon which node downloads the le. This strategy is used in Gnutella. This strategy is simple to implement. However, replicating larger les at one single le can Table 4. Examples of different types of P2P systems Type Example Centralised Napster Decentralised structured Freenet (loosely structured) Distribute hash table (DHT) (highly structured) FatTrack eDonkey Decentralised unstructured Gnutella 226 Goel & Buyya Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permis- sion of Idea Group Inc. is prohibited. be cumbersome in terms of space and time (Bhagwan, Moore, Savage, & Voelker, 2002). 2. Block-level replication: This replication divides each le into an ordered sequence of xed-size blocks. This is also advantageous if a single peer cannot store a whole le. Block-level replication is used by eDonkey. A limitation of block-level replication is that during le downloading, it is required that enough peers are available to assemble and reconstruct the whole le. Even if a single block is unavailable, the le cannot be reconstructed. To overcome this problem, erasure codes (ECs), such as Reed-Solomon (Pless, 1998), are used. 3. Erasure-code replication: This provides the capability for original les to be constructed from less available blocks. For example, k original blocks can be reconstructed from l (l is close to k) coded blocks taken from a set of ek (e is a small constant) coded blocks (Bhagwan et al., 2002). In Reed-Solomon codes, the source data are passed through a data encoder, which adds redundant bits (parity) to the pieces of data. After the pieces are retrieved later, they are sent through a decoder process. The decoder attempts to recover the original data even if some blocks are missing. Adding EC in block-level replication can improve the availability of the les because it can tolerate the unavailability of certain blocks. Based on Replica Distribution The following need to be dened. Consider that each le is replicated on r i nodes. Replication scheme in P2P Based on file granularity Based on replica distribution Based on replica-creation strategy Full file e.g., Gnutella Block level e.g., Freenet Erasure codes Uniform distribution Block-level distribution Square-root distribution Owner or requester site e.g., Gnutella Path replication e.g., Freenet Random Figure 5. Classication of replication schemes in P2P systems Data Replication Strategies in Wide-Area Distributed Systems 227 Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. Let the total number of les (including replicas) in the network be denoted as R (Cohen & Shenker, 2002). R = ∑ = m i i r 1 , where m is the number of individual les or objects. (i) Uniform: The uniform replication strategy replicates everything equally. Thus, from the above equation, replication distribution for the uniform strategy can be dened as follows: r i = R / m. (ii) Proportional: The number of replicas is proportional to their popularity. Thus, if a data item is popular, it has more chances of nding the data close to the site where the query was submitted. r i ∝ q i , where, q i = the relative popularity of the le or object (in terms of the number of queries issued for the ith le). ∑ = m i i q 1 = 1 If all objects were equally popular, then q i = 1/m. However, results have shown that object popularity show a Zipf-like distribution in systems such Napster and Gnutella. Thus, the query distribution is as follows: q i ∝ 1/ i a , where a is close to unity. (iii) Square root: The number of replicas of a le i is proportional to the square root of query distribution q i . r i ∝ i q 228 Goel & Buyya Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permis- sion of Idea Group Inc. is prohibited. The necessity of square-root replication is clear from the following discussion. The uniform and proportional strategies have been shown to have the same search space, as follows. m: number of les n: number of sites r i : number of replicas for the i th le R = total number of les The average search size for le i is A i = i r n . Hence, the overall average search size is A = i i i Aq ∑ . The assumed average number of les per site is m = n R . Following the above equations, the average search size for the uniform replication strategy is as follows. Since r i = R / m, the following equations are true. A = i i r n q ∑ (replacing the value of A i ) A = R mn q i ∑ A = m (as, ∑ = m i i q 1 =1) (1) The average search size for the proportional replication strategy is as follows. Since r i = R q i (as, r i ∝ q i , and q i = 1), the following are true. A = i i r n q ∑ (replacing the value of A i ) A = i i Rq n q ∑ A = m (as, ∑ = m i i q 1 =1, R n = 1 and i q 1 = m for proportional replication (2) It is clear from Equations 1 and 2 that the average search size is the same in the uniform and proportional replication strategies. It has also been shown in the literature (Cohen & Shenker, 2002) that the average search size is the minimum under the following condition: Data Replication Strategies in Wide-Area Distributed Systems 229 Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. A optimal = 1 ( ∑ i q ) 2 . This is known as square-root replication. Based on Replica-Creation Strategy 1. Owner replication: The object is replicated only at the requester node once the le is found. For example, Gnutella (Oram, 2001) uses owner replication. 2. Path replication: The le is replicated at all the nodes along the path through which the request is satised. For example, Freenet uses path replication. 3. Random replication: The random-replication algorithm creates the same number of replicas as path replication. However, it distributes the replicas in a random order rather than following the topological order. It has been shown in Lv, Cao, Cohen, Li, and Shenker (2002) that the factor of improvement in path replication is close to 3, and in random replication, the improvement factor is approximately 4. The following tree summarises the classication of replication schemes in P2P systems, as discussed above. Replication Strategy for Read-Only Requests Replica Selection Based on Replica Location and User Preference The replicas are selected based on users’ preferences and the replica location. Va- zhkudai, Tuecke, and Foster (2001) propose a strategy that uses Condor’s ClassAds (classied advertisements; Raman, Livny, & Solomon, 1998) to rank the sites’ suitability in the storage context. The application requiring access to a le presents its requirement to the broker in the form of ClassAds. The broker then does the search, match, and access of the le that matches the requirements published in the ClassAds. Dynamic replica-creation strategies discussed in Ranganathan and Foster (2001) are as follows: 1. Best client: Each node maintains a record of the access history for each replica, that is, which data item is being accessed by which site. If the access frequency of a replica exceeds a threshold, a replica is created at the requester site. 230 Goel & Buyya Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permis- sion of Idea Group Inc. is prohibited. 2. Cascading replication: This strategy can be used in the tired architecture discussed above. Instead of replicating the data at the best client, the replica is created at the next level on the path of the best client. This strategy evenly distributes the storage space, and other lower level sites have close proximity to the replica. 3. Fast spread: Fast spread replicates the le in each node along the path of the best client. This is similar to path replication in P2P systems. Since the storage space is limited, there must be an efcient method to delete the les from the sites. The replacement strategy proposed in Ranganathan and Foster (2001) deletes the most unpopular les once the storage space of the node is exhausted. The age of the le at the node is also considered to decide the unpopularity of the le. Economy-Based Replication Policies The basic principle behind economy-based polices are to use the socioeconomic concepts of emergent marketplace behaviour, where local optimisation leads to global optimisation. This could be thought of as an auction, where each site tries to buy a data item to create the replica at its own node and generate revenue in the future by selling the replica to other interested nodes. Various economy-based protocols such as those in Carman, Zini, Serani, and Stockinger (2002) and Bell, Cameron, Carvajal-Schiafno, Millar, Stockinger, and Zini (2003) have been proposed, which dynamically replicate and delete the les based on the future return on the investment. Bell et al. use a reverse-auction protocol to determine where the replica should be created. For example, following rule is used in Carman et al. (2002). A le request (FR) is considered to be an n-tuple of the form FR i = 〈 t i , o i , g i , n i , r i , s i , p i 〉 , where the following are true. t i : time stamp at which the le was requested o i , g i , and n i : together represent the logical le being requested (o i is the virtual organisation to which the le belongs, g i is the group, and n i is the le identication number) Data Replication Strategies in Wide-Area Distributed Systems 231 Copyright © 2007, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited. r i and s i : represent the element requesting and supplying the le, respectively p i : represents the price paid for the le (price could be virtual money) To maximise the prot, the future value of the le is dened over the average life time of the le storage T av . V(F, T k ) = ∑ + += ∂∂ nk ki iii ssFFp 1 ),(),( , where V represents the value of the le, p i represents the price paid for the le, s is the local storage element, and F represents the triple (o, g, n). ∂ is a function that returns 1 if the arguments are equal and 0 if they differ. The investment cost is determined by the difference in cost between the price paid and the expected price if the le is sold immediately. As the storage space of the site is limited, the choice of whether it is worth deleting an existing le must be made before replicating a le. Thus, the investment decision between purchasing a new le and keeping an old le depends on the change in prot between the two strategies. Data-production site e g CERN Regional Regional Regional Tier Tier Local Local Local Tier Participating Participating Tier End user End user … Tier Figure 6. A tiered or hierarchical architecture of a data grid for the particle physics accelerator at the European Organization for Nuclear Research (CERN) [...]... descriptions in the service registries for service- discovery purposes The service requesters search the service registries for services that meet their requirements The service requesters then can communicate with the service providers directly and use their services Similar to Web services, ebXML also has a service register to collect the service descriptions Different from Web services, however, the business... Functionality Both Web services and ebXML put their service entities on a network and have means for service description, service discovery, and service invocation A Web service adopts a service- oriented architecture (SOA) with three kinds of parties: service providers, service requesters, and service registries (as shown in Figure 1) The service providers register their service descriptions in the service registries... registry and repository, and the messaging layer For example, the CPP or CPA defines authorization, authentication, and confidentiality It also provides the means to create tamperproof documents Service Discovery To locate a service, Web services use UDDI, while ebXML relies on a registry with repositories The platform-independent, XML-based UDDI is a directory service that allows access to WSDL information... partners, such technology-oriented bottom-up architecture might prove to be too limited 2 Web services describe systems, not businesses More precisely, Web services describe the parameter types for service invocation from the software-engineering point of view, but not the semantics of the parameters from the business point of view While this is valid for simple Web services, the limitations when dealing... programming languages UDDI is the protocol used by a service registry to describe the information of the services One important piece of information in the business descriptions is the URI (uniform resource indicator) for the WSDL file WSDL is an XML file describing how the service can be invoked from a software-engineering point of view Web-services invocation is similar to an RPC (Figure 2) The client... policy of minimising cost and delay with service migration considers the variation in service quality If the site is incapable of maintaining the promised service quality, the request can be migrated to other sites World Wide Web The WWW has become a ubiquitous media for content sharing and distribution Applications using the Web spans from small-business applications to large scientific calculations Download... Inc is prohibited Web Services vs ebXML 245 Figure 2 A Web service follows a simple RPC-like communication pattern Servce Requester Servce Provder Clent Servce Communcaton System convention (marshaling) The SOAP message is transported to the server end and unwrapped The parameter information is used to invoke the service The same method is used to send information back to the client side Many... permission of Idea Group Inc is prohibited Web Services vs ebXML 243 but also the delivery of services and information through networks The e-business tools and standards come from two domains known as Web services and e-business XML (ebXML; electronic business using extensible markup language) Web services are a technology-oriented approach Its ancestors include CORBA (common object request broker... messages between service providers and requestors, both ebXML and Web services use SOAP SOAP can be transferred via arbitrary protocols, yet the only binding defined in the SOAP specification 1.2 (Gudgin, Hadley, Mendelsohn, & Moreau, 2003) is of SOAP to HTTP (hypertext transfer protocol) All SOAP messages are XML documents, but the structures of the SOAP messages are different between Web services and... service providers or requesters, but are treated as the same role of business partners The service discovery and invocation are similar to Web services (details in this section) For some people in the ebXML community, ebXML is not an SOA solution If we consider SOA as a kind of architecture in computing technology, the argument is true that SOA is a solution to software-component reuse, analogous to . Raman, Livny, & Solomon, 19 98) to rank the sites’ suitability in the storage context. The application requiring access to a le presents its requirement to the broker in the form of ClassAds prohibited. Parrington, G. D., Shrivastava, S. K., Wheater, S. M., & Little, M. C. (1995). The design and implementation of Arjuna. USENIX Computing Systems Journal, 8( 2), 255-3 08. Pless, V. (19 98) . Introduction. (up to 10,000) and the local autonomy of sites. Mariposa implements an asynchronous replica-control protocol, thus distributed data may be stale at certain sites. The updates are propagated to