1. Trang chủ
  2. » Công Nghệ Thông Tin

Building Secure and Reliable Network Applications phần 9 potx

51 201 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 51
Dung lượng 363,74 KB

Nội dung

Kenneth P. Birman - Building Secure and Reliable Network Applications 410 410 To this author, the implication is that while both models introduce reliability into distributed systems, they deal with very different reliability goals: recoverability on the one hand, and availability on the other. While the models can be integrated so that one could use transactions within a virtual synchrony context and vice versa, there seems to be little hope that the could be merged into a single model that would provide all forms of reliability in a single, highly transparent environment. Integration and co-existence is for this reason a more promising direction, and seems to be the one favored by industry and research groups. 21.5 Weak Consistency Models There are some applications in which one desires most aspects of the transactional model, but where serializability in the strict sense is not practical to implement. Important among these are distributed systems in which a database must be accessed from a remote node that is sometimes partitioned away from the system. In this situation, even if the remote node has a full copy of the database, it is potentially limited to read-only access. Even worse, the impossibility of building a non-blocking commit protocol for partitioned settings potentially prevents these read-only transactions from executing on the most current state of the database, since a network partitioning failure can leave a commit protocol in the “prepared” state at the remote site. In practice, many distributed systems treat remote copies of databases as a form of second-class citizen. Such databases are often updated by periodic transfer of the log of recent committed transactions, and used only for read-only queries. Update transactions execute on a primary copy of the database. This approach avoids the need for a multi-phase commit but has limited opportunity to benefit from the parallelism inherent in a distributed architecture. Moreover, the delay before updates reach the remote copies may be substantial, so that remote transactions will often execute against a stale copy of the database, with outcomes that may be inconsistent with the external environment in obvious ways. For example, a remote banking system may fail to reflect a recent deposit for hours or days. In the subsections that follow we briefly present some of the mechanisms that have been proposed as extensions to the transactional model to improve its usefulness in settings such as these. 21.5.1 Epsilon serializability Originally proposed by Pu, this is a model in which a pre-agreed strategy is used to limit the possible divergence between a primary database and its remote replicas [Pu93]. The “epsilon” refers to the case where the database contains numeric data, and it is agreed that any value read by a transaction is within ε of the correct one. For example, suppose that a remote transaction is executed to determine the current value of a bank balance, and the result obtained is $500. If ε=$100, we can conclude that the actual balance in the database (in the primary version) is no less than $400 and no more than $600. The benefit of this approach is that it relaxes the need to run costly synchronization protocols between remote copies of a database and the primary: such protocols are only needed if an update might violate the constraint. Continuing our example, suppose that we know that there are two replicas and one primary copy of the database. We can now allocate ranges within which these copies can independently perform update operations without interacting with one another to confirm that it is safe to do so. Thus, the primary copy and each replica might be limited to a maximum cumulative update of $50 (larger updates would require a standard locking protocol). Even if the primary and one replica perform maximum increments to the balance of $50 respectively, the remaining replica still sees a value that is within $100 of the true value, and this remains true for any update that the third replica might undertake. In general, the rule for this Chapter 21: Transactional Systems 411 411 model is that the minimum and maximum cumulative updates done by “other copies” must be bounded by ε to ensure that a given copy will see a value within ε ofthetrueone. 21.5.2 Weak and strong consistency in partitioned database systems During periods when a database system may be completely disconnected from other replicas of the same database, we will in general be unable to determine a safe serialization order for transactions originating at that disconnected copy. Suppose that we want to implement a database system for use by soldiers in the field, where communication may be severely disrupted. The database could be a map showing troop positions, depots, the state of roads and bridges, and major targets. In such a situation, one can imagine transactions of varying ranges of urgency. A fairly routine transaction might be to update the record showing where an enemy outpost is located, indicating that there has been “no change” in the status of the outpost. At the other extreme would be an emergency query seeking to locate the closest medic or supply depot capable of servicing a given vehicle. Serializability considerations underlie the consistency and correctness of the real database, but one would not necessarily want to wait for serializability to be guaranteed before making an “informed guess” about the location of a medical team. Thus, even if a transactional system requires time to achieve a completely stable ordering on transactions, there may be cases in which one would want it to process at least certain classes of transactions against the information presently available to it. In his doctoral thesis, Amir addressed this problem using the Transis system as a framework within which he constructed a working solution [Ami95; see also CGS85, AKA93, TTPD95]. His basic approach was to consider only transactions that can be represented as a single multicast to the database, which is understood to be managed by a process group of servers. (This is a fairly common assumption in transactional systems, and in fact most transactional applications indeed originate with a single database operation that can be represented in a multicast or remote procedure call). Amir’s approach was to use abcast (the dynamically uniform or “safe” form) to distribute update transactions among the servers, which were designed to use a serialization order that is deterministically related to the incoming abcast order. Queries were implemented as local transactions requiring no interaction with remote database servers. As we saw earlier, dynamically uniform abcast protocols must wait during partitioning failures in all but the primary component of the partitioned system. Thus, Amir’s approach is subject to blocking in a site that has become partitioned away from the main system. Such a site may, in the general case, have a queue of undeliverable and partially ordered abcasts that are waiting either for a final determination of their relative ordering, or for a guarantee that dynamic uniformity will be achieved. Each such abcast corresponds to an update transaction that could change the database state, perhaps in an order-sensitive way, and which cannot be safely applied until this information is known. What Amir does next depends on the type of request presented to the system. If a request is urgent, it can be executed either against the last known completely safe state (ignoring these incomplete transactions), or against an approximation to the correct and current state (by applying these transactions, evaluating the database query, and then aborting the entire transaction). Finally, a normal update can simply wait until the safe and global ordering for the corresponding transaction is known, which may not occur until communication has been reestablished with remote sites. Amir’s work is not the only effort to have arrived at this solution to the problem. Working independently, a group at Xerox Parc developed a very similar approach to disconnected availability in the Kenneth P. Birman - Building Secure and Reliable Network Applications 412 412 Bayou system [TTPD95]. There work is not expressed in terms of process groups and totally ordered, dynamically uniform, multicast, but the key ideas are the same. In other ways, the Bayou system is more sophisticated than the Transis-based one: it includes a substantial amount of constraint checking and automatic correction of inconsistencies that can creep into a database if urgent updates are permitted in a disconnected mode. Bayou is designed to support distributed management of calendars and scheduling of meetings in large organizations, a time-consuming activity that often requires approximate decision making because some desired participants may be on the road or otherwise unavailable at the time a meeting must be scheduled. 21.5.3 Transactions on multi-database systems The Phoenix system [Mal96], developed by Malloth, Guerraoui, Raynal, Schiper and Wilhelm, adopts a similar philosophy but considers a different aspect of the problem. Starting with the same model as is used in Amir’s work and in Bayou, where each transaction is initiated from a single multicast to the database servers, which form a process group, this effort asked how transactions that operate upon multiple objects could be accommodated. Such considerations lead them to propose a generalized multi- group atomic broadcast that is totally ordered, dynamically uniform, and failure atomic over multiple process groups to which it is sent [SR96]. The point of using this approach is that if a database is represented in fragments that are managed by separate servers, each of which is implemented in a process group, a single multicast would not otherwise suffice to do the desired updates. The Phoenix protocol used for this purpose is similar to the extended three-phase commit developed by Keidar for the Transis system, and is considerably more efficient than sending multiple concurrent and asynchronous multicasts to the process groups and then running a multi-phase commit on the full set of participants. Moreover, whereas such as multi-step protocol would leave serious unresolved questions insofar as the view- synchronous addressing aspects of the virtual synchrony model are considered, the Phoenix protocol can be proved to guarantee this property within all of the destination groups. 21.5.4 Linearizability Herlihy and Wing studied consistency issues from a more theoretical perspective [HW90]. In a paper on the linearizability model of database consistency, they suggested that object oriented systems may find the full nested serializability model overly constraining, but still benefit from some forms of ordering guarantee. A nested execution is linearizable if the invocations of each object, considered independently of other objects, leave that object in a state that could have been reached by some sequential execution of the same operations, in an order consistent with the causal ordering on the original invocation sequence. In other words, this model says that an object may reorder the operations upon it and interleave their execution provided that it behaves as if it had executed operations one by one, in some order consistent with the (causal) order in which the invocations were presented to it. Linearizability may seem like a very simple and obvious idea, but there are many distributed systems in which servers might not be guaranteed to respect this property. Such servers can allow concurrent transactions to interfere with one another, or may reorder operations in ways that violate intuition (for example by executing a read-only operation on a state that is sufficiently old to be lacking some updates that were issued before the read by the same source). At the same time, notice that traditional serializability can be viewed as an extention of linearizability (although serializability does not require that the causal order of invocations be respected, few database systems intentionally violate this property). Herlihy and Wing argue that if designers of concurrent objects at least prove them to achieve linearizability, the objects will behave in an intuitive and consistent way when used in a complex distributed system; should one then wish to go further and superimpose a transactional structure over such a system, doing so simply requires stronger concurrency control. This author is inclined to agree: linearizability seems like an appropriate “weakest” consistency guarantee for the objects used in a distributed environment. The Herlihy and Wing paper develops this idea by presenting proof rules for demonstrating that an object implementation achieves linearizability; however, we will not discuss this issue here. Chapter 21: Transactional Systems 413 413 21.5.5 Transactions in Real-Time Systems The option of using transactional reliability in real-time systems has been considered by a number of researchers, but the resulting techniques have apparently seen relatively little use in commercial products. There are a number of approaches that can be taken to this problem. Davidson is known for work on transactional concurrency control subject to real-time constraints; her approach involves extending the scheduling mechanisms used in transactional systems (notably, timestamped transactional systems) to seek to satisfy the additional constraints associated with the need to perform operations before a deadline expires. Amir, in work on the Transis project, has looked at transactional architectures in which data is replicated and it may be necessary to perform a transaction with weaker consistency than the normal serializability model because of temporal or communication constraints [Ami95]. For example, Amir considers the case of a mobile and disconnected user who urgently requires the results of a query, even at the risk that the database replica on which it will be executed is somewhat out of date. The Bayou project, descibed below, uses a similar approach. These methods can be considered “real-time” to the degree to which they might be used to return a result for a query that has a temporal constraint inconsistent with the need to run a normal concurrency control algorithm, which can delay a transaction for an unpredictable period of time. Broadly, however, the transactional model is fairly complex and consequently ill-suited for use in settings where the temporal constraints have fine granularity with regard to the time needed to execute a typical transaction. In environments where there is substantial “breathing room” transactions may be a useful technique even if there are real-time constraints that should be taken into account, but as the temporal demands on the system rise, more and more deviation from the pure serializability model is typically needed in order to continue to guarantee timely response. 21.6 Advanced Replication Techniques Looking to the future, one of the more exciting research directions of which the author is aware involves the use of process groups as a form of coherently replicated cache to accelerate access to a database. The idea can be understood as a synthesis of Liskov’s work on the Harp file system [LGGJ91], the author’s work on Isis [BR94], and research by Seltzer and others on log-structured database systems [Sel93]. However, this author is not aware of any publication in which the contributions of these disparate systems are unified. To understand the motivation for this work, it may help to briefly review the normal approach to replication in database systems. As was noted earlier, one can replicate a data item by maintaining multiple copies of that item on servers that will fail independently, and updating the item using a transaction that either writes all copies or at least writes to a majority of copies of the item. However, such transactions are slowed by the quorum read and commit operations: the former will now be a distributed operation and hence subject to high overhead, while the latter is a cost not paid in the non- distributed or non-replicated case. For this reason, most commercial database systems are operated in a non-distributed manner, even in the case of technologies such as Tuxedo or Encina that were developed specifically to support distributed transactional applications. Moreover, many commercial database systems provide a weak form of replication for high availability applications, in which the absolute guarantees of a traditional serializability model are reduced to improve performance. The specific approach is often as follows. The database system is replicated between a primary and backup server, whose roles will be interchanged if the primary fails and later is repaired and recovers. The primary server will, while running, maintain a log of Kenneth P. Birman - Building Secure and Reliable Network Applications 414 414 committed transactions, periodically transmitting it to the backup, which applies the corresponding updates. Notice that this protocol has a “window of vulnerability”. If a primary server is performing transactions rapidly, perhaps hundreds of them per second, the backup may lag by hundreds or thousands of transactions because of the delay associated with preparing and sending the log records. Should the primary server now crash, these transactions will be trapped in the log records: they are committed and the client has potentially seen the result, but the backup will take over in a state that does not yet reflect the corresponding updates. Later, when the primary restarts, the lost transactions will be recovered and, hopefully, can still be applied without invalidating other actions that occurred in the interim; otherwise, a human operator is asked to intervene and correct the problem. The benefit of the architecture is that it gives higher availability without loss of performance; the hidden cost, however, is the risk that transactions will be “rolled back” by a failure, creating noticeable inconsistencies and a potentially difficult repair problem. As it happens, we can do a less costly job of replicating a database using process groups, and may actually gain performance by doing so! The idea is the following. Suppose that one were to consider a database as being represented by a checkpoint and a log of subsequent updates. At any point in time, the state of the database could be constructed by loading the checkpoint and then applying the updates to it; if the log were to grow too long, it could be truncated by forming a new checkpoint. This isn’t an unusual way to actually view database systems: Seltzer’s work on log-structured databases [Sel93] in fact implemented a database this way and demonstrated some performance benefits by doing so, and Liskov’s research on Harp (a non-transactional file store that was implemented using a log-based architecture) employed a similar idea, albeit in a system with non-volatile RAM memory. Indeed, within the file system community, Rosenblum’s work on LFS (a log-structured file system) revolutionized the architecture of many file system products [RO91]. So, it is entirely reasonable to adopt a similar approach to database systems. Now, given a a checkpoint and log representation of the database, a database server can be viewed as a process that caches the database contents in high speed volatile memory. Each time the server is launched, it reconstructs this cached state from the most recent checkpoint and log of updates, and subsequently transactions are executed out of the cache. To commit a transaction in this model, it suffices to force a description of the transaction to the log (perhaps as little as the transactional request itself and the serialization order that was used). The database state maintained in volatile memory by the server can safely be discarded after a failure, hence the costly disk access associated with the standard database server architecture are avoided. Meanwhile, the log itself becomes an append-only structure that is almost never reread, permitting a very efficient storage on disk. This is precisely the sort of logging studied by Rosenblum for file systems and Seltzer for database systems, and is known to be very cost effective for primary backup log i log i+1 client Figure 21-4: Many commercial database products achieve high availability using a weak replication policy that can have a window of vulnerability. In this example, the red transactions have not been logged to the backup and hence can be lost if the primary fails; the green transactions, on the other hand, are “stable” and will not be lost even if the primary fails. Although lost transactions will be recovered when the primary restarts, it may not be possible to reapply the updates automatically. A human operator intervenes in such cases. Chapter 21: Transactional Systems 415 415 small to moderate sized databases it can work well. Subsequent research has suggested that this approach can also be applied to very large databases. But now our process group technology offers a path to further performance improvements through parallelism. What we can do is to use the lightweight replication methods of Chapter 15 to replicate the volatile, “cached” database state within a group of database servers, which can now use one of the load-balancing techniques of Section 15.3.3 to subdivide the work of performing transactions. Within this process group, there is no need to run a multi-phase commit protocol! To see this, notice that just as the non-replicated volatile server is merely a cache of the database log, so the replicated group is merely a volatile, cached database state. When we claim that there is no need to run a multiphase commit protocol here, it may at first seem that such a claim is incorrect, since the log records associated with the transaction do need to be forced to disk (or to NVRAM if we use the Harp approach), and if there is more than one log, there will need to be a coordination of this activity to ensure that either all logs reflect the committed transaction, or that none does. For availability, it may actually be necessary to replicate the log, and if this is done, a multi-phase commit would be unavoidable. However, in many settings it might make sense to use just a single log server; for example, if the loging device is itself a RAID disk, then the intrinsic fault-tolerance of the RAID technology could be adequate to provide the degree of availability desired for our purposes. Thus, it may be better to say that there is no inherent reason for a multiphase commit protocol here, although in specific cases one may be needed. The primary challenge associated with this approach is to implement a suitable concurrency control scheme in support of it. While optimistic methods are favored in the traditional work on distributed databases, it is not clear that they represent the best solution for this style of group-structured replication married to a log-structured database architecture. In the case of pessimistic locking, a solution is known from the work of Joseph and this author in the mid 1980’. In the approach developed by Joseph, data is replicated within a process group [Jos86]. Reads are done from any single local copy and writes are done by issuing an asynchronous cbcast to the full group. Locking is done by obtaining local read locks and replicated write locks, the latter using a token-based scheme. The issue now arises of read locks that can be broken by a failure; this is addressed by re-registering read locks at an operational database server if one of the group members fails. In the scheme Joseph explored, such re-registration occurs during the flush protocol used to reconfigure the group membership. Next, Joseph introduced a rule whereby a write lock must be granted in the same process group view in which it was requested. If a write lock is requested in a group view where process p belongs to the group, and p fails before the lock is granted (perhaps because of a read lock some transaction held at process p), this forces the transaction requesting the write lock to release any locks it successfully acquired and to repeat its request. The repeated request will occur after the read-lock has been re-registered, checkpoint log Figure 21-5: Future database systems may gain performance benefits by exploiting process groups as scaleable "parallel" front-ends that cache the database in volatile memory and run transactions against this coherently cached state in a load-balanced manner. A persistent log is used to provide failure atomicity; because the log is a write-only structure it can be optimized to give very high performance. The log would record a description of each update transaction and the serialization order that was used; only committed transactions need be logged in this model. Kenneth P. Birman - Building Secure and Reliable Network Applications 416 416 avoiding the need to abort a transaction because its read locks were broken by a failure. In such an approach, the need to support unilateral transaction abort is eliminated, because the log now provides persistency, and locks can never be lost within the process group (unless all its members fail, which is a special case). Transaction commit becomes an asynchronous cbcast, with the same low cost as the protocol used to do writes. Readers familiar with transactional concurrency control may be puzzled by the similarity of this scheme to what is called the available copies replication method, an approach that is known to yield non- serializable executions [BHG87]. In fact, however, there is a subtle difference between Joseph’s scheme and the available copies scheme, namely that Joseph’s approach depends on group membership changes to trigger lock reregistration, whereas the available copies scheme does not. Since group membership, in the virtual synchrony model, involves a consensus protocol that provides consistent failure notification throughout the operational part of a system, the inconsistent failure detections that arise in the available copies approach do not occur. This somewhat obscure observation does not seem to be widely known within the database community. Using Joseph’s pessimistic locking scheme, a transaction that does not experience any failures will be able to do local reads at any copy of the replicated data objects on which it operates. The update and commit protocols both permit immediate local action at the group member where the transaction is active, together with an asynchronous cbcast to inform other members of the event. Only the acquisition of a write lock and the need to force the transaction description and commit record (including the serialization order that was used) involve a potential delay. This overhead however is counterbalanced by the performance benefits that come with scaleable parallelism. The result of this effort represents an interesting mixture of process group replication and database persistence properties. On the one hand, we get the benefit of high-speed memory-mapped database access, and can use the very lightweight non-uniform replication techniques that achieved such good performance in previous chapters. Moreover, we can potentially do load-balancing or other sorts of parallel processing within the group. Yet the logging method also gives us the persistence properties normally associated with transactions, and the concurrency control scheme provides for traditional transactional serializability. Moreover, this benefit is available without special hardware (such as NVRAM), although NVRAM would clearly be beneficial if one wanted to replicate the log itself for higher available. To the author, the approach seems to offer the best of both worlds. The integration of transactional constructs and process groups thus represents fertile territory for additional research, particularly of an experimental nature. As noted earlier, it is clear that developers of reliable distributed systems need group mechanisms for high availability and transactional ones for persistence and recoverability of critical data. Integrated solutions that offer both options in a clean way could lead to a much more complete and effective programming environment for developing the sorts of robust distributed applications that will be needed in complex environments. 21.7 Related Readings Chapter 26 includes a review of some of the major research projects in this area, which we will not attempt to duplicate here. For a general treatment of transactions, this author favors [GR93, BHG87]. On the nested transaction model, [Mos82]. Disconnected operation in transactional systems [Ami95, CGS85, AKA93, TTPD95]. Log-based transactional architectures [LGGJ91, Jos86, Sel93, BR94]. Chapter 22: Probabilistic Protocols 417 417 22. Probabilistic Protocols The protocols considered in previous chapters of this textbook share certain basic assumptions concerning the way that a distributed behavior or a notion of distributed consistency is derived from the local behaviors of system components. Although we have explored a number of styles of protocol, the general pattern involves reasoning about the possible system states observable by a correct process, and generalizing from this to properties that are shared by sets of correct processes. This approach could be characterized as a “deductive” style of distributed computing, in which the causal history prior to an event is used to deduce system properties, and the possible deductions by different processes are shown to be consistent in the sense that, through exchanges of messages, they will not encounter direct contradictions in the deduced distributed state. In support of this style of computing we have reviewed a type of distributed system architecture that is hierarchical in structure, or perhaps (as in Transis) composed of a set of hierarchical structures linked by some form of wide-area protocol. There is little doubt that this leads to an effective technology for building very complex, highly reliable distributed systems. One might wonder, however, if there are other ways to achieve meaningful forms of consistent distributed behavior, and if so, whether the corresponding protocols might have advantages that would favor their use under conditions where the protocols we have seen up to now, for whatever reason, encounter limitations. This line of reasoning has motivated some researchers to explore other styles of reliable distributed protocol, in which weaker assumptions are made about the behavior of the component programs but stronger ones are made about the network. Such an approach results in a form of protection against misbehavior whereby a process fails to respect the rules of the protocol but is not detected as having failed. In this chapter we discuss the use of probabilistic techniques to implement reliable broadcast protocols and replicated data objects. Although we will see that there are important limitations on the resulting protocols, they also represent an interesting design point that may be of practical value in important classes of distributed computing systems. Probabilistic protocols are not likely to replace the more traditional deductive protocols anytime soon, but they can be a useful addition to our repertoire of “tools” for constructing reliable distributed systems, particularly in setting where the load and timing properties of the system components are extremely predictable. 22.1 Probabilistic Protocols The protocols we will be looking at in this section are scaleable and probabilistically reliable. Unlike the protocols presented previously, they are based on a probabilistic system model somewhat similar to the “synchronous” model that we considered in our discussion of real-time protocols. In contrast to the asynchronous model, no mechanism for detecting failure is required. These protocols are scaleable in two senses. First, the message costs and latencies of the protocols grow slowly with the system size. Second, the reliability of the protocols, expressed in terms of the probability of a failed run of a protocol, approaches 0 exponentially fast as the number of processes is increased. This scaleable reliability is achieved through a form of gossip protocol which is strongly self- stabilizing. Such a system has the property that if it is disrupted into an inconsistent state, it will automatically restore itself to a consistent one given a sufficient period of time without failures. Our protocols (particularly for handling replicated data) also have this property. Kenneth P. Birman - Building Secure and Reliable Network Applications 418 418 The basic idea with which we will work is illustrated in Figure 22-1, which shows a possible execution for a form of gossip protocol developed by Demers and others at Xerox Parc [DGHI87]. In this example of a “push” gossip protocol, messages are diffused through a randomized flooding mechanism. The first time a process receives a message, it selects some fixed percentage of destinations from the set of processes that have not yet received it. The number of such destinations is said to be the fanout of the protocol, and the processes selected are picked randomly (a bit vector, carried on the messages, indicates which processes have received them). As these processes receive the message, they relay it in the same manner. Subsequently, if a process receives a duplicate copy of a message it has seen before, it discards the message silently. Gossip protocols will typically flood the network within a logarithmic number of rounds. This behavior is very similar to that of a biological epidemic, hence such protocols are also known as epidemic ones [Bai75]. Notice that although each process may receive a message many times, the computational cost of detecting duplicates and discarding them is likely to be low. On the other hand, the cost of relaying them is a fixed function of the fanout regardless of the size of the network; this is cited as one of the benefits of the approach. The randomness of the protocols has the benefit of overcoming failures of individual processes, in contrast with protocols where each process has a specific role to play and must play it correctly, or fail detectably, for the protocol itself to terminate correctly. Our figure illustrates a push protocol, in the sense that processes with data push it to other processes that lack data by gossiping. A “pull” style of gossip can also be defined: in this approach, a process periodically solicits messages from some set of randomly selected processes. Moreover, the two schemes can be combined. Demers and his colleagues have provided an analysis of the convergence and scaling properties of gossip protocols based on pushing, pulling, and combined mechanisms, and shown how these can overcome failures [DGHI89]. They prove that both classes of protocols converge towards flooding exponentially quickly, and demonstrate that they can be applied to real problems. The motivation for their work was a scaling problem that arose in the wide-area mail system that was developed at Parc in the 1980’s. As this system was used on a larger and larger scale, it began to exhibit consistency problems and had difficulties in accommodating mobile users. Demers and his colleagues showed that by reimplementing the email system to use a gossip broadcast protocol they could overcome these problems, helping ensure timely and consistent email services that were location independent and inexpensive. p 0 p 1 p n Figure 22-1: A "push" gossip protocol. Each process that receives a message picks some number of destinations (the "fanout", 3 in the example shown) randomly among the processes not known to have received it. Within a few rounds, the message has reached all destinations (above, a process that has received a message is shown in gray). A “pull” protocol can be used to complement this behavior: a process periodically selects a few processes and solicits messages from them. Both approaches exhibit exponential convergence typical of epidemics in densely populated biological systems. Chapter 22: Probabilistic Protocols 419 419 22.2 Other applications of gossip protocols The protocol of Demers is not the first or the only to explore gossip-style information dissemination as a tool for communication in distributed systems. Other relevant work in this area includes [ABM87], an information diffusion protocol that uses a technique similar to the one presented above, and [Gol91a, GT92], which uses gossip as a mechanism underlying a group membership algorithm for wide-area applications. For reasons of brevity, however, we will not treat these papers in detail in the current chapter. 22.3 Hayden’s pbcast primitive In the style of protocol explored at Xerox, the actual rate with which messages will flood the network is not guaranteed because of failures. Instead, these protocols guarantee that, given enough time, eventually either all or no correct processes will deliver a message. This property is called eventual convergence. Although eventual convergence is sufficient for many uses, the property is weaker than the guarantees of the protocols we used earlier to replicate data and perform synchronization, because eventual convergence does not provide bounds on message latency or ordering properties. Hayden has shown how gossip protocols can be extended to have these properties [HB93], and in this section we present the protocol he developed for this purpose. Hayden calls his protocol pbcast, characterizing it as a probabilistic analog of the abcast protocol for process groups. The pbcast protocol is based on a number of assumptions about the environment, which may not hold in typical distributed systems. Thus, after presenting the protocol, we will need to ask ourselves when the protocol could appropriately be applied. If used in a setting where these assumptions are not valid, pbcast might not perform as well as the analysis would otherwise suggest. Specifically, pbcast is designed for a static set of processes that communicate synchronously over a fully connected, point-to-point network. The processes have unique, totally ordered identifiers, and can toss weighted, independent random coins. Runs of the system proceed in a sequence of rounds in which messages sent in the current round are delivered in the next. There are two types of failures, both probabilistic in nature. The first are process failures. There is an independent, per-process probability of at most f p that a process has a crash failure during the finite duration of a protocol. Such processes are called faulty. The second type of failures are message omission failures. There is an independent, per-message probability of at most f m that a message between non-faulty processes experiences a send omission failure. The union of all message omission failure events and process failure events are mutually independent. In this model, there are no malicious faults, spurious messages, or corruption of messages. We expect that both f p and f m are small probabilities. (For example, unless otherwise stated, the values used in the graphs in this section are f m =0.05 and f p =0.001.) The impact of the failure model above can be visualized by thinking of the power that would be available to an adversary who seeks to cause a run of the protocol to fail by manipulating the system within the bounds of the model. Such an adversary has these capabilities and restrictions: • An adversary cannot use knowledge of future probabilistic outcomes, interfere with random coin tosses made by processes, cause correlated (non-independent) failures to occur, or do anything not enumerated below. • An adversary has complete knowledge of the history of the current run of the protocol. • At the beginning of a run of the protocol, the adversary has the ability to individually set process failure rates, within the bounds [0 f p ]. • For faulty processes, the adversary can choose an arbitrary point of failure. [...]... against 23 .9 Related Readings This chapter drew primarily from [Mar84, Mar90, MCBW91, Woo91, BM93] 441 442 Kenneth P Birman - Building Secure and Reliable Network Applications 24 Cluster Computer Architectures A new generation of hardware is emerging in support of client-server applications: the so-called cluster server architectures, in which a collection of relatively standard compute nodes and storage... sensors themselves 433 434 Kenneth P Birman - Building Secure and Reliable Network Applications Marzullo and Wood have developed a comprehensive theoretical treatment of 25 these issues, dealing both with estimation of values and performing imprecise comparisons 20 [Woo93] Their work results both in L O 15 algorithms for combining and comparing A 10 sensor values, and the suggestion that D comparison operators... are an esoteric area of research, on which relatively little has been published For gossip protocols, [DGHI87, ABM87, Gol91a, GT92] The underlying theory, [Bai75] Hayden’s work [HB93], which draws on [CASD85, CT90] 427 428 Kenneth P Birman - Building Secure and Reliable Network Applications 23 Distributed System Management In distributed systems that are expected to overcome failures, reconfigure themselves... system, and informing the components of the new structure Most of the material in this chapter is based loosely on work by Wood and Marzullo on a management and monitoring system called Meta, and on related work by Marzullo on issues of clock and sensor synchronization and fault-tolerance [Mar84, Mar90, MCBW91, Woo91] The Meta system is not the only one to have been developed for this purpose, and in... duplicated for redundancy, and the compute and storage nodes are hot-pluggable without disrupting availability of critical applications Illustrated is a 6-node cluster of compute or storage nodes, the management network (black, on the left), and the dual internal networks (red and green, right) The bottom 2 nodes contain the network hardware, interfaces to external networks, power supplies, and other cluster... probably want an out-of-band signaling mechanism, since the message queues may be 435 436 Kenneth P Birman - Building Secure and Reliable Network Applications congested when this condition arises Thus, a degraded query actuator might simply be a bit that can be set in the MIB associated with the database server; the server itself would check this bit periodically and switch in and out of degraded query... program process (sz), and so forth The field called nid gives the client node on which the client is running That is, this field “relates” the client_program entity to the client_node entity having that node id 4 29 430 Kenneth P Birman - Building Secure and Reliable Network Applications • client_nodes This relation has an entry for each node on which a client program might be running and, for that node,... client_nodes fsid load nid sz uptime dbid load nid sz uptime 13 6 27 12.2 30 3.5 67 33 25 1702 620 98 0 16:20:03 12:22:11 1:02: 19 1 2 5 7.5 6.2 3.1 67 25 33 1888 95 90 2 890 16:21:02 12:11: 09 1:21:02 file_servers database_servers nid load memused vmemused memavail IP addr protocol 67 25 33 18.1 9. 6 10.7 6541 6 791 5618 16187 2 198 1 17566 6151 6151 4371 128.13.67.1 128.13.67.2 128.13.67.5 SNMP SNMP SNMP server_nodes... in time and sensor values can confuse a monitoring system Here, sensor readings (shaded boxes) are obtained from a distributed system Not only is there inaccuracy in the values and time of the reading, but they are not sampled simultaneously To ensure that its actions will be reasonable, a management system must address these issues 431 Kenneth P Birman - Building Secure and Reliable Network Applications. ..420 • Kenneth P Birman - Building Secure and Reliable Network Applications For messages, the adversary has the ability to individually set send omission failure probabilities within the bounds of [0 fm] Note that although probabilities may be manipulated by the adversary, doing so can only make the system “more reliable than the bounds, fp and fm The probabilistic analysis of the properties . [DGHI87, ABM87, Gol91a, GT92]. The underlying theory, [Bai75]. Hayden’s work [HB93], which draws on [CASD85, CT90]. Kenneth P. Birman - Building Secure and Reliable Network Applications 428 428 23 round. Message receipt and pbcast are executed as atomic actions. Kenneth P. Birman - Building Secure and Reliable Network Applications 422 422 22.3.3 Probabilistic Reliability and the Bimodal Delivery. Hayden’s graphs of pbcast performance and scale here Figure 22-4: Graphs showing pbcast reliability, performance and scaling. Kenneth P. Birman - Building Secure and Reliable Network Applications 424 424 A

Ngày đăng: 14/08/2014, 13:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w