Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 51 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
51
Dung lượng
404,52 KB
Nội dung
Kenneth P Birman - Building Secure and Reliable Network Applications 206 p0 p1 a d b c e f p2 p3 Figure 13-2: Distorted timelines that might correspond to faster or slower executions of the processes illustrated in the previous figure Here we have redrawn the earlier execution to make an inconsistent cut appear to be physically instantaneous, by slowing down process p1 (dotted lines) and speeding up p3 (jagged) But notice that to get the cut “straight” we now have message e travelling “backwards” in time, an impossibility! The black cuts in the earlier figure, in contrast, can all be straightened without such problems This lends intuition to the idea that a consistent cut is a state that could have occured at an instant in time, while an inconsistent cut is a state that could not have occured in real-time A very simple logical clock can be constructed by associating a counter with each process and message in the system Let LTp be the logical time for process p (the value of p’s copy of this counter), and let LTm be the logical time associated with message m (also called the logical timestamp of m) The following rules are used to maintain these counters If LTp make change permanent abort => delete temp area After failure: for each pending protocol contact coordinator to learn outcome Figure 13-4: 2PC extended to handle participant failures Consider next the case where the coordinator fails during a 2PC protocol If we are willing to wait for the coordinator to recover, the protocol requires few changes to deal with this situation The first change is to modify the coordinator to save its commit decision to persistent storage before sending commit or abort messages to the participants.8 Upon recovery, the coordinator is now guaranteed to have available the information needed to terminate the protocol, which it can by simply retransmitting the final commit or abort message A participant that is not in the precommit state would acknowledge such a message but take no action; a participant waiting in the precommit state would terminate the protocol upon receipt of it It is actually sufficient for the coordinator to save only commit decisions in persistent storage After failure, a recovering coordinator can safely presume the protocol to have aborted if it finds no commit record; the advantage of such a change is to make the abort case less costly, by removing a disk I/O operation from the “critical path” before the abort can be acted upon The elimination of a single disk I/O operation may seem like a minor optimization, but in fact can be quite significant, in light of the 10-fold latency difference between a typical disk I/O operation (10-25ms) and a typical network communication operation (perhaps 1-4ms latency) One doesn’t often have an opportunity to obtain an order of magnitude performance improvement in a critical path, hence these are the sorts of engineering decisions that can have very important implications for overall system performance! 215 242 Kenneth P Birman - Building Secure and Reliable Network Applications The remaining operations are all initiated by processes that belong to the system These, too, might need to be reissued if the GMS process contacted to perform the operation fails before responding (the failure would be detected when a new GMS membership list is delivered to a process waiting for a response, and the GMS member it is waiting for is found to have been dropped from the list) It is clear that exactly the same approach can be used to solve this problem Each request need only be uniquely identifiable, for example using the process identifier of the invoking process and some form of counter (request 17 from process p on host h) The central issue is thus reduced to replication of data within the GMS, or within similar groups of processes We will postpone this problem momentarily, returning below when we give a protocol for implementing replicated data within dynamically defined groups of processes 13.10.3 GMS Notifications With Bounded Delay If the processes within a system possess synchronized clocks, it is possible to bound the delay before a process becomes aware that it has been partitioned from the system Consider a system in which the health of a process is monitored by the continued reception of some form of “still alive” messages received from it; if no such message is received after delay σ, any of the processes monitoring that process can report it as faulty to the GMS (Normally, such a process would also cease to accept incoming messages from the faulty process, and would also gossip with other processes to ensure that if p considers q to have failed, than any process that receives a message from p will also begin to shun messages from q) Now, assume further that all processes which receive a “still alive” message acknowledge it In this setting, p will become aware that it may have been partitioned from the system within a maximum delay of 2*ε+σ, where ε represents the maximum latency of the communication channels More precisely, p will discover that it has been partitioned from the system 2*ε+σ time units after it last had contact with a majority of the previous primary component of the GMS In such situations, it would be appropriate for p to break any channels it has to system members, and to cease taking actions on behalf of the system as a whole Thus, although the GMS may run its protocol to exclude p as early as 2*ε time units before p discovers that has been partitioned from the main system, there is a bound on this delay The implication is that the new primary system component can safely break locks held by p or otherwise takeover actions for which p was responsible after 2*ε time units have elapsed 242 Chapter 13: Guaranteeing Behavior in Distributed Systems p0 time t t+ε t+σ t+ε+σ t+2ε+σ p1 243 p2 last ping last ack timeout! timeout at p0 “safe” state Figure 13-13: If channel delays are bounded, a process can detect that it has been partitioned from the primary component within a bounded time interval, making it safe for the primary component to take over actions from it even if where externally visible effects may be involved Above, the gray region denotes a period during which the new primary process will be unable to take over because there is some possibility that the old primary process is still operational in a non-primary component and may still be initiating “authoritative actions” At the end of the gray period a new primary process can be appointed within the primary component There may be a period of realtime during which no primary process was active, but there is no risk that two were simultaneously active One can also bias a system in the other direction, so that there will always be at least one primary active provided that the rate of failures is limited Reasoning such as this is only possible in systems where clocks are synchronized to a known precision, and in which the delays associated with communication channels are also known In practice, such values are rarely known with any accuracy, but coarse approximations may exist Thus, in a system where message passing primitives provide expected latencies of a few milliseconds, one might take ε to be a much larger number, like one second or ten seconds Although extremely conservative, such an approach would in practice be quite safe Later we will examine real-time issues more closely, to ask how much better we can do, but it is useful to keep in mind that very coarse-grained real-time problems are often simple in distributed systems where the equivalent fine-grained real-time problems would be very difficult or provably impossible At the same time, even a coarse-grained rule such as this one would only be safe if there was good reason to believe that the value of ε was a “safe” approximation Some systems provide no guarantees of this sort at all, in which case incorrect behavior could result if a period of extreme overload or some other unusual condition caused the ε limit to be exceeded To summarize, the “core primary partition GMS” protocol must satisfy the following properties: • C-GMS-1: The system membership takes the form of “system views” There is an initial system view which is predetermined at the time the system starts Subsequent views differ by the addition or deletion of processes • C-GMS-2: Only processes that request to be added to the system are added Only processes that are suspected of failure, or that request to leave the system, are deleted • C-GMS-3: A majority of the processes in view i of the system must acquiesce in the composition of view i+1 of the system • C-GMS-4: Starting from an initial system view, subsequences of a single sequence of “system views” are reported to system members Each system member observes such a subsequence starting with the view in which it was first added to the system, and continuing until it fails, leaves the system, or is excluded from the system 243 244 Kenneth P Birman - Building Secure and Reliable Network Applications • C-GMS-5: If process p suspects process q of being faulty, then if the core-GMS service is able to report new views, either q will be dropped from the system, or p will be dropped, or both • C-GMS-6: In a system with synchronized clocks and bounded message latencies, any process dropped from the system view will know that this has occurred within bounded time As noted above, the core GMS protocol will not always be able to make progress: there are patterns of failures and communication problems that can prevent it from reporting new system views For this reason, C-GMS-5 is a conditional liveness property: if the core GMS is able to report new views, then it eventually acts upon process add or delete requests It is not yet clear what conditions represent the weakest environment within which liveness of the GMS can always be guaranteed For the protocol given above, the core GMS will make progress provided that at most a minority of processes from view i fail or are suspected as having failed during the period needed to execute the two- or three-phase commit protocol used to install new views Such a characterization may seen evasive, since such a protocol may execute extremely rapidly in some settings and extremely slowly in others However, unless the timing properties of the system are sufficiently strong to support estimation of the time needed to run the protocol, this seems to be as strong a statement as can be made We note that the failure detector called W in the work of Chandra and Toueg is characterized in terms somewhat similar to this [CT91, CHT92] Very recent work by several researchers ([BDM95, Gue95, FKMB95) has shown that the W failure detector can be adapted to asynchronous systems in which messages can be lost during failures or processes can be “killed” because the majority of processes in the system consider them to be malfunctioning Although fairly theoretical in nature, these studies are shedding light on the conditions under which problems such as membership agreement can always be solved, and those under which agreement may not always be possible (the theoreticians are fond of calling the latter settings in which the problem is “impossible”) To present this work here, however, would require a lengthy theoretical digression, which would be out of keeping with the generally practical tone of the remainder of the text Accordingly, we cite this work and simply encourage the interested reader to turn to the papers for further detail 13.10.4 Extending the GMS to Allow Partition and Merge Events Research on the Transis system, at Hebrew University in Jerusalem, has yielded insights into the extension of protocols such as the one used to implement our primary component GMS so that it can permit continued operation during partitionings that leave no primary component, or allow activity in a non-primary component, reconciling the resulting system state when partitions later remerge [ADKM92b, Mal94] Some of this work was done jointly with the Totem project at U C Santa Barbara [MAMA94] Briefly, the approach is as follows In Ricciardi’s protocols, when the GMS is unable to obtain a majority vote in favor of a proposed new view, the protocol ceases to make progress In the extended protocol, such a GMS can continue to produce new views, but no longer considers itself to be the primary partition of the system Of course, there is also a complementary case in which the GMS encounters some other GMS and the two merge their membership views It may now be the case that one GMS or the other was the primary component of the system, in which case the new merged GMS will also be primary for the system On the other hand, perhaps a primary component fragmented in such a way that none of the surviving components considers itself to be the primary one When this occurs, it may be that later, such components will remerge and primaryness can then be “deduced” by study of the joint histories of the two components Thus, one can extend the GMS to make progress even when partitioning occurs Some recent work at the University of Bologna, on a system named Relacs, has refined this approach into one that is notable for its simplicity and clarity Ozalp Babaoglu, working with Alberto Bartoli and Gianluca Dini, have demonstrated that a very small set of extensions to a view-synchronous 244 Chapter 13: Guaranteeing Behavior in Distributed Systems 245 environment suffice to support EVS-like functionality They call their model Enriched View Synchrony, and describe it in a technical report that appeared shortly before this text went to press [BBD96] Very briefly, Enriched View Synchrony arranges to deliver only non-overlapping group views within different components of a partitioned system The reasoning behind this is that overlapping views can cause applications to briefly believe that the same process or site resides on both sides of a partition, leading to inconsistent behavior Then, they provide a set of predicates by which a component can determine whether or not it has a quorum that would permit direct update of the global system state, and algorithmic tools for assisting in the state merge problem that arises when communication is reestablished The author is not aware of any implementation of this model yet, but the primitives are simple and an implementation in a system such as Horus (Chapter 18) would not be difficult Having described these approaches, there remains an important question: whether or not it is desirable to allow a GMS to make progress in this manner We defer this point until Chapter 16 In addition, as was noted in a footnote above, Keidar and Dolev have shown that there are cases in which no component is ever the primary one for the system, and yet dynamically unform actions can still be performed through a type of gossip that occurs whenever the network becomes reconnected and two nonminority components succeed in communicating [KD95] Although interesting, this protocol is costly: prior to taking any action, a majority of all the processes in the system must be known to have seen the action Indeed, Keidar and Dolev develop their solution for a static membership model, in which the GMS tracks subsets of a known maximum system membership The majority requirement makes this protocol costly, hence although it is potentially useful in the context of wide-area systems that experience frequent partition failures, it is not likely that one would use it directly in the local-area communication layers of a system We will return to this issue in Chapter 15, in conjunction with the model called Extended Virtual Synchrony 13.11 Dynamic Process Groups and Group Communication When the GMS is used to ensure system-wide agreement on failure and join events, the illusion of a failstop computing environment is created [SM94] For example, if one were to implement the 3PC protocol of Section 13.6.1.2 using the notifications of the GMS service as a failure detection mechanism, the 3PC protocol would be non-blocking provided, of course, that the GMS service itself is able to remain active and continue to output failure detections The same power that the GMS brings to the 3PC problem can also be exploited to solve other problems, such as data replication, and offers us ways to so that can be remarkably inexpensive relative to the quorum update solutions presented previously Yet, we remain able to say that these systems are reliable in a strong sense: under the conditions when the GMS can make progress, such protocol will also make progress, and will maintain their consistency properties continuously, at least when permission to initiate new actions is limited to the primary component in the event that the system experiences a partitioning failure 245 246 Kenneth P Birman - Building Secure and Reliable Network Applications p q r s t crash Figure 13-14: A dynamically uniform multicast involves a more costly protocol, but if the message is delivered to any destination, the system guarantees that the remaining destinations will also receive it This is sometimes called a “safe” delivery, in the sense that it is safe to take actions that leave externally visible effects with respect to which the remainder of the system must be consistent However, a non-uniform multicast is often safe for applications in which the action taken upon receipt of the message has only internal effects on the system state, or when consistency with respect to external actions can be established in other ways, for example from the semantics of the application In the subsections that follow, we develop this idea into an environment for computing with what are called virtually synchronous process groups We begin by focusing on a simpler problem, closely related to 2PC, namely the reliable delivery of a message to a statically defined group of processes Not surprisingly, our solution will be easily understood in terms of the 2PC protocol, delivering messages in the first phase if internal consistency is all that we require, and doing so in the second phase if dynamic uniformity (external consistency) is needed We will then show how this solution can be extended to provide ordering on the delivery of messages; later, such ordered and reliable communication protocols will be used to implement replicated data and locking Next, we show how the same protocols can also be used to implement dynamic groups of processes In contrast to the dynamic membership protocols used in the GMS, however, these protocols will be quite a bit simpler and less costly Next, we introduce a synchronization mechanism that allows us to characterize these protocols as failure-atomic with respect to group membership changes; this implements a model called the view synchrony model Finally, we show how view synchrony can support a more extensive execution model called virtually synchrony, which supports a particularly simple and efficient style of fault-tolerant computing Thus, step by step, we will show how to built up a reliable and consistent computing environment starting with the protocols embodied in the group membership service Up to the present we have focused on protocols in terms of a single group of processes at a time, but the introduction of sophisticated protocols and tools in support of process group computing also creates the likelihood that a system will need to support a great many process groups simultaneously and that a single distributed application may embody considerable numbers of groups perhaps many groups per process that is present Such developments have important performance implications, and will motivate us to reexamine our protocols 246 Chapter 13: Guaranteeing Behavior in Distributed Systems 247 Finally, we will turn to the software engineering issues associated with support for process group computing This topic, which is addressed in the next chapter of the textbook, will center on a specific software system developed by the author and his colleagues, called Horus The chapter also reviews a number of other systems, however, and in fact one of the key goals of Horus is to be able to support the features of some of these other systems within a common framework 13.11.1 Group Communication Primitives A group communication primitive is a procedure for sending a message to a set of processes that can be addressed without knowledge of the current membership of the set Recall that we discussed the notion of a hardware broadcast capable of delivering a single message to every computer connected to some sort of communications device Group communication primitives would normally transmit to subsets of the full membership of a computing system, so we use the term multicast to describe their behavior A multicast is a protocol that sends a message from one sender process to multiple destination processes, which deliver it Suppose that we know that the current composition of some group G is {p0, pk} What properties should a multicast to G satisfy? The answer to this question will depend upon the application As will become clear in Chapters 16 and 17, there are a great number of “reliable” applications for which a multicast with relatively weak properties would suffice For example, an application that is simply seeking information that any of the members of G can provide might multicast an inquiry to G as part of an RPC-style protocol that requires a single reply, taking the first one that is received Such a multicast would ideally avoid sending the message to the full membership of G, resorting instead to heuristics for selecting a member that is likely to respond quickly (like one on the same machine as the sender), and implementing the multicast as an RPC to this member that falls back to some other strategy if no local process is found, or if it fails to respond before a timeout elapses One might argue that this is hardly a legitimate implementation of a multicast, since it often behaves like an RPC protocol, but there are systems that implement precisely this functionality and find it extremely useful A multimedia system might use a similar multicast, but with real-time rate-control or latency properties that correspond to the requirements of the display software [RS92] As groupware uses of distributed systems become increasingly important, one can predict that vendors of web browsers will focus on offering such functions and that telecommunications service providers will provide the corresponding communications support Such support would probably need to be aware of the video p q r s t crash Figure 13-15: Non-dynamically uniform message delivery: the message is delivered to one destination, q, but then the sender and q crash and the message is delivered to none of the remaining destinations 247 Kenneth P Birman - Building Secure and Reliable Network Applications 248 encoding that is used, for example MPEG, so that it can recognize and drop data selectively if a line becomes congested or a date frame is contaminated or will arrive too late to be displayed p q r s t: joins crashes Figure 13-16: Neither form of atomicity guarantees that a message will actually be delivered to a destination that fails before having an opportunity to deliver the message, or that joins the system after the message is sent Distributed systems that use groups as a structuring construct may require guarantees of a different nature, and we now focus on those A slightly more ambitious multicast primitive that could be useful in such a group-oriented application might work by sending the message to the full membership of the destination set, but without providing reliability, ordering, flow control, or any form of feedback in regard to the outcome of the operation Such a multicast could be implemented by invoking the IP multicast (or UDP multicast) transport primitives that we discussed in Section 3.3.1 The user whose application requires any of these properties would implement some form of end-to-end protocol to achieve them A more elaborate form of multicast would be a failure-atomic multicast, which guarantees that for a specified class of failures, the multicast will either reach all of its destinations, or none of them As we just observed, there are really two forms of failure atomicity that might both be interesting, depending on the circumstance A failure-atomic multicast is dynamically uniform if it guarantees that if any process delivers the multicast, than all processes that remain operational will so, regardless of whether or not the initial recipient remains operational subsequent to delivering the message A failure-atomic multicast that is not dynamically uniform would guarantee only that if one waits long enough, one will find either that all the destinations that remained operational delivered the message, or that none did so To avoid trivial outcomes, both primitives require that the message be delivered eventually if the sender doesn’t fail.11 To reiterate a point made earlier, the key difference between a dynamically uniform protocol and one that is merely failure-atomic but non-uniform has to with the obligation when the first delivery event occurs From the perspective of a recipient process p, if m is sent using a protocol that provides dynamic uniformity, then when p delivers m it also knows that any future execution of the system in which a set of processes remains operational will also guarantee the delivery of m within its remaining destinations among that set of processes, as illustrated in Figure 13-14 (We state it this way because 11 Such a definition leaves open the potential for another sort of trivial solution: one in which the act of invoking the multicast primitive causes the sender to be excluded from the system as faulty A rigorous non-triviality requirement would also exclude this sort of behavior, and there may be other trivial cases that this author is not aware of However, as was noted early in the textbook, our focus here is on reliability as a practical engineering discipline, and not on the development of a mathematics of reliability The author is convinced that such a mathematics is urgently needed, and is concerned that minor problems such as this one could have subtle and undesired implications in correctness proofs However, the logical formalism that would permit a problem such as this to be specified completely rigorously remain elusive, apparently because of the self-defined character of a system that maintains its own membership and seeks an internal consistency guarantee but not an external one This author is hopeful that with further progress in the area of specification, limitations such as these can be overcome in the near future 248 Chapter 13: Guaranteeing Behavior in Distributed Systems 249 processes that join after m was sent are not required to deliver m) On the other hand, if process p receives a non-uniform multicast m, p knows that if both the sender of m and p crash or are excluded from the system membership, m may not reach its other destinations, as seen in Figure 13-15 Dynamic uniformity is a costly property to provide, but p would want this guarantee if its actions upon receiving m will leave some externally visible trace that the system must know about, such as redirecting an airplane or issuing money from a automatic teller A non-dynamic failure atomicity rule would be adequate for most internal actions, like updating a data structure maintained by p, and even for some external ones, like displaying a quote on a financial analyst’s workstation, or updating an image in a collaborative work session In these cases, one may want the highest possible performance and not be willing to pay a steep cost for the dynamic uniformity property because the guarantee it provides is not actually a necessary one Notice that neither property ensures that a message will reach all of its destinations, because no protocol can be sure that a destination will not crash before having an opportunity to deliver the message, as seen in Figure 13-16 In this section of the textbook, the word “failure” is understood to refer to the reporting of failure events by the GMS [SM94] Problems that result in the detection of an apparent failure by a system member are reported to the GMS but not directly trigger any actions (except that messages from apparently faulty processes are ignored), until the GMS officially notifies the full system that the failure has occurred Thus, although one could develop failure atomic multicasts against a variety of failure models, we will not be doing so in this section 13.12 Delivery Ordering Options Turning now to multicast delivery ordering, let us start by considering a multicast that offers no guarantees whatsoever Using such a multicast, a process that sends two messages m0 and m1 concurrently would have no assurances at all about their relative order of delivery or relative atomicity That is, suppose that m0 was the message sent first Not only might m1 reach any destinations that it shares with m0 first, but a failure of the sender might result in a scenario where m1 was delivered atomically to all its destinations but m0 was not delivered to any process that remains operational (Figure 13-17) Such an outcome would be atomic on a per-multicast basis, but might not be a very useful primitive from the perspective of the application developer! Thus, we should ask what forms of order a multicast primitive can guarantee, but also ask how order is connected to atomicity in our failure-atomicity model 249 250 Kenneth P Birman - Building Secure and Reliable Network Applications q G p0 p1 m0 m1 r m2 Figure 13-17: An unordered multicast provides no guarantees Here, m0 was sent before m1, but is received after m1 at destination p0 The reception order for m2, sent concurrently by process r, is different at each of its destinations We will be studying a hierarchy of increasingly ordered delivery properties The weakest of these is usually called “sender order” or “FIFO order” and requires that if the same process sends m0 and m1 then m0 will be delivered before m1 at any destinations they have in common A slightly stronger ordering property is called causal delivery order, and says that if send(m0)→send(m1), then m0 will be delivered before m1 at any destinations they have in common (Figure 13-19) Still stronger is an order whereby any processes that receive the same two messages receive them in the same order: if at process p, deliv(m0)→deliv(m1), then m0 will be delivered before m1 at all destinations they have in common This is sometimes called a totally ordered delivery protocol, but to so is something of a misnomer, since one can imagine a number of ordering properties that would be total in this respect without necessarily implying the existing of a single system-wide total ordering on all the messages sent in the system The reason for this is that our definition focuses on delivery orders where messages overlap, but doesn’t actually relate these orders to an acyclic system-wide ordering The Transis project calls this type of locally ordered multicast an “agreed” order, and we like this term too: the destinations agree on the order, even for multicasts that may have been initiated concurrently and hence that may be unordered by their senders (Figure 13-20) However, the agreed order is more commonly called a “total” order or an “atomic” delivery order in the systems that support multicast communication and the papers in the literature q m0 m1 G p0 p1 r m2 Figure 13-18: Sender ordered or "fifo" multicast Notice that m2, which is sent concurrently, is unordered with respect to m0 and m1 250 Chapter 13: Guaranteeing Behavior in Distributed Systems q0 q1 G p0 251 p1 r m0 m2 m1 Figure 13-19: Causally ordered multicast delivery Here m0 is sent before m1 in a causal sense, because a message is sent from q0 to q1 after m0 was sent, and before q1 sends m1 Perhaps q0 has requested that q1 send m1 m0 is consequently delivered before m1 at destinations that receive both messages Multicast m2 is sent concurrently and no ordering guarantees are provided In this example, m2 is delivered after m1 by p0 and before m1 by p1 One can extend the agreed order into a causal agreed order (now one requires that if the sending events were ordered by causality, the delivery order will respect the causal send order), or into a systemwide agreed order (one requires that there exists a single system-wide total order on messages, such that the delivery ordering used at any individual process is consistent with the message ordering in this system total order) Later we will see why these are not identical orderings Moreover, in systems that have multiple process groups, the issue will arise of how to extend ordering properties to span multiple process groups q0 q1 G p0 p1 r m0 m1 m2 Figure 13-20: When using a totally ordered multicast primitive, p0 and p1 receive exactly the same multicasts, and the message are delivered in identical orders Above the order happens to also be causal, but this is not a specific guarantee of the primitive Wilhelm and Schiper have proposed that total ordering be further classified as weak or strong in terms analogous to the dynamically uniform and non-uniform delivery properties A weak total ordering property would be one guaranteed to hold only at correct processes, namely those that remain operational until the protocol terminates A strong total ordering property would hold even at faulty processes, namely those that fail after delivering messages but before the protocol as a whole has terminated 251 Kenneth P Birman - Building Secure and Reliable Network Applications 252 For example, suppose that a protocol fixes the delivery ordering for messages m1 and m2 at process p, delivering m1 first If p fails, a weak total ordering would permit the delivery of m2 before m1 at some other process q that survives the failure, even though this order is not the one seen by p Like dynamic uniformity, the argument for strong total ordering is that this may be required if the ordering of messages may have externally visible consequences, which could be noticed by an external observer who interacts with a process that later fails, and then interacts with some other process that remained operational Naturally, this guarantee has a price, though, and one would prefer to use a less costly weak protocol in settings where such a guarantee is not required Let us now return to the issue raised briefly above, concerning the connection between the ordering properties for a set of multicasts and their failure atomicity properties To avoid creating an excessive number of possible multicast protocols, we will assume here that the developer of a reliable application will typically want the specified ordering property to extend into the failure atomicity properties of the primitives used, too That is, in a situation where the ordering property of a multicast would imply that message m0 should be delivered before m1 if they have any destinations in common, we will require that if m1 is delivered successfully, then m0 must be too, whether or not they actually have common destinations This is sometimes called a gap freedom guarantee: it is the constraint that failures cannot leave holes or gaps in the ordered past of the system Such a gap is seen in Figure 13-21 q0 q1 G p0 p1 r m0 m1 crash m2 ? ? Figure 13-21: In this undesirable scenario, the failure of q0 leaves a "causal gap" in the message delivery order, preventing q1 from communicating with members of G If m1 is delivered, the causal ordering property would be violated, because send(m0)→send(m1) But m0 will never be delivered Thus q1 is logically partitioned from G! Notice that this rule is stated so that it would apply even if m0 and m1 have no destinations in common! The reason for this is that ordering requirements are normally transitive: if m0 is before m1 and m1 is before m2, then m0 is also before m2, and we would like both delivery ordering obligations and failure atomicity obligations to be guaranteed between m0 and m2 Had we instead required that “in a situation where the ordering property of a multicast implies that message m0 should be delivered before m1, then if they have any destinations in common, we will also require that if m1 is delivered successfully, then m0 must be too,” the delivery atomicity requirement might not apply between m0 and m2 Lacking a gap-freedom guarantee, one can imagine runs of a system that would leave orphaned processes that are technically prohibited from communicating with one-another For example, in Figure 13-21, q1 sends message m1 to the members of group G causally after m0 was sent by q0 to G The members of G are now required to deliver m0 before delivering m1 However, if the failure atomicity rule is such that the failure of q0 could prevent m0 from ever being delivered, this ordering obligation can only be satisfied by never delivering m1 One could say that q1 has been partitioned from G by the ordering obligations of the system! Thus, if a system provides ordering guarantees and failure atomicity guarantees, it should normally extend the latter to encompass the former Yet an additional question arises if a process sends multicasts to a group while processes are joining or leaving it In these cases the membership of the group will be in flux at the time that the message is sent, and one can imagine a number of ways to interpret group atomicity that a system could implement We will defer this problem for the present, returning to it in Section 13.12.2 252 Chapter 13: Guaranteeing Behavior in Distributed Systems 253 13.12.1.1 Non-Uniform Failure-Atomic Group Multicast Consider the following simple, but inefficient group multicast protocol The sender adds a header to its message listing the membership of the destination group at the time that it sends the message It now transmits the message to the members of the group, perhaps taking advantage of a hardware multicast feature if one is available, and otherwise transmitting the message over stream-style reliable connections to the destinations (However, unlike a conventional stream protocol, here we will assume that the connection is only broken if the GMS reports that one of the endpoints has left the system) Upon reception of a message, the destination processes deliver it immediately, but then resend it to the remaining destinations Again, each process uses reliable stream-style channels for this retransmission stage, breaking the channel only if the GMS reports the departure of an endpoint A participant will now receive one copy of the message from the sender, and one from each non-failed participant other than itself After delivery of the initial copy, it therefore discards any duplicates We will now argue that this protocol is failure-atomic, although not dynamically uniform To see that it is failure-atomic, assume that some process pi receives and delivers a copy of the message and remains operational Failure atomicity tells us that all other destinations that remain operational must also receive and deliver the message It is clear that this will occur, since the only condition under which pi would fail to forward a message to pj would be if the GMS reports that pi has failed, or if it reports that pj has failed But we assumed that pi does not fail, and the output of the GMS can be trusted in this environment Thus, the protocol achieves failure-atomicity To see that the protocol is not dynamically uniform, consider the situation if the sender sends a copy of the message only to process pi and then both processes fail In this case, pi may have delivered the message and then executed for some extended period of time before crashing or detecting that it has been partitioned from the system The message has thus been delivered to one of the destinations and that destination may well have acted on it in a visible way, and yet none of the processes that remain operational will ever receive it As we noted earlier, this often will not pose a problem for the application, but it is a behavior that the developer must anticipate and treat appropriately in his or her application p0 p1 p2 p3 Figure 13-22: A very simple reliable multicast protocol The initial round of messages triggers a second round of messages as each recipient echoes the incoming message to the other destinations As can be seen in Figure 13-22, this simple protocol is a costly one: to send a message to n destinations requires O(n2) messages Of course, with hardware broadcast functions, or if the network is not a bottleneck, the cost will be lower, but the protocol still requires each process to send and receive each message approximately n times 253 Kenneth P Birman - Building Secure and Reliable Network Applications 254 But now, suppose that we delay the “retransmission” stage of the protocol, doing this only if the GMS informs the participants that the sender has failed This change yields we have a less costly protocol, which requires n messages (or just one, if hardware broadcast is an option), but in which the participants may need to save a copy of each message indefinitely They would this “just in case” the sender fails Recall that we are transmitting messages over a reliable stream It follows that within the lower levels of the communication system, there is an occassional acknowledgment flowing from each participant back to the sender If we tap into this information, the sender will “know” when the participants have all received copies of its message It can now send a second phase message out, informing the participants that it is safe to delete the saved copy of each message, although they must still save the message “identification” information to reject duplicates if the sender happens to crash midway through this stage At this stage the participants can disable their retransmission logic and discard the saved copy of the message (although not its identification information), since any retransmitted message would be a duplicate Later, the sender could run still a third phase, telling the participants that they can safely delete even the message identification information, because after the second phase there will be no risk of a failure that would cause the message to be retransmitted by the participants But now a further optimization is possible There is no real hurry to run the third phase of this protocol, and even ok to deliver the second phase can be delayed to some degree Moreover, most processes that all have seen it send a multicast will tend to send a subsequent one soon afterwards: this garbage collect principle is well known from all forms of operating systems and database software, and can be summarized by the maxim that the most likely action by any process is to repeat the same action it took most recently Accordingly, it Figure 13-23: An improved 3-phase protocol Ideally, the second makes sense to delay sending out and third phases would be piggybacked onto other multicasts from messages for the second and third phase the same sender to the same set of destinations, and hence would not of the protocol, in the hope that a new require “extra” messages multicast will be initiated and this information can be piggybacked onto the first-stage of an outgoing message associated with that subsequent protocol! p0 p1 p2 p3 In this manner, we arrive at a solution, illustrated in Figure 13-23, that has an average cost of n messages per multicast, or just one if hardware broadcast can be exploited, plus some sort of background cost associated with the overhead to implement a reliable stream channel When a failure does occur, any pending multicast will suddenly generate as many as n2 additional messages, but even this effect can potentially be mitigated For example, since the GMS provides the same membership list to all processes and the message itself carried the list of its destinations, the participants can delay briefly in the hope that some jointly identifiable “lowest ranked” participant will turn out to have received the message and will terminate the protocol on behalf of all We omit the details of such a solution, but any serious system for reliable distributed computing would implement a variety of such mechanisms to keep costs down to an absolute minimum, and to maximize the value of each message actually transmitted using piggybacking, delaying tactics, and hardware broadcast 254 Chapter 13: Guaranteeing Behavior in Distributed Systems 255 13.12.1.2 Dynamically Uniform Failure-Atomic Group Multicast We can extend the above protocol to one that is dynamically uniform, but doing so requires that no process deliver the message until it is known the processes in the destination group all have a copy (In some cases it may be sufficient to know that a majority have a copy, but we will not concern ourselves with these sorts of special cases now, because they are typically limited to the processes that actually run the GMS protocol) We could accomplish this in the original inefficient protocol of Figure 13-22, by modifying the original non-uniform protocol to delay the delivery of messages until a copy has been received from every destination that is still present in the membership list provided by the GMS However, such a protocol would suffer from the inefficiencies that lead us to optimize the original protocol into the one in Figure 13-23 Accordingly, it makes more sense to focus on that improved protocol Here, it can be seen that an additional round of messages will be needed before the multicast can be save message delivered initially; the remainder of the protocol can then be used without change ok to deliver (Figure 13-24) Unfortunately, though, this initial round also delays the delivery all have seen it of the messages to their destinations In the original protocol, a message could be garbage collect delivered as soon as it reached a destination for the first time, thus the latency to delivery is precisely the latency from the sender to a given Figure 13-24: A dynamically uniform version of the optimized, destination for a single “hop” Now the reliable multicast protocol Latency to delivery may be much latency might be substantially increased: higher, because no process can deliver the message until all for a dynamically uniform delivery, we processes have received and saved a copy Here, the third and will need to wait for a round-trip to the fourth phases can piggyback on other multicasts but the first two slowest process in the set of destinations, stages may need to be executed as promptly as possible, to avoid and then one more hop until the sender increasing the latency still further Latency is often a key has time to inform the destinations that it performance factor is safe to deliver the messages In practice, this may represent an increase in latency of a factor of ten or more Thus, while dynamically uniform guarantees are sometimes needed, the developer of a distributed application should request this property only when it is genuinely necessary, or performance (to the degree that latency is a factor in performance) will suffer badly p0 p1 p2 p3 13.12.2 Dynamic Process Groups When we introduced the GMS, our system became very dynamic, allowing processes to join and leave at will But not all processes in the system will be part of the same application, and the protocols presented in the previous section are therefore assumed to be sent to groups of processes that represent subsets of the full system membership This is seen in Figure 13-25, which illustrates the structure of a hypothetical trading system, in which services (replicated for improved performance or availability) implement theoretical pricing calculations Here we have one big system, with many small groups in it How should the membership of such a subgroup be managed? 255 256 Kenneth P Birman - Building Secure and Reliable Network Applications Historical pricing database Market Data Feed: Current Pricing Trading display (front end client systems) Analytics Long-haul WAN Spooler (to Zurich, Tokyo, ) Figure 13-25: Distributed trading system may have both “static” and “dynamic” uses for process groups The historical database, replicated for load-balancing and availability, is tied to the databases themselves and hence can be viewed as static This is also true of the market data feeds, which are often redundant for fault-tolerance Other parts of the system, however, such as the analytics (replicated for parallelism) and the client interface processes (one or more per trader) are highly dynamic groups For uniformity of the model, it makes sense to adopt a dynamic group model, but to keep in mind that some of these groups in fact manage physical resources In this section, we introduce a membership management protocol based on the idea that a single process within each group will serve as the “coordinator” for membership changes If a process wishes to join the group, or voluntarily leaves the group, this coordinator will update the group membership accordingly (The role of being coordinator will really be handled by the layer of software that implements groups, so this won’t be visible to the application process itself.) Additionally, the coordinator will monitor the members (through the GMS, and by periodically pinging them to verify that they are still healthy), excluding any failed processes from the membership much as in the case of a process that leaves voluntarily In the approach we present here, all processes that belong to a group maintain a local copy of the current membership list We call this the “view” of the group, and will say that each time the membership of the group changes, a “new view” of the group is reported to the members Our protocol will have the property that all group members see the identical sequence of group views within any given component of a partitioned system In practice, we will mostly be interested in primary-component partitions, and in these cases, we will simply say that all processes either see identical views for a group or, if excluded from the primary component, cease to see new views and eventually detect that they are partitioned, at which point a process may terminate or attempt to rejoin the system much as a new process would The members of a group depend upon their coordinator for the reporting of new views, and consequently monitor the liveness of the coordinator by periodically pinging it If the coordinator appears to be faulty, the member or members that detect this report the situation to the GMS in the usual manner, simultaneously cutting off communication to the coordinator and starting to piggyback or “gossip” this information on messages to other members, which similarly cut their channels to the coordinator and, if necessary, relay this information to the GMS The GMS will eventually report that the coordinator has failed, at which point the lowest ranked of the remaining members takes over as the new coordinator, and similarly if this process fails in its turn 256 ... whole has terminated 251 Kenneth P Birman - Building Secure and Reliable Network Applications 252 For example, suppose that a protocol fixes the delivery ordering for messages m1 and m2 at process... in it How should the membership of such a subgroup be managed? 255 256 Kenneth P Birman - Building Secure and Reliable Network Applications Historical pricing database Market Data Feed: Current... scale to 231 232 Kenneth P Birman - Building Secure and Reliable Network Applications encompass periods of hardware reconfiguration and upgrades, or looks at the applications themselves, the static