Tài liệu Độ tin cậy của hệ thống máy tính và mạng P7 pdf

53 380 0
Tài liệu Độ tin cậy của hệ thống máy tính và mạng P7 pdf

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design Martin L Shooman Copyright  2002 John Wiley & Sons, Inc ISBNs: 0-471-29342-3 (Hardback); 0-471-22460-X (Electronic) RELIABILITY OPTIMIZATION INTRODUCTION The preceding chapters of this book discussed a wide range of different techniques for enhancing system or device fault tolerance In some applications, only one of these techniques is practical, and there is little choice among the methods However, in a fair number of applications, two or more techniques are feasible, and the question arises regarding which technique is the most cost-effective To address this problem, if one is given two alternatives, one can always use one technique for design A and use the other technique for design B One can then analyze both designs A and B to study the trade-offs In the case of a standby or repairable system, if redundancy is employed at a component level, there are many choices based on the number of spares and which component will be spared At the top level, many systems appear as a series string of elements, and the question arises of how we are to distribute the redundancy in a cost-effective manner among the series string Specifically, we assume that the number of redundant elements that can be added is limited by cost, weight, volume, or some similar constraint The object is to determine the set of redundant components that still meets the constraint and raises the reliability by the largest amount Some authors refer to this as redundancy optimization [Barlow, 1965] Two practical works—Fragola [1973] and Mancino [1986]—are given in the references that illustrate the design of a system with a high degree of parallel components The reader should consult these papers after studying the material in this chapter In some ways, this chapter can be considered an extension of the material in Chapter However, in this chapter we discuss the optimization approach, 331 332 RELIABILITY OPTIMIZATION where rather than having the redundancy apply to a single element, it is distributed over the entire system in such a way that it optimizes reliability The optimization approach has been studied in the past, but is infrequently used in practice for many reasons, such as (a) the system designer does not understand the older techniques and the resulting mathematical formulation; (b) the solution takes too long; (c) the parameters are not well known; and (d) constraints change rapidly and invalidate the previous solution We propose a technique that is clear, simple to explain, and results in the rapid calculation of a family of good suboptimal solutions along with the optimal solution The designer is then free to choose among this family of solutions, and if the design features or parameters change, the calculations can be repeated with modest effort We now postulate that the design of fault-tolerant systems can be divided into three classes In the first class, only one design approach (e.g., parallel, standby, voting) is possible, or intuition and experience points only to a single approach Thus it is simple to decide on the level of redundancy required to meet the design goal or the level allowed by the constraint To simplify our discussion, we will refer to cost, but we must keep in mind that all the techniques to be discussed can be adapted to any other single constraint or, in many cases, multiple constraints Typical multiple constraints are cost, reliability, volume, and weight Sometimes, the optimum solution will not satisfy the reliability goal; then, either the cost constraint must be increased or the reliability goal must be lowered In the second class, if there are two or three alternative designs, we would merely repeat the optimization for each class as discussed previously and choose the best result The second class is one in which there are many alternatives within the design approach because we can apply redundancy at the subsystem level to many subsystems The third class, where a mixed strategy is being considered, also has many combinations To deal with the complexity of the third-class designs, we will use computer computations and an optimization approach to guide us in choosing the best alternative or set of alternatives OPTIMUM VERSUS GOOD SOLUTIONS Because of practical considerations, an approximate optimization yielding a good system is favored over an exact one yielding the best solution The parameters of the solution, as well as the failure rates, weight, volume, and cost, are generally only known approximately at the beginning of a design; moreover, in some cases, we only know the function that the component must perform, not how that function will be implemented Thus the range of possible parameters is often very broad, and to look for an exact optimization when the parameters are known only over a broad range may be an elegant mathematical formulation but is not a practical engineering solution In fact, sometimes choosing the exact optimum can involve considerable risk if the solution is very sensitive to small changes in parameters OPTIMUM VERSUS GOOD SOLUTIONS 333 To illustrate, let us assume that there are two design parameters, x and y, and the resulting reliability is z We can visualize the solution as a surface in x, y, z space, where the reliability is plotted along the vertical z-axis as the two design parameters vary in the horizontal xy plane Thus our solution is a surface lying above the xy plane and the height (z) of the surface is our reliability that ranges between and unity Suppose our surface has two maxima: one where the surface is a tall, thin spire with the reliability zs c 0.98 at the peak, which occurs at xs, ys, and the other where the surface is a broad one and where the reliability reaches zb c 0.96 at a small peak located at xb, yb in the center of a broad plateau having a height of 0.94 Clearly, if we choose the spire as our design and if parameters x or y are a little different than xs, ys, the reliability may be much lower—below 0.96 and even below 0.94—because of the steep slopes on the flanks of the spire Thus the maximum of 0.96 is probably a better design and has less risk, since even if the parameters differ somewhat from xb, yb, we still have the broad plateau where the reliability is 0.94 Most of the exact optimization techniques would choose the spire and not even reveal the broad peak and plateau as other possibilities, especially if the points xs, ys and xb, yb were well-separated Thus it is important to find a means of calculating the sensitivity of the solution to parameter variations or calculating a range of good solutions close to the optimum There has been much emphasis in the theoretical literature on how to find an exact optimization The brute force approach is to enumerate all possible combinations and calculate the resulting reliability; however, except for small problems, this approach requires long or intractable computations An alternate approach uses dynamic programming to reduce the number of possible combinations that must be evaluated by breaking the main optimization into a sequence of carefully formulated suboptimizations [Bierman, 1969; Hiller, 1974; Messinger, 1970] The approach that this chapter recommends is the use of a two-step procedure We assume that the problem in question is a large system Generally, at the top level of a large system, the problem can be modeled as a series connection of a number of subsystems The process of apportionment (see Lloyd [1977, Appendix 9A]) is used to allocate the system reliability (or availability) goal among the various subsystems and is the first step of the procedure This process should reduce a large problem into a number of smaller subproblems, the optimization of which we can approach by using a bounded enumeration procedure One can greatly reduce the size of the solution space by establishing a sequence of bounds; the resulting subsystem optimization is well within the power of a modern PC, and solution times are reasonable Of course, the first step in the process—that of apportionment—is generally a good one, but it is not necessarily an optimum one It does, however, fit in well with the philosophy alluded to in the previous section that a broad, easy-to-achieve, easy-to-understand suboptimum is preferred in a practical case As described later in this chapter, allocation tends to divert more resources to the “weakest link in the chain.” There are other important practical arguments for simplified semioptimum 334 RELIABILITY OPTIMIZATION techniques instead of exact mathematical optimization In practice, optimizing a design is a difficult problem for many reasons Designers, often harried by schedule and costs, look for a feasible solution to meet the performance parameters; thus reliability may be treated as an afterthought This approach seldom leads to a design with optimum reliability—much less a good suboptimal design The opposite extreme is the classic optimization approach, in which a mathematical model of the system is formulated along with constraints on cost, volume, weight, and so forth, where all the allowable combinations of redundant parallel and standby components are permitted and where the underlying integer programming problem is solved The latter approach is seldom taken for the previously stated reasons: (a) the system designer does not understand the mathematical formulation or the solution process; (b) the solution takes too long; (c) the parameters are not well known; and (d) the constraints rapidly change and invalidate the previous solution Therefore, clear, simple, and rapid calculation of a family of good suboptimal solutions is a sensible approach The study of this family should reveal which solutions, if any, are very sensitive to changes in the model parameters Furthermore, the computations are simple enough that they can be repeated should significant changes occur during the design process Establishing such a range of solutions is an ideal way to ensure that reliability receives adequate consideration among the various conflicting constraints and system objectives during the trade-off process—the preferred approach to choosing a good, well-balanced design 7.3 A MATHEMATICAL STATEMENT OF THE OPTIMIZATION PROBLEM One can easily define the classic optimization approach as a mathematical model of the system that is formulated along with constraints on cost, volume, weight, and so forth, in which all the allowable combinations of redundant parallel and standby components are permitted and the underlying integer programming problem must be solved We begin with a series model for the system with k components where x is the event success of element one, x is the event failure of element one, and P(x ) c − P(x ) is the probability of success of element one, which is the reliability, r (see Fig 7.1) Clearly, the components in the foregoing mathematical model can be subsystems if we wish The system reliability is given by the probability of the event in which all the components succeed (the intersection of their successes): Rs c P(x U x2 U ··· U xk ) If we assume that all the elements are independent, Eq (7.1a) becomes (7.1a) A MATHEMATICAL STATEMENT OF THE OPTIMIZATION PROBLEM x1 x2 xk r1 r2 rk Figure 7.1 335 A series system of k components k Rs c ∏ Ri ic1 (7.1b) We will let the single constraint on our design be the cost for illustrative purposes, and the total cost, c, is given by the sum of the individual component costs, ci : k c c 冱 ci ic1 (7.2) We assume that the system reliability given by Eq (7.1b) is below the system specifications or goals, Rg , and that the designer must improve the reliability of the system to meet these specifications (In the highly unusual case where the initial design exceeds the reliability specifications, the initial design can be used with a built-in safety factor, or else the designer can consider using cheaper shorter-lifetime parts to save money; the latter is sometimes a risky procedure.) We further assume that the maximum allowable system cost, c0 , is in general sufficiently greater than c so that the funds can be expended (e.g., redundant components added) to meet the reliability goal If the goal cannot be reached, the best solution is the one with the highest reliability within the allowable cost constraint In the case where more than one solution exceeds the reliability goal within the cost constraint, it is useful to display a number of “good” solutions Since we wish the mathematical optimization to serve a practical engineering design process, we should be aware that the designer may choose to just meet the reliability goal with one of the suboptimal solutions and save some money Alternatively, there may be secondary factors that favor a good suboptimal solution (e.g., the sensitivity and risk factors discussed in the preceding section) There are three conventional approaches to improving the reliability of the system posed in the preceding paragraph: Improve the reliability of the basic elements, r i , by allocating some or all of the cost budget, c0 , to fund redesign for higher reliability Place components in parallel with the subsystems that operate contin- 336 RELIABILITY OPTIMIZATION R1, n1 R2 , n2 Rk , nk Figure 7.2 The choice of redundant components to optimize the reliability of the series system of Fig 7.1 uously (see Fig 7.2) This is ordinary parallel redundancy (hot redundancy) Place components in parallel (standby) with the k subsystems and switch them in when an on-line failure is detected (cold redundancy) There are also strategies that combine these three approaches Such combined approaches, as well as reliability improvement by redesign, are discussed later in this chapter and also in the problems Most of the chapter focuses on the second and third approaches of the preceding list—hot and cold redundancy 7.4.1 PARALLEL AND STANDBY REDUNDANCY Parallel Redundancy Assuming that we employ parallel redundancy (ordinary redundancy, hot redundancy) to optimize the system reliability, Rs , we employ nk elements in parallel to raise the reliability of each subsystem that we denote by Rk (see Fig 7.2) The reliability of a parallel system of nk independent components is most easily formulated in terms of the probability of failure (1 − r i )ni For the structure of Fig 7.2 where all failures are independent, Eq (7.1b) becomes k Rs c ∏ (1 − [1 − r i ]ni ) ic1 (7.3) and Eq (7.2) becomes k c c 冱 ni c i ic1 (7.4) We can develop a similar formulation for standby redundancy 7.4.2 Standby Redundancy In the case of standby systems, it is well known that the probability of failure is governed by the Poisson distribution (see Section A5.4) HIERARCHICAL DECOMPOSITION P(x; m) c mx e− m x! 337 (7.5) where x c the number of failures m c the expected number of failures A standby subsystem succeeds if there are fewer failures than the number of available components, x k < nk ; thus, for a system that is to be improved by standby redundancy, Eqs (7.3) and (7.4) becomes k x k c nk − ic1 xk c Rs c ∏ 冱 P(x k ; m k ) (7.6) and, of course, the system cost is still computed from Eq (7.4) HIERARCHICAL DECOMPOSITION This section examines the way a designer deals with a complex problem and attempts to extract the engineering principles that should be employed This leads to a number of viewpoints, from which some simple approaches emerge The objective is to develop an approach that allows the designer to decompose a complex system into a manageable architecture 7.5.1 Decomposition Systems engineering generally deals with large, complex structures that, when taken as a whole (in the gestalt), are often beyond the “intellectual span of control.” Thus the first principle in approaching such a design is to decompose the problem into a hierarchy of subproblems This initial decomposition stops when the complexity of the resulting components is reduced to a level that puts it within the “intellectual span of control” of one manager or senior designer This approach is generally called divide and conquer and is presented for use on complex problems in books on algorithms [Aho, 1974, p 60; Cormen, 1992, p 12] The term probably comes from the ancient political maxim divide et impera (“divide and rule”) cited by Machiavelli [Bartlett, 1968, p 150b], or possibly early principles of military strategy 7.5.2 Graph Model Although the decomposition of a large system is generally guided by experience and intuition, there are some guidelines that can be used to guide the process We begin by examining the structure of the decomposition One can 338 RELIABILITY OPTIMIZATION Root Node Depth Depth Leaf Height Depth Leaves Leaves Depth Leaves Figure 7.3 A tree model of a hierarchical decomposition illustrating some graph nomenclature describe a hierarchical block diagram of a system in more precise terms if we view it as a mathematical graph [Cormen, 1992, pp 93–94] We replace each box in the block diagram by a vertex (node) and leaving the connecting lines that form the edges (branches) of the graph Since information can flow in both directions, this is an undirected graph; if information can flow in only one direction, however, the graph is a directed graph, and an arrowhead is drawn on the edge to indicate the direction A path in the graph is a continuous sequence of vertices from the start vertex to the end vertex If the end vertex is the same as the start vertex, then this (closed) path is called a cycle (loop) A graph without cycles where all the nodes are connected is called a tree (the graph corresponding to a hierarchical block diagram is a tree) The top vertex of a tree is called the root (root node) In general, a node in the tree that corresponds to a component with subcomponents is called a parent of the subcomponents, which are called children The root node is considered to be at depth (level 0); its children are at depth (level 1) In general, if a parent node is at level n, then its children are at level n + The largest depth of any vertex is called the depth of the tree The number of children that a parent has is the out-degree, and the number of parents connected to a child is the in-degree A node that has no children is the end node (terminal node) of a path from the root node and is called a leaf node (external node) Nonleaf nodes are called internal nodes An example illustrating some of this nomenclature is given in Fig 7.3 7.5.3 Decomposition and Span of Control If we wish our decomposition to be modeled by a tree, then the in-degree must always be one to prevent cycles or inputs to a stage entering from more than HIERARCHICAL DECOMPOSITION 339 one stage Sometimes, however, it is necessary to have more than one input to a node, in which case one must worry about synchronization and coupling between the various nodes Thus, if node x has inputs from nodes p and q, then any change in either p or q will affect node x Imposing this restriction on our hierarchical decomposition leads to simplicity in the interfacing of the various system elements We now discuss the appropriate size of the out-degree If we wish to decompose the system, then the minimum size of the out-degree at each node must be two, although this will result in a tree of great height Of course, if any node has a great number of children (a large out-degree), we begin to strain the intellectual span of control The experimental psychologist Miller [1956] studied a large number of experiments related to sensory perception and concluded that humans can process about 5–9 levels of “complexity.” (A discussion of how Miller’s numbers relate to the number of mental discriminations that one can make appears in Shooman [1983, pp 194, 195].) If we specify the out-degree to be seven for each node and all the leaves (terminal nodes) to be at level (depth) h, then the number of leaves at level h (NLh ) is given by NLh c h (7.7) In practice, each leaf is the lowest level of replaceable unit, which is generally called a line replaceable unit (LRU) In the case of software, we would probably call the analog of an LRU a module or an object The total number of nodes, N, in the graph can be computed if we assume that all the leaves appear at level h N c NL0 + NL1 + NL2 + · · · + NLh (7.8a) If each parent node has seven children, Eq (7.8a) becomes N c + + 72 + · · · + h (7.8b) Using the formula for the sum of the terms in a geometric progression, N c a(r n − 1)/ (r − 1) (7.9a) where r c the common ratio (in our case, 7) n c the number of terms (in our case, h + 1) a c the first term (in our case, 1) Substitution in Eq (7.9a) yields N c (7 h + − )/ (7.9b) If h c 2, we have N c (73 − 1)/ c 57 We can check this by substitution in Eq (7.8b), yielding + + 49 c 57 340 RELIABILITY OPTIMIZATION 7.5.4 Interface and Computation Structures Another way of viewing a decomposition structure is to think in terms of two classes of structures, interfaces, and computational elements—a breakdown that applies to either hardware or software In the case of hardware, the computational elements are LRUs; for software, they are modules or classes In the case of hardware, the interfaces are analog or digital signals (electrical, light, sound) passed from one element (depth, level) to another; the joining of mechanical surfaces, hydraulics or pneumatic fluids; or similar physical phenomena In the case of software, the interfaces are generally messages, variables, or parameters passed between procedures or objects Both hardware and software have errors (failure rates, reliability) associated with either the computational elements or the interfaces If we again assume that leaves appear only at the lowest level of the tree, the number of computational elements is given by the last term in Eq (7.8a), NLh In counting interfaces, there is the interface out of an element at level i and the interface to the corresponding element at level i + In electrical terms, we might call this the output impedance and the corresponding input impedance In the case of software, we would probably be talking about the passing of parameters and their scope between a procedure call and the procedure that is called, or else the passing of messages between classes and objects For both hardware and software, we count the interface (information-out–information-in) pair as a single interface Thus all modules except level have a single associated interface pair There is no structural interface at level 0; however, let us consider the system specifications as a single interface at level Thus, we can use Eqs (7.8) and (7.9) to count the number of interfaces, which is equivalent to the number of elements Continuing the foregoing example where h c 2, we have 72 c 49 computational elements and (73 − 1)/ c 57 interfaces Of course, in a practical example, not all the leaves will appear at depth (level) h, since some of the paths will terminate before level h; thus the preceding computations and formulas can only be considered upper bounds on an actual (less-idealized) problem One can use these formulas for many interfaces and computational units to conjecture models for complexity, errors, reliability, and cost 7.5.5 System and Subsystem Reliabilities The structure of the system at level in the graph model of the hierarchical decomposition is a group of subsystems equal in number to the out-degree of the root node Based on Miller’s work, we have decided to let the out-degree be (or to 9) As an example, let us consider an overview of an air traffic control (ATC) system for an airport [Gilbert, 1973, p 39, Fig 61] Level in our decomposition is the “air traffic control system.” At level 1, we have the major subsystems that are given in Table 7.1 An expert designer of a new ATC system might view things a little differently (in fact, two expert designers working for different companies might ... attempted Adopting the terminology of government contracting (which generally has parallels in the commercial world), we might say that the methods of Sections 7.6.1–7.6.3 are useful in formulating the... elements, r i , by allocating some or all of the cost budget, c0 , to fund redesign for higher reliability Place components in parallel with the subsystems that operate contin- 336 RELIABILITY OPTIMIZATION... well-separated Thus it is important to find a means of calculating the sensitivity of the solution to parameter variations or calculating a range of good solutions close to the optimum There has

Ngày đăng: 15/12/2013, 08:15

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan