614 5 Safety and Risk i n Engineering Design e) Critical Risk Theory in Designing for Safety In applying critical risk theory to a series process engineering system, the following modelling approach is taken: Assume the system consists of k independent components, each with expected useful life lengths of z 1 , z 2 , z 3 , , z k , all of which must function for the system to be able to function, and where the useful life length of the system is Z. Denoting the survival function of the useful life expectancy of Z by F , and of z i by F i (i = 1,2,3, ,k),then Z = min(z 1 ,z 2 ,z 3 , ,z k ) (5.52) F i (z)=P 00 (0,z i ) Then: F (Z)= ∏ k i=1 F (Z). The hazard rate represented by the in tensity function can now be formulated h(Z)= k ∑ i=1 h i (Z) (5.53) The probability of failure resulting from critical risk is expressed as (Eq. 5.54): P 0i (0,Z)= ∞ 0 F (Z) ·h i (Z)dz (5.54) Using the expression for the hazard rate h i (z) of useful life expectancy of Z i , the survival function of the useful life expectancy of the series process engineering system is then expressed as F i (Z)=exp ⎡ ⎣ − K ∏ i=1 z 0 f(z|C = i) F (Z) dz ⎤ ⎦ (5.55) f) The Concep t of Delayed Fatalities In assessing the safety of a complex process, critical risk may b e considered as re- sulting in fatalities due to an accident. T hese fatalities can be classified as immediate or as delayed. It is the delayed fatalities that are of primary interest in high-risk en- gineered installations such as nuclear reactors (NUREG 75/014 1975; NUREG/CR- 0400 1978). Critical risk analysis applies equally well to delayed fatalities as to immediate fatalities. To model the impact of delayed fatalities in the assessment of safety in engineering design, consider the effect of a new constan t risk, with inten- sity h(y), which is delayed for time d. The model parameters include the following expressions (Thompson 1988): 5.2 Theoretical Overview of Safety and Risk in Engineering Design 615 The inten sity function for the new risk is: h new (y)=0 y ≤ d = λ y > d The probability that the new risk is the critical risk (resulting in fatality) is (from Eq. 5.51) ∏ i = P(y ≤∞,C = i) (5.56) = P 0i (0,∞) P 0i (0,∞)= ∞ 0 F (y) ·h i (y)dy (5.57) P d (0,∞)= ∞ 0 λ e − λ y F (y)dy (5.58) = λ ∞ d F (y)dy+( λ ) The expected useful life with the new risk delayed is expressed as (from Eq. 5.49) μ = d 0 F (y)dy+ ∞ d e − λ y F (y)dy (5.59) = μ ∞ d 1− e − λ y F (y)dy = μ − λ ∞ d F (y)dy+( λ ) The US Nuclear Regulatory Commission’s Reactor Safety Study (NUREG 75/014 1975) also presents nuclear risk in comparison with the critical risk of other types of accidents. For example, the annual chances of fatality for vehicle accidents in the USA are given as 1 in 4,000, whereas for nuclear reactor accidents the value is 1 in 5 billion. 5.2.3.2 Fault-Tree Analysis for Safety Systems Design For potentially hazardous process engineering systems, it is required statutory prac- tice to conduct a quantitative assessment of the safety features at the engineering design stage. The design is assessed by predicting the probability that the safety 616 5 Safety and Risk in Engineering Design systems might fail to perform their intended task of either preventing or reducing the consequences of hazardous events. This type of assessment is best carried out in the preliminary design phase when the system has sufficient detail f or a mean- ingful analysis, and when it can still be easily modified. Several methods have been developed for predicting the likelihood that systems will fail, and for making as- sessments on avoiding such failure, or of mitigating its consequence. Such methods include Markov analysis, fault-tree analysis, root cause and common cause analysis, cause-consequence analysis, and simulation. Fault-tree analysis (FTA) is the most frequently used in the assessment of safety protection systems for systems design. a) Assessment of Safety Protection Systems The criterion used to determine the adequacy of the safety system is usually a com- parison with specific target values related to a system’s probability to function on demand. The initial preliminary design specification is to predict its likelihood of failure to perform according to the design intent. The predicted performance is then compared to that which is considered acceptable. If system performance is not ac- ceptable, then deficiencies in the design are removed through redesign, and the as- sessment r epeated. With all the various options for establishing the design criteria of system configuration, level o f redundancy and/or diversity, reliability, availabil- ity and maintainability, there is little chance that this app roach will ensu re that the design reaches its final detail phase with all options adequately assessed. For safety systems with consequence o f failure seen as catastrophic, it is important to optimise perfor mance with consideration of all the required desig n criteria, and not just ade- quate performance at the b est cost. The target values should be used as a minimum acceptance level, and the design should be optimised with respect to performance within the constraints of the design criteria. These analysis methods are well de- veloped and can be incorporated into a computerised automatic design assessment cycle that can be terminated when optimal system performance is achieved within the set constraints. Safety systems are designed to operate only when certain conditions occur, and function to prevent these conditions from developing into hazardous events with catastrophic consequences. As such, there are specific features common to all safety protection systems—for example, all safety systems have sensing devices that re- peatedly monitor the process for the occurrence of an initiating event. These sensors usually measure some or o ther process variable, and transmit the state of the variable to a controller,such as a programmablelogiccontrollers(PLC) or distributed control system (DCS). The controller determines wheth er the state of the process variable is acceptable, by comparing the input signal to a set point. When the variable exceeds the alarm limit of the set point, the necessary protective action is activated. This protective action may either prevent a hazardous event from occurring, or reduce its consequence. There are several design options with respect to the structure and operation of a safety system where, from a design assessment point of view, the level of 5.2 Theoretical Overview of Safety and Risk in Engineering Design 617 redundancy and level of diversity are perhaps the more important. The safety sys- tem must be designed to have a high likelihood of operability on demand. Thus, single component failures should not be able to prevent the system from function- ing. One means of achieving this is by incorporating redundancy or diversity into the system’s configuration. Redundancy duplicates items of equipment (assemblies, sub-assemblies and/or components) in a system, while diversity includes totally dif- ferent equipment to achieve the same function. However, increased levels of redun- dancy and diversity can also increase the number of system failures. To counter- act this problem, partial redundancy is opted for—e.g. k out of n sensors indicate a failed condition. It is specifically as a result of the assessment of safety in engi- neering design during the preliminary design phase that decisions are made where to incorporate redundancy or diversity, and if full or partial redundancy is appropriate. b) Design Optimisation in Designing for Safety The objective of design optimisation in designing for safety is to minimise system unreliability (i.e. probability of component failure) and system unavailability (i.e. probability of system failure on demand), by manipulating the design variables such that design criteria constraints are not violated. However, the nature of the design variables as well as the design criteria constraints engender a complexity problem in design optimisation. Commonly with mathematical optimisation, an objective function defines how the characteristics that are to be optimised relate to the variables. I n the case where an objective function cannot be explicitly defined, some form of the function must be assumed and the region defined over which the approximate function can be con- sidered acceptable. Design criteria constraints fall into two categories: those that can be determined from an objective function relating to the design variables, which can be assessed mathematically, and those that cannot be easily expressed as a func- tion, and can be assessed only through analysis. In the former case, a computational method is used to solve the design optimisation problem of a safety system. The method is in the form of an iterative scheme that produces a sequence of system designs gradually improving the safety system performance. When the d esign can no longer be improved due to restrictions of the design criteria constraints, the opti- misation procedure terminates (Andrews 1994). Assessment of the preliminary design of a safety system might require improve- ments to system performance. This could imply developing a means of expressing system performance as a function of the design variables Q system = f (V 1 ,V 2 ,V 3 , ,V n ) (5.60) where: V 1 , V 2 , V 3 , ,V n are the design variables, typically including: • the number of high-pressure valves, • the number of pressure transmitters, 618 5 Safety and Risk in Engineering Design • the level of redundancy of valves, • the nu mber of tr ansmitters to trip. It is computationally difficult to develop a function Q that can consider all design options. However, with the use of a Taylor series expansion, the following expres- sion is obtained f(x + Δx)= f(x)+g T Δx+ 1 2 Δx T ·G ·Δx (5.61) where: Δx = th e change in the design vector g = the gradient vector G = the Hessian matrix. The gradient g(x) is the first-order partial derivatives of f (x) g(x)= δ δ x 1 f(x), δ δ x 2 f(x), δ δ x n f(x) (5.62) The Hessian matrix G(x) is a square symmetric matrix of second derivatives given as G(x)= ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ δ 2 F δ x 1 δ x 1 , δ 2 F δ x 1 δ x 2 , δ 2 F δ x 1 δ x n δ 2 F δ x 2 δ x 1 , δ 2 F δ x 2 δ x 2 , δ 2 F δ x 2 δ x n . . . . . . . . . δ 2 F δ x n δ x 1 , δ 2 F δ x n δ x 2 , δ 2 F δ x n δ x n ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (5.63) Truncating (Eq. 5.61) after the lin ear term in Δx means that the function f(x+Δx) can be evaluated providedthat the gradient vector can be obtained, that is, ∂ f/ ∂ x for each design parameter. Since integer design variables are being d ealt with, ∂ f/ ∂ x cannot be strictly formulated but, if consideration is taken of the fact that a smooth curve has been used to link all discrete points to give the marginal distribution of f as a function of x i ,then ∂ f/ ∂ x i can be obtained. Partial derivatives can be used to determine how values of f are improved by updating each x i by Δx i .Afaulttree can be developed to obtain f(x+ Δx) for each x i provided x i + Δx i is integer; finite differences can then be used to estimate ∂ f/ ∂ x i . This would require a large number of fault trees to be produced and analysed, which would usually result in this option not being pursued from a practical viewpoint. Since truncating the Taylor series of (Eq. 5.61) at a finite number of terms pro- vides only an approximationof f(x+Δx), the solution space overwhich this approx- imation is acceptable also needs to be defined. This is accomplished by setting up a solution space in the neighbourhood of the design’s specific target variable. This procedure results in an iterative scheme, and the optimal solution being approached by sequential optimisation. 5.2 Theoretical Overview of Safety and Risk in Engineering Design 619 c) Assessment of Safety Systems with FTA Where design criteria constraints can be assessed only through analysis, fault-tree analysis (FTA) is applied. In the assessment of the performance of a safety system, a fault tree is constructed and analysed for two basic system failure modes: failure to work on demand, and spurious system trips. Fault trees are analysed in the de- sign optimisation problem, to obtain numerical estimates of the partial derivatives of system performance with respect to each design variable. This information is re- quired to produce the objective function coefficients. However, the requirement to draw fault trees for several potential system designs, representing the causes of the two system failure modes, would make the optimisation method impractical. Man- ual development of a new tree for each assessment would be too time-consuming. One approach in resolving this difficulty is to utilise computer automated fault-tree synthesis programs; at present, these have not been adequately developed to accom- plish such a task. An alternative approach has been developed to construct a fault tree for systems design, using house events (Andrews et al. 1986). House events can be included in the structure of fault trees, and either occur with certainty ( event set to TRUE) or do not occur with certainty ( event set to FALSE). Their inclusion in a fault-tree model has the effect of turning on or off branches in the tree. Thus, a single fault tree can be constructed that, by defining the status of house events, could represent the causes of system failure on demand for any of several potential designs. An example of a sub-system of a fault tree that develops causes of dormant failure of a high-pressure protection system, alternately termed a high-integrity protection system (HIPS), is illustrated in Fig. 5.22. In this exam - ple, the function of the HIPS sub-system is to prevent a high-pressure surge pass- ing through the process, thereby protecting the process equipment from exceeding its individual pressure ratings. The HIPS utilises transmitters that determine when pipeline pressure exceeds the allowed limit. The transmitters relay a signal to a con- troller that activates HIPS valves to close down the pipeline. The design variables for optimisation o f the HIPS sub-system include six house events (refer to Fig. 5.22) that can be summarised in the following criteria: • what type of valve should be fitted, • whether high-pressure valve type 1 should be fitted,ornot, • whether high-pressure valve type 2 should be fitted,ornot. The house events in the fault tree represent the following conditions: H1 – HIPS valve 1 fitted NH1 – HIPS valve 1 not fitted H2 – HIPS valve 2 fitted NH2 – HIPS valve 2 not fitted V1 – Valve type 1 selected V2 – Valve type 2 selected. Considering first the bottom left-hand branch in Fig. 5.22 that represents ‘HIPS valve 1 fails stuck’, this event will depend on which type of valve has been selected 620 5 Safety and Risk in Engineering Design Fig. 5.22 Fault tree of dormant failure of a high-integrity protection system (HIPS; Andrews 1994) in the design. If type 1 has been fitted, then V1 is set to TRUE. If type 2 is fitted, then V2 is set to TRUE. This provides the correct causes of the event being developed in function of which valve is fitted. One of either V1 or V2 must be set. Furthermore, if no HIPS option is included in the system design, then house events NH1 and NH2 will both be set (i.e. TRUE). Once these events are set, the output event from the OR gates into which they feed will also be true. At the next level up in the tree structure, both inputs to the AND gate will have occurredand, therefore, the HIPS system will not provide protection. Where HIPS valves are fitted, the appropriate house events 5.2 Theoretical Overview of Safety and Risk in Engineering Design 621 NH1 or NH2 will be set to FALSE, requiring component failure event to render the HIPS sub-system inactive. By using house events in this manner, all design options can be represented in a single fault tree. Another fault tree can be constructed using the same technique to represent causes of spurious system failure for each potential design. The fault trees are then analysed to obtain numerical estimates of the partial derivatives of system performance with respect to each design variable. This information is required to produce the objective function coefficients in the design optimisation problem. The objective function is then derived by truncating the Taylor series at the linear term of the gradient vector, g, and ignoring the quadratic term of the Hessian matrix. This truncation means that a finite number of terms provide an approximation of the objective function, with a valid representation of Q system only within the neighbour- hood of the target design variables. Additional constraints are therefore included to restrict the solution space in the neighbourhood of the d esign’s specific target vari- ables. The objective f unction is then evaluated in the restricted design space, and the optimal design selected. 5.2.3.3 Common Cause Failures in Root Cause Analysis The concept o f multiple failures arisin g from a common cause was first studied on a formal basis during the application of root cause analysis in the nuclear power industry. In order to obtain sufficiently high levels of reliability and safety in critical risk control circuits, redundancy was introduced. In applying redundancy, several items can be used in parallel with only one required to be in working order. Although the approach increases system reliability, it leads to large increases in false alarms measured in what is termed the false alarm rate (FAR).Thisisover- come, however, by utilising a concept terme d voting redundancy; in its simplest arrangement, this is two out of three, where the circuit function is retained if two or three items are in work ing order. This not only improves reliability and safety but also reduces the FAR. Voting redundancyhas the added advantage that a system can tolerate the failure of some items in a redundant set, allowing failed items to be taken out of service for repair or replacement ( electronic control components such as sensors, circuit boards, etc. are usually replaced). a) Defining CMF and CCF It has become evident from practical experiencein the process industry that, in many cases, the levels of reliability and safety that are actually being obtained have fallen short of the predicted design values. This is due largely to common root causes leading to the concurrent failure of several items. The concept of common mode failures (CMF) was developed from studies into this problem. It was subsequently recognised that multiple failures could arise fromcommon weaknesses, where a par- ticular item (assembly and/or component) was used in various locations on a plant. 622 5 Safety and Risk in Engineering Design Furthermore, the term common cause failure (CCF) was applied to the root causes of these failure modes, not only manifested at the parts and componentlevel but also including the effects of the working environment on the item, e.g. the effects from the assembly, sub-system and system levels, as well as the process and environmen- tal conditions. Consequently, the term dependent failure was used to include both CMF and CCF, although CMF is, in effect, a subset of CCF. Many terms relating to the integrity of systems and components were formally defined and included in a range of military standards, especially reliability and maintainability. However, it took some time before CMF and CCF were formally defined in the nuclear energy industry. The UK Atomic Energy Authority (AEA) has defined CMF as follows (Edwards et al. 1979): “A common-mode failure (CMF) is the result of an event which, because of dependencies, causes a coincidence of failure states of components in two or more separate channels of a redundancy system, leading to the defined system failing t o perform its intended function”. The UK Atomic Energy Authority has also defined CCF as follows (Watson 1981): “A common-cause failure is the inability of multiple first in line items to perform as required in a defined critical time period, due to a single underlying defect or physical phenomena, such that the end effect is judged to be a loss of one or mor e systems”. CCF can arise from both engineering and operational causes: • Engineering causes can be related to the engineering design as well as manu- facturing, installation and construction stages. Of these, engineering design cov- ers the execution of the design requirement and functional deficiencies, while the manufacturing, installation and construction stag es cover the activities of fabrication and inspection, packaging, handling and transportation, installation and/or construction. Plant commissioning is often also included in the engineer- ing causes. • Operational causes can be separated into procedural causes and environmental effects. The procedural causes cover all aspects of maintenance and operation of the equipment, while environmental causes are quite diverse in that they in- clude not only conditions within the process (influenced partly by the process parameters and the materials h andled in the process) but external environmental conditions such as climatic conditions, and extreme events such as fire, floods, earthquakes, etc. as well. Typical examples of actual causes of CCF are (Andrews et al. 1993): • Identical manufacturing defects in similar components. • Maintenance errors made by the same maintenance crews. • Operational errors made by the same operating crews. • Components in the same location subject to the same stresses. Since the earliest applications of CCF, two methods have been extensively used to allow for such events. These are the cut-off probability method and the beta factor 5.2 Theoretical Overview of Safety and Risk in Engineering Design 623 Table 5.12 Upper levels of systems unreliability due to CCF Systems configuration Minimum failure probability Single instrument 10 −2 Redundant system 10 −3 Partially diverse system 10 −4 Fully diverse system 10 −5 Two diverse systems 10 −6 method. The cut-off probability method proposes limiting values of failure proba- bility to account for the effect of CCF. The basis of this is the assumption that, because of CCF, system reliability can never exceed an upper limit determined by the configuration of the system. These upper levels of systems unreliability were generically given as shown in Table 5.12 (Bourne et al. 1981). The beta method assumes that a proportion, β , of the total failure rate of a com- ponent arises from CCF. It follows, therefore, that the proportion (1− β ) arises from independent failures. This can be expressed in λ t = λ i + λ ccf (5.64) where: λ t = the total failure rate λ i = the independent failure rate λ ccf = the common cause failure rate. From this equation follows λ ccf = β · λ t and: λ i = 1− β · λ t (5.65) The results from the beta factor method must, however, b e considered with some pessimism because they need to be modified for higher levels of redundancy than is needed for the simple one-out-of-two case. Although in theory CCF can occur, it does not follow that it will. The probability of failure of all three items of a two-out- of-three redundancy system due to CCF is likely to b e lower than the probability of two failing (Andrews et al. 1993). The cut-off method is thus extensively used where there are no relevant field data or even if any database is inadequate, and serves as a suitable guide in the preliminary design phase for determining the limiting values of failure probability to account for the effect of CCF. It is also quite usual in such circumstances to use the beta factor method, but this requires engineering judgment on the appropriate values for beta—in itself, this is probably no more accurate than using the cut-off method. A combination o f both methods in the assessment of reliability and safety due to CCF in engineering design is best suited for application by expert judgment . both engineering and operational causes: • Engineering causes can be related to the engineering design as well as manu- facturing, installation and construction stages. Of these, engineering design. 614 5 Safety and Risk i n Engineering Design e) Critical Risk Theory in Designing for Safety In applying critical risk theory to a series process engineering system, the following modelling approach. Overview of Safety and Risk in Engineering Design 617 redundancy and level of diversity are perhaps the more important. The safety sys- tem must be designed to have a high likelihood of operability