3.2 Theoretical Overview of Reliability and Performance in Engineering Design 83 Table 3.7 Failure mode effect severity classificati ons Item Classification Description No. 1 Catastrophic The occurrence of failure may result in death or equipment loss A 2 Critical The occurrence of f ailure may result in severe injury or major system damage leading to loss B 3 Marginal The occurrence of failure may result in minor injury or minor system damage leading to loss C 4 Minor The failure is not serious enough t o lead to injury or system damage, but it will result in repair or in unscheduled m aintenance D Table 3.8 Qualitative failure probability levels Item Probability Term Description level 1 I Frequent High probability of occurrence during the item operational period 2 II Reasonably probable Moderate probability of occurrence during the item operational period 3 III Occasional Occasion probability of occurrence during the item operational period 4 IV Remote Unlikely probability of occurrence during the item operational period 5 V Extremely unlikely Zero chance of occurrence during the item operational period Fig. 3.17 Criticality matrix (Dhillon 1999) 84 3 Reliability and Performance in Engineering Design Table 3.9 Failure effect probability guideline values Item no. Failure effect description Probability value of F 1 No ef f ect 0 2 Actual loss 1.0 3 Probable loss 0.10 < F < 1.00 4 Possible loss 0 < F < 0.10 where: K fm is the failure mode criticality number. θ = the failure mode ratio or the probability that a component will fail in the particular failure mode of interest. More specifically, it is the fraction of the component failure rate that can be allocated to the failure mode under con- sideration. When all failure modes of a component are specified, the sum of the allocations eq uals unity. F = the conditional probability that the failure effect results in the indicated severity classification or category, given that the failure mode occurs. The values of F are based on an analyst’s judgment, and these values are quanti- fied according to Table 3.9. T = is the operational time expressed in hours or cycles. λ = is the component failure rate. The item criticality number K i is calculated separately for each severity class. Thus, the total of the criticality numbers of all the failure modes o f a component in the severity class of interest is given by the summation of the variables of Eq. (3.20), as indicated in K i = n ∑ j= 1 (k fm ) j = n ∑ j= 1 (F θλ T) j , (3.21) where n is the item failure modes that fall under the severity classification under consideration. When a component’s failure mode results in multiple severity class effects, each with its own occurrence probability, then only the most important is used in the calculation of the criticality number K i (Agarwala 1990). This can lead to errone ously low K i values for the less critical severity categories. In order to rectify this error, it is recommended to compute F values for all severity categories associated with a failure mode, and ultimately include only contributions of K i for category B, C and D failures (Bowles et al. 1994). c) FMECA Data Sources and Users Design-related information required for the FMECA includes system schematics, functional block diagrams, equipment detail drawings, pipe and instrument dia- grams (P&IDs), design descriptions, relevant specifications, reliability data, avail- 3.2 Theoretical Overview of Reliability and Performance in Engineering Design 85 able field service data, effects of operational and environmentalstress, configuration management data, operating specifications and limits, and interface specifications. Usually, an FMECA satisfies the needs of many groups during the engineering de- sign process, including not only the different engineering disciplines but quality assurance, reliability and maintainability specialists, systems engineering, logistics support, system safety, various regulatory agencies, and manufacturing contractors as well. Some specific FMECA-related factors and their correspondingdata retrieval sources are given as follows (Bowles et al. 1994). FMECA-related factors and their corresponding data sources: • Failure modes, causes and rates (manufacturer’s database, field experience). • Failure effects (design engineer, reliability engineer, safety engineer). • Item identification numbers (parts list). • Failure detection method ( design engineer, maintenance engineer). • Function (client requirements, design engineer). • Failure probability/severity classification (safety engineer). • Item nomenclature/functional specifications (parts list, design engineer). • Mission phase/operational mode (design engineer). The FMEA worksheet (Moss et al. 1996) is tabular in format to provide a system- atic approach to the analysis. The column headings of a standard FMEA worksheet generally are: • Item identity/description: a unique identification code and description of each item. • Function: a brief description of the function performed by the item. • Failure mode: each item failure mode is listed separately, as there may be several for an item. • Possible causes: the likely causes of each postulated failure mode. • Failure detection method: features of the design through which failure can be recognised. • Failure effect—local level: the effect of the failure on the item’s function. • Compensating provisions: which could mitigate the effect of the failure. • Remarks: comments on the effect of failure, including any potential design changes. FMEA extension into FMECA worksheet If the analysis is extended to quantify the severity and probability of failure (or failure rate) of the equipment as defined in a failure modes and effects criticality analysis (FMECA), further columns are added to the FMEA worksheet, such as: Failure consequence—system level: the consequences of the failure mode on sys- tem operation. Severity: the level of severity of the consequence of each failure mode, classified as: Level 1—minor, with no consequence on functional performance Level 2—major, with degradation of system functional performance 86 3 Reliability and Performance in Engineering Design Level 3—critical, with a severe reduction in the performance of system function resulting in a change in the system operational state Level 4—catastrophic, with complete loss of system function. Loss frequency: the expected frequency of loss r esulting from each failure mode, either as a failure rate or as failure proba bility. The latter is usually estimated for the operating time interval as a proportion of the overall system failure rate or failure probability (FP) . The levels generally employed for processes are: i) Very low probability <0.01 FP ii) Low probability 0.01–0.lFP iii) Medium probability 0.1–0.2FP iv) High probability >0.2FP Component failure rate λ p : the overall failure rate of the component in its opera- tional mode and environment. Where appropriate, application and environmental factors may be applied to adjust for the difference between the conditions asso- ciated with the generic failure rate data and operating stresses under which the item is to be used. Failure mode proportion α : the fraction of the overall failure rate related to the fail- ure mode under consideration. Probability of failure consequence β : conditional probability that a failure conse- quence occurs. Operational failure rate λ o : the product of λ p , α and β . Data source: the source of the failure rate (or failure probability) data. For FMECA s, a criticality matrix is constructed that relates loss frequency to sever- ity for each failure mode. Failure mode identification numbers are entered in the appropriate cell of the matrix according to their loss frequency and severity to iden- tify each critical item failure mode. Thus: Criticality = Severity × Loss frequency, or: Criticality = Severity × Operational failure rate. 3.2.2.6 Fault-Tree Analysis in Reliability Assessment There are two approaches that can be used to analyse the causal relationships be- tween equipment and system failures (Moss et al. 1996). These are inductive or forward analysis, and deductive or backward analysis. FMEA is an example of in- ductive analysis. As previously considered, it starts with a set of equipment failure conditions and proceeds forwards, identifying the possible consequences; this is a ‘what happens if’ approach. Fault-tree analysis is a deductive ‘what can cause this’ approach, and is used to identify the causal relationships leading to a specific system failure mode—the ‘top event’. The fault tree is developed from this top, undesired event, in branches showing the different event paths. Equipment failure events represented in the tree are progressively redefined in terms of lower resolution events until the basic events 3.2 Theoretical Overview of Reliability and Performance in Engineering Design 87 are encountered on which substantial failure data must be available. The events are combined logically by use of gate symbols as shown in Fig. 3.18, which illustrates the structure of a typical fault tree. In this case, the basic event combinations are developed that could result in total loss of output from a simple cooling water system. Using this failure logic diagram, the probability of the top event or the top event frequency can then be calculated by providing information on the basic event probabilities. The top event and the system boundary must be chosen with care so that the analysis is not too broad or too narrow to produce the results required. The specification of the system boundary is particularly important to the success o f the analysis. Many cooling water systems h ave external power supplies and other services such as a water supply. It would not be practical to trace all possible causes of failure of these services back through the distribution and generation systems, nor would this extra detail provide any useful information concerning the system being Total loss of output Filter failure Failure of power supply Supp Failure of both pumps Failure of pump A Pump A Pump B Failure of pump B Filter Valve OR Pump failure Valve failure OR Fig. 3.18 Simple fault tree of cooling water system 88 3 Reliability and Performance in Engineering Design assessed. The location of the external boundary will be partially decided by the as- pect of system performancethat is of inter est; however, it is also important to define the external boundary in the time domain. Process start-up or shutdown conditions can generate different hazards from steady-state operation, and it may be necessary to trace any possible faults that could occur. In Fig. 3.18, basic event combinations are developed of the failures of both pump A and pump B or failure of the power supply that results in overall pump failure and/or failures of the filter or valve that could result in total loss of output of the cooling water system. This approach is clearly depicted in the structure of the fault tree of Fig. 3.18, in that the basic events are combined in an event hierar- chy, from the lower component/sub-assembly levels to the higher assembly/systems levels of the cooling water system systems breakdown structure (SBS). a) Fault-Tree Analysis Steps The detailed steps required to perform a fault-tree analysis within the reliability assessment procedure for equipment design can be summarised in the following (Andrews et al. 1993): • Step 1: System configuration understanding. • Step 2: Identification of system failure states. • Step 3: Logic model generation. • Step 4: Qualitative evaluation of the logic model. • Step 5: Equipment failure analysis. • Step 6: Quantitative evaluation of the logic model. • Step 7: Uncertainty analysis. • Step 8: Sensitivity/importance analysis. Many of these steps are the same, whatever system and/or equipment is being ana- lysed, though there are some aspects that require special attention, particularly to systems interface when mechanical and electrical equipment is involved. Once the first four steps have been conducted, a qualitative evaluation of the fault-tree logi- cal model is necessary to review whether system configuration and system failure states are correctly understood. The minimal cut sets (combinations of equipment failures that provide the necessary and sufficient conditions for system failure) are then produced. To progress even further with reliability assessment using fault-tree analysis, the probability of equipment failure, q(t), may be determined together with equipment maintainability in the form of a repair rate q(t)= λ λ + ν (1− e −( λ + ν )t ) . (3.22) Equation (3.22) is for revealed failures where λ is the failure rate and ν the repair rate. Equation (3.23) is for unrevealed failures, where q AV is the average unavail- 3.2 Theoretical Overview of Reliability and Performance in Engineering Design 89 ability, τ is the mean time to repair, and θ is the test interval q AV = λ ( τ + θ /2). (3.23) For safety systems that are normally inactive, failures are revealed only during test or actual use, which means that the unrevealed failure model is appropriatefor these systems. However, the underlying assumption in both of these models is that the failure and repair rates are constant, giving a negative exponential distribution for the probability of failure (repair)prior to time t.Constant failure rates are associated with random failure events, as indicated by the useful life period of the hazard rate curve, considered in detail in Section 3.2.3. However, mechanical equipment subject to wear, corrosion, fatigue, etc. may in many cases not conform to this assumption (Andrews et al. 1993). When either the failure or repair rates are not con stant, and the probability density functions for the times to failure f(t) and repair g(t) are available, then they can be combined to give the unconditional failure intensity w(t) and unconditional repair intensity ν (t) by solving the following simultaneous integral w(t)= f(t)+ t 0 f(t −u) ν (u)du , (3.24) ν (t)= t 0 g(t −u)w(u)du . (3.25) Having solved these equations, the equipment failure probability is then given by q(t)= t 0 [w(u) − ν (u)]du . (3.26) For the case of constant failure rates, the probability density functions for the times to failure and repair are given as f(t)= λ e − λ t , (3.27) g(t)= ν e − ν t . (3.28) Equation s (3.24 ) and (3.25) can be solved by Laplace tran sforms. Substituting the solution obtained into Eq. (3.26) yields Eq. (3.27). For more complex distributions of failure and repair times, numerical solutions may be required.With the equipment failure data produced at Step 5, fault-tree quantification gives the system failure probability, the system failure rate, and the exp ected number of system failures. Where failure and repair distributions have been specified for the analysis, con- fidence intervals can be determined at Step 7. Step 8 produces the importance rank- ings for the basic event identifying the equipment that provides the most significant 90 3 Reliability and Performance in Engineering Design contribution to system failure. Fault trees in reliability assessments of integrated en- gineering systems are significantly more complex than that illustrated in Fig. 3.18. With complex engineering designs, fault-tree methodology includes the concepts of availability and maintain ability. This is considered in greater detail in Chapter 4, Availability and Maintainability in Engineering Design. b) Fault-Tree Analysis and Safety and Risk Assessment The main use of fault trees in designing for reliability is in safety and risk studies. Fault trees provide a useful representation of the different failure paths, and this can lead to safety and risk assessments of systems and processes even without consider- ing failure and repair data—which does cause some difficulties (Moss et al. 1996). In many cases, fault trees and failure mode and effect analysis (FMEA) are em- ployed in combination—the FMEA to define the effects and consequences of spe- cific equipment failures, and the fault tree (or several fault trees) to identify and quantify the paths that lead to equipment failure probability, and high risks of safety. 3.2.3 Theoretical Overview of Reliability Evaluation in Detail Design Reliability evaluation determines the reliability and criticality values for each in- dividual item of equipment at the lower systems levels of the systems breakdown structure. Reliability evaluation determines the failure rates and failure rate patterns of components, not only for functional failures that occur at random intervals but for wear-out failur es as well. Reliability evaluation is considered in the detail design phase of the engineering design process, to the extent of determination of the frequencies with which failures occur over a specified period of time based on component failure rates. The most applicable methodology for reliability evaluation in the detail design phase includes basic concepts of mathematical modelling such as: • The hazard rate function. (To represent the failure rate pattern of a component by evaluating the ratio be- tween its probability of failure and its reliability function.) • The exponential failure distribution. (To define the p robability of failure and the reliability function of a component when it is subject only to functional failures that occur at random intervals.) • The Weibull failure distribution. (To determine component criticality for wear-out failures, rather than random failures.) • Two-state d evice reliability networks. (A component is said to have two states if it either operates or fails.) 3.2 Theoretical Overview of Reliability and Performance in Engineering Design 91 • Three-state device reliability n etworks. (A three-state component derates with one operational and two failure states.) 3.2.3.1 The Hazard Rate Function The hazard rate function is a representation of the failure rate pattern of the ratio between a particular probability density function (p.d.f.), and its cumulative distri- bution function (c.d.f.) or its reliability function. For continuous random variables, the cumulative distribution function is defined by F(t)= t −∞ f(x)dx , (3.29) where: f(x)=probability density function of the distribution of value x over the interval −∞ to t. In the case where t→∞, the cumulative distribution function is unity F(∞)= ∞ −∞ f(x)dx . (3.30) The probability density function is derived from the derivative of the cumulative distribution function, as follows dF(t) dt = d dt ⎡ ⎣ t −∞ f(x)dx ⎤ ⎦ . (3.31) The reliability function over a period of time t is the difference between the cumu- lative distribution function where t →∞ and the cum ulative distribution function in the period of time t or, alternately, it is the subtraction of the cumulative distribution function of failure over a period of time t from unity R(t)=1−F(t) . (3.32) The hazard rate function is then defined as λ (t)= f(t) R(t) (3.33) or λ (t)= f(t) 1−F(t) . 92 3 Reliability and Performance in Engineering Design Thus, the hazard rate function can be used to represent the hazard rate curve of sev- eral different p robability density functions, particularly the exponential or Poisson function in which λ (t) is a constant, and the Weibull functionin which λ (t) is either decreasing or increasing. a) Review of the Hazard Rate Curve A hazard rate curve is shown in Fig. 3.19. This curve is used to represent the failure rate pattern of equipment (i.e. assemblies and predominantly components; EPRI 1974). Failure rate representation of electronic components is a prime example, in which case only the middle portion (useful life period), or the constant failure rate region of the curve is considered. As can be seen in Fig. 3.19, the hazard rate curve may be divided into three distinct regions or parts (i.e. decreasing, constant, and increasing hazard rate). The decreasing hazard rate region of the curve is designated the ‘burn-in period’, or ‘in- fant mortality period’. The ‘burn-in period’ failures, known as ‘early failures’, are the result of design, manufacturing or construction defects in new equipment. As the ‘burn-in period’ increases, equipment failures decrease, until the beginning of the constant failure rate region, which is the middle portion of the curve and des- ignated the ‘useful life period’ of equipment. Failures occurring during the ‘useful life period’ are known as ‘random failures’ because they occur unpredictably. This period starts from the end of the ‘burn-in period’ and finishes at the beginning of the ‘wear-out phase’. Fig. 3.19 Failure hazard curve (life characteristic curve or risk profile) . availability and maintain ability. This is considered in greater detail in Chapter 4, Availability and Maintainability in Engineering Design. b) Fault-Tree Analysis and Safety and Risk Assessment The main. needs of many groups during the engineering de- sign process, including not only the different engineering disciplines but quality assurance, reliability and maintainability specialists, systems engineering, . assessments of integrated en- gineering systems are significantly more complex than that illustrated in Fig. 3.18. With complex engineering designs, fault-tree methodology includes the concepts of availability