664 5 Safety and Risk i n Engineering Design Table 5.18 (continued) Component Failure description Failure mode Failure causes Defect. MATL & LAB ($)/failure (incl. damage) Econ. $/failure (prod. loss) Total $/failure (prod. and repair) Risk Cost criticality rating Instrument loop (press. 2) Fails to detect low pressure condition TLF Low pressure switch fails due to corrosion or mechanical damage $10,000 $0 $10,000 2.00 Low cost Instrument loop (press. 2) Fails to detect low pressure condition TLF Pressure switch relay or cabling failure $10,000 $0 $10,000 2.00 Low cost Instrument loop (press. 2) Fails to provide output signal for alarm TLF PLC alarm function or indicator fails $10,000 $0 $10,000 2.00 Low cost 5.2 Theoretical Overview of Safety and Risk in Engineering Design 665 Table 5.19 FMECA for process and cost criticality Component Failure description Failure mode Failure consequences Total $/failure (prod. and repair) Cost risk MTBF (months) Process criticality rating Cost criticality rating Maintenance frequency Control valve Fails to open TLF Product ion $73,850 6.00 12 Medium criticality Medium cost 12 monthly Control valve Fails to open TLF Product ion $32,600 6.00 6 Medium criticality Medium cost 12 monthly Control valve Fails to seal/close TLF Product ion $43,250 6.00 6 Medium criticality Medium cost 6 monthly Control valve Fails to seal/close TLF Product ion $43,250 6.00 4 Medium criticality Medium cost 6 monthly Instrument loop (press. 1) Fails to provide accurate pressure indication TLF M a int. $500 2.00 3 Low criticality Low cost 3 monthly Instrument loop (press. 2) Fails to detect low pressure condition TLF M a int. $10,000 2.00 3 Lo w criticality Low cost 3 monthly Instrument loop (press. 2) Fails to detect low pressure condition TLF M a int. $10,000 2.00 4 Lo w criticality Low cost 3 monthly Instrument loop (press. 2) Fails to provide output signal for alarm TLF M a int. $10,000 2.00 4 Lo w criticality Low cost 3 monthly 666 5 Safety and Risk in Engineering Design qualitative assessment values for the likelihood of occurrence and the impact that the risk may have on costs. Assessment values for risk may be designated as indicated previously, where risk has been defined as the result of multiplying the consequence of the failure mode (i.e. its severity) by the probability of failure (i.e. its likelihood): Risk (R)=Severity × Probability (or Likelihood) Severity The use of qualitative assessment scales for determining the severity of a failure consequence is common in risk analysis, where severity criteria are designated a value ranging from 10 to 1. The most severe consequence is valued at 10 (dis- abling injury—life risk), whereas no safety risk is valued at 1, or 0, as indicated in the risk assessment scale in Table 5.20. Likelihood Many different scales have been developed for determining the likelihood of failure occurrence. One commonly used scale is expressed in terms of ‘probability quali- fiers’ given as: Actual occurrence = 0.95 to 1.00, Probable occurrence = 0.50 to 0.95, and Possible occurrence = less than 0.50 . Criticality Once an overall total and an overall average value of risk has been assessed accord- ing to the risk assessment scale, a criticality rating can be defined for each failure mode, using the following expression: Criticality (C)=Risk × Failure rate Failure Rate If the failure rate for the item cannot be determined from available data, a represen- tative estimation for failure rate in high-corrosive process applications can be used. This is done by the following qualifying values: Qualification Failure rate (×10 −4 ) Very low <100 Low 100 to 500 Medium 500 to 1,000 High 1,000 to 5,000 Very high >5,000 5.2 Theoretical Overview of Safety and Risk in Engineering Design 667 Table 5.20 Risk assessment scale Risk assessment scale Estimated degree of Risk assessment values: safety: Degree of severity × Probability Severity criteria Actual Probable Possible 0.95 to 1.00 0.50 to 0.95 0.01 to 0.05 (Disabling injury) Deg. Prob. Risk Deg. Prob. Risk Deg. Prob. Risk Life risk 10 10 10 Loss risk 9 9 9 Health risk 8 8 8 (Reported accident) People risk 7 7 7 Process risk 6 6 6 Product risk 5 5 5 (Physical condition) Damage risk 4 4 4 Defects risk 3 3 3 Loss risk 2 2 2 (No safety risk) 1 1 1 Overall risk Total Total Total Overall average Average Av erage Average e) Qualitative Criticality Analysis Qualitative criticality analysis is stru ctured in a failure modes and safety effects (FMSE) analysis, in contrast to the standa rd FMECA, which is based on failure rates, MTBF and MTTR. The outcome of the FMSE, given in Table 5.21, indicates that the dominant failure modes that are the key shutdown drivers in determining the optimum maintenance frequency are the two control valve failure modes o f medium criticality and scheduled frequency of 6 months. All other tasks relating to the control valve can be re-scheduled into this half- yearly shut. This implies that the annual scheduled service of the control valve can be premature with a low risk impact, and the quarterly scheduled checks or compo- nent replacements of the pressure instrument loops (pressure gauges and switches) can be delayed with low risk impact. A cost criticality analysis can now be conducted on the basis of the shutdown frequency of 6 months being the estimated likelihood of failure for all the relevant failure modes. This approach is repeated for all those items of equ ipment initially found to be critical items according to a ranking of their consequences of failure. The task seems formidable but, following the Pareto principle (or 80–20 rule), in most cases 80% of cost risk consequences are due to only 20% of all components. Table 5.21 shows the application of qualitative risk assessment in the form of an FMSE for process criticality of the control valve given in Table 5.19. 668 5 Safety and Risk in Engineering Design Table 5.21 Qualitative risk-based FMSE for process criticality, where (1)=likelihood of occurrence (%), (2)=severity of the consequence (rating), (3)=risk (probability×severity), (4)=failure rate (1/MTBF ), (5)=criticality (risk×failure rate) Component Failure description Failure mode Failure consequences Failure causes (1) (2) (3) (4) (5) Criticality rating Control valve Fails to open TLF Production Solenoid valve fails, failed cylinder actuator or air receiver failure 75% 6 4.50 0.083 0.37 Low criticality Control valve Fails to open TLF Production No PLC output due to modules electronic fault or cabling 75% 6 4.50 0.167 0.75 Low criticality Control valve Fails to seal/close TLF Production Valve disk damaged due to corrosion wear (same causes as ‘fails to open’) 100% 6 6.00 0.167 1.0 Medium criticality Control valve Fails to seal/close TLF Production Valve stem cylinders seized due to chemical deposition or corrosion 100% 6 6.00 0.25 1.5 Medium criticality Instrument loop (press. 1) Fails to provide accurate pressure indication TLF Maint. Restricted sensing port due to blockage of chemical or physical accumulation 100% 2 2.00 0.33 0.66 Low criticality 5.2 Theoretical Overview of Safety and Risk in Engineering Design 669 Table 5.21 (continued) Component Failure description Failure mode Failure consequences Failure causes (1) (2) (3) (4) (5) Criticality rating Instrument loop (press. 2) Fails to detect low pressure condition TLF Maint. Low pressure switch fails due to corrosion or mechanical damage 100% 2 2.00 0.33 0.66 Low criticality Instrument loop (press. 2) Fails to detect low pressure condition TLF Maint. Pressure switch relay or cabling failure 75% 2 1.50 0.25 0.38 Low criticality Instrument loop (press. 2) Fails to provide output signal for alarm TLF Maint. PLC alarm function or indicator fails 100% 2 2.00 0.25 0.5 Low criticality 670 5 Safety and Risk in Engineering Design f) Residual Life Evaluation Component residual life, in the context of a renewal/replacement process that is typically carried out during scheduled preventive maintenance shutdowns in pro- cess p lant, is in effect equivalent to the time elapsed between shutdowns. This is, however, not the true residual life of the component based on its reliability charac- teristics. The difference between the two provides a suitable means of comparison for maintenance optimisation of safety-critical components. Optimum maintenance intervals are best determined through the method of equipment age analysis, which identifies the rate of component deterioration and potential failure ages. The risk-based maintenance technique of residual life assess- ment is ideally applied in equipment age analysis where the frequencies of preven- tive maintenance activities in shutdown programs can be optimised. However, resid- ual life is widely used in modelling stochastic processes during detail engineering design, and is one of the random variables that determines the design requirements for component renewal/replacement; the other being the component age once the process design has progressed beyond the engineered installation stage, and has been in operation for some time. In reliability theory, residual life appears as the time until the next failure, whereas for the renewal/replacement process it is normally expressed as a math- ematical function o f conditional reliability in which the residual life is determined from the component age. The mean residual life or remaining life expectancy func- tion at a specific component age is defined to be the expected remaining life given survival to that age. It is a concept of o bvious interest in maintenance optimisation, and mo st im portant in process reliability. g) Failure Probability, Reliability and Residual Life There are fundamentally two measures of reliability: the failure density function, which quantifies how many components would fail at different time points (i.e. a combination of h ow many components survive at each point, and the risk of fail- ure in the interval up to the following time point), and the hazard rate, which is the conditional chance of failure, assuming the equipment h as survived so far. It is the hazard rate that is essential for decisions about how long equipment can be left in service with a related risk of failure, or whether it should be renewed or replaced. Component failure density in a common series systems configuration (or in a com- plex system reduced to a simple series configuration) is defined by the following function f i (t)= lim Δt→0 α S (t) − α S (t + Δt) α 0 Δt (5.85) where: f i (t)= the ith component failure Δt = the time interval 5.2 Theoretical Overview of Safety and Risk in Engineering Design 671 α 0 = the total number of components in operation at time t = 0 α S = the number of components surviving at time t or t + Δt. The ith component cumulative distribution function (failure probability) is defined by the following expression F i (t)= t 0 f i (t)dt (5.86) and the ith component reliability is defined by: R i (t)={1 −F i (t)} Substituting the equation for F i (t) in the equation for R i (t) leads to R i (t)=1 − t 0 f i (t)dt (5.87) However, a commonly used alternative expression for R i (t) is R i (t)=e − t 0 λ i (t) dt (5.88) where: λ i (t)=the ith component hazard rate or instantaneous failure rate. In this case, a componentfailure time can follow any statistical distribution function of which the hazard rate is known. The expression R i (t) is reduced to R i (t)=e − λ i t (5.89) The mean time between failures (MTBF) is defined by the following expression MTBF = ∞ 0 R(t)dt (5.90) Substituting the expression for R i (t) and integrating in the series gives the model for MTBF—in effect, this is the sum of the inverse values of the component hazard rates, or instantaneous failure rates of all the components in the series MTBF = n ∑ i=1 λ i −1 (5.91) where: λ i = the ith component hazard rate or instantaneous failure rate. 672 5 Safety and Risk in Engineering Design Residual life Let T denote the time to failure. The survival function can then be expressed as R(t)=P(T > t) (5.92) The conditional survival function of a component that has survived without fail- ure can now be formulated. The conditional survival function of a component that has survived (without fail- ure) up to time x is R(t|x)=P(T > t + x|T > x) (5.93) = P(T > t + x) P(T > t) = R(t + x) R(x) R(t|x) denotes the probability that a component (of age x) will survive an extra time t.Themean residual life (MRL) of a component of age x can thus be expressed as MRL(x)= ∞ 0 R(t|x)dt (5.94) If x = 0, then the initial age is zero, implying a new item, MRL(0)=MTTF, the mean time to fail. The difference between MTBF and MTTF is in their application. Although both are similarly calculated, MTBF is applied to components that are repaired, and MTTF to componentsthat are replaced. The mean residual life (MRL) function or remaining life expectancy function at age x is defined to be the expected remaining life given survival to age x. Consider now the reliable life for the one- parameter exponential distribution, compared to the residual life h(x)= MRL(x) MTTF (5.95) Certain characteristics of the comparison between the mean residual life MRL and the mean time to fail MTTF are the following: • When the time to failure for an item, T, has an exponential distribution (i.e. constant hazard rate), then the function h(x)=1forallx and MRL = MTTF. • When T has a Weibull distribution with shape parameter β < 1 (i.e. a decreasing failure rate), then h(x) is an increasing function. • When T has a Weibull distribution with shape parameter β > 1 (i.e. an increasing failure rate), then h(x) is a decreasing function. Thus, in the case of scheduled preventive maintenance activities with frequencies less than th eir MTTF, thecost/risk of premature renewal or replacement is the loss of potential equipment life (accumulated over all components), equivalent to the sum of the differences between the residual life of each component and the scheduled 5.2 Theoretical Overview of Safety and Risk in Engineering Design 673 frequency.Similarly, for those scheduled preventive maintenance activities with fre- quencies greater than their MTTF, the cost/risk of delayed renewal or replacement is the cost of losses ( accumulated over all components) due to forced shutdowns as a result of failure, plus the cost of repair to the failed component and to any con- sequential damage. The likelihood of failure is equivalent to the ratio of the differ- ences between the MTTF of each component and the scheduled frequency, divided by the differences between the residual life of each component and the scheduled frequency. Table 5 .22 shows the replacement of (1) = likelihood of occurrence and (4) = failure rate with the calculatedresidual life values, to the FMSE of Table 5.21. h) Sensitivity Testing Sensitivity testing in FMSE considers limits of the likelihood of failure. This is done by representing the likelihood as a statistical distribution (usually, the standard nor- mal distribution), and determining the variance and standard deviation of the range of likelihood values. Sensitivity testing in this case is thus a statistical measure of how well a likelihood test correctly identifies a failure condition. This is illustrated in the concept tabulated below. The sensitivity is the proportion of ‘true positives’ or true likelihood of failure, and is a parameter of the test. Specificity in the concept diagram is a statistical measure of how well a likeli- hood test correctly identifies the negative cases, or those cases that do not result in a failure condition. The significance level of the sensitivity test is a statistical hypothesis testing concept. It is defined as the probability of making a decision to reject the null hypothesis when the null hypothesisis actually true (a decision known as a type I error, or ‘false positive determination’). The decision is made using the P-value of the hypothesis test. If the P-value is less than the significance level, then the null hypothesis is rejected. The smaller the P-value, the more significant the result is considered to be. Different α -levels of the hypothesis test indicate greater confidence in the determination of significance with smaller α -levels but run greater risks of failing to reject a false null hypothesis (a type II error, or ‘false negative de- termination’). Selection of an α -level involves a compromise in tendency towards a type I error, or a type II error. A common misconception is that a statistically significant result is always of practical significance. One of the more common prob- lems in significance testing of sensitivity is the tendency for multiple comparisons to yield spurious significant differences even where the null hypothesis is true. For example, in a comparison study of the likelihood of failure of several failure modes, using an α -level of 5%, one comparison will likely yield a significant result despite the null hypothesis being true. During a sensitivity analysis, the values of the specified sensitivity variables are modified with changes to the expected value. For one-way sensitivity analyses, one variable is changed at a time. For two-way sensitivity analyses, two variables are changed simultaneously. For a more sophisticated sensitivity analyses, an FMSE what-if analysis is conducted. The differences between the outcomes of the qualita- tive risk-based FMSE and related cost risk for different expected values can then be . sensing port due to blockage of chemical or physical accumulation 100% 2 2.00 0.33 0.66 Low criticality 5.2 Theoretical Overview of Safety and Risk in Engineering Design 669 Table 5.21 (continued) Component. Theoretical Overview of Safety and Risk in Engineering Design 667 Table 5.20 Risk assessment scale Risk assessment scale Estimated degree of Risk assessment values: safety: Degree of severity × Probability Severity. stochastic processes during detail engineering design, and is one of the random variables that determines the design requirements for component renewal/replacement; the other being the component age