AVAILABILITY MEASUREMENT SESSION NMS-2201 NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved Agenda • Introduction • Availability Measurement Methodologies Trouble Ticketing Device Reachability: ICMP (Ping), SA Agent, COOL SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application • Developing an Availability ‘Culture’ NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Associated Sessions • NMS-1N01: Intro to Network Management • NMS-1N02: Intro to SNMP and MIBs • NMS-1N04: Intro to Service Assurance Agent • NMS-1N41: Introduction to Performance Management • NMS-2042: Performance Measurement with Cisco IOSđ ACC-2010: Deploying Mobility in HA Wireless LANs • NMS-2202: How Cisco Achieved HA in Its LAN • RST-2514: HA in Campus Network Deployments • NMS-4043: Advanced Service Assurance Agent • RST-4312: High Availability in Routing NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved INTRODUCTION WHY MEASURE AVAILABILITY? NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Why Measure Availability? Baseline the network Identify areas for network improvement Measure the impact of improvement projects NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved Why Should We Care About Network Availability? • Where are we now? (baseline) • Where are we going? (business objectives) • How best we get from where we are not to where we are going? (improvements) • “What if, we can’t get there from here?” NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Why Should We Care About Network Availability? Recent Studies by Sage Research Determined That US-Based Service Providers Encountered: • Percent of downtime that is unscheduled: 44% • 18% of customers experience over 100 hours of unscheduled downtime or an availability of 98.5% • Average cost of network downtime per year: $21.6 million or $2,169 per minute! Downtime—Costs too Much!!! SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved Cause of Network Outages • Change management Technology • Hardware • Links 20% • Design • Environmental issues • Natural disasters • Process consistency User Error and Process 40% Source: Gartner Group NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Software and Application 40% •Software issues •Performance and load •Scaling Top Three Causes of Network Outages • Congestive degradation • Network design • Capacity (unanticipated peaks) • WAN failure (e.g., major fiber cut or carrier failure) • Solutions validation • Power • Software quality • Critical services failure (e.g DNS/DHCP) • Inadvertent configuration change • Change management • Protocol implementations and misbehavior • Hardware fault NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved Method for Attaining a Highly-Available Network Or a Road to Five Nine’s • Establish a standard measurement method • Define business goals as related to metrics • Categorize failures, root causes, and improvements • Take action for root cause resolution and improvement implementation NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 10 Where Are We Going? Or What Are Your Business Goals? • Financial ROI Economic Value Added Revenue/Employee • Productivity • Time to market • Organizational mission • Customer perspective Satisfaction Retention Market Share Define Your ‘End-State’? What Is Your Goal? NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 11 Why Availability for Business Requirements? • Availability as a basis for productivity data Measurement of total-factor productivity Benchmarking the organization Overall organizational performance metric • Availability as a basis for organizational competency Availability as a core competency Availability improvement as an innovation metric • Resource allocation information Identify defects Identify root cause Measure MTTR—tied to process NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 12 It Takes a Design Effort to Achieve HA Hardware and Software Design Process Design NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved Network and Physical Plant Design 13 INTRODUCTION WHAT IS NETWORK AVAILABILITY? NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 14 What Is High Availability? High Availability Means an Average End User Will Experience Less than Five Minutes Downtime per Year Availability NMS-2201 9627_05_2004_c2 Downtime per Year (24x7x365) 99.000% Days 15 Hours 36 Minutes 99.500% Day 19 Hours 48 Minutes 99.900% Hours 46 Minutes 99.950% Hours 23 Minutes 99.990% 53 Minutes 99.999% Minutes 99.9999% 30 Seconds © 2004 Cisco Systems, Inc All rights reserved 15 Availability Definition • Availability definition is based on business objectives Is it the user experience you are interesting in measuring? Are some users more important than other? • Availability groups? Definitions of different groups • Exceptions to the availability definition i.e the CEO should never experience a ‘network’ problem NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 16 How You Define Availability • Define availability perspective (customer, business, etc.) • Define availability groups and levels of redundancy • Define an outage • Define impact to network Ensure SLAs are compatible with outage definition Understand how maintenance windows affect outage definition Identify how to handle DNS and DHCP within definition of Layer outage Examine component level sparing strategy • Define what to measure • Define measurement accuracy requirements NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 17 Network Design What Is Reliability? • “Reliability” is often used as a general term that refers to the quality of a product Failure rate MTBF (Mean Time Between Failures) or MTTF (Mean Time To Failure) Engineered availability • Reliability is defined as the probability of survival (or no failure) for a stated length of time NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 18 MTBF Defined • MTBF stands for Mean Time Between Failure • MTTF stands for Mean Time to Failure This is the average length of time between failures (MTBF) or, to a failure (MTTF) More technically, it is the mean time to go from an OPERATIONAL STATE to a NON-OPERATIONAL STATE MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems • MTTR stands for Mean Time to Repair NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 19 One Method of Calculating Availability • Availability = MTBF (MTBF + MTTR) • What is the availability of a computer with MTBF = 10,000 hrs and MTTR = 12 hrs? A = 10000 ÷ (10000 + 12) = 99.88% • Annual uptime 8,760 hrs/year X (0.9988) = 8,749.5 hrs • Conversely, annual DOWN time is, 8,760 hrs/year X (1- 0.9988) = 10.5 hrs NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 20 Complex Redundancy Examples: 1-of-2 2-of-3 m-of-n n 2-of-4 8-of-10 “Pure Active Parallel” NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 133 More Complex Redundancy • Pure active parallel All components are on • Standby redundant Backup components are not operating • Perfect switching Switch-over is immediate and without fail • Switchover reliability The probability of switchover when it is not perfect • Load sharing All units are on and workload is distributed NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 134 Networks Consist of Series-Parallel • Combinations of in-series and redundant components D1 A B1 1/2 B2 NMS-2201 9627_05_2004_c2 C D2 2/3 E F D3 © 2004 Cisco Systems, Inc All rights reserved 135 Failure Rate • The number of failures per time: Failures/hour Failures/day Failures/week Failures/106 hours Failures/109 hours ⇒ called “FITs” (“Failures in Time”) NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 136 Approximating MTBF • 13 units are tested in a lab for 1,000 hours with failures occurring • Another units were tested for 6,000 hours with failure occurring • The failed units are repaired (or replaced) • What is the approximate MTBF? NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 137 Approximating MTBF (Cont.) • MTBF = 13*1000 + 4*6000 1+2 = 37,000 = 12,333 hours NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 138 Frequency Modeling MTBF • Normal MTBF • Log-Normal • Weibull Time-to-Failure Frequency Distributions Exponential MTBF Time-to-Failure NMS-2201 9627_05_2004_c2 â 2004 Cisco Systems, Inc All rights reserved 139 Constant Failure Rate The Exponential Distribution • The exponential function: f(t) = λe-λt, t > Failure rate, λ , IS CONSTANT λ = 1/MTBF • If MTBF = 2,500 hrs., what is the failure rate? • λ = 1/2500 = 0.0004 failures/hr NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 140 Failure Rate The “Bathtub” Curve DECREASING Failure Rate INCREASING Failure Rate CONSTANT Failure Rate Time Infant Mortality NMS-2201 9627_05_2004_c2 “Useful Life” Period Wear-Out © 2004 Cisco Systems, Inc All rights reserved 141 The Exponential Reliability Formula • Commonly used for electronic equipment • The exponential reliability formula: R(t) = e-λt or R(t) = e-t/MTBF NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 142 Calculating Reliability • A certain Cisco router has an MTBF of 100,000 hrs; what is the annual reliability? Annual reliability is the reliability for one year or 8,760 hrs R =e-(8760/100000) = 91.6% • This says that the probability of no failure in one year is 91.6%; or, 91.6% of all units will survive one year NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 143 ADDITIONAL TROUBLE TICKETING SLIDES NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 144 Essential Data Elements Parameter Format Description Date dd/mmm/yy Date Ticket Issued Ticket Alphanumeric Trouble Ticket Number Start Date dd/mmm/yy Date of Fault Start Time hh:mm Time of Fault Resolution Date dd/mmm/yy Date of Resolution Resolution Time hh:mm Time of Resolution Customers Impacted Interger Number of Customers that Lost Service; Number Impacted or Names of Customers Impacted Problem Description String Outline of the Problem Root Cause String HW, SW, Process, Environmental, etc Component/Part/SW Version Alphanumeric For HW Problems include Product ID; for SW Include Release Version Type Planned/Unplanned Identity if the Event Was Due to Planned Maintenance Activity or Unplanned Outage Resolution String Description of Action Taken to Fix the Problem Note: Above Is the Minimum Data Set, However, if Other Information Is Captured it Should Be Provided NMS-2201 9627_05_2004_c2 145 © 2004 Cisco Systems, Inc All rights reserved HA Metrics/NAIS Synergy Referral for Analysis Data Analysis Operational Process and Procedures Analysis • Baseline availability • Determine DPM • Network reliability improvement analysis (Defects Per Million) Trouble Tickets by: • Problem management • Definitions Planned/Unplanned Root Cause Resolution Equipment • Data accuracy • Collection processes • MTTR NMS-2201 9627_05_2004_c2 Fault management Resiliency assessment Change management Performance management • Availability management • • • • Analyzed Trouble Ticket Data Referral for Process/Procedural Improvement © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 146 ADDITIONAL SA AGENT SLIDES NMS-2201 9627_05_2004_c2 147 © 2004 Cisco Systems, Inc All rights reserved SA Agent: How It Works SNMP Management Application User configures Collectors through Mgmt Application GUI Mgmt Application provisions Source routers with Collectors SA Agent Source router measures and stores performance data, e.g.: Response time Availability Application retrieves data from Source routers once an hour Data is written to a database Source router evaluates SLAs, sends SNMP Traps Source router stores latest data point and hours of aggregated points Reports are generated NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 148 SAA Monitoring IP Core R2 R1 P1 P3 IP Core R3 P2 Management System NMS-2201 9627_05_2004_c2 149 © 2004 Cisco Systems, Inc All rights reserved Monitoring Customer IP Reachability P1 Nw1 Nw3 TP1 P2 TPx P3 Nw3 PN NwN P1-Pn Service Assurance Agent ICMP Polls to a Test Point in the IP Core NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 150 Service Assurance Agent Features • Measures Service Level Agreement (SLA) metrics Packet Loss Response time Throughput Availability Jitter • Evaluates SLAs • Proactively sends notification of SLA violations NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 151 SA Agent Impact on Devices • Low impact on CPU utilization • 18k memory per SA agent • SAA rtr low-memory NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 152 Monitored Network Availability Calculation • Not calculated: Already have availability baseline Fault type, frequency and downtime may be more useful Faults directly measured from management system(s) NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 153 Monitored Network Availability Assumptions • All connections below IP are fixed • Management systems can be notified of all fixed connection state changes • All (L2) events impact on IP (L3) service NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 154 ADDITIONAL COOL SLIDES NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 155 CLIs Configuration CLI Commands [no] cool run [no] cool interface interface-name(idb) [no] cool physical-FRU-entity entity-index (int) [no] cool group-interface group-objectID(string) [no] cool add-cpu objectID threshold duration [no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ] [no] cool if-filter group-objectID (string) Display CLI Commands Router#show cool event-table [] displays all if not specified Router#show cool object-table [] displays all object types if not specified Router#show cool fru-entity Exec CLI Commands Router#clear cool event-table Router#clear cool persistent-files NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 156 Measurement Example: Router Device Outage Reload (Operational) , Power Outage, or Device H/W failure Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4) Index: the corresponding MIB table index If it is PhysicalEntity(2), index in the ENTITY-MIB Status: Up (1) Down (2) Last-change: last object status change time AOT: Accumulated Outage Time (sec) NAF: Number of Accumulated Failure NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved 157 Measurement Example: Cisco IOS S/W Outage Standby RP in Slot Crash Using “Address Error (4) Test Crash”; AdEL Exception Æ It Is Caused Purely by Cisco IOS S/W Standby RP Crash Using “Jump to Zero (5) Test Crash”; Bp Exception Ỉ It Can Be Caused by S/W, H/W, or Operation NMS-2201 9627_05_2004_c2 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr 158 Measurement Example: Linecard Outage Add a Linecard Reset the Linecard Down Event Captured Up Event Captured AOT and NAF Updated NMS-2201 9627_05_2004_c2 159 © 2004 Cisco Systems, Inc All rights reserved Measurement Example: Interface Outage 12406-R1202(config)#cool group-interface ATM2/0 12406-R1202(config)#no cool group-interface ATM2/0.3 Object Table sh cool object | include ATM2/0 33 1054859087 0 35 1054859088 0 39 1054859090 0 41 1054859090 0 ATM2/0.1 ATM2/0.2 ATM2/0.4 ATM2/0.5 12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#shut Shut ATM2.0 show cool event-table Interface Down **** COOL Event Table **** type index event time-stamp interval hist_id object-name 33 1054859105 18 ATM2/0.1 35 1054859106 18 ATM2/0.2 Down Event 39 1054859107 17 ATM2/0.4 Captured 41 1054859108 18 ATM2/0.5 12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut No Shut ATM2.0 show cool event-table Interface **** COOL Event Table **** type index event time-stamp interval hist_id object-name 33 1054859146 41 ATM2/0.1 35 1054859147 41 ATM2/0.2 Up Event 39 1054859149 42 ATM2/0.4 Captured 41 1054859150 42 ATM2/0.5 NMS-2201 9627_05_2004_c2 Configure to Monitor All the Interfaces which Includes ATM2/0; String, Except ATM2/0.3 © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Object Table Shows AOT and NAF sh cool object | include ATM2/0 33 1054859087 41 35 1054859088 41 39 1054859090 42 41 1054859090 42 1 1 ATM2/0.1 ATM2/0.2 ATM2/0.4 ATM2/0.5 160 Measurement Example: Remote Device Outage 12406-R1202(config)#cool remote-device 50.1.1.2 remobj.1 30 50.1.1.1 12406-R1202(config)#cool remote-device 50.1.2.2 remobj.2 30 50.1.2.1 12406-R1202(config)#cool remote-device 50.1.3.2 remobj.3 30 50.1.3.1 Remote Devices Are Added sh cool object-table | include remobj 1 1054867061 0 remobj.1 1054867063 0 remobj.2 1054867065 0 remobj.3 Object Table 12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#shut Shut Down the Interface Link Between the Remote Device and Router 4 5 1054867105 1054867108 1054867130 42 47 65 10 remobj.2 remobj.1 remobj.3 12406-R1202(config)#interface ATM2/0 12406-R1202(config-if)#no shut 4 4 4 1054867171 1054867193 1054867200 63 63 95 10 No Shut the Interface Link remobj.1 remobj.3 remobj.2 sh cool object-table | include remobj 1 1054867061 63 remobj.1 1054867063 63 remobj.2 1054867065 95 remobj.3 NMS-2201 9627_05_2004_c2 Down Event Captured © 2004 Cisco Systems, Inc All rights reserved © 2004 Cisco Systems, Inc All rights reserved Printed in USA Presentation_ID.scr Up Event Captured Object Table Shows AOT and NAF 161 ... Why Availability for Business Requirements? • Availability as a basis for productivity data Measurement of total-factor productivity Benchmarking the organization Overall organizational performance... Going? Or What Are Your Business Goals? • Financial ROI Economic Value Added Revenue/Employee • Productivity • Time to market • Organizational mission • Customer perspective Satisfaction Retention... Is Reliability? • “Reliability” is often used as a general term that refers to the quality of a product Failure rate MTBF (Mean Time Between Failures) or MTTF (Mean Time To Failure) Engineered