TEAMFLY Team-Fly ® Software Fault Tolerance Techniques and Implementation Limits of Liability and Disclaimer of Warranty Every reasonable attempt has been made to ensure the accuracy, complete- ness, and correctness of the information contained in this book at the time of writing. However, neither the author nor the publisher, Artech House, Inc., shall be responsible or liable in negligence or otherwise, in respect to any inaccuracy or omission herein. The author and the publisher make no repre- sentation that this information is suitable for every application to which a reader may attempt to apply the information. Many of the techniques and theories are still subject to academic debate. The author and Artech House make no warranty of any kind, expressed or implied, including warranties of fitness for a particular purpose, with regard to the information contained in this book, all of which is provided as is. Without derogating from the gen- erality of the foregoing, neither the author nor the publisher shall be liable for any direct, indirect, incidental, or consequential damages or loss caused by or arising from any information or advice, inaccuracy, or omission herein. This work is published with the understanding that the author and Artech House are supplying information, but are not attempting to render engineer- ing judgment or other professional services. For a complete listing of the Artech House Computing Library, turn to the back of this book. Software Fault Tolerance Techniques and Implementation Laura L. Pullum Artech House Boston London www.artechhouse.com Library of Congress Cataloging-in- Publication Data Pullum, Laura. Software fault tolerance techniques and implementation / Laura Pullum. p. cm. - (Artech House computing library) Includes bibliographical references and index. ISBN 1-58053-137-7 (alk. paper) 1. Fault -tolerant computing. 2. Computer software-Reliability. I. Title. II. Series. QA76.9.F38 P85 2001 005.1-dc21 2001035915 British Library Cataloguing in Publication Data Pullum, Laura Software fault tolerance techniques and implementation. - (Artech House computing library) 1. Computer software-Development 2. Software failures I. Title 005.1’2 ISBN 1 - 58053 - 470 - 8 Cover design by Igor Valdman © 2001 ARTECH HOUSE, INC. 685 Canton Street Norwood, MA 02062 All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, in - cluding p hotocopying, recording, or by any information storage and retrieval system, with out permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech Ho use cannot attest to the accuracy of this informa tion. Use of a term in this book should not be regarded as affecting the validity of any trade mark or service mark. International Standard Book Number: 1-58053-137-7 Library of Congress Catalog Card Number: 2001035915 10 9 8 7 6 5 4 3 2 1 Contents Preface xi Acknowledgments xiii 1 Introduction 1 1.1 A Few Definitions 3 1.2 Organization and Intended Use 4 1.3 Means to Achieve Dependable Software 6 1.3.1 Fault Avoidance or Prevention 7 1.3.2 Fault Removal 9 1.3.3 Fault/Failure Forecasting 11 1.3.4 Fault Tolerance 12 1.4 Types of Recovery 13 1.4.1 Backward Recovery 14 1.4.2 Forward Recovery 16 1.5 Types of Redundancy for Software Fault Tolerance 18 1.5.1 Software Redundancy 18 v 1.5.2 Information or Data Redundancy 19 1.5.3 Temporal Redundancy 21 1.6 Summary 21 References 23 2 Structuring Redundancy for Software Fault Tolerance 25 2.1 Robust Software 27 2.2 Design Diversity 29 2.2.1 Case Studies and Experiments in Design Diversity 31 2.2.2 Levels of Diversity and Fault Tolerance Application 33 2.2.3 Factors Influencing Diversity 34 2.3 Data Diversity 35 2.3.1 Overview of Data Re-expression 37 2.3.2 Output Types and Related Data Re-expression 38 2.3.3 Example Data Re-expression Algorithms 40 2.4 Temporal Diversity 42 2.5 Architectural Structure for Diverse Software 44 2.6 Structure for Development of Diverse Software 44 2.6.1 Xu and Randell Framework 45 2.6.2 Daniels, Kim, and Vouk Framework 51 2.7 Summary 53 References 53 3 Design Methods, Programming Techniques, and Issues 59 3.1 Problems and Issues 59 LE Software Fault Tolerance Techniques and Implementation 3.1.1 Similar Errors and a Lack of Diversity 60 3.1.2 Consistent Comparison Problem 62 3.1.3 Domino Effect 68 3.1.4 Overhead 70 3.2 Programming Techniques 76 3.2.1 Assertions 78 3.2.2 Checkpointing 80 3.2.3 Atomic Actions 84 3.3 Dependable System Development Model and N-Version Software Paradigm 88 3.3.1 Design Considerations 88 3.3.2 Dependable System Development Model 91 3.3.3 Design Paradigm for N-Version Programming 93 3.4 Summary 94 References 97 4 Design Diverse Software Fault Tolerance Techniques 105 4.1 Recovery Blocks 106 4.1.1 Recovery Block Operation 107 4.1.2 Recovery Block Example 113 4.1.3 Recovery Block Issues and Discussion 115 4.2 N-Version Programming 120 4.2.1 N-Version Programming Operation 121 4.2.2 N-Version Programming Example 125 4.2.3 N-Version Programming Issues and Discussion 127 4.3 Distributed Recovery Blocks 132 4.3.1 Distributed Recovery Block Operation 132 4.3.2 Distributed Recovery Block Example 137 4.3.3 Distributed Recovery Block Issues and Discussion 139 Contents LEE 4.4 N Self-Checking Programming 144 4.4.1 N Self-Checking Programming Operation 144 4.4.2 N Self-Checking Programming Example 145 4.4.3 N Self-Checking Programming Issues and Discussion 149 4.5 Consensus Recovery Block 152 4.5.1 Consensus Recovery Block Operation 152 4.5.2 Consensus Recovery Block Example 155 4.5.3 Consensus Recovery Block Issues and Discussion 159 4.6 Acceptance Voting 162 4.6.1 Acceptance Voting Operation 162 4.6.2 Acceptance Voting Example 166 4.6.3 Acceptance Voting Issues and Discussion 169 4.7 Technique Comparisons 172 4.7.1 N-Version Programming and Recovery Block Technique Comparisons 176 4.7.2 Recovery Block and Distributed Recovery Block Technique Comparisons 180 4.7.3 Consensus Recovery Block, Recovery Block Technique, and N-Version Programming Comparisons 181 4.7.4 Acceptance Voting, Consensus Recovery Block, Recovery Block Technique, and N-Version Programming Comparisons 182 References 183 5 Data Diverse Software Fault Tolerance Techniques 191 5.1 Retry Blocks 192 5.1.1 Retry Block Operation 193 5.1.2 Retry Block Example 202 5.1.3 Retry Block Issues and Discussion 204 5.2 N-Copy Programming 207 LEEE Software Fault Tolerance Techniques and Implementation 5.2.1 N-Copy Programming Operation 208 5.2.2 N-Copy Programming Example 212 5.2.3 N-Copy Programming Issues and Discussion 214 5.3 Two-Pass Adjudicators 218 5.3.1 Two-Pass Adjudicator Operation 218 5.3.2 Two-Pass Adjudicators and Multiple Correct Results 223 5.3.3 Two-Pass Adjudicator Example 227 5.3.4 Two-Pass Adjudicator Issues and Discussion 229 5.4 Summary 232 References 233 6 Other Software Fault Tolerance Techniques 235 6.1 N-Version Programming Variants 235 6.1.1 N-Version Programming with Tie-Breaker and Acceptance Test Operation 236 6.1.2 N-Version Programming with Tie-Breaker and Acceptance Test Example 241 6.2 Resourceful Systems 244 6.3 Data-Driven Dependability Assurance Scheme 247 6.4 Self-Configuring Optimal Programming 253 6.4.1 Self-Configuring Optimal Programming Operation 253 6.4.2 Self-Configuring Optimal Programming Example 257 6.4.3 Self-Configuring Optimal Programming Issues and Discussion 260 6.5 Other Techniques 262 6.6 Summary 262 References 265 Contents EN [...]... (Sections 7 .1. 17 .1. 3, 7 .1. 7, 7.2) (Sections 7 .1. 47 .1. 6, 7 .1. 8) Alternate path Alternate path Figure 1. 1 A proposed guide to reading this book 6 Software Fault Tolerance Techniques and Implementation 1. 3 Means to Achieve Dependable Software AM FL Y We have stated that the need for dependable software in general, and software fault tolerance in particular, arises from the pervasiveness of software, .. .Software Fault Tolerance Techniques and Implementation Adjudicating the Results 269 7 .1 7 .1. 1 7 .1. 2 7 .1. 3 7 .1. 4 7 .1. 5 270 273 278 282 289 7 .1. 6 7 .1. 7 7 .1. 8 Voters Exact Majority Voter Median Voter Mean Voter Consensus Voter Comparison Tolerances and the Formal Majority Voter Dynamic Majority and Consensus Voters Summary of Voters Discussed Other Voters 295 303 309 311 7.2 7.2 .1 7.2.2 7.2.3... software dependability Fault/ failure forecasting can indicate the need for fault tolerance 1. 3.4 Fault Tolerance One way to reduce the risks of software design faults and thus enhance software dependability is to use software fault tolerance techniques Software fault tolerance techniques are employed during the procurement, or development, of the software They enable a system to tolerate software faults... validation (V&V) methods, and eliminating the detected faults Fault removal techniques contribute to system dependability using software testing, formal inspection, and formal design proofs 10 Software Fault Tolerance Techniques and Implementation 1. 3.2 .1 Software Testing The most common fault removal techniques involve testing An overview of software- testing techniques is provided by the author in [26] The... groups: (1) those that are employed during the software construction process (fault avoidance and fault tolerance) , and (2) those that contribute to validation of the software after it is developed (fault removal and fault forecasting) Briefly, the techniques are: • Fault avoidance or prevention: to avoid or prevent fault introduction and occurrence; • Fault removal: to detect the existence of faults and. .. However, it will be treated separately to introduce 20 Software Fault Tolerance Techniques and Implementation HW SW1 SW 1 SW2 SW 2 SWn SW n (a) Adjudicator Adjudicator HW2 HW1 HWn SW1 SW2 SWn (b) … Adjudicator HW1 HW2 SW1 HWn SW2 … SWn (c) HWn + 1 Adjudicator SW1 SW2 SWn versus SW1 SW2 SWn (d) HW = Hardware SW = Software Figure 1. 5 Various views of redundant software: (a) all replicas on a single hardware... single version software techniques, multiple version software techniques, or multiple data representation techniques 1. 3.4 .1 Single Version Software Environment In a single version software environment, these techniques are used to partially tolerate software design faultsmonitoring techniques, atomicity of actions, decision verification, and exception handling 1. 3.4.2 Multiple Version Software Environment... recovery Techniques using forward recovery include NVP, NCP, and the distributed recovery block (DRB) technique (which has the effect of forward recovery) 18 Software Fault Tolerance Techniques and Implementation 1. 5 Types of Redundancy for Software Fault Tolerance A key supporting concept for fault tolerance is redundancy, that is, additional resources that would not be required if fault tolerance. .. of fault avoidance techniques on system dependability Despite fault prevention efforts, faults are created, so fault removal is needed 1. 3.2 Fault Removal Fault removal techniques are dependability-enhancing techniques employed during software verification and validation These techniques improve software dependability by detecting existing faults, using verification and validation (V&V) methods, and. .. applicable recovery technique for software fault tolerance It is used frequently, despite its overhead The RcB technique and most distributed systems incorporating software fault tolerance employ backward recovery 16 1. 4.2 Software Fault Tolerance Techniques and Implementation Forward Recovery TE AM FL Y As stated earlier, after an error occurs in a program, recovery techniques attempt to return the . Recovery 13 1. 4 .1 Backward Recovery 14 1. 4.2 Forward Recovery 16 1. 5 Types of Redundancy for Software Fault Tolerance 18 1. 5 .1 Software Redundancy 18 v 1. 5.2 Information or Data Redundancy 19 1. 5.3. Organization and Intended Use 4 1. 3 Means to Achieve Dependable Software 6 1. 3 .1 Fault Avoidance or Prevention 7 1. 3.2 Fault Removal 9 1. 3.3 Fault/ Failure Forecasting 11 1. 3.4 Fault Tolerance 12 1. 4 Types. Tolerance Techniques 10 5 4 .1 Recovery Blocks 10 6 4 .1. 1 Recovery Block Operation 10 7 4 .1. 2 Recovery Block Example 11 3 4 .1. 3 Recovery Block Issues and Discussion 11 5 4.2 N-Version Programming 12 0 4.2.1