4.13 Architecture for Verification Software (§ 5) 105 Fig. 4.11. Saving initialized systems for CRV However, there is one more step that typically precedes application of CRV tests, that of initialization. Consider Fig. 4.11. 106 Chapter 4 – Planning and Execution After a system has been initialized, assigning values to all variables of condition and writing them into their respective registers in the target and in the context, excitation with stimuli generated by the test generator be- gins. Substantial gains in efficiency can be made by avoiding the common preamble that activates and initializes the system. Simply dump the state of the system in a location accessible to all available simulation engines (the “simulation farm”). Then, as CRV jobs are dispatched to the networked simulation engines in the simulation farm, this saved state is sent along as well. In Fig. 4.12 the tasks within the rectangle labeled as “Dispatcher” would be handled typically by a main CRV script that sends jobs to each of a large number of simulation engines in the simulation farm. The remaining tasks are those executed on the individual engines in the farm. Of course, if, during the course of activation and initialization, some faulty behavior is observed, the resulting state will probably not be valid and should not be used for subsequent CRV tests. Instead, the system state and all other necessary diagnostic data are saved for failure analysis. Another common practice that avoids unnecessary simulation cycles is the magical initialization of memory in the target’s context and sometimes even in the target itself. Consider a processor example. A processor’s cache control logic is not necessarily well-exercised by an extremely long loop that fills the cache, and tests that are intended (weighted) to exercise the execution pipeline might simply be executed from some saved post- activation state, and then magically (in zero-time) its cache is initialized with the program of interest. 4.13 Architecture for Verification Software (§ 5) 107 Fig. 4.12. CRV using previously initialized system(s) 108 Chapter 4 – Planning and Execution 4.13.5 Static vs. Dynamic Test Generation (§ 5) There are numerous advantages to using dynamically generated tests and modern commercially available tools readily accommodate this type of generation. A statically generated test is one whose entire trajectory of stimuli is known in advance, much like a deterministic test except that the test has been generated with random value assignment driving the creation of the test. A dynamically generated test is one whose trajectory is determined by current or recent state with random value assignment being performed as the test is simulated. Progress of the test is made more useful by anticipat- ing or detecting indirect conditions, yielding greater functional coverage with fewer simulation cycles. This practice is rather well established in the industry and commercially available tools readily accommodate this type of test generation. Certain classes of test are not readily generated dynamically, such as a program for a processor. However, activity that is independent of such a test can be generated dynamically. For example, an interrupt might be caused to occur at a particular time relative to the instruction pipeline. 4.13.6 Halting Individual Tests (§ 5) Another advantage to using dynamic test generation is that test failures can be detected on-the-fly. This means that simulation cycles following faulty behavior are not wasted, because the target has “headed into the weeds” anyways. However, it is usually worthwhile to let simulation proceed for a few dozens or hundreds of cycles to yield possibly useful execution history for the purposes of failure analysis. Some tests may need to wait for quiescence before halting the simula- tion so that all internal state has reached its final value before analyzing fi- nal state for correctness and completeness. 4.13.7 Sanity Checking and Other Tests (§ 5) Even after a target has survived an overwhelming volume of CRV testing it is always prudent to apply some “sanity check” tests just as a precau- tionary measure. For example, verification of a JPEG codec should also use actual image files with visual confirmation and approval of the generated 4.13 Architecture for Verification Software (§ 5) 109 images. Boot-up firmware and initialization software might also be use- ful as alternative sources for tests. One species of “actual software” that one encounters eventually is that of the “compliance tests.” These test suites are typically developed to demonstrate that requirements for some particular industry standard (such as USB or PCI-Express) have been met. It is not uncommon for hard pro- totypes (and sometimes, actual products) to pass its relevant compliance tests and yet still contain bugs. We might better refer to these tests as “non- compliance” tests because, if any test fails, the device under test is not compliant to the requirements as implemented in the test. Not only should one apply such “sanity check” tests. One should also determine whether any of these tests increase coverage above that which has been accumulated via CRV and pre-CRV activation testing. This may reveal coverage holes in the CRV testing. Another method for checking the sanity of an overall regression suite is to add one or more bugs deliberately to the RTL and then determine whether the regression suite is able to detect any faulty behavior as a con- sequence of the “seeded bug(s)”. Future verification tools may provide such capability to grade a regression suite based on its ability to expose such bugs. 4.13.8 Gate-level Simulation (§ 5) Gate-level simulation constitutes simulation of a (usually) synthesized gate-level model with back-annotated timing. 6 The objective for this step is to verify that synthesis generated the correct netlist, that scan and clock tree insertion doesn’t change circuit functionality, that the chip will meet timing (by exercising the timing-critical paths), and that any hand edits to the netlist have not introduced a bug. This gate-level model is also used for generating manufacturing tests and to evaluate fault coverage. Static timing analysis tools typically handle the job of ensuring that the chip will meet timing requirements, but the additional confidence gained from successful gate-level simulations provided a cross check for these static timing tools, provided that timing critical paths can be exercised. Manufacturing tests are typically achieved through full scan testing or BIST if implemented. 6 This is a connectivity of gates, but at a nearly meaningless level of abstraction from the point of view of the functional specifications. However, it is a very meaningful abstraction for the electrical specifications (area, power, timing). 110 Chapter 4 – Planning and Execution The term “formal verification” is also used to apply to equivalence checking of RTL with the synthesized netlist, the netlist with test logic in- serted, the netlist with clock tree inserted, and any hand edits. Hand edits may be needed to accomplish a bug fix by modifying masks (they are very expensive so we want to retain them whenever we can) and these modifi- cations must be modeled in RTL. Equivalence checking ensures that they are equivalent. Equivalence checkers find any non-correspondence between a gate-level netlist and the RTL from which it was synthesized. Gate-level simulations with a selected subset of the regression suite serves as a cross check on equivalence checking. Fault coverage can be estimated using only a gate-level model with suit- able stuck-at models. Gate-level simulations will typically run much slower than simulation of RTL because the level of abstraction is lower, resulting in many more ele- ments to evaluate as signal changes propagate through the target. 4.13.9 Generating Production Test Vectors (§ 5) Production test vectors typically consist largely of automatically generated vectors for the testability morph of the target. Test time is expensive, so achieving high fault coverage in as little time as possible is very important. Fortunately, ATPG (automated test program generation) does the heavy lifting for us, producing vectors for the testability morph. If the test fixture for the manufactured device is able to drive and ob- serve all I/O ports, then running some fraction (or all) of the functional re- gression suite on the tester can be done. One advantage is that manufac- tured prototypes can be tested even though a complete set of manu- facturing tests is not yet available. It’s not uncommon for generation of production test vectors to lag production of prototype devices. 4.14 Change Management (§ 6) Change is constant, as they say. Effective project management expects change and is prepared to accommodate it. Changes come from within, as a consequence of invention and discovery, as well as from without, as a consequence of market and schedule pressures. To facilitate a fluid and nimble response to changes from whatever source, it is worthwhile to discuss how such changes will be handled early in the planning stage. Any agreements on how to manage such changes 4.15 Organizing the Teams (§ 7) 111 might already be part of the management practices of the organization, in which case the verification plan need merely cite these established prac- tices. For example, the verification manager might be allocated a budget for resources and then be empowered to spend those resources as needed to achieve the project’s goals on schedule. On the other hand, various lev- els of approval up the management chain might be needed before changing course. When changes are imposed by external factors, such as a change in product requirements, it is invaluable to document these changes in writ- ing. If specifications are changed, then these changes must be interpreted into the standard framework and propagated throughout the testbench as needed. If changes are made to schedule or resources, it’s vital that these changes be understood in terms of their consequences to the project. It may be useful to establish a “drop dead” date 7 after which no further changes in requirements can be accommodated. A shorter schedule or loss of resources will usually entail assuming greater risk at tape-out than was originally planned. On the other hand, the availability of an advanced tool or the discovery of some more efficient process might entail embracing a more aggressive risk goal than was origi- nally planned. 4.15 Organizing the Teams (§ 7) There are many ways to organize the various teams of engineers and man- agers that work together to produce a new product, but there are some common elements to each. A representative organization of development effort is illustrated in Fig. 4.13. There are typically three major efforts in the development of a new (or revised) IC, and these efforts are often undertaken by three separate teams. First there is the need to create the RTL that performs the required func- tions, synthesizes into the intended die area, and meets the timing require- ments. A separate RTL team is often responsible for this effort. Second there is the need to create the artwork that is transformed into the mask set for the IC. This artwork must match the RTL in terms of logi- cal functionality, fit into the die area, meet the timing requirements, and meet the design rules for the manufacturing process. 7 This is the time when engineering tells marketing to “drop dead” if they bring new requirements for the design. 112 Chapter 4 – Planning and Execution Fig. 4.13. Typical team organization Third there is the need to create the test environment and the resulting regression suite that demonstrates to a sufficiently high level of confidence that the synthesizable RTL functions correctly and completely according to the specifications. Additionally, a gate-level model with back-annotated timing information derived from the artwork might also be subjected to the same (or, more likely, abbreviated) regression as the source RTL. In effect, the RTL team and the verification team are each producing in- dependently and in separate computer languages a functional implementa- tion of the specification documents. The likelihood that each team will du- plicate design errors is very small, so each is checking the work of the other, thereby assuring a quality result. 8 8 The diagram in Fig. 4.13 is only a rough approximation of the countless interac- tions of team members, and it would be an error to restrict communications to 4.15 Organizing the Teams (§ 7) 113 During the course of the project there will invariably be some number of bugs that must be exposed and changes made so that functionality and area and timing requirements are met. The presence of these bugs is revealed in failures that must be analyzed for cause and for changes needed to elimi- nate whatever faulty behavior led to the failure. The verification team, hav- ing ruled out anything in the test environment as having caused the failure, relays the data associated with the failure to the RTL team for analysis. When the changes needed to eliminate the faulty behavior are made, the revised RTL is related to both the verification team and to the artwork team. Similarly, if a signal path fails to meet its timing budget or if a block fails to meet its area budget, the artwork team rules out any cause within their domain (such as routing or transistor sizing) before relaying the fail- ure data back to the RTL team for their analysis. Again, the RTL is revised as necessary and related back to the artwork team and to the verification team. Of course, this is a highly simplified view of team interaction and the reality is usually much more dynamic and not nearly so cut and dried. RTL revisions are not necessarily relayed for each and every change, but might instead be accumulated for a more orderly check-in process, perhaps daily or weekly or at some other interval. One thing is required, however, and that is that everyone is working on the identical version of RTL. As mentioned earlier, it’s valuable to adopt an “always working” meth- odology whereby any changes that fail regression are rejected. Having a “quick test” that screens for common problems can prevent a broken check-in of new code from reaching the CRV processes prematurely. Proper check-in disciplines should be enforced at all times, usually by the curator for the overall environment. This is a member of the verifica- tion team who is in a position to remain aware of all changes made by eve- ryone contributing to regressions, whether it’s RTL from the RTL team, testbench components from the verification team, or gate-level models from the artwork team. This engineer might also be empowered to manage the simulation farm and to ensure that optimal productivity from this scarce asset is obtained. only that shown in the diagram, confining the creative process to some graphic in a book. The map is not the territory. 114 Chapter 4 – Planning and Execution Some tests fail due to external factors, such as a disk becoming full or someone from the nightly cleaning crew unplugging a computer. Some tests fail due to bugs in the environment control software that dis- patches simulation jobs and retrieves their results. Some tests fail due to bugs in the test environment. If the foregoing causes have been ruled out, then the cause of the failure is most likely due to one or more bugs in the RTL. This is about the right time for a hand-off from the verification team to the RTL team. Triage – sifting through failed tests to find likely common causes and sorting them as to likely cause – is an important role in the verification team. This important function is often performed most effectively by the engineer responsible for integration of the many contributions (RTL, test environment), i.e. the curator. Finding all paths that lead to the same faulty behavior gives us what we need to know to eliminate the faulty behavior as well as to generate the test that exercises the changes (not that this will be an easy task). One method that has proven useful for exploring the functional space in the proximity of some faulty behavior is sweep testing or functional shmoo testing. By varying values of one variable while keeping others constant, one may get a more comprehensive understanding of what variables contribute to the problem. Again, this is not necessarily easy but may be needed for critical functionality. We know that we can create VTGs for any subgraph in the functional space. This may be a useful technique in verifying changes to fix bugs and eliminate faulty behavior. Coverage of the VTG constructed for the af- fected variables can tell us how thoroughly the changes have been exer- cised. For bugs requiring changes in multiple modules created by different en- gineers, it may be worthwhile to analyze the behavior for sensitivity to context, to activation, to operation, and, of course, to excitation. This com- prehensive analysis may facilitate the design of changes to eliminate the faulty behavior or the design of workarounds (avoidance of the bug by constraining software’s use of some functionality). For example, if some faulty behavior requires extensive changes to the RTL or if time does not permit designing the changes, and the behavior occurs only for one value of a variable of condition, then the behavior can be avoided by no longer using that value. Of course, this is only feasible if the other consequences of not using that value are still acceptable within the sphere of intended usage for the design. The design of a workaround is usually not a trivial exercise. 4.15.1 Failure Analysis (§ 7) Members of the verification are on the front lines of failure analysis. . task). One method that has proven useful for exploring the functional space in the proximity of some faulty behavior is sweep testing or functional shmoo testing. By varying values of one variable. this is not necessarily easy but may be needed for critical functionality. We know that we can create VTGs for any subgraph in the functional space. This may be a useful technique in verifying. connectivity of gates, but at a nearly meaningless level of abstraction from the point of view of the functional specifications. However, it is a very meaningful abstraction for the electrical specifications