Model-Based Design for Embedded Systems- P3 pptx

36 Model-Based Design for Embedded Systems a translation of the binary code into the SystemC code generates a fast code compared to an interpreting ISS, as no decoding of instructions is needed and the generated SystemC code can be easily used within a SystemC simulation environment However, this approach has some major disadvantages One main drawback is that the same problems that have to be solved in the static compilation (binary translation) have to be solved here (e.g., addresses of calculated branch targets have to be determined) Another disadvantage is that the automatically generated code is not very easily read by humans 2.4.1 Back-Annotation of WCET/BCET Values In this section, we will describe our approach in more detail Figure 2.6 shows an overview of the approach First, the C source code has to be taken and translated using an ordinary C (cross)-compiler into the binary code for the embedded processor (source processor) After that, our back-annotation tool reads the object file and a description of the used source processor This description contains both a description of the architecture and a description of the instruction set of the processor Figure 2.4 shows an example for the description of the architecture It contains information about the resources of the processor (Figure 2.4a) This information is used for the modeling of the pipeline Furthermore, it contains a description of the properties of the instruction (Figure 2.4b) and data caches (Figure 2.4c) Furthermore, such a description can contain information about the branch prediction of the processor Annotation of C code for a basic block C code corresponding to a basic block delay(statically predicted number of cycles); C code corresponding to the cache analysis blocks of the basic block Architectural model Cache model delay(cycleCalculationICache(tag, iStart, iEnd)); delay(cycleCalculationForConditionalBranch()); Branch prediction model Function call of consume function if necessary (e.g before I/O access) consume(getTaskTime()); FIGURE 2.3 Back-annotation of WCET/BCET values (From Schnerr, J et al., Highperformance timing simulation of embedded software, in: Proceedings of the 45th Design Automation Conference (DAC), Anaheim, CA, pp 290–295, June 2008 Copyright: ACM Used with permission.) 37 SystemC-Based Performance Analysis of Embedded Systems FI DI EX WB 2 8 4096 lru 2 8 4096 lru write-back (a) (b) (c) FIGURE 2.4 Example for a description of the architecture Figure 2.5 shows an example for the description of the instruction set This description contains information about the structure of the bit image of the instruction code (Figure 2.5c) It also contains information to determine the timing behavior of instructions and the timing behavior of instructions that are executed in context with other instructions (Figure 2.5d) Furthermore, for debugging and documentation purposes more information about the instruction can be given (Figure 2.5a and b) Using this description, the object code is decoded and translated into an intermediate representation consisting of a list of objects Each of these objects represents one intermediate instruction In the next step, the basic blocks of this program are determined using the intermediate representation consisting of a list of objects As a result, using this list, a list of basic blocks is built After that, the execution time is statically calculated for each basic block with respect to the provided pipeline model of the proposed source processor This calculation step is described in more detail in Section 2.4.3 Subsequently, the back-annotation correspondences between the C source code and the binary code are identified Then, the back-annotation process takes place This is done by automated code instrumentation for cycle generation and dynamic cycle correction The structure and functionality of this code are described in Section 2.4.2 Not every impact of the processor architecture on the number of cycles can be predicted statically Therefore, if dynamic, data-dependent effects (e.g., branch prediction and caches) have to be taken into account, an 38 Model-Based Design for Embedded Systems a 4 b 4 c 4 d 4 n 2 addsc.a Ac, Ab, Da, n (a) Left-shift the contents of data register Da by the amount specified by n, where n can be 0, 1, 2, or Add that value to the contents of address register Ab and put the result in address register Ac (b) c0110000000nba00000001 (c) FI 1 DI 1 EX 1 WB 1 (d) FIGURE 2.5 Example for a description of an instruction additional code needs to be added Further details concerning this code are described in Section 2.4.5 During back-annotation, the C program is transformed into a cycleaccurate SystemC program that can be compiled to be executed on the processor of the simulation host (target processor) One advantage of this approach is a fast execution of the annotated code as the C source code does not need major changes for back-annotation Moreover, the generated SystemC code can be easily used within a SystemC simulation environment The difficulty in using this approach is to find the corresponding parts of the binary code in the C source code if the compiler optimizes or changes the structure of the binary code too much If this happens, recompilation techniques [4] have to be used to find the correspondences 2.4.2 Annotation of SystemC Code On the left-hand side of Figure 2.3, there is the necessary annotation of a piece of the C code that corresponds to a basic block The right-hand side of SystemC-Based Performance Analysis of Embedded Systems 39 C source code C Compiler Construction of intermediate representation Building of basic blocks Static cycle calculation Analysis binary code Processor description Find correspondences between C source code and binary code Insertion of cycle generation code Insertion of dynamic correction code Back-annotation Analysis source code Back-annotation tool Binary code Annotated SystemC program FIGURE 2.6 General principle for a basic block annotation (Copyright: ACM Used with permission.) this figure shows the cache model and the branch prediction model that are used during runtime As described in further detail in Section 2.4.7, a function delay is used for accumulating the execution time of an annotated basic block during simulation At the beginning of the annotated basic block code, the annotation tool adds a call of the delay function that contains the statically determined number of cycles this basic block would use on the source processor as a parameter How this number is calculated is described in more detail in Section 2.4.3 In modern processor architectures, the impact of the processor architecture on the number of executed cycles cannot be completely predicted statically Especially the branch prediction and the caches of a processor have a significant impact on the number of used cycles Therefore, the statically determined number of cycles has to be corrected dynamically The partitioning of the basic block for the calculation of additional cycles of instruction cache misses, as shown in Figure 2.3, is explained in Section 2.4.5 40 Model-Based Design for Embedded Systems If there is a conditional branch at the end of a basic block, branch prediction has to be considered and possible correction cycles have to be added This is described in more detail in Section 2.4.5 As shown in Figure 2.3, the back-annotation tool adds a call to the consume function that performs cycle generation at the end of each basic block code If necessary, this instruction generates the number of cycles this basic block would need on the source processor How this consume function works is described in Section 2.4.7 In order to guarantee both—as fast as possible the execution of the code as well as the highest possible accuracy—it is possible to choose different accuracy levels of the generated code that parameterize the annotation tool The first and the fastest one is a purely static prediction The second one additionally includes the modeling of the branch prediction And the third one takes also the dynamic inclusion of instruction caches into account The cycle calculation in these different levels will be discussed in more detail in the following sections 2.4.3 Static Cycle Calculation of a Basic Block In modern architectures, pipeline effects, superscalarity, and caches have an important impact on the execution time Because of this, a calculation of the execution time of a basic block by summing the execution or latency times of the single instructions of this block is very inaccurate Therefore, the incorporation of a pipeline model per basic block becomes necessary [21] This model helps statically predict pipeline effects and the effects of superscalarity For the generation of this model, informations about the instruction set and the pipelines of the used processor are needed These informations is contained in the processor description that is used by the annotation tool With regard to this, the tool uses a modeling of the pipeline to determine which instructions of the basic block will be executed in parallel on a superscalar processor and which combinations of instructions in the basic block will cause pipeline stalls Details of this will be described in the next section With the information gained by basic block modeling, a prediction is carried out This prediction determines the number of cycles the basic block would have needed on the source processor Section 2.4.5 will show how this kind of prediction is improved during runtime, and how a cache model is included 2.4.4 Modeling of Pipeline for a Basic Block As previously mentioned, the processor description contains informations of the resources the processor has and of the resources a certain instruction uses These informations about the resources are used to build a resource usage model that specifies microarchitecture details of the used processor 41 SystemC-Based Performance Analysis of Embedded Systems For this model, it is assumed that all units in the processor such as functional units, pipeline stages, registers, and ports form a set of resources These resources can be allocated or released by every instruction that is executed This means that the resource usage model is based on the assumption that every time when an instruction is executed, this instruction allocates a set of resources and carries out an action When the execution proceeds, the allocated resources and the carried-out actions change If two instructions wait for the same resource, then this is resolved by allocating the resource to the instruction that entered the pipeline earlier This model is powerful enough to describe pipelines, superscalar execution, and other microarchitectures 2.4.4.1 Modeling with the Help of Reservation Tables The timing information of every program construct can be described with a reservation table Originally, reservation tables were proposed to describe and analyze the activities in a pipeline [32] Traditionally, reservation tables were used to detect conflicts for the scheduling of instructions [25] In a reservation table, the vertical dimension represents the pipeline stages and the horizontal dimension represents the time Figure 2.7 shows an example of a basic block and the corresponding reservation table In the figure, every entry in the reservation table shows that the corresponding pipeline stage is used in the particular time slot The entry consists of the number of the instruction that uses the resource The timing interdependencies between the instructions of a basic block are analyzed using the composition of their basic block In the reservation table, not only conflicts that occur because of the different pipeline stages, but also data dependencies between the instructions can be considered Time in clock cycles add d1,d2,d3 add d4,d5,d6 ld a3,[a2]0 ld a4,[a5]0 sub d7,d8,d9 WB FIGURE 2.7 Example of a reservation table for a basic block 2 3 5 5 4 4 Resources Basic block FI DI int-Pipeline EX WB FI DI ls-Pipeline EX 42 2.4.4.1.1 Model-Based Design for Embedded Systems Structural Hazards In the following, a modeling of the instructions in a pipeline using reservation tables will be described [12,32] To determine at which time after the start of an instruction the execution of a new instruction can start without causing a collision, these reservation tables have to be analyzed One possibility to determine if two instructions can be started in the distance of K time units is to overlap the reservation with itself using an offset of K time units If a used resource is overlapped by another, then there will be a collision in this segment and K is a forbidden latency Otherwise, no collision will occur and K is an allowed latency 2.4.4.1.2 Data Hazards The time delay caused by data hazards is modeled in the same way as the delay caused by structural hazards As the result of the pipelining of an instruction sequence should be the same as the result of sequentially executed instructions, register accesses should be in the same order as they are in the program This restriction is comparable with the usage of pipeline stages in the order they are in the program, and, therefore, it can be modeled by an extension of the reservation table 2.4.4.1.3 Control Hazards Some processors (like the MIPS R3000 [12]) use delayed branches to avoid the waiting cycle that otherwise would occur because of the control hazard This can be modeled by adding a delay slot to the basic block with the branch instruction Such a modeling is possible, because the instruction in the delay slot is executed regardless of the result of the branch instruction 2.4.4.2 Calculation of Pipeline Overlapping In order to be able to model the impact of architectural components such as pipelines, the state of these components has to be known when the basic block is entered If the state is known, then it is possible to find out the gain that results from the use of this component If it is known that in the control-flow graph of the program, node ei is the predecessor of node ej , and the pipeline state after the execution of node ei is also known, then the information about this state can be used to calculate the execution time of node ej This means the gain resulting from the fact that node ei is executed before node ej can be calculated The gain will be calculated for every pair of succeeding basic blocks using the pipeline overlapping This pipeline overlapping is determined using reservation tables [29] Appending a reservation table of a basic block to a reservation table of another basic block works the same way as appending an instruction to this reservation table Therefore, it is sufficient to consider only the first and the last columns The maximum number of columns that SystemC-Based Performance Analysis of Embedded Systems 43 have to be considered does not have to be larger than the maximum number of cycles for which a single instruction can stay in the pipeline [21] 2.4.5 Dynamic Correction of Cycle Prediction As previously described, the actual cycle count a processor needs for executing a sequence of instructions cannot be predicted correctly in all cases This is the case if, for example, a conditional branch at the end of a basic block produces a pipeline flush, or if additional delays occur because of cache misses in instruction caches The combination of static analysis and dynamic execution provides a well-suited solution for this problem, since statically unpredictable effects of branch and cache behaviors can be determined during execution This is done by inserting appropriate function calls into the translated basic blocks These calls interact with the architectural model in order to determine the additional number of cycles caused by mispredicted branch and cache behaviors At the end of each basic block, the generation of previously calculated cycles (static cycles plus correction cycles) can occur (Figure 2.3) 2.4.5.1 Branch Prediction Conditional branches have different cycle times depending on four different cases resulting from the combination of predicted and mispredicted branches, as well as taken and non-taken branches A correctly predicted branch needs less cycles for execution than a mispredicted one Furthermore, additional cycles can be needed if a correctly predicted branch is taken, as the branch target has to be calculated and loaded in the program counter This problem is solved by implementing a model of the branch prediction and by a comparison of the predicted branch behavior with the executed branch behavior If dynamic branch prediction is used, a model of the underlying state machine is implemented and its results are compared with the executed branch behavior The cycle count of each possible case is calculated and added to the cumulative cycle count before the next basic block is entered 2.4.5.2 Instruction Cache Figure 2.3 shows that for the simulation of the instruction cache, every basic block of the translated program has to be divided into several cache analysis blocks This has to be done until the tag changes or the basic block ends After that, a function call to the cache handling model is added This code uses a cache model to find out possible cache hits or misses The cache simulation will be explained in more detail in the next few paragraphs This explanation will start with a description of the cache model 44 Model-Based Design for Embedded Systems C program Binary code v C_stmnt1 C_stmnt2 C_stmnt3 C_stmnt4 cycleCalcICache tag Cache model lru data asm_inst1 asm_instl asm_instl+1 asm_inst2l asm_inst2l+1 asm_instn FIGURE 2.8 Correspondence C—assembler—cache line (Copyright: ACM Used with permission.) 2.4.5.3 Cache Model The cache model, as it can be seen on the right-hand side of Figure 2.8, contains data space that is used for the administration of the cache In this space, the valid bit, the cache tag, and the least recently used (lru) information (containing the replacement strategy) for each cache set during runtime is saved The number of cache tags and the according amount of valid bits that are needed depend on the associativity of the cache (e.g., for a two-way set associative cache, two sets of tags and valid bits are needed) 2.4.5.4 Cache Analysis Blocks In the middle of Figure 2.8, the C source code that corresponds to a basic block is divided in several smaller blocks, the so-called cache analysis blocks These blocks are needed for the consideration of the effects of instruction caches Each one of these blocks contains the part of a basic block that fits into a single cache line As every machine language instruction in such a cache analysis block has the same tag and the same cache index, the addresses of the instructions can be used to determine how a basic block has to be divided into cache analysis blocks This is because each address consists of the tag information and the cache index The cache index information (iStart to iEnd in Figure 2.3) is used to determine at which cache position the instruction with this address is cached The tag information is used to determine which address was cached, as there can be multiple addresses with the same cache index Therefore, a changed cache tag can be easily determined during the traversal of the binary code with respect to the cache parameters The block offset information is not needed for the cache simulation, as no real caching of data takes place After the tag has been changed or at the end of a basic block, a function call that handles the simulated cache and the calculation of the additional cycles of cache misses are added to this block More details about this function are described in the next section SystemC-Based Performance Analysis of Embedded Systems § int cycleCalculationICache( tag, iStart , iEnd ) { for index = iStart to iEnd { if tag is found in index and valid bit is set then { // cache hit renew lru information return } else { // cache miss use lru information to determine tag to overwrite write new tag set valid bit of written tag renew lru information return additional cycles needed for cache miss } } } ¦ Listing 2.1 Function for cache cycle correction 2.4.5.5 45 Ô Ơ Cycle Calculation Code As previously mentioned, each cache analysis block is characterized by a combination of tag and cache-set index informations At the end of each basic block, a call to a function is included During runtime, this function should determine whether the different cache analysis blocks that the basic block consists of are in the simulated cache or not This way, cache misses are detected The function is shown in Listing 2.1 It has the tag and the range of cacheset indices (iStart to iEnd) as parameters To find out if there is a cache hit or a cache miss, the function checks whether the tag of each cache analysis block can be found in the specified set and whether the valid bit for the found tag is set If the tag can be found and the valid bit is set, the block is already cached (cache hit) and no additional cycles are needed Only the lru information has to be renewed In all other cases, the lru information has to be used to determine which tag has to be overwritten After that, the new tag has to be written instead of the found old one, and the valid bit for this tag has to be set The lru information has to be renewed as well In the final step, the additional cycles are returned and added to the cycle correction counter SystemC-Based Performance Analysis of Embedded Systems 51 References K Albers, F Bodmann, and F Slomka Hierarchical event streams and event dependency graphs: A new computational model for embedded real-time systems In Proceedings of the 18th Euromicro Conference on RealTime Systems (ECRTS), Dresden, Germany, pp 97–106, 2006 J Aynsley OSCI TLM2 User Manual Open SystemC Initiative (OSCI), November 2007 J Bryans, H Bowman, and J Derrick Model checking stochastic automata ACM Transactions on Computational Logic (TOCL), 4(4):452–492, 2003 C Cifuentes Reverse compilation techniques PhD thesis, Queensland University of Technology Brisbane, Australia, November 19, 1994 CoWare Inc CoWare Processor Designer http://www.coware.com/ PDF/products/ProcessorDesigner.pdf L B de Brisolara, Marcio F da S Oliveira, R Redin, L C Lamb, L Carro, and F R Wagner Using UML as front-end for heterogeneous software code generation strategies In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 504–509, 2008 A Donlin Transaction level modeling: Flows and use models In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and System Synthesis (CODES+ISSS), San Jose, CA, pp 75–80, 2004 T Grötker, S Liao, G Martin, and S Swan System Design with SystemC Kluwer, Dordrecht, the Netherlands, 2002 M González Harbour, J J Gutiérrez García, J C Palencia Gutiérrez, and J M Drake Moyano MAST: Modeling and analysis suite for real time applications In Proceedings of the 13th Euromicro Conference on Real-Time Systems (ECRTS), Delft, the Netherlands, pp 125–134, 2001 10 H Heinecke Automotive open system architecture – An industry-wide initiative to manage the complexity of emerging automotive E/E architectures In Convergence International Congress & Exposition On Transportation Electronics, Detroit, MI, 2004 11 R Henia, A Hamann, M Jersak, R Racu, K Richter, and R Ernst System level performance analysis—the SymTA/S approach IEE Proceedings Computers and Digital Techniques, 152(2):148–166, March 2005 52 Model-Based Design for Embedded Systems 12 Y Hur, Y H Bae, S.-S Lim, S.-K Kim, B.-D Rhee, S L Min, C Y Park, H Shin, and C.-S Kim Worst case timing analysis of RISC processors: R3000/R3010 case study In Proceedings of the IEEE Real-Time Systems Symposium (RTSS), Pisa, Italy, pp 308–319, 1995 13 Y Hwang, S Abdi, and D Gajski Cycle-approximate retargetable performance estimation at the transaction level In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 3–8, 2008 14 IEEE Computer Society IEEE Standard SystemC Language Reference Manual, March 2006 15 Infineon Technologies AG TC10GP Unified 32-bit Microcontroller-DSP— User’s Manual, 2000 16 Infineon Technologies Corp TriCoreTM 32-bit Unified Processor Core— Volume 1: v1.3 Core Architecture, 2005 17 S Kraemer, L Gao, J Weinstock, R Leupers, G Ascheid, and H Meyr HySim: A fast simulation framework for embedded software development In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Salzburg, Austria, pp 75–80, 2007 18 M Krause, O Bringmann, and W Rosenstiel Target software generation: An approach for automatic mapping of SystemC specifications onto real-time operating systems Design Automation for Embedded Systems, 10(4):229–251, December 2005 19 M Krause, O Bringmann, and W Rosenstiel Hardware-dependent Software: Principles and Practice, Chapter 10 Verification of AUTOSAR Software by SystemC-based virtual prototyping pp 261–293, Springer, Netherlands, 2009 20 S Künzli, F Poletti, L Benini, and L Thiele Combining simulation and formal methods for system-level performance analysis In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 236–241, 2006 21 S.-S Lim, Y H Bae, G T Jang, B.-D Rhee, S L Min, C Y Park, H Shin, K Park, S.-M Moon, and C S Kim An accurate worst case timing analysis for RISC processors IEEE Transactions on Software Engineering, 21(7):593–604, 1995 SystemC-Based Performance Analysis of Embedded Systems 53 22 R Marculescu and A Nandi Probabilistic application modeling for system-level performance analysis In Proceedings of the Conference on Design, Automation and Test in Europe (DATE), Munich, Germany, pp 572–579, 2001 23 M Ajmone Marsan, G Conte, and G Balbo A class of generalized stochastic petri nets for the performance evaluation of multiprocessor systems ACM Transactions on Computer Systems, 2(2):93–122, 1984 24 The MathWorks, Inc Real-Time Workshop R Embedded Coder 5, September 2007 25 Steven S Muchnick Advanced Compiler Design and Implementation Morgan Kaufmann Publishers, San Francisco, CA, 1997 26 A Nohl, G Braun, O Schliebusch, R Leupers, H Meyr, and A Hoffmann A universal technique for fast and flexible instruction-set architecture simulation In Proceedings of the 39th Design Automation Conference (DAC), New York, pp 22–27, 2002 27 C Norström, A Wall, and W Yi Timed automata as task models for event-driven systems In Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications (RTCSA), Hong Kong, China, pp 182–189, 1999 28 OPNET Technologies, Inc http://www.opnet.com 29 G Ottosson and M Sjödin Worst-case execution time analysis for modern hardware architectures In Proceedings of the ACM SIGPLAN 1997 Workshop on Languages, Compilers, and Tools for Real-Time Systems (LCTRTS ’97), Las Vegas, NV, pp 47–55, 1997 30 M Oyamada, F R Wagner, M Bonaciu, W O Cesário, and A A Jerraya Software performance estimation in MPSoC design In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp 38–43, 2007 31 P Pop, P Eles, Z Peng, and T Pop Analysis and optimization of distributed real-time embedded systems In Proceedings of the 41st Design Automation Conference (DAC), San Diego, CA, pp 593–625, 2004 32 C V Ramamoorthy and H F Li Pipeline architecture ACM Computing Surveys, 9(1):61–102, 1977 33 K Richter, M Jersak, and R Ernst A formal approach to MpSoC performance verification Computer, 36(4):60–67, 2003 54 Model-Based Design for Embedded Systems 34 K Richter, D Ziegenbein, M Jersak, and R Ernst Model composition for scheduling analysis in platform design In Proceedings of the 39th Design Automation Conference (DAC), New Orleans, LA, pp 287–292, 2002 35 G Schirner, A Gerstlauer, and R Dömer Abstract, multifaceted modeling of embedded processors for system level design In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp 384–389, 2007 36 J Schnerr, O Bringmann, and W Rosenstiel Cycle accurate binary translation for simulation acceleration in rapid prototyping of SoCs In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 792–797, 2005 37 J Schnerr, O Bringmann, A Viehl, and W Rosenstiel High-performance timing simulation of embedded software In Proceedings of the 45th Design Automation Conference (DAC), Anaheim, CA, pp 290–295, June 2008 38 J Schnerr, G Haug, and W Rosenstiel Instruction set emulation for rapid prototyping of SoCs In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 562–567, 2003 39 A Siebenborn, O Bringmann, and W Rosenstiel Communication analysis for network-on-chip design In International Conference on Parallel Computing in Electrical Engineering (PARELEC), Dresden, Germany, pp 315– 320, 2004 40 A Siebenborn, O Bringmann, and W Rosenstiel Communication analysis for system-on-chip Design In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Paris, France, pp 648–655, 2004 41 A Siebenborn, A Viehl, O Bringmann, and W Rosenstiel Control-flow aware communication and conflict analysis of parallel processes In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp 32–37, 2007 42 E W Stark and S A Smolka Compositional analysis of expected delays in networks of probalistic I/O Automata In IEEE Symposium on Logic in Computer Science, Indianapolis, IN, pp 466–477, 1998 43 Synopsys, Inc Synopsys Virtual Platforms http://www.synopsys.com/ products/designware/virtual_platforms.html 44 L Thiele, S Chakraborty, and M Naedele Real-time calculus for scheduling hard real-time systems In IEEE International Symposium on Circuits and Systems (ISCAS), Geneva, Switzerland, volume 4, pp 101– 104, 2000 SystemC-Based Performance Analysis of Embedded Systems 55 45 VaST Systems Technology CoMET R http://www.vastsystems.com/ docs/CoMET_mar2007.pdf 46 A Viehl, M Schwarz, O Bringmann, and W Rosenstiel Probabilistic performance risk analysis at system-level In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Salzburg, Austria, pp 185–190, 2007 47 A Viehl, T Schönwald, O Bringmann, and W Rosenstiel Formal performance analysis and simulation of UML/SysML Models for ESL Design In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp 242–247, 2006 48 T Wild, A Herkersdorf, and G.-Y Lee TAPES – Trace-based architecture performance evaluation with systemC Design Automation for Embedded Systems, 10(2–3):157–179, September 2005 49 A Yakovlev, L Gomes, and L Lavagno, editors Hardware Design and Petri Nets Kluwer Academic Publishers, Dordrecht, the Netherlands, March 2000 Formal Performance Analysis for Real-Time Heterogeneous Embedded Systems Simon Schliecker, Jonas Rox, Rafik Henia, Razvan Racu, Arne Hamann, and Rolf Ernst CONTENTS 3.1 3.2 Introduction Formal Multiprocessor Performance Analysis 3.2.1 Application Model 3.2.2 Event Streams 3.2.3 Local Component Analysis 3.2.4 Compositional System-Level Analysis Loop 3.3 From Distributed Systems to MPSoCs 3.3.1 Deriving Output Event Models 3.3.2 Response Time Analysis in the Presence of Shared Memory Accesses 3.3.3 Deriving Aggregate Busy Time 3.4 Hierarchical Communication 3.5 Scenario-Aware Analysis 3.5.1 Echo Effect 3.5.2 Compositional Scenario-Aware Analysis 3.6 Sensitivity Analysis 3.6.1 Performance Characterization 3.6.2 Performance Slack 3.7 Robustness Optimization 3.7.1 Use-Cases for Design Robustness 3.7.2 Evaluating Design Robustness 3.7.3 Robustness Metrics 3.7.3.1 Static Design Robustness 3.7.3.2 Dynamic Design Robustness 3.8 Experiments 3.8.1 Analyzing Scenario 3.8.2 Analyzing Scenario 3.8.3 Considering Scenario Change 3.8.4 Optimizing Design 3.8.5 System Dimensioning 3.9 Conclusion References 58 59 60 60 62 63 64 65 66 68 69 73 74 75 76 76 77 79 79 80 81 81 81 82 85 86 86 87 87 87 88 57 58 Model-Based Design for Embedded Systems 3.1 Introduction Formal approaches to system performance modeling have always been used in the design of real-time systems With increasing system complexity, there is a growing demand for the use of more sophisticated formal methods in a wider range of systems to improve system predictability, and determine system robustness to changes, enhancements, and design pitfalls This demand can be addressed by the significant progress in the last couple of years in performance modeling and analysis on all levels of abstraction New modular models and methods now allow the analysis of large-scale, heterogeneous systems, providing reliable data on transitional load situations, end-to-end timing, memory usage, and packet losses A compositional performance analysis allows to decompose the system into the analysis of individual components and their interaction, providing a versatile method to approach real-world architectures Early industrial adopters are already using such formal methods for the early evaluation and exploration of a design, as well as for a formally complete performance verification toward the end of the design cycle—neither of which could be achieved solely with simulation-based approaches The formal methods, as presented in this chapter, are based on abstract load and execution data, and are thus applicable even before executable hardware or software models are available Such data can even be estimates derived from previous product generations, similar implementations, or simply engineering competence allowing for first evaluations of the application and the architecture This already allows tuning an architecture for maximum robustness against changes in system execution and communication load, reducing the risk of late and expensive redesigns During the design process, these models can be iteratively refined, eventually leading to a verifiable performance model of the final implementation The multitude of diverse programming and architectural design paradigms, often used together in the same system, call for formal methods that can be easily extended to consider the corresponding timing effects For example, formal performance analysis methods are also becoming increasingly important in the domain of tightly integrated multiprocessor systemon-chips (MPSoCs) Although such components promise to deliver higher performance at a reduced production cost and power consumption, they introduce a new level of integration complexity Like in distributed embedded systems, multiprocessing comes at the cost of higher timing complexity of interdependent computation, communication, and data storage operations Also, many embedded systems (distributed or integrated) feature communication layers that introduce a hierarchical timing structure into the communication This is addressed in this chapter with a formal Formal Performance Analysis 59 representation and accurate modeling of the timing effects induced during transmission Finally, today’s embedded systems deliver a multitude of different software functions, each of which can be particularly important in a specific situation (e.g., in automotives: an electronic stability program (ESP) and a parking assistance) A hardware platform designed to execute all of these functions at the same time will be expensive and effectively overdimensioned given that the scenarios are often mutually exclusive Thus, in order to supply the desired functions at a competitive cost, systems are only dimensioned for subsets of the supplied functions, so-called scenarios, which are investigated individually This, however, poses new pitfalls when dimensioning distributed systems under real-time constraints It becomes mandatory to also consider the scenario-transition phase to prevent timing failures This chapter presents an overview of a general, modular, and formal performance analysis framework, which has successfully accommodated many extensions First, we present its basic procedure in Section 3.2 Several extensions are provided in the subsequent sections to address specific properties of real systems: Section 3.3 visits multi-core architectures and their implications on performance; hierarchical communication as is common in automotive networks is addressed in Section 3.4; the dynamic behavior of switching between different application scenarios during runtime is investigated in Section 3.5 Furthermore, we present a methodology to systematically investigate the sensitivity of a given system configuration and to explore the design space for optimal configurations in Sections 3.6 and 3.7 In an experimental section (Section 3.8), we investigate timing bottlenecks in an example heterogeneous automotive architecture, and show how to improve the performance guided by sensitivity analysis and system exploration 3.2 Formal Multiprocessor Performance Analysis In past years, compositional performance analysis approaches [6,14,16] have received an increasing attention in the real-time systems community Compositional performance analyses exhibit great flexibility and scalability for timing and performance analyses of complex, distributed embedded realtime systems Their basic idea is to integrate local performance analysis techniques, for example, scheduling analysis techniques known from real-time research, into system-level analyses This composition is achieved by connecting the component’s inputs and outputs by stream representations of their communication behaviors using event models This procedure is illustrated in Sections 3.2.1 through 3.2.4 60 Model-Based Design for Embedded Systems 3.2.1 Application Model An embedded system consists of hardware and software components interacting with each other to realize a set of functionalities The traditional approach to formal performance analysis is performed bottom-up First, the behavior of the individual functions needs to be investigated in detail to gather all relevant data, such as the execution time This information can then be used to derive the behavior within individual components, accounting for local scheduling interference Finally, the system-level timing is derived on the basis of the lower-level results For an efficient system-level performance verification, embedded systems are modeled with the highest possible level of abstraction The smallest unit modeling performance characteristics at the application level is called a task Furthermore, to distinguish computation and communication, tasks are categorized into computational and communication tasks The hardware platform is modeled by computational and communication resources, which are referred to as CPUs and buses, respectively Tasks are mapped on resources in order to be executed To resolve conflicting requests, each resource is associated with a scheduler Tasks are activated and executed due to activating events that can be generated in a multitude of ways, including timer expiration, and task chaining according to inter-task dependencies Each task is assumed to have one input first-in first-out (FIFO) buffer In the basic task model, a task reads its activating data solely from its input FIFO and writes data into the input FIFOs of dependent tasks This basic model of a task is depicted in Figure 3.1a Various extensions of this model also exist For example, if the task may be suspended during its execution, this can be modeled with the requesting-task model presented in Section 3.3 Also, the direct task activation model has been extended to more complex activation conditions and semantics [10] 3.2.2 Event Streams The timing properties of the arrival of workload, i.e., activating events, at the task inputs are described with an activation model Instead of considering each activation individually, as simulation does, formal performance analysis abstracts from individual activating events to event streams Generally, Activation Local task execution Local task execution (a) FIGURE 3.1 Task execution model Termination (b) System-level transactions 61 Formal Performance Analysis event streams can be described using the upper and lower event-arrival functions, η+ and η− , as follows Definition 3.1 (Upper Event-Arrival Function, η+ ) The upper event-arrival function, η+ (Δt), specifies the maximum number of events that may occur in the event stream during any time interval of size Δt Definition 3.2 (Lower Event-Arrival Function, η− ) The lower event-arrival function, η− (Δt), specifies the minimum number of events that may occur in the event stream during any time interval of size Δt Correspondingly, an event model can also be specified using the functions δ− (n) and δ+ (n) that represent the minimum and maximum distances between any n events in the stream This representation is more useful for latency considerations, while the η-functions better express the resource loads Each can be derived from the other (as they are “pseudo-inverse,” as defined in [5]) Different parameterized event models have been developed to efficiently describe the timings of events in the system [6,14] One popular and computationally efficient abstraction for representing event streams is provided by so-called standard event models [33], as visualized in Figure 3.2 Standard event models capture the key properties of event streams using three parameters: the activation period, P; the activation jitter, J; and the minimum distance, d Periodic event models have one parameter P stating that each event arrives periodically at exactly every P time units This simple model can be extended with the notion of jitter, leading to periodic with jitter event models, which are described by two parameters, namely, P and J Events generally occur periodically, yet they can jitter around their exact position within a jitter interval of size J If the jitter value is larger than the period, then two or more events can occur simultaneously, leading to bursts To describe bursty event models, periodic with jitter event models can be extended with the parameter dmin capturing the minimum distance between the occurrences of any two events Periodic η η+ η η– P 2P 3P 4P Δ FIGURE 3.2 Standard event models Periodic with jitter η+ Periodic with burst η η– J J P–J PP + J 2P 3P 4P η+ η– J Δ d Δ P 2P 3P 4P 62 Model-Based Design for Embedded Systems 3.2.3 Local Component Analysis Based on the underlying resource-sharing strategy, as well as stream representations of the incoming workload modeled through the activating event models, local component analyses systematically derive worst-case scenarios to calculate worst-case (sometimes also best-case) task response times (BCRT and WCRT), that is, the time between task activation and task completion, for all tasks sharing the same component (i.e., the processor) Thereby, local component analyses guarantee that all observable response times fall into the calculated [best-case, worst-case] interval These analyses are therefore considered conservative Note that different approaches use different models of computation to perform local component analyses For instance, SymTA/S [14,43] is based on the algebraic solution of response time formulas using the slidingwindow technique proposed by, for example, Lehoczky [23], whereas the real-time calculus utilizes arrival curves and service curves to characterize the workload and processing capabilities of components, and determines their real-time behavior [6] These concepts are based on the network calculus For details please refer to [5] Additionally, local component analyses determine the communication behaviors at the outputs of the analyzed tasks by considering the effects of scheduling The basic model assumes that tasks produce output events at the end of each execution Like the input timing behavior, the output event timing behavior can also be captured by event models The output event models can then be derived for every task, based on the local response time analysis For instance, standard event models used by SymTA/S allow the specification of very simple rules to obtain output event models during the local component analysis Note that in the simplest case (i.e., if tasks produce exactly one output event for each activating event) the output event model period equals the activation period A discussion on how output event model periods are determined for more complex semantics (when considering rate transitions) can be found in [19] The output event model jitter, Jout , is calculated by adding the difference between maximum and minimum response times, Rmax − Rmin , the response time jitter, to the activating event model jitter, Jin [33]: Jout = Jin + (Rmax − Rmin ) (3.1) The output event model calculation can also be performed for general event models that are specified solely with the upper and lower event-arrival functions This method will be applied in Section 3.4 to hierarchical event models (HEMs) Recently, a more exact output jitter calculation algorithm was proposed for the local component analysis based on standard event models [15] and general event models [43] The approaches exploit the fact that the response time of a task activation is correlated with the timings of Formal Performance Analysis 63 preceding events—the task activation arriving with worst-case jitter does not necessarily experience the worst-case response time 3.2.4 Compositional System-Level Analysis Loop On the highest level of the timing hierarchy, the compositional system-level analysis [6,14] derives the system’s timing properties from the lower-level results For this, the local component analysis (as explained in Section 3.2.3) is alternated with the output event model propagation The basic idea is visualized on the right-hand side of Figure 3.3 (The shared-resource analysis depicted on the left-hand side will be explained in Section 3.3.) In each global iteration of the compositional system-level analysis, input event model assumptions are used to perform local scheduling analyses for all components From this, their response times and output event models are derived as described above Afterward, the calculated output event models are propagated to the connected components, where they are used as activating input event models for the subsequent global iteration Obviously, this iterative analysis represents a fix-point problem If all calculated output event models remain unmodified after an iteration, the convergence is reached and the last calculated task response times are valid [20,34] To successfully apply the compositional system-level analysis, the input event models of all components need to be known or must be computable by the local component analysis Obviously, for systems containing feedback between two or more components, this is not the case, and, thus, the systemlevel analysis cannot be performed without additional measures The concrete strategies to overcome this issue depend on the component types and their input event models One possibility is the so-called starting point generation of SymTA/S [33] Environment model Input event model Shared resource access analysis Local scheduling analysis Derive output event models Until convergence or nonschedulability FIGURE 3.3 MPSoC performance analysis loop 64 3.3 Model-Based Design for Embedded Systems From Distributed Systems to MPSoCs The described procedure appropriately covers the behaviors of hardware and software tasks that consume all relevant data upon activation and produce output data into a single FIFO This represents the prevailing design practice in many real-time operating systems [24] and parallel programming concepts [21] However, it is also common—particularly in MPSoCs— to access shared resources such as a memory during the execution of a task The diverse interactions and correlations between integrated system components then pose fundamental challenges to the timing predictions Figure 3.4 shows an example dual-core system in which three tasks access the same shared memory during execution In this section, the scope of the above approach is extended to cover such behaviors The model of the task is for this purpose extended to include local execution as well as memory transactions during the execution [38] While the classical task model is represented as an execution time interval (Figure 3.1a), a so-called requesting task performs transactions during its execution, as depicted in Figure 3.1b The depicted task requires three chunks of data from an external resource It issues a request and may only continue execution after the transaction is transmitted over the bus, processed on the remote component, and transmitted back to the requesting source Thus, whenever a transaction has been issued, but is not finished, the task is not ready The accesses to the shared resource may be logical shared resources (as in [27]), but for the scope of this chapter, we assume that the accesses go to a shared memory Such memory accesses may be explicit data-fetch operations or implicit cache misses The timing of such memory accesses, especially cache misses, is extremely difficult to accurately predict Therefore, an analysis cannot predict the timing of each individual transaction with an acceptable effort Instead, a shared-resource access analysis algorithm will be utilized in Section 3.3.3 that subsumes all transactions of a task execution and the interference by Shared memory T1 T2 T3 Multicore comp FIGURE 3.4 Multicore component with three requesting tasks that access the same shared memory during execution Formal Performance Analysis 65 other system activities Even though this presumes a highly unfavorable and unlikely coincidence of events, this approach is much less conservative than the consideration of individual transactions The memory is considered to be a separate component, and an analysis must be available for it to predict the accumulated latency of a set of memory requests For this analysis to work, the event models for the amount of requests issued from the various processors are required The outer analysis loop in the procedure of Figure 3.3, as described in Section 3.2, provides these event models for task activations throughout the system These task activating event models allow the derivation of bounds on the task’s number of requests to the shared-resource These bounds can be used by the shared-resource analysis to derive the transaction latencies The processor’s scheduling analysis finally needs to account for the delays experienced during the task execution by integrating the transaction latencies This intermediate analysis is shown on the left hand side of Figure 3.3 As it is based on the current task activating event model assumptions of the outer analysis, the shared-resource analysis possibly needs to be repeated when the event models are refined In order to embed the analysis of requesting tasks into the compositional analysis framework described in Section 3.2, three major building blocks are required: Deriving the number of transactions issued by a task and all tasks on a processor Deriving the latency experienced by a set of transactions on the shared resource Integrating the transaction latency into the tasks’ worst-case response times These three steps will be carried out in the following We begin with the local investigation of deriving the amount of initiated transactions (Section 3.3.1) and the extended worst-case response time analysis (Section 3.3.2) Finally, we turn to the system-level problem of deriving the transaction latency (Section 3.3.3) 3.3.1 Deriving Output Event Models For each individual task-activation, the amount of issued requests can be bound by closely investigating the task’s internal control flow For example, a task may explicitly fetch data each time it executes a for-loop that is repeated several times By multiplying the maximum number of loop iterations with the amount of fetched data, a bound on the memory accesses can be derived Focused on the worst-case execution time problem, previous research has provided various methods to find the longest path through such a program description with the help of integer linear programming (see [49]) Implicit data fetches such as cache misses are more complicated to capture, as they only occur during runtime and cannot be directly identified ... formal approach to MpSoC performance verification Computer, 36(4):60–67, 2003 54 Model-Based Design for Embedded Systems 34 K Richter, D Ziegenbein, M Jersak, and R Ernst Model composition for. .. 82 85 86 86 87 87 87 88 57 58 Model-Based Design for Embedded Systems 3.1 Introduction Formal approaches to system performance modeling have always been used in the design of real-time systems... are already using such formal methods for the early evaluation and exploration of a design, as well as for a formally complete performance verification toward the end of the design cycle—neither

Định dạng
Số trang	30
Dung lượng	716,21 KB