Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 46 2009-10-13 46 Model-Based Design for Embedded Systems 2.4.6 Consideration of Task Switches In modern embedded systems, software performance simulation has to han- dle task switching and multiple interrupts. Cooperative task scheduling can already be handled by the previously mentioned approach since the pre- sented cache model is able to cope with nonpreemptive task switches. Inter- rupts, and cooperative and nonpreemptive task scheduling can be handled similarly because the task preemption is usually implemented by using soft- ware interrupts. Therefore, the incorporation of interrupts is discussed in the following. Software interrupts had to be included in the SystemC model. This has been achieved by the automatic insertion of dedicated preemption points after cycle calculation. This approach provides an integration of different user-defined task scheduling policies, and a task switch generates a soft- ware interrupt. Since the cycle calculation is completed before a task switch is executed and a global cache and branch prediction model is used, no other changes are necessary. A minor deviation of the cycle count for certain pro- cesses can occur because of the actual task switch that is carried out with a small delay caused by the projection of the task preemption at the binary- code level to the C/C++ source-code level. But, nevertheless, the cumulative cycle count is still correct. The accuracy can be increased by the insertion of the cycle calculation code after each C/C++ statement. If the additional delay caused by the context switch itself has to be included, the (binary) code of the context switch routine can be treated like any other code. 2.4.7 Preemption of Software Tasks For the modeling of unconditional time delays, there is the function wait(sc_time) in SystemC. The call of wait(Δt) by a SystemC thread at the simulation time t suspends the calling thread until the simulation time t+Δt is reached, and after that it continues its execution with the proceeding instruction. The time that Δt needs is independent of the number of other active tasks at that time in the system. Therefore, the wait function is suit- able for the delay of hardware functionality, as this is inherently parallel. In contrast, software tasks can only be executed if they are allocated to a cor- responding execution unit. This means that the execution of a software task will be suspended as soon as the execution unit is withdrawn by the oper- ating system. In order to model the software timing behavior, two functions have to be used. The first function is the delay(int) function, as shown in Listing 2.2. As previously mentioned, this function is used for a fine gran- ular addition of time. The second one is the consume(sc_time) function that does a coarse-grained consumption of time of the accumulated delays. This function is an extension of the function wait(sc_time) with an appro- priate condition as needed. Listing 2.3 shows such a consume(sc_time) function. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 47 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 47 ✞ ☎ int taskTime; const sc_time t_PERIOD (timePeriod, SC_NS); void delay(int c) { taskTime+=c; } sc_time getTaskTime() { return taskTime∗t_PERIOD; } void resetTaskTime() { taskTime=0; } ✌ ✝ ✆ Listing 2.2 The delay function. If a software task calls the consume function with a time value, T,as a parameter, it decrements the time only if the calling software task is in the state RUNNING. If the execution unit is withdrawn by the RTOS sched- uler by a change of the execution state, the decrementation of the time in the consume function will be suspended. By changing the state to RUNNING by the scheduler, the software task can allocate an execution unit again, lead- ing to a continuation of the decrementation of the time that was suspended before. 2.5 Experimental Results In order to test the execution speed and the accuracy of the translated code, a few examples were compiled using a C compiler into an object code for the Infineon TriCore processor [15]. This object code was also used to generate an annotated SystemC code from the C code, as described in Section 2.4.1. As a reference, the execution speed and the cycle count of the TriCore code have been measured on a TriCore TC10GP evaluation board and on a TriCore ISS [16]. The examples consist of two filters (fir and ellip) and two programs that are part of audio-decoding routines (dpcm and subband). Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 48 2009-10-13 48 Model-Based Design for Embedded Systems ✞ ☎ void consume(sc_time T) { while(T > SC_ZERO_TIME || state != _state) { if (signals.empty()) { sc_time time = sc_time_stamp(); wait(T, signal_event); if (state == _state) T−= sc_time_stamp() − time; } } } ✌ ✝ ✆ Listing 2.3 The consume function. 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 dpcm fir ellip subband Tricore eva- luation board Annotated SystemC 1 Annotated SystemC 2 Tricore ISS Million instructions per second FIGURE 2.9 Comparison of speed. (Copyright: ACM. Used with permission.) Figure 2.9 shows the comparison of the execution speed of the generated code with the execution speed of the TriCore evaluation board and the ISS. The execution speed in this figure is represented by million instructions of the TriCore Processor per second. The Athlon 64 processor running the Sys- temC code and the ISS had a clock rate of 2.4 GHz. The TriCore processor of the evaluation board ran at 48 MHz. Using the annotated SystemC code, two different types of annotations have been used: the first one generates the cycles after the execution of each Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 49 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 49 basic block, the second one adds cycles to a cycle counter after each basic block. The cycles are only generated when it is necessary (e.g., when com- munication with the hardware takes place). This is much more efficient and is depicted in Figure 2.9. The execution speed of the TriCore processor ranges from 36.8 to 50.8 mil- lion instructions per second, whereas the execution speed of the annotated SystemC that models with immediate cycle generation ranges from 3.5 to 5.7 millions of simulated TriCore instructions per second. This means that the execution speed of the SystemC model is only about ten times slower than the speed of a real processor. The execution speed of the annotated SystemC code with on-demand cycle generation ranges from 11.2 to 149.9 million Tri- Core instructions per second. In order to compare the SystemC execution speed with the execution speed of a conventional ISS, the same examples were run using the Tri- Core ISS. The result was an execution speed ranging from 1.5 to 2.4 mil- lion instructions per second. This means our approach delivers an execution speed increase of up to 91%. A comparison of the number of simulated cycles of the generated Sys- temC code using branch prediction and cache simulation with the number of executed cycles of the TriCore evaluation board is shown in Figure 2.10. The deviation of the cycle counts of the translated programs (with branch 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000 27500 30000 32500 35000 37500 dpcm fir ellip subband Cycles Tricore eva- luation board Annotated SystemC 2 Tricore ISS FIGURE 2.10 Comparison of cycle accuracy. (Copyright: ACM. Used with permission.) Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 50 2009-10-13 50 Model-Based Design for Embedded Systems prediction and caches included) compared to the measured cycle count from the evaluation board ranges between 4% for the program fir to 7% for the program dpcm. This is in the same range as it is using conventional ISS. 2.6 Outlook As clock frequencies cannot be increased as linearly as the number of cores, modern processor architectures can exploit multiple cores to satisfy increas- ing computational demands. The different cores can share architectural resources such as data caches to speed up the access to common data. There- fore, access conflicts and coherency protocols have a potential impact on the runtimes of tasks executing on the cores. The incorporation of multiple cores is directly supported by our SystemC approach. Parallel tasks can easily be assigned to different cores, and the code instrumentation by cycle information can be carried out independently. However, shared caches can have a significant impact on the number of exe- cuted cycles. This can be solved by the inclusion of a shared cache model that executes global cache coherence protocols, such as the MESI protocol. A clock calculation after each C/C++ statement is strongly recommended here to increase the accuracy. 2.7 Conclusions This chapter presented a methodology for the SystemC-based performance analysis of embedded systems. To obtain a high accuracy with an acceptable runtime, a hybrid approach for a high-performance timing simulation of the embedded software was given. The approach shown was implemented in an automated design flow. The methodology is based on the generation of the SystemC code out of the original C code and the back-annotation of the stat- ically determined cycle information into the generated code. Additionally, the impact of data dependencies on the software runtime is analytically han- dled during simulation. Promising experimental results from the application of the implemented design flow were presented. These results show a high execution performance of the timed embedded software model as well as good accuracy. Furthermore, the created SystemC models representing the timed embedded software could be easily integrated into virtual SystemC prototypes because of the generated TLM interfaces. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 51 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 51 References 1. K. Albers, F. Bodmann, and F. Slomka. Hierarchical event streams and event dependency graphs: A new computational model for embedded real-time systems. In Proceedings of the 18th Euromicro Conference on Real- Time Systems (ECRTS), Dresden, Germany, pp. 97–106, 2006. 2. J. Aynsley. OSCI TLM2 User Manual. Open SystemC Initiative (OSCI), November 2007. 3. J. Bryans, H. Bowman, and J. Derrick. Model checking stochastic automata. ACM Transactions on Computational Logic (TOCL), 4(4):452–492, 2003. 4. C. Cifuentes. Reverse compilation techniques. PhD thesis, Queensland University of Technology Brisbane, Australia, November 19, 1994. 5. CoWare Inc. CoWare Processor Designer. http://www.coware.com/ PDF/products/ProcessorDesigner.pdf. 6. L. B. de Brisolara, Marcio F. da S. Oliveira, R. Redin, L. C. Lamb, L. Carro, and F. R. Wagner. Using UML as front-end for heterogeneous software code generation strategies. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp. 504–509, 2008. 7. A. Donlin. Transaction level modeling: Flows and use models. In Pro- ceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and System Synthesis (CODES+ISSS), San Jose, CA, pp. 75–80, 2004. 8. T. Grötker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer, Dordrecht, the Netherlands, 2002. 9. M. González Harbour, J. J. Gutiérrez García, J. C. Palencia Gutiérrez, and J. M. Drake Moyano. MAST: Modeling and analysis suite for real time applications. In Proceedings of the 13th Euromicro Conference on Real-Time Systems (ECRTS), Delft, the Netherlands, pp. 125–134, 2001. 10. H. Heinecke. Automotive open system architecture – An industry-wide initiative to manage the complexity of emerging automotive E/E archi- tectures. In Convergence International Congress & Exposition On Transporta- tion Electronics, Detroit, MI, 2004. 11. R. Henia, A. Hamann, M. Jersak, R. Racu, K. Richter, and R. Ernst. Sys- tem level performance analysis—the SymTA/S approach. IEE Proceed- ings Computers and Digital Techniques, 152(2):148–166, March 2005. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 52 2009-10-13 52 Model-Based Design for Embedded Systems 12. Y. Hur, Y. H. Bae, S S. Lim, S K. Kim, B D. Rhee, S. L. Min, C. Y. Park, H. Shin, and C S. Kim. Worst case timing analysis of RISC processors: R3000/R3010 case study. In Proceedings of the IEEE Real-Time Systems Symposium (RTSS), Pisa, Italy, pp. 308–319, 1995. 13. Y. Hwang, S. Abdi, and D. Gajski. Cycle-approximate retargetable per- formance estimation at the transaction level. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp. 3–8, 2008. 14. IEEE Computer Society. IEEE Standard SystemC Language Reference Man- ual, March 2006. 15. Infineon Technologies AG. TC10GP Unified 32-bit Microcontroller-DSP— User’s Manual, 2000. 16. Infineon Technologies Corp. TriCore TM 32-bit Unified Processor Core— Volume 1: v1.3 Core Architecture, 2005. 17. S. Kraemer, L. Gao, J. Weinstock, R. Leupers, G. Ascheid, and H. Meyr. HySim: A fast simulation framework for embedded software develop- ment. In Proceedings of the 5th IEEE/ACM International Conference on Hard- ware/Software Codesign and System Synthesis (CODES+ISSS), Salzburg, Austria, pp. 75–80, 2007. 18. M. Krause, O. Bringmann, and W. Rosenstiel. Target software gener- ation: An approach for automatic mapping of SystemC specifications onto real-time operating systems. Design Automation for Embedded Sys- tems, 10(4):229–251, December 2005. 19. M. Krause, O. Bringmann, and W. Rosenstiel. Hardware-dependent Soft- ware: Principles and Practice, Chapter 10 Verification of AUTOSAR Soft- ware by SystemC-based virtual prototyping. pp. 261–293, Springer, Netherlands, 2009. 20. S. Künzli, F. Poletti, L. Benini, and L. Thiele. Combining simulation and formal methods for system-level performance analysis. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference,Munich, Germany, pp. 236–241, 2006. 21. S S. Lim, Y. H. Bae, G. T. Jang, B D. Rhee, S. L. Min, C. Y. Park, H. Shin, K. Park, S M. Moon, and C. S. Kim. An accurate worst case timing analysis for RISC processors. IEEE Transactions on Software Engineering, 21(7):593–604, 1995. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 53 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 53 22. R. Marculescu and A. Nandi. Probabilistic application modeling for system-level performance analysis. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE), Munich, Germany, pp. 572–579, 2001. 23. M. Ajmone Marsan, G. Conte, and G. Balbo. A class of generalized stochastic petri nets for the performance evaluation of multiprocessor systems. ACM Transactions on Computer Systems, 2(2):93–122, 1984. 24. The MathWorks, Inc. Real-Time Workshop R Embedded Coder 5, Sep- tember 2007. 25. Steven S. Muchnick. Advanced Compiler Design and Implementation.Mor- gan Kaufmann Publishers, San Francisco, CA, 1997. 26. A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and A. Hoff- mann. A universal technique for fast and flexible instruction-set archi- tecture simulation. In Proceedings of the 39th Design Automation Conference (DAC), New York, pp. 22–27, 2002. 27. C. Norström, A. Wall, and W. Yi. Timed automata as task models for event-driven systems. In Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications (RTCSA), Hong Kong, China, pp. 182–189, 1999. 28. OPNET Technologies, Inc. http://www.opnet.com. 29. G. Ottosson and M. Sjödin. Worst-case execution time analysis for mod- ern hardware architectures. In Proceedings of the ACM SIGPLAN 1997 Workshop on Languages, Compilers, and Tools for Real-Time Systems (LCT- RTS ’97), Las Vegas, NV, pp. 47–55, 1997. 30. M. Oyamada, F. R. Wagner, M. Bonaciu, W. O. Cesário, and A. A. Jerraya. Software performance estimation in MPSoC design. In Pro- ceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp. 38–43, 2007. 31. P. Pop, P. Eles, Z. Peng, and T. Pop. Analysis and optimization of dis- tributed real-time embedded systems. In Proceedings of the 41st Design Automation Conference (DAC), San Diego, CA, pp. 593–625, 2004. 32. C. V. Ramamoorthy and H. F. Li. Pipeline architecture. ACM Computing Surveys, 9(1):61–102, 1977. 33. K. Richter, M. Jersak, and R. Ernst. A formal approach to MpSoC perfor- mance verification. Computer, 36(4):60–67, 2003. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 54 2009-10-13 54 Model-Based Design for Embedded Systems 34. K. Richter, D. Ziegenbein, M. Jersak, and R. Ernst. Model composition for scheduling analysis in platform design. In Proceedings of the 39th Design Automation Conference (DAC), New Orleans, LA, pp. 287–292, 2002. 35. G. Schirner, A. Gerstlauer, and R. Dömer. Abstract, multifaceted mod- eling of embedded processors for system level design. In Proceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp. 384–389, 2007. 36. J. Schnerr, O. Bringmann, and W. Rosenstiel. Cycle accurate binary trans- lation for simulation acceleration in rapid prototyping of SoCs. In Pro- ceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp. 792–797, 2005. 37. J. Schnerr, O. Bringmann, A. Viehl, and W. Rosenstiel. High-performance timing simulation of embedded software. In Proceedings of the 45th Design Automation Conference (DAC), Anaheim, CA, pp. 290–295, June 2008. 38. J. Schnerr, G. Haug, and W. Rosenstiel. Instruction set emulation for rapid prototyping of SoCs. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Munich, Germany, pp. 562–567, 2003. 39. A. Siebenborn, O. Bringmann, and W. Rosenstiel. Communication analy- sis for network-on-chip design. In International Conference on Parallel Com- puting in Electrical Engineering (PARELEC), Dresden, Germany, pp. 315– 320, 2004. 40. A. Siebenborn, O. Bringmann, and W. Rosenstiel. Communication anal- ysis for system-on-chip Design. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference, Paris, France, pp. 648–655, 2004. 41. A. Siebenborn, A. Viehl, O. Bringmann, and W. Rosenstiel. Control-flow aware communication and conflict analysis of parallel processes. In Pro- ceedings of the 12th Asia and South Pacific Design Automation Conference (ASP-DAC), Yokohama, Japan, pp. 32–37, 2007. 42. E. W. Stark and S. A. Smolka. Compositional analysis of expected delays in networks of probalistic I/O Automata. In IEEE Symposium on Logic in Computer Science, Indianapolis, IN, pp. 466–477, 1998. 43. Synopsys, Inc. Synopsys Virtual Platforms. http://www.synopsys.com/ products/designware/virtual_platforms.html. 44. L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time systems. In IEEE International Symposium on Circuits and Systems (ISCAS), Geneva, Switzerland, volume 4, pp. 101– 104, 2000. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 55 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 55 45. VaST Systems Technology. CoMET R . http://www.vastsystems.com/ docs/CoMET_mar2007.pdf. 46. A. Viehl, M. Schwarz, O. Bringmann, and W. Rosenstiel. Probabilis- tic performance risk analysis at system-level. In Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and Sys- tem Synthesis (CODES+ISSS), Salzburg, Austria, pp. 185–190, 2007. 47. A. Viehl, T. Schönwald, O. Bringmann, and W. Rosenstiel. Formal perfor- mance analysis and simulation of UML/SysML Models for ESL Design. In Proceedings of the Design, Automation and Test in Europe (DATE) Confer- ence, Munich, Germany, pp. 242–247, 2006. 48. T. Wild, A. Herkersdorf, and G Y. Lee. TAPES – Trace-based architec- ture performance evaluation with systemC. Design Automation for Embed- ded Systems, 10(2–3):157–179, September 2005. 49. A. Yakovlev, L. Gomes, and L. Lavagno, editors. Hardware Design and Petri Nets. Kluwer Academic Publishers, Dordrecht, the Netherlands, March 2000. . Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 46 2009-10-13 46 Model-Based Design for Embedded Systems 2.4.6 Consideration of Task Switches In modern embedded systems,. audio-decoding routines (dpcm and subband). Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 48 2009-10-13 48 Model-Based Design for Embedded Systems ✞ ☎ void consume(sc_time T) { while(T. (Copyright: ACM. Used with permission.) Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 50 2009-10-13 50 Model-Based Design for Embedded Systems prediction and caches included)