Instruction cache optimizations for embedded systems

INSTRUCTION CACHE OPTIMIZATIONS FOR EMBEDDED SYSTEMS YUN LIANG (B.Eng, TONGJI UNIVERSITY SHANGHAI, CHINA) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2010 Acknowledgements First of all, I would like to express my deepest gratitude to my Ph.D advisor, Professor Tulika Mitra for her constant guidance and encouragement during my five years of graduate study. Her persistent guidance helps me stay on track of doing research. Without her help this dissertation would not have been possible. I am grateful to my dissertation committee members, Professors Wong Weng Fai, Teo Yong Meng and Sri Parameswaran for their time and thoughtful comments. Thanks are also due to Professors Abhik Roychoudhury and Samarjit Chakraborty. It is an honor for me to work with them throughout my graduate study. I have greatly benefitted from the discussion I have had with them. I would like to thank the National University of Singapore for funding me with research scholarship and offering me the teaching opportunities to support my last year of study. My thanks also go to the administrative staffs in School of Computing, National University of Singapore for their supports during my study. I would like to thank my friends in NUS for assisting and helping me in my research: Ju Lei, Ge Zhiguo, Huynh Phung Huynh, Unmesh D. Bordoloi, Joon Edward Sim, Ankit Goel, Ramkumar Jayaseelan, Vivy Suhendra, Pan Yu, Li Xianfeng, Liu Haibin, i ii Liu Shanshan, Kathy Nguyen Dang, Andrei Hagiescu and David Lo. My graduate life at NUS would not have been interesting and fun without them. I woud like to extend heartfelt gratitude to my parents for their never ending love and faith in me and encouraging me to pursue my dreams. They are a great source of encouragement during my graduate study especially when I found it difficult to carry on. Thank you for always being there. Finally, this dissertation would not have been possible without the support of my wife Chen Dan. She sacrificed a great deal ever since I started my graduate study, but she was never one to complain. The hardest part has been the last year, when I was doing teaching assistantship and she was looking for jobs. In spite of all the difficulties, Chen Dan is always supportive. Thank you for your love and understanding. Contents Acknowledgements i Contents iii Abstract viii List of Publications x List of Tables xi List of Figures xii Introduction 1.1 Embedded System Design . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Memory Optimization for Embedded System . . . . . . . . . . . . . . 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background 10 iii iv 2.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Literature Review 3.1 Application Specific Memory Optimization . . . . . . . . . . . . . . . 14 3.2 Design Space Exploration of Caches . . . . . . . . . . . . . . . . . . . 15 3.3 14 3.2.1 Trace Driven Simulation . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Analytical Modeling . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . 18 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Hard Real-time Systems . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 General Embedded Systems . . . . . . . . . . . . . . . . . . . 21 3.4 Code Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Cache Modeling for Timing Analysis . . . . . . . . . . . . . . . . . . 25 Cache Modeling via Static Program Analysis 27 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Cache Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 4.3.1 Concrete Cache States . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Probabilistic Cache States . . . . . . . . . . . . . . . . . . . . 32 Static Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4.1 Analysis of DAG . . . . . . . . . . . . . . . . . . . . . . . . . 35 v Analysis of Loop . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.3 Special case for Direct Mapped Cache . . . . . . . . . . . . . . 39 4.4.4 Analysis of Whole Program . . . . . . . . . . . . . . . . . . . 41 4.5 Cache Hierarchy Analysis . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.7 4.4.2 4.6.1 Level-1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6.2 Multi-level Caches . . . . . . . . . . . . . . . . . . . . . . . . 52 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Design Space Exploration of Caches 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 General Binomial Tree (GBT) . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Probabilistic GBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1 Concatenation of Probabilistic GBTs . . . . . . . . . . . . . . 64 5.3.2 Combining GBTs in a Probabilistic GBT . . . . . . . . . . . . 66 5.3.3 Bounding the size of Probabilistic GBT . . . . . . . . . . . . . 68 5.3.4 Cache Hit Rate of a Memory Block . . . . . . . . . . . . . . . 70 5.4 Static Cache Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Instruction Cache Locking 6.1 76 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 vi 6.2 Cache Locking Problem . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Cache Locking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 84 Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.2 Heuristic Approach . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Procedure Placement 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 Procedure Placement Problem . . . . . . . . . . . . . . . . . . . . . . 114 7.3 Intermediate Blocks Profile . . . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Procedure Placement Algorithm . . . . . . . . . . . . . . . . . . . . . 120 7.5 Neutral Procedure Placement . . . . . . . . . . . . . . . . . . . . . . . 123 7.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.7 6.3.1 7.6.1 Layout for a Specific Cache Configuration . . . . . . . . . . . . 129 7.6.2 Neutral Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Putting it All Together 141 8.1 Integrated Optimization Flow . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 142 Conclusion 9.1 144 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vii 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Bibliography 146 Abstract The application specific nature of embedded systems creates the opportunity to design a customized system-on-chip (SoC) platform for a particular application or an application domain. Cache memory subsystem bears significant importance as it bridges the performance gap between the fast processor and the slow main memory. In particular, instruction cache, which is employed by most embedded systems, is one of the foremost power consuming and performance determining microarchitectural features as instructions are fetched almost every clock cycle. Thus, careful tuning and optimization of instruction cache memory can lead to significant performance gain and energy saving. The objective of this thesis is to exploit application characteristics for instruction cache optimizations. The application characteristics we use include branch probability, loop bound, temporal reuse profile and intermediate blocks profile. These application characteristics are identified through profiling and exploited by our subsequent analytical approach. We consider both hardware and software solutions. The first part of the thesis focuses on hardware optimization — identifying best cache configurations to match the specific temporal and spatial localities of a given application through analytical approach. We first develop a static program analysis to viii ix accurately model the cache behavior of a specific cache configuration. Then, we extend our analysis by taking the structural relations among the related cache configurations into account. Our analysis can estimate the cache hit rates for a set of cache configurations with varying number of sets and associativity in one pass as long as the cache line size remains constant. The input to our analysis is simply the branch probability and loop bounds, which is significantly more compact compared to the memory address traces required by trace-driven simulators and other trace based analytical works. The second part of the thesis focuses on software optimizations. We propose techniques to tailor the program to the underlying instruction cache parameters. First, we develop a framework to improve the average-case program performance through static instruction cache locking. We introduce temporal reuse profile to accurately and efficiently model the cost and benefit of locking memory blocks in the cache. We propose two cache locking algorithms : an optimal algorithm based on branch-and-bound search and a heuristic approach. Second, we propose an efficient algorithm to place procedures in memory for a specific cache configuration such that cache conflicts are minimized. As a result, both performance and energy consumption are improved. Our efficient algorithm is based on intermediate blocks profile that accurately but compactly models cost-benefit of procedure placement for both direct mapped and set associative caches. Finally, we propose an integrated instruction cache optimization framework by combining all the techniques together. Chapter Conclusion 9.1 Thesis Contributions The application specific nature of embedded systems creates the opportunity to design a customized system-on-chip (SoC) platform for a particular application or an application domain. With the knowledge of application characteristics, many cache parameters can be customized to meet various design goals. This is especially true for parameterizable embedded systems. Furthermore, the program code can be transformed in various ways to fit the underlying cache architectures. The optimized memory architecture and program code can improve the performance and energy consumption significantly. The objective of this thesis is utilize application characteristic so as to achieve significant cache performance improvements. Application characteristics used in this thesis include basic block execution count profile (branch probability, loop bound), temporal reuse profile and intermediate blocks profile. These application characteristics 144 CHAPTER 9. CONCLUSION 145 are identified through profiling and exploited by our subsequent analytical approach. In this thesis, we consider both hardware (architecture) and software optimization solutions. For hardware (architecture) solutions, we propose techniques to customize the instruction cache according to the specific temporal and spatial localities of a given application. For software solutions, we propose techniques to tailor the program to underlying instruction cache parameters. More concretely, the contributions of this thesis are: • a static program analysis that accurately and efficiently model the cache behavior of a specific cache configuration. • an analytical approach that accurately explores cache design space with multiple cache configurations in a single pass. • a precise and accurate cache modeling using temporal reuse profile and two static instruction cache locking algorithms for performance improvement. • an improved procedure placement algorithm for set associative caches using intermediate blocks profile and an algorithm for a neutral code layout with good portability. 9.2 Future Directions Though the techniques developed by this thesis are mainly for instruction caches, most of them can be applied to data cache. First, probabilistic cache state proposed in chap- CHAPTER 9. CONCLUSION 146 ter and is a general concept which can be used for data cache as well [81] with special optimizations for space. Second, cache locking techniques in chapter can be used for data cache, given the corresponding program data reference trace. Finally, procedure placement technique in chapter can be used for data cache by replacing procedures with data segments. With the advent of multi-core architecture, the embedded computing world is moving into the direction of multiprocessing. This opens up new challenges for the embedded system designers. The main challenges arise from the mapping and scheduling of parallel tasks, conflict modeling of shared resources such as cache and communication media, and timing unpredictability caused by cache warm-up due to task migration and preemption. In this thesis, the techniques we developed are mainly targeted for a single core. When attempting to extend the local optimization techniques to global optimization, the interactions among the cores have to be taken into account. Though multi-core architecture offers better performance and energy saving opportunities, the application characteristics are not utilized and incorporated well in the current hardware and software design flow. Thus, there is a huge gap between the potential performance that can be offered by multi-core architecture and the way that they are being used. In this thesis, we consider the software transformations for single core instruction cache. For multi-core architecture, our goal is to develop compilation techniques that can optimize high level programs for application-specific multi-core architecture by utilizing the knowledge of application and system resources (processor cores, memory units, communication bandwidth). Bibliography [1] 3rd Generation Intel Xscale Microarchitecture Developers’s Manual. Intel, May 2007. http://www.intel.com/design/intelxscale. [2] ADSP-BF533 Processor Hardware Reference. Analog Devices, April 2009. http://www.analog.com/static/imported-files/processor_ manuals/bf533_hwr_Rev3.4.pdf. [3] Arc International. In http://www.arccores.com, 2005. [4] ARM Cortex A-8 Technical reference Manual. Arm, Revised March 2004. http:// www.arm.com/products/CPUs/families/ARMCortexFamily.html. [5] ARM Embedded Processor. In http://www.arm.com, 2005. [6] ARM1156T2-S Technical reference Manual. Arm, Revised July 2007. http://www. arm.com/products/CPUs/families/ARM11Family.html. [7] IBM ics. Systems and Technology Group. PowerPC: IBM Microelectron- http://www-306.ibm.com/chips/techlib/techlib.nsf/ productfamilies/PowerPC. [8] NIOS Embedded Processor. In http://www.altera.com, 2005. 147 BIBLIOGRAPHY 148 [9] Xtensa Processor Generator. In http://www.tensilica.com, 2005. [10] D. H. Albonesi. Selective cache ways: on-demand cache resource allocation. In MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, 1999. [11] M. Alt et al. Cache behavior prediction by abstract interpretation. In SAS ’96: Proceedings of the Third International Symposium on Static Analysis, 1996. [12] K. Anand and R. Barua. Instruction cache locking inside a binary rewriter. In CASES ’09: Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, 2009. [13] R. Arnold et al. Bounding worst-case instruction cache performance. In RTSS ’94: Proceedings of the 23rd IEEE Real-Time Systems Symposium, 1994. [14] T. Austin, E. Larson, and D. Ernst. Simplescalar: An infrastructure for computer system modeling. IEEE Computer, 35(2), 2002. [15] R. Balasubramonian et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, 2000. [16] T. Ball. Efficiently counting program events with support for on-line queries. ACM Transactions on Programming Languages and Systems, 16(5), 1994. [17] S. Bartolini and C. A. Prete. Optimizing instruction cache performance of embedded systems. ACM Trans. Embed. Comput. Syst., 4(4):934–965, 2005. BIBLIOGRAPHY 149 [18] L. Benini et al. From architecture to layout: Partitioned memory synthesis for embedded systems-on-chip. In DAC ’01: Proceedings of the 44th annual Design Automation Conference, 2001. [19] L. Benini, A. Macii, and M. Poncino. Energy-aware design of embedded memories: A survey of technologies, architectures, and optimization techniques. ACM Trans. Embed. Comput. Syst., 2(1):5–32, 2003. [20] E. Berg and E. Hagersten. Statcache: a probabilistic approach to efficient and accurate data locality analysis. In ISPASS ’04:Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software, 2004. [21] K. Beyls and E. H. D‘Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems, 2001. [22] K. Beyls and E. H. D’Hollander. Reuse distance-based cache hint selection. In Euro-Par ’02: Proceedings of the 8th International Euro-Par Conference on Parallel Processing, 2002. [23] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA ’00: Proceedings of the 27th annual international symposium on Computer architecture, 2000. [24] B. Buck and J. K. Hollingsworth. An api for runtime code patching. Int. J. High Perform. Comput. Appl., 14(4), 2000. BIBLIOGRAPHY 150 [25] A. M. Campoy et al. Cache contents selection for statically-locked instruction caches: An algorithm comparison. In ECRTS ’05: Proceedings of the 17th Euromicro Conference on Real-Time Systems, 2005. [26] C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances. In ICS ’03: Proceedings of the 17th annual international conference on Supercomputing, 2003. [27] S. Chatterjee et al. Exact analysis of the cache behavior of nested loops. SIGPLAN Not., 36(5):286–297, 2001. [28] C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. SIGPLAN Not., 38(5):245–257, 2003. [29] J. Edler and M. D. Hill. Dinero iv trace-driven uniprocessor cache simulator. http: //www.cs.wisc.edu/˜markhill/DineroIV/. [30] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction cache locking using worst-case execution paths. In CODES+ISSS ’07: Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, 2007. [31] C. Ferdinand and R. Wilhelm. On predicting data cache behaviour for real-time systems. In ACM SIGPLAN Workshop 1998 on Languages, Compilers, and Tools for Embedded System, 1998. [32] C. Ferdinand and R. Wilhelm. Fast and efficient cache behavior prediction for real-time systems. Real-Time Systems, 17(2/3):131–181, 1999. BIBLIOGRAPHY 151 [33] G. Gebhard and S. Altmeyer. Optimal task placement to improve cache performance. In EMSOFT ’07: Proceedings of the 7th ACM & IEEE international conference on Embedded software, 2007. [34] A. Ghosh and T. Givargis. Analytical design space exploration of caches for embedded systems. In DATE ’03: Proceedings of the conference on Design, Automation and Test in Europe, 2003. [35] A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. Des. Autom. Electron. Syst., 9(4):419–440, 2004. [36] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst., 21(4):703– 746, 1999. [37] T. Givargis, F. Vahid, and J. Henkel. Fast cache and bus power estimation for parameterized system-on-a-chip design. In DATE ’00: Proceedings of the conference on Design, automation and test in Europe, 2000. [38] T. Givargis, F. Vahid, and J. Henkel. System-level exploration for pareto-optimal configurations in parameterized systems-on-a-chip. In ICCAD ’01: Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, 2001. [39] N. Gloy et al. Procedure placement using temporal ordering information. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, 1997. [40] N. Gloy and M. D. Smith. Procedure placement using temporal-ordering information. ACM Trans. Program. Lang. Syst., 21(5):977–1027, 1999. BIBLIOGRAPHY 152 [41] C. Goldfeder. Frequency-based code placement for embedded multiprocessors. In DAC ’05: Proceedings of the 42nd annual Design Automation Conference, 2005. [42] A. Gordon-Ross et al. A one-shot configurable-cache tuner for improved energy and performance. In DATE ’07: Proceedings of the conference on Design, automation and test in Europe, 2007. [43] A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embedded applications. In DATE ’04: Proceedings of the conference on Design, automation and test in Europe, 2004. [44] A. Gordon-Ross, F. Vahid, and N. Dutt. Fast configurable-cache tuning with a unified second-level cache. In ISLPED ’05: Proceedings of the 1995 international symposium on Low power design, 2005. [45] C. Guillon et al. Procedure placement using temporal-ordering information: dealing with code size expansion. In CASES ’04: Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems, 2004. [46] M. R. Guthaus et al. Mibench: A free, commercially representative embedded benchmark suite. In Workshop on Workload Characterization, 2001. [47] M. S. Haque, A. Janapsatya, and S. Parameswaran. Susesim: a fast simulation strategy to find optimal l1 cache configuration for embedded systems. In CODES+ISSS ’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, 2009. BIBLIOGRAPHY 153 [48] D. Hardy and I. Puaut. WCET analysis of multi-level non-inclusive set-associative instruction caches. In RTSS ’08: Proceedings of the 2008 Real-Time Systems Symposium, 2008. [49] J. S. Harper, D. J. Kerbyson, and G. R. Nudd. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput., 48(10):1009–1024, 1999. [50] A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient procedure mapping using cache line coloring. SIGPLAN Not., 32(5):171–182, 1997. [51] J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., 2002. [52] M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches. IEEE Transactions on Computers, 38(12), 1989. [53] W. W. Hwu and P. P. Chang. Achieving high instruction cache performance with an optimizing compiler. SIGARCH Comput. Archit. News, 17(3):242–251, 1989. [54] A. Janapsatya et al. Instruction trace compression for rapid instruction cache simulation. In DATE ’07: Proceedings of the conference on Design, Automation and Test in Europe, 2007. [55] J. Kalamatianos et al. Analysis of temporal-based program behavior for improved instruction cache performance. IEEE Trans. Comput., 48(2):168–175, 1999. [56] J. Kin et al. Power efficient mediaprocessors: design space exploration. In DAC ’99: Proceedings of the 36th annual ACM/IEEE Design Automation Conference, 1999. BIBLIOGRAPHY 154 [57] X. F. Li et al. Design space exploration of caches using compressed traces. In ICS ’04:Proceedings of the 18th annual international conference on Supercomputing, 2004. [58] Y. Li et al. Hardware-software co-design of embedded reconfigurable architectures. In DAC ’00: Proceedings of the 45th annual Design Automation Conference, 2000. [59] Y. Li and J. Henkel. A framework for estimation and minimizing energy dissipation of embedded hw/sw systems. In DAC ’98: Proceedings of the 35th annual Design Automation Conference, 1998. [60] Y. Li and W. Wolf. Hardware/software co-synthesis with memory hierarchies. In ICCAD ’98: Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design, 1998. [61] Y. S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with instruction cache modeling. ACM Trans. Des. Autom. Electron. Syst., 4(3):257–279, 1999. [62] Y. Liang and T. Mitra. Instruction cache locking using temporal reuse profile. In DAC ’10: Proceedings of the 47th annual Design Automation Conference. [63] Y. Liang and T. Mitra. Cache modeling in probabilistic execution time analysis. In DAC ’08: Proceedings of the 45th annual Design Automation Conference, 2008. [64] Y. Liang and T. Mitra. Static analysis for fast and accurate design space exploration of caches. In CODES+ISSS ’08: Proceedings of the 6th IEEE/ACM/IFIP international conference on Hardware/Software codesign and system synthesis, 2008. [65] S. Lim et al. An accurate worst case timing analysis for risc processors. IEEE Trans. Softw. Eng., 21(7):593–604, 1995. BIBLIOGRAPHY 155 [66] T. Liu, M. Li, and C. J. Xue. Minimizing WCET for real-time embedded systems via static instruction cache locking. In RTAS ’09: Proceedings of the 15th IEEE Real-Time and Embedded Technology and Applications Symposium, 2009. [67] P. Lokuciejewski, H. Falk, and P. Marwedel. WCET-driven cache-based procedure positioning optimizations. In ECRTS ’08: Proceedings of the 2008 Euromicro Conference on Real-Time Systems, 2008. [68] R. L. Mattson et al. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2), 1970. [69] P. Mishra, M. Mamidipaka, and N. Dutt. Processor-memory coexploration using an architecture description language. ACM Trans. Embed. Comput. Syst., 3(1):140–162, 2004. [70] J. Montanaro et al. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. Digital Tech. J., 9(1), 1997. [71] F. Mueller. Timing analysis for instruction caches. Real-Time Syst., 18(2-3):217–247, 2000. [72] N. Nguyen, A. Dominguez, and R. Barua. Memory allocation for embedded systems with a compile-time-unknown scratch-pad size. In CASES ’05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, 2005. [73] M. Palesi and T. Givargis. Multi-objective design space exploration using genetic algorithms. In CODES ’02: Proceedings of the tenth international symposium on Hardware/software codesign, 2002. BIBLIOGRAPHY 156 [74] P. R. Panda, N. D. Dutt, and A. Nicolau. Architectural exploration and optimization of local memory in embedded systems. In ISSS ’97:Proceedings of the 10th international symposium on System synthesis, 1997. [75] P. R. Panda et al. Data and memory optimization techniques for embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6(2):149–206, 2001. [76] S. Parameswaran and J. Henkel. I-copes: fast instruction code placement for embedded systems to improve performance and energy efficiency. In ICCAD ’01: Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design, 2001. [77] D. A. Patterson and J. L. Hennessy. Computer Organization and Design: The Hardware/software Interface. Morgan Kaufmann, 1998. [78] P. Petrov and A. Orail˘glu. Towards effective embedded processors in codesigns: customizable partitioned caches. In CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, 2001. [79] K. Pettis and R. C. Hansen. Profile guided code positioning. SIGPLAN Not., 25(6):16–27, 1990. [80] I. Puaut and D. Decotigny. Low-complexity algorithms for static cache locking in multitasking hard real-time systems. In RTSS ’02: Proceedings of the 23rd IEEE Real-Time Systems Symposium, 2002. [81] V. Puranik, T. Mitra, and Y. N. Srikant. Probabilistic modeling of data cache behavior. In EMSOFT ’09: Proceedings of the seventh ACM international conference on Embedded software, 2009. BIBLIOGRAPHY [82] P. Puschner and A. Burns. 157 Guest editorial: A review of worst-case execution- timeanalysis. Real-Time Syst., 18(2-3):115–128, 2000. [83] G. Rajaram and V. Rajaraman. A probabilistic method for calculating hit ratios in direct mapped caches. Journal of Network and Computer Applications, 19(3), 1996. [84] J. Robertson and K. Gala. Instruction and data cache locking on the e300 processor core. Freescale Semiconductor, Inc., 2006. [85] R. Sen and Y. N. Srikant. WCET estimation for executables in the presence of data caches. In EMSOFT ’07: Proceedings of the 7th ACM & IEEE international conference on Embedded software, 2007. [86] W. Shiue and C. Chakrabarti. Memory exploration for low power, embedded systems. In DAC ’99: Proceedings of the 45th annual Design Automation Conference, 1999. [87] W. Shiue, S. Udayanarayanan, and C. Chakrabarti. Data memory design and exploration for low-power embedded systems. ACM Trans. Des. Autom. Electron. Syst., 6(4):553– 568, 2001. [88] A. Shrivastava, I. Issenin, and N. Dutt. Compilation techniques for energy reduction in horizontally partitioned cache architectures. In CASES ’05:Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, 2005. [89] J. E. W. Steven and P. J. Norman. Cacti: An enhanced cache access and cycle time model. IEEE Journal of Solid-State Circuits, 31:677–688, 1996. BIBLIOGRAPHY 158 [90] R. A. Sugumar and S. G. Abraham. Set-associative cache simulation using generalized binomial trees. ACM Transactions on Computer Systems, 13(1), 1995. [91] V. Suhendra et al. WCET centric data allocation to scratchpad memory. In RTSS ’05: Proceedings of the 26th IEEE International Real-Time Systems Symposium, 2005. [92] V. Suhendra and T. Mitra. Exploring locking & partitioning for predictable shared caches on multi-cores. In DAC ’08: Proceedings of the 45th annual Design Automation Conference, 2008. [93] L. Thiele and R. Wilhelm. Design for timing predictability. Real-Time Syst., 28(2-3):157– 177, 2004. [94] H. Tomiyama and H. Yasuura. Code placement techniques for cache miss rate reduction. ACM Trans. Des. Autom. Electron. Syst., 2(4):410–429, 1997. [95] R. A. Uhlig and T. N. Mudge. Trace-driven memory simulation: a survey. ACM Comput. Surv., 29(2):128–170, 1997. [96] A. V. Veidenbaum et al. Adapting cache line size to application behavior. In ICS ’99: Proceedings of the 13th international conference on Supercomputing, 1999. [97] X. Vera, B. Lisper, and J. Xue. Data cache locking for higher program predictability. In SIGMETRICS ’03: Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2003. [98] X. Vera, B. Lisper, and J. Xue. Data caches in multitasking hard real-time systems. In RTSS ’03: Proceedings of the 24th IEEE International Real-Time Systems Symposium, 2003. BIBLIOGRAPHY 159 [99] P. Viana et al. Configurable cache subsetting for fast cache tuning. In DAC ’06: Proceedings of the 43rd annual Design Automation Conference, 2006. [100] W. H. Wang and J. L. Baer. Efficient trace-driven simulation methods for cache performance analysis. ACM Trans. Comput. Syst., 9(3):222–241, 1991. [101] R. T. White et al. Timing analysis for data caches and set-associative caches. In RTAS ’97: Proceedings of the 3rd IEEE Real-Time Technology and Applications Symposium (RTAS ’97), 1997. [102] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI ’91: Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, 1991. [103] Z. Wu and W. Wolf. Iterative cache simulation of embedded cpus with trace stripping. In CODES ’99:Proceedings of the seventh international workshop on Hardware/software codesign, 1999. [104] L. Xue, O. Ozcan, and K. Mahmut. In DAC ’07: Proceedings of the 44th annual Design Automation Conference. [105] H. Yang et al. Improving power efficiency with compiler-assisted cache replacement. J. Embedded Comput., 1(4):487–499, 2005. [106] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded systems. SIGARCH Comput. Archit. News, 31(2), 2003. [107] E. Zitzler, K. Deb, and L. Thiele. Comparison of multiobjective evolutionary algorithms: Empirical results. Evol. Comput., 8(2):173–195, 2000. [...]... memory and the cache A cache is divided into K sets Each cache set, in turn, is divided into A cache blocks, where A is the associativity of the cache For a direct-mapped cache A = 1, for a set-associative cache A > 1, and for a fully associative cache K = 1 In other words, a direct-mapped cache has only one cache block per set, whereas a fully-associative cache has only one cache set Now the cache size... the cache instead of main memory, which consumes more power and incurs longer delay per access In this thesis, we focus on instruction cache, which is present in almost all embedded systems Instruction cache is one of the foremost power consuming and performance determining microarchitectural features of modern embedded systems as instructions are fetched almost every clock cycle For example, instruction. .. related to instruction cache exploration and optimization for embedded systems Chapter 4 presents a static program analysis technique to model the cache behavior of a particular application Chapter 5 extends the static program analysis in chapter 4 for efficient instruction cache design space exploration Chapter 6 discusses employing cache locking for improving average case execution time for general embedded. .. we will summarize the techniques of employing cache locking for improving the timing predictability for hard real-time systems and the average-case performance of general embedded systems 3.3.1 Hard Real-time Systems Instruction cache locking has been employed in hard real-time systems for better timing predictability [80, 25, 30, 66] In hard real-time systems, worst case execution time (WCET) is an... will be cache hits However, most existing cache locking techniques are proposed for improving the predictability of hard real-time systems Using cache locking for improving the performance of general embedded systems are not explored We observe that cache locking can be quite effective in improving the average-case execution time of general embedded applications as well We propose precise cache modeling... optimizing cache memory design for embedded systems has received a lot of attention from the research community [75, 106, 10, 15, 86, 104, 69, 59, 60, 88, 18, 78, 19, 96] In this thesis, we focus on design space exploration of caches — determining the best instruction cache parameters from vast number of cache configurations for a given application and software optimizations — instruction cache locking... be tailored for the specific cache architectures Cache aware program transformations allow the modified application to utilize the underlying cache more efficiently For architecture customization, the system designer can choose an on-chip cache configuration that is suited for a particular application and customize the caches for it However, the cache design parameters include the size of the cache, the... the cache returned from design space exploration may be too big Hence, we also consider software based instruction cache optimization techniques to further improve performance CHAPTER 1 INTRODUCTION 5 For software solutions, since the underlying instruction cache parameters are known, the program code can be appropriately tailored for the specific cache architecture More concretely, for software optimizations, ... 27% of the total power is spent by instruction cache for StrongARM 110 processor [70] Thus, careful tuning and optimization of instruction cache memory can lead to significant performance gain and energy saving Instruction cache performance can be improved via hardware (architecture) means and software means From an architectural perspective, caches can be customized for the specific temporal and spatial... Design Space Exploration of Caches One of the most effective cache optimizations is to tune cache parameters for the specific application The tuning process is done through cache design space exploration More concretely, for application specific embedded system, we can choose specific cache configuration from the huge cache design space to meet the design constraints (i.e., performance, energy and hardware . INSTRUCTION CACHE OPTIMIZATIONS FOR EMBEDDED SYSTEMS YUN LIANG (B.Eng, TONGJI UNIVERSITY SHANGHAI, CHINA) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY. the performance gap between the fast processor and the slow main memory. In particular, instruction cache, which is employed by most embedded systems, is one of the foremost power consuming and performance. optimization of instruction cache memory can lead to significant performance gain and energy saving. The objective of this thesis is to exploit application characteristics for instruction cache optimizations.

Định dạng
Số trang	175
Dung lượng	6,66 MB