Instruction cache optimizations in embedded real time systems

INSTRUCTION CACHE OPTIMIZATIONS IN EMBEDDED REAL-TIME SYSTEMS DING HUPING (B.Eng., Harbin Institute of Technology) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2013 Acknowledgements First of all, my gratitude goes to my Ph.D. advisor Prof. Tulika Mitra. Thanks for her persistent and generous guidance on the research. She is full of wisdom, and I benefit a lot from her insightful comments and advices. I would also thank her patience and encouragement during my study, especially when there are difficulties. She also offered me the research assistant position in the last year of my study. Without her help, this thesis would not be possible. I would like to thank my thesis committee members. Thanks for their time and valuable comments. I would like to express my sincere gratitude to Prof. Wong Weng-Fai. Thanks for his guidance in my early stage of Ph.D. study. He is generous and kind, and helped me a lot. I am also grateful to Dr. Liang Yun in Peking University for the research collaborations. I collaborated with him in most of my research work. It is my great pleasure to cooperate with him. I also thank my friends and lab mates, Sudipta Chattopadhyay, Wang Chundong, Qi Dawei, Chen Jie, Chen Liang, Mihai Pricopi and Thannirmalai Somu Muthukaruppan, for their help in the research work and the fun in daily life. I also give my sincere gratitude to my girlfriend Fu Qinqin, the beautiful and thoughtful girl, for being together with me for over four years. She brought me happiness during my Ph.D. study. She encourages me to pursue my dreams. Thanks for her patience and great love. I also want to thank my parents and my little sister. They have been always supportive of me in pursuing my dreams. Thanks for their support, encouragement and great love. The work presented in this thesis was partially supported by Singapore Ministry of Education Academic Research Fund Tier 2, MOE2009-T2-1-033. i Contents Acknowledgements i Contents ii Abstract vi List of Publications viii List of Tables ix List of Figures x Introduction 1.1 Embedded Real-time Systems . . . . . . . . . . . . . . . . . . 1.2 Cache Modeling and Optimization . . . . . . . . . . . . . . . . 1.2.1 Cache in Uni-Processor . . . . . . . . . . . . . . . . . . 1.2.2 Shared Cache in Multi-core Processors . . . . . . . . . 1.3 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 10 Background 11 2.1 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Worst-case Execution Time Computation . . . . . . . . . . . . 14 2.3.1 Micro-architectural Modeling . . . . . . . . . . . . . . 15 2.3.2 Program Path Analysis . . . . . . . . . . . . . . . . . . 18 Literature Review 3.1 21 Cache Analysis in Uni-processor . . . . . . . . . . . . . . . . . 21 3.1.1 Intra-task Cache Conflict Analysis . . . . . . . . . . . . 21 3.1.2 Inter-task Cache Interference Analysis . . . . . . . . . . 23 ii 3.2 Cache Analysis in Multi-core . . . . . . . . . . . . . . . . . . . 25 3.3 Cache Locking . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Cache Locking for Single Task . . . . . . . . . . . . . . 27 3.3.2 Cache Locking in Multitasking . . . . . . . . . . . . . . 28 3.4 Memory Optimizations in Multi-core Processors . . . . . . . . . 29 3.5 Other Optimizations for Worst-case Performance . . . . . . . . 30 3.5.1 Cache Partitioning . . . . . . . . . . . . . . . . . . . . 30 3.5.2 Code Layout Optimization . . . . . . . . . . . . . . . . 31 3.5.3 Scratchpad Memory . . . . . . . . . . . . . . . . . . . 31 Partial Cache Locking for Single Task 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Motivating Example 4.3 Cache Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 4.4 4.5 34 . . . . . . . . . . . . . . . . . . . . . . . 35 Cache States . . . . . . . . . . . . . . . . . . . . . . . 37 Partial Cache Locking Algorithms . . . . . . . . . . . . . . . . 39 4.4.1 Optimal solution with concrete cache states . . . . . . . 40 4.4.2 Heuristic with abstract cache states . . . . . . . . . . . 43 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 47 4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 47 4.5.2 Partial Cache Locking vs. Static Analysis . . . . . . . . 47 4.5.3 Partial versus Full Cache Locking . . . . . . . . . . . . 48 4.5.4 Impact of Different Associativity . . . . . . . . . . . . 50 4.5.5 Impact of Different Block Sizes . . . . . . . . . . . . . 53 4.5.6 Optimal vs. Heuristic Approach . . . . . . . . . . . . . 53 4.5.7 Percentage of Lines Locked . . . . . . . . . . . . . . . 55 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Partial Cache Locking for Multitasking 57 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.1 WCET Comparison of Various Locking Schemes. . . . . 61 5.2.2 Scheduling Results of RMS . . . . . . . . . . . . . . . 62 5.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 WCET and CRPD Analysis . . . . . . . . . . . . . . . . . . . . 66 5.5.1 Intra-Task WCET . . . . . . . . . . . . . . . . . . . . . 66 5.5.2 Inter-Task CRPD . . . . . . . . . . . . . . . . . . . . . 67 iii 5.6 5.7 5.6.1 Cost-benefit analysis within a task . . . . . . . . . . . . 70 5.6.2 Cost-benefit analysis of other tasks . . . . . . . . . . . . 71 5.6.3 Memory block selection strategy . . . . . . . . . . . . . 72 5.6.4 Integrated Locking + Analysis Algorithms . . . . . . . . 73 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 78 5.7.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . 78 5.7.2 CPU Utilization Comparison . . . . . . . . . . . . . . . 79 5.7.3 Response Time Speed-up . . . . . . . . . . . . . . . . . 79 5.7.4 CPU Utilization Breakdown . . . . . . . . . . . . . . . 80 5.7.5 Unlocked Cache Space . . . . . . . . . . . . . . . . . . 81 5.7.6 Runtime of Our Approach . . . . . . . . . . . . . . . . 82 5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Dynamic Cache Locking 84 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Motivating Example 6.3 Cache Modeling and Locking . . . . . . . . . . . . . . . . . . . 88 6.4 6.5 Locking Algorithm for Multitasking . . . . . . . . . . . . . . . 69 . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.1 Cache Modeling . . . . . . . . . . . . . . . . . . . . . 89 6.3.2 Cache Locking Mechanism . . . . . . . . . . . . . . . . 89 Dynamic Cache Locking Algorithm . . . . . . . . . . . . . . . 90 6.4.1 Framework Overview . . . . . . . . . . . . . . . . . . . 91 6.4.2 WCET Analysis . . . . . . . . . . . . . . . . . . . . . 92 6.4.3 Resilience Analysis . . . . . . . . . . . . . . . . . . . . 93 6.4.4 Locking Slot Analysis . . . . . . . . . . . . . . . . . . 94 6.4.5 Memory Block Selection . . . . . . . . . . . . . . . . . 101 6.4.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . 102 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 103 6.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . 103 6.5.2 Comparison with Static Approaches . . . . . . . . . . . 104 6.5.3 Comparison with Region-based Approach . . . . . . . . 105 6.5.4 Runtime of Different Methods . . . . . . . . . . . . . . 107 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Cache Locking for Shared Cache Multi-core Processors 109 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Motivating Example for Task Mapping . . . . . . . . . . . . . . 111 iv 7.3 7.4 7.5 7.6 7.7 7.8 7.9 Task Model and System Architecture . . . . . . . . . Task Mapping Framework Overview . . . . . . . . . Components of the Task Mapping Framework . . . . 7.5.1 Intra-Task Cache Analysis . . . . . . . . . . 7.5.2 WCRT Estimation . . . . . . . . . . . . . . 7.5.3 ILP Formulation for Task Mapping . . . . . Cache Locking in Multi-core Processors . . . . . . . 7.6.1 Locking Mechanisms . . . . . . . . . . . . . 7.6.2 Locking Algorithm for Multi-core Processors Experimental Evaluation . . . . . . . . . . . . . . . 7.7.1 Experimental Setup . . . . . . . . . . . . . . 7.7.2 DEBIE Case Study . . . . . . . . . . . . . . 7.7.3 Synthetic Task Graphs . . . . . . . . . . . . 7.7.4 Impact of Different Number of Cores . . . . 7.7.5 L1 Block Size vs. L2 Block Size . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 113 116 117 117 118 122 123 123 127 127 130 132 134 134 135 135 Conclusion 136 8.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 137 Bibliography 139 v Abstract Applications in embedded real-time systems are required to meet their timing constraints. Deadline miss in hard real-time systems results in catastrophic effects. Thus, the worst-case performance of application plays an important role in the schedulability of hard real-time systems. However, due to the existence of micro-architectural features, such as caches, the worst-case timing analysis becomes intractable. Caches are widely employed in modern embedded real-time systems. They bridge the performance gap between the fast CPU and the slow off-chip memory. However, they also introduce timing unpredictability in real-time systems, as it is not known statically whether a memory block is in the cache or in the main memory. Existing approaches dealing with timing unpredictability of caches usually employ static cache analysis or cache locking techniques. Cache analysis statically models the cache behavior. However, it may not produce accurate results due to the existence of conservative estimation. Cache locking locks the entire cache with selected memory blocks and guarantees predictable timing. Nevertheless, such aggressive locking technique may have negative impact on the execution time, as the unlocked memory blocks cannot reside in the cache and exploit their locality. In this thesis, we propose partial cache locking technique to optimize the worst-case performance of embedded real-time systems. Partial cache locking only locks a part of the cache space, while the rest of the cache remains free and can be used by the unlocked memory blocks to exploit their cache locality. Thus, static cache analysis is still required for the unlocked cache space, while the locked cache contents are selected through accurate cost-benefit analysis. By integrating static cache analysis and cache locking, our partial cache locking approach can achieve the best of these two techniques. We first exploit the cache optimization in uni-processors. We propose static partial instruction cache locking for single task to minimize the WCET (Worstcase Execution Time), where intra-task cache conflicts are carefully handled. An optimal approach based on concrete cache state analysis and a time-efficient vi heuristic method based on abstract cache analysis are developed to select the cache contents. Substantial improvement on WCET is achieved, compared to state-of-the-art static cache analysis approach and full cache locking method. We extend our approach to multitasking real-time systems, where both intratask cache conflicts and inter-task interference are considered. Our approach takes the global effects on all task into account and selects the most beneficial memory blocks in improving the schedulability/utilization. Subsequently, we explore dynamic cache locking for single task. We propose a loop-based dynamic partial cache locking approach to minimize the WCET. Our approach can better capture the dynamic program behavior, compared to static cache locking. An ILP (Integer Linear Programming) formulation with global optimization is developed to allocate the amount of locked cache space for each loop, and the most beneficial memory blocks are selected to fill this space. Finally, we also apply partial cache locking in multi-core processors with shared cache, where the inter-core cache interference from concurrent executing tasks must also be carefully handled. Prior to cache locking, an ILP formulation based task mapping approach is proposed to optimize the WCRT (Worst-case Response Time) of multitasking applications. Based on the generated task mapping, we lock the memory blocks in the private L1 cache, which not only reduces the number of cache misses in L1 cache but also reduces the number of accesses to L2 cache. Experimental evaluation shows further improvement on WCRT for multitasking applications via cache locking. In summary, this thesis proposes and studies partial instruction cache locking in the context of different architectures and system models in embedded real-time systems. The worst-case performance of the applications is greatly improved, compared to the existing approaches. vii List of Publications • WCET-Centric Partial Instruction Cache Locking. Huping Ding, Yun Liang and Tulika Mitra. In Proceedings of the 49th annual Design Automation Conference (DAC ’12), June 2012. • Timing Analysis of Concurrent Programs Running on Shared Cache Multicores. Yun Liang, Huping Ding, Tulika Mitra, Abhik Roychoudhury, Yan Li, Vivy Suhendra. Real-Time Systems Journal, Volume 48, Issue 6, 2012. • Shared Cache Aware Task Mapping for WCRT Minimization. Huping Ding, Yun Liang and Tulika Mitra. In Proceedings of 18th Asia and South Pacific Design Automation Conference (ASP-DAC ’13), January 2013. • Integrated Instruction Cache Analysis and Locking in Multitasking Realtime Systems. Huping Ding, Yun Liang and Tulika Mitra. In Proceedings of the 50th annual Design Automation Conference (DAC ’13), June 2013. • WCET-Centric Dynamic Instruction Cache Locking. Huping Ding, Yun Liang and Tulika Mitra. In Proceedings of Design Automation and Test in Europe (DATE ’14), March 2014. viii block size can also outperform locking at L2 block size granularity, e.g., DEBIE benchmark. Locking at L2 block size can completely eliminate the access to L2 memory block in L2 cache once a memory block is locked, which leads to the reduction of shared L2 cache interference. One the other hand, locking at L1 block size granularity is more fine-grained and flexible than locking at L2 block size granularity. 7.8 Discussion A two-step framework is proposed to improve the WCRT for multitasking applications in multi-core processors with shared cache. However, we only consider homogeneous multi-core processors, and the tasks are assumed to execute in a non-preemptive fashion. In the task mapping step, we have approximations on the interference modeling and WCET computation modeling, in order to reduce the complexity of the ILP formulation. Thus, we may occasionally not achieve the best task mapping. 7.9 Summary In this chapter, we perform cache locking in multi-core processors with shared cache. Prior to cache locking, a cache aware task mapping approach is proposed to minimize the WCRT of concurrent tasks. Caches are modeled through abstract interpretation and an ILP formulation approach is employed for task mapping. Both the cache conflicts in the L2 cache and the workload balance are considered in our approach. Cache locking further minimizes the WCRT using the resultant task mapping. We statically lock the memory blocks in private L1 cache for each task. Both L1 block size granularity and L2 block size granularity are explored. Our cache locking approach not only reduces the number of cache misses in private L1 cache but also minimizes the number of accesses to shared L2 cache. Experimental results with both synthetic task graphs and realworld benchmarks show that both task mapping approach and cache locking technique substantially improve the WCRT. Our task mapping approach returns the best task mapping in most of the cases, and it is more efficient in runtime compared to an exhaustive enumeration approach that can produce optimal solution. Cache locking is complementary to task mapping and further improves the WCRT. 135 Chapter Conclusion 8.1 Thesis Contribution Timing constraint is an important feature in embedded real-time systems. Applications in real-time systems are required to meet their time deadlines, in order to guarantee proper functioning. Worst-case performance, thus becomes a crucial metric in the schedulability analysis of real-time systems. In this thesis, we study cache optimizations by proposing partial cache locking in embedded real-time systems, in order to improve the worst-case performance of applications. Our partial cache locking integrates static cache analysis and cache locking. With partial cache locking, only a portion of the cache is locked with memory blocks, while the free cache space can still be used by the unlocked memory blocks to exploit their cache locality. Accurate cost-benefit analysis is performed based on static cache analysis, in order to select the most beneficial memory blocks that can minimize the worst-case execution time. Our partial cache locking achieves the best of both static cache analysis and cache locking approach. Our partial cache locking is studied in different architectures and system models in embedded real-time systems. In uni-processors, static partial instruction cache locking is first developed for single task, in order to improve the WCET. We carefully model the intra-task cache conflict, as well as the cost and benefit by locking a memory block. An optimal approach based on concrete cache state and a time-efficient heuristic method based on abstract cache state are proposed to select the most beneficial memory blocks in improving the WCET. Then, we extend partial cache locking to multitasking real-time systems, and both intra-task and inter-task cache conflicts are carefully considered. With our approach, each task may lock a portion of the cache, while there is still 136 unlocked cache space that is shared by all the tasks in a time-multiplexed style. As the cache is shared by all the tasks, locking a memory block has global effect and can impact both WCET and CRPD. We propose a greedy selection method that iteratively select the memory blocks, where the global effect of cache locking is handled. Our approach improves the schedulability/utilization for both RMS (Rate Monotonic Scheduling) and EDF (Earliest Deadline First) scheduling policies. Static partial cache locking is also extended to dynamic cache locking in this thesis, in order to further improve the WCET for a single task. Compared to the region-based approaches that partition program into different regions, we propose a flexible loop-based dynamic cache locking approach. Our approach locks the memory blocks at the entry point of a loop and unlocks them at the corresponding exit point. Memory blocks from the same loop can be locked at different program points with consideration to global optimization of the WCET. Thus, we not only select the memory blocks to be locked, but also decide the locking points where they should be locked. At last, we also study partial cache locking in multi-core processors with shared cache. Inter-core cache conflict is considered due to the existence of shared cache. A two-step framework is proposed to minimize the WCRT. Prior to cache locking, a task mapping method is adopted to minimize WCRT. The task mapping approach considers both the workload balance and shared cache conflict. Based on the resultant task mapping, we further improve the WCRT via partial cache locking approach. 8.2 Future Directions Data cache is as important as instruction cache in embedded real-time systems, which provides fast access to the program data. Although we only study instruction cache optimizations for embedded real-time systems in this thesis, our techniques can be extended to data cache. Our partial cache locking technique relies on static cache analysis to select the beneficial memory blocks to lock. Recently, Huynh et al. [51] propose a scope-aware static data cache analysis method based on persistence analysis. They adopt the data address analysis technique proposed in [108]. With the data address analysis framework and the scope-aware abstract cache state analysis, we believe that our partial cache locking technique can be easily extended to data cache. As we have mentioned, cache locking can reduce the cache conflicts. However, cache locking cannot completely eliminate the cache conflicts. Suppose 137 there are three memory blocks m1 , m2 and m3 in a loop, and they are mapped to the same cache set of a 2-way set-associative cache. Clearly, these three memory blocks conflict in the cache set. When we lock one of them, all accesses to the locked memory blocks are cache hits, while the other two memory blocks can still conflict with each other. Recently, compiler-assisted code positioning approaches have been proposed to optimize the WCET [76, 37]. These methods change the layout of program codes, which may completely eliminate some of the cache conflicts. Thus, we believe that a combined approach of partial cache locking and code positioning may produce better results. In this thesis, we have a trade-off between the performance and predictability. Our approach not only guarantees the timing predictability but also improves the worst-case performance. 138 Bibliography [1] ARM920T technical reference manual. http://infocenter.arm. com. [2] ARM940T technical reference manual. http://infocenter.arm. com. [3] IBM ILOG CPLEX Optimizer. http://www-01.ibm.com/ software/commerce/optimization/cplex-optimizer. [4] PowerPC 440 embedded core. https://www-01.ibm.com/ chips/techlib/techlib.nsf/products/PowerPC\_440\ _Embedded\_Core. [5] IDT 79RC64574/RC64575 user reference manual, Mar. 2000. Integrated Device Technology. [6] Intel XScale core developers manual. http://www.intel.com/ design/intelxscale, Jan. 2004. [7] ADSP-BF53x/BF56x blackfin processor programming reference, Feb. 2006. Analog Devices, Inc. [8] AbsInt. aiT: worst-case execution time analyzers. absint.com/ait/. http://www. [9] S. Altmeyer, C. Maiza, and J. Reineke. Resilience analysis: tightening the crpd bound for set-associative caches. In Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems, LCTES ’10, pages 153–162, 2010. [10] S. Altmeyer and C. Maiza Burguière. Cache-related preemption delay via useful cache blocks: Survey and redefinition. J. Syst. Archit., 57(7):707– 719, Aug. 2011. 139 [11] K. Anand and R. Barua. Instruction cache locking inside a binary rewriter. In Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, CASES ’09, pages 185–194, 2009. [12] J. H. Anderson and J. M. Calandrino. Parallel task scheduling on multicore platforms. SIGBED Rev., 3(1):1–6, Jan. 2006. [13] J. H. Anderson, J. M. Calandrino, and U. C. Devi. Real-time scheduling on multicore platforms. In Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS ’06, pages 179–190, 2006. [14] L. C. Aparicio, J. Segarra, C. Rodr´ıguez, and V. Viñals. Improving the WCET computation in the presence of a lockable instruction cache in multitasking real-time systems. J. Syst. Archit., 57(7):695–706, Aug. 2011. [15] A. Arnaud and I. Puaut. Dynamic instruction cache locking in hard realtime systems. In Proceedings of the 14th International Conference on Real-Time and Network Systems, RTNS ’06, 2006. [16] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worstcase instruction cache performance. In Proceedings of the 15th IEEE Real-Time Systems Symposium, RTSS ’94, pages 172–181, 1994. [17] C. Ballabriga and H. Casse. Improving the first-miss computation in setassociative instruction caches. In Proceedings of the 2008 Euromicro Conference on Real-Time Systems, ECRTS ’08, pages 341–350, 2008. [18] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th international symposium on Hardware/software codesign, CODES ’02, pages 73–78, 2002. [19] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proceedings of the 27th annual International Symposium on Computer Architecture, ISCA ’00, pages 83–94, 2000. [20] B. Buck and J. K. Hollingsworth. An api for runtime code patching. Int. J. High Perform. Comput. Appl., 14(4):317–329, Nov. 2000. 140 [21] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. SIGARCH Comput. Archit. News, 25(3):13–25, June 1997. [22] J. M. Calandrino and J. H. Anderson. Cache-aware real-time scheduling on multicore platforms: Heuristics and a case study. In Proceedings of the 2008 Euromicro Conference on Real-Time Systems, ECRTS ’08, pages 299–308, 2008. [23] A. M. Campoy, I. Puaut, A. P. Ivars, and J. V. B. Mataix. Cache contents selection for statically-locked instruction caches: An algorithm comparison. In Proceedings of the 17th Euromicro Conference on Real-Time Systems, ECRTS ’05, pages 49–56, 2005. [24] J. F. Cantin and M. D. Hill. Cache performance for SPEC CPU2000 benchmarks. http://www.cs.wisc.edu/multifacet/misc/ spec2000cache-data, May 2003. [25] S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, PLDI ’01, pages 286–297, 2001. [26] S. Chattopadhyay, C. L. Kee, A. Roychoudhury, T. Kelter, P. Marwedel, and H. Falk. A unified WCET analysis framework for multi-core platforms. In Proceedings of the IEEE 18th Real Time and Embedded Technology and Applications Symposium, RTAS ’12, pages 99–108, 2012. [27] S. Chattopadhyay and A. Roychoudhury. Static bus schedule aware scratchpad allocation in multiprocessors. In Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, LCTES ’11, pages 11–20, 2011. [28] S. Chattopadhyay, A. Roychoudhury, and T. Mitra. Modeling shared cache and bus in multi-cores for timing analysis. In Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems, SCOPES ’10, pages 6:1–6:10, 2010. [29] H. Chetto and M. Chetto. Some results of the earliest deadline scheduling algorithm. IEEE Trans. Softw. Eng., 15(10):1261–1269, Oct. 1989. [30] A. Colin and I. Puaut. Worst case execution time analysis for a processor withbranch prediction. Real-Time Syst., 18(2/3):249–274, May 2000. 141 [31] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL ’77, pages 238–252, 1977. [32] C. Cullmann. Cache persistence analysis: Theory and practice. ACM Trans. Embed. Comput. Syst., 12(1s):40:1–40:25, Mar. 2013. [33] J.-F. Deverge and I. Puaut. WCET-directed dynamic scratchpad memory allocation of data. In Proceedings of the 19th Euromicro Conference on Real-Time Systems, ECRTS ’07, pages 179–190, 2007. [34] R. P. Dick, D. L. Rhodes, and W. Wolf. TGFF: task graphs for free. In Proceedings of the 6th international workshop on Hardware/software codesign, CODES/CASHE ’98, pages 97–101, 1998. [35] European Space Agency. DEBIE – First standard space debris monitoring instrument. http://gate.etamax.de/edid/ publicaccess/debie1.php, 2008. [36] H. Falk and J. C. Kleinsorge. Optimal static WCET-aware scratchpad allocation of program code. In Proceedings of the 46th Annual Design Automation Conference, DAC ’09, pages 732–737, 2009. [37] H. Falk and H. Kotthaus. WCET-driven cache-aware code positioning. In Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems, CASES ’11, pages 145–154, 2011. [38] H. Falk, S. Plazar, and H. Theiling. Compile-time decided instruction cache locking using worst-case execution paths. In Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis, CODES+ISSS ’07, pages 143–148, 2007. [39] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair thread scheduling for multicore processors. Technical Report TR-17-06, Harvard University, 2006. [40] A. Fedorova, M. Seltzer, M. D. Smith, and C. Small. CASC: A cache-aware scheduling algorithm for multithreaded chip multiprocessors,. Technical Report TR-2005-0142, Sun Labs, 2005. 142 [41] C. Ferdinand, F. Martin, R. Wilhelm, and M. Alt. Cache behavior prediction by abstract interpretation. Sci. Comput. Program., 35(2-3):163–189, Nov. 1999. [42] C. Ferdinand and R. Wilhelm. On predicting data cache behavior for real-time systems. In Proceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Tools for Embedded Systems, LCTES ’98, pages 16–30, 1998. [43] G. Gebhard and S. Altmeyer. Optimal task placement to improve cache performance. In Proceedings of the 7th ACM & IEEE international conference on Embedded software, EMSOFT ’07, pages 259–268, 2007. [44] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst., 21(4):703–746, July 1999. [45] C. Guillon, F. Rastello, T. Bidault, and F. Bouchez. Procedure placement using temporal-ordering information: Dealing with code size expansion. J. Embedded Comput., 1(4):437–459, Dec. 2005. [46] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper. The mälardalen WCET benchmarks - past, present and future. In Proceedings of the 10th International Workshop on Worst-Case Execution Time Analysis, WCET ’11, pages 136–146, 2010. [47] D. Hardy, T. Piquet, and I. Puaut. Using bypass to tighten WCET estimates for multi-core processors with shared instruction caches. In Proceedings of the 30th IEEE Real-Time Systems Symposium, RTSS ’09, pages 68–77, 2009. [48] D. Hardy and I. Puaut. Wcet analysis of multi-level non-inclusive setassociative instruction caches. In Proceedings of the 29th Real-Time Systems Symposium, RTSS ’08, pages 456–466, 2008. [49] C. A. Healy, R. D. Arnold, F. Mueller, M. G. Harmon, and D. B. Walley. Bounding pipeline and instruction cache performance. IEEE Trans. Comput., 48(1):53–70, Jan. 1999. [50] J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2006. 143 [51] B. K. Huynh, L. Ju, and A. Roychoudhury. Scope-aware data cache analysis for WCET estimation. In Proceedings of the 17th IEEE RealTime and Embedded Technology and Applications Symposium, RTAS ’11, pages 203–212, 2011. [52] L. Ju, S. Chakraborty, and A. Roychoudhury. Accounting for cacherelated preemption delay in dynamic priority schedulability analysis. In Proceedings of the conference on Design, automation and test in Europe, DATE ’07, pages 1623–1628, 2007. [53] T. Kelter, H. Falk, P. Marwedel, S. Chattopadhyay, and A. Roychoudhury. Bus-aware multicore WCET analysis through TDMA offset bounds. In Proceedings of the 23rd Euromicro Conference on Real-Time Systems, ECRTS ’11, pages 3–12, 2011. [54] J. C. Kleinsorge, H. Falk, and P. Marwedel. A synergetic approach to accurate analysis of cache-related preemption delay. In Proceedings of the 9th ACM international conference on Embedded software, EMSOFT ’11, pages 329–338, 2011. [55] M. Langenbach, S. Thesing, and R. Heckmann. Pipeline modeling for timing analysis. In Proceedings of the 9th International Symposium on Static Analysis, SAS ’02, pages 294–309, 2002. [56] C.-G. Lee, J. Hahn, Y.-M. Seo, S. L. Min, R. Ha, S. Hong, C. Y. Park, M. Lee, and C. S. Kim. Analysis of cache-related preemption delay in fixed-priority preemptive scheduling. IEEE Trans. Comput., 47(6):700– 713, June 1998. [57] B. Lesage, D. Hardy, and I. Puaut. WCET analysis of multi-level setassociative data caches. In Proceedings of the 9th Intl. Workshop on Worst-Case Execution Time WCET Analysis, WCET ’09, 2009. [58] B. Lesage, D. Hardy, and I. Puaut. Shared data cache conflicts reduction for WCET computation in multi-core architectures. In Proceedings of the 18th International Conference on Real-Time and Network Systems, RTNS ’10, 2010. [59] X. Li, Y. Liang, T. Mitra, and A. Roychoudhury. Chronos: A timing analyzer for embedded software. Sci. Comput. Program., 69(1-3):56–67, Dec. 2007. 144 [60] X. Li, T. Mitra, and A. Roychoudhury. Modeling control speculation for timing analysis. Real-Time Syst., 29(1):27–58, Jan. 2005. [61] X. Li, A. Roychoudhury, and T. Mitra. Modeling out-of-order processors for wcet analysis. Real-Time Syst., 34(3):195–227, Nov. 2006. [62] Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury. Timing analysis of concurrent programs running on shared cache multi-cores. In Proceedings of the 30th IEEE Real-Time Systems Symposium, RTSS ’09, pages 57–67, 2009. [63] Y.-T. S. Li and S. Malik. Performance analysis of embedded software using implicit path enumeration. In Proceedings of the 32nd annual ACM/IEEE Design Automation Conference, DAC ’95, pages 456–461, 1995. [64] Y.-T. S. Li, S. Malik, and A. Wolfe. Efficient microarchitecture modeling and path analysis for real-time software. In Proceedings of the 16th IEEE Real-Time Systems Symposium, RTSS ’95, pages 298–, 1995. [65] Y.-T. S. Li, S. Malik, and A. Wolfe. Cache modeling for real-time software: beyond direct mapped instruction caches. In Proceedings of the 17th IEEE Real-Time Systems Symposium, RTSS ’96, pages 254–, 1996. [66] Y.-T. S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with instruction cache modeling. ACM Trans. Des. Autom. Electron. Syst., 4(3):257–279, July 1999. [67] Y. Liang, H. Ding, T. Mitra, A. Roychoudhury, Y. Li, and V. Suhendra. Timing analysis of concurrent programs running on shared cache multicores. Real-Time Syst., 48(6):638–680, Nov. 2012. [68] Y. Liang and T. Mitra. Cache modeling in probabilistic execution time analysis. In Proceedings of the 45th annual Design Automation Conference, DAC ’08, pages 319–324, 2008. [69] Y. Liang and T. Mitra. Improved procedure placement for set associative caches. In Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems, CASES ’10, pages 147–156, 2010. 145 [70] Y. Liang and T. Mitra. Instruction cache locking using temporal reuse profile. In Proceedings of the 47th Design Automation Conference, DAC ’10, pages 344–349, 2010. [71] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM, 20(1):46–61, Jan. 1973. [72] T. Liu, M. Li, and C. J. Xue. Minimizing WCET for real-time embedded systems via static instruction cache locking. In Proceedings of the 15th IEEE Symposium on Real-Time and Embedded Technology and Applications, RTAS ’09, pages 35–44, 2009. [73] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for embedded systems using probability profile. J. Signal Process. Syst., 69(2):173– 188, Nov. 2012. [74] T. Liu, M. Li, and C. J. Xue. Instruction cache locking for multi-task realtime embedded systems. Real-Time Syst., 48(2):166–197, Mar. 2012. [75] T. Liu, Y. Zhao, M. Li, and C. J. Xue. Task assignment with cache partitioning and locking for WCET minimization on MPSoC. In Proceedings of the 39th International Conference on Parallel Processing, ICPP ’10, pages 573–582, 2010. [76] P. Lokuciejewski, H. Falk, and P. Marwedel. WCET-driven cache-based procedure positioning optimizations. In Proceedings of the 2008 Euromicro Conference on Real-Time Systems, ECRTS ’08, pages 321–330, 2008. [77] T. Lundqvist and P. Stenström. An integrated path and timing analysis method based on cycle-level symbolic execution. Real-Time Syst., 17(23):183–207, Dec. 1999. [78] T. Lundqvist and P. Stenström. Timing anomalies in dynamically scheduled microprocessors. In Proceedings of the 20th IEEE Real-Time Systems Symposium, RTSS ’99, pages 12–, 1999. [79] F. Martin, M. Alt, R. Wilhelm, and C. Ferdinand. Analysis of loops. In In Proceedings of the 7th International Conference on Compiler Construction, CC ’98, pages 80–94, 1998. [80] F. Mueller. Compiler support for software-based cache partitioning. In Proceedings of the ACM SIGPLAN 1995 workshop on Languages, compilers, & tools for real-time systems, LCTES ’95, pages 125–133, 1995. 146 [81] F. Mueller. Timing analysis for instruction caches. Real-Time Syst., 18(2/3):217–247, May 2000. [82] H. S. Negi, T. Mitra, and A. Roychoudhury. Accurate estimation of cache-related preemption delay. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, CODES+ISSS ’03, pages 201–206, 2003. [83] C. Y. Park. Predicting program execution times by analyzing static and dynamic program paths. Real-Time Syst., 5(1):31–62, Mar. 1993. [84] S. Plazar, J. C. Kleinsorge, P. Marwedel, and H. Falk. WCET-aware static locking of instruction caches. In Proceedings of the 10th International Symposium on Code Generation and Optimization, CGO ’12, pages 44– 52, 2012. [85] S. Plazar, P. Lokuciejewski, and P. Marwedel. WCET-aware software based cache partitioning for multi-task real-time systems. In Proceedings of the 9th Intl. Workshop on Worst-Case Execution Time Analysis, WCET’09, 2009. [86] I. Puaut. Wcet-centric software-controlled instruction caches for hard real-time systems. In Proceedings of the 18th Euromicro Conference on Real-Time Systems, ECRTS ’06, pages 217–226, 2006. [87] I. Puaut and D. Decotigny. Low-complexity algorithms for static cache locking in multitasking hard real-time systems. In Proceedings of the 23rd IEEE Real-Time Systems Symposium, RTSS ’02, pages 114–, 2002. [88] I. Puaut and C. Pais. Scratchpad memories vs locked caches in hard realtime systems: a quantitative comparison. In Proceedings of the conference on Design, automation and test in Europe, DATE ’07, pages 1484– 1489, 2007. [89] J. E. Sasinowski and J. K. Strosnider. A dynamic programming algorithm for cache memory partitioning for real-time systems. IEEE Trans. Comput., 42(8):997–1001, Aug. 1993. [90] J. Schneider and C. Ferdinand. Pipeline behavior prediction for superscalar processors by abstract interpretation. In Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems, LCTES ’99, pages 35–44, 1999. 147 [91] R. Sen and Y. N. Srikant. WCET estimation for executables in the presence of data caches. In Proceedings of the 7th ACM & IEEE international conference on Embedded software, EMSOFT ’07, pages 203–212, 2007. [92] A. C. Shaw. Reasoning about time in higher-level language software. IEEE Trans. Softw. Eng., 15(7):875–889, July 1989. [93] Y. N. Srikant and P. Shankar. The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition. CRC Press, Inc., 2nd edition, 2007. [94] F. Stappert, A. Ermedahl, and J. Engblom. Efficient longest executable path search for programs with complex flows and pipeline effects. In Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, CASES ’01, pages 132–140, 2001. [95] J. Staschulat and R. Ernst. Scalable precision cache analysis for real-time software. ACM Trans. Embed. Comput. Syst., 6(4), Sept. 2007. [96] V. Suhendra and T. Mitra. Exploring locking & partitioning for predictable shared caches on multi-cores. In Proceedings of the 45th annual Design Automation Conference, DAC ’08, pages 300–303, 2008. [97] V. Suhendra, T. Mitra, A. Roychoudhury, and T. Chen. WCET centric data allocation to scratchpad memory. In Proceedings of the 26th IEEE International Real-Time Systems Symposium, RTSS ’05, pages 223–232, 2005. [98] V. Suhendra, T. Mitra, A. Roychoudhury, and T. Chen. Efficient detection and exploitation of infeasible paths for software timing analysis. In Proceedings of the 43rd annual Design Automation Conference, DAC ’06, pages 358–363, 2006. [99] V. Suhendra, A. Roychoudhury, and T. Mitra. Scratchpad allocation for concurrent embedded software. ACM Trans. Program. Lang. Syst., 32(4):13:1–13:47, Apr. 2010. [100] Y. Tan and V. Mooney. Integrated intra- and inter-task cache analysis for preemptive multi-tasking real-time systems. In In Proceedings of the 8th International Workshop, SCOPES 2004, in: Lecture Notes on Computer Science, LNCS3199, SCOPES ’04, pages 182–199, 2004. 148 [101] H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise WCET prediction by separated cache andpath analyses. Real-Time Syst., 18(2/3):157–179, May 2000. [102] L. Thiele and R. Wilhelm. Design for timing predictability. Real-Time Syst., 28(2-3):157–177, Nov. 2004. [103] H. Tomiyama and N. D. Dutt. Program path analysis to bound cacherelated preemption delay in preemptive real-time systems. In Proceedings of the 8th International Workshop on Hardware/Software Codesign, CODES ’00, pages 67–71, 2000. [104] X. Vera, B. Lisper, and J. Xue. Data cache locking for tight timing calculations. ACM Trans. Embed. Comput. Syst., 7(1):4:1–4:38, Dec. 2007. [105] M. Verma, K. Petzold, L. Wehmeyer, H. Falk, and P. Marwedel. Scratchpad sharing strategies for multiprocess embedded systems: a first approach. In The 3rd Workshop on Embedded Systems for Real-Time Multimedia, ESTIMedia ’05, pages 115–120, 2005. [106] Q. Wan, H. Wu, and J. Xue. WCET-aware data selection and allocation for scratchpad memory. In Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES ’12, pages 41–50, 2012. [107] I. Wenzel, R. Kirner, P. Puschner, and B. Rieder. Principles of timing anomalies in superscalar processors. In Proceedings of the 5th International Conference on Quality Software, QSIC ’05, pages 295–306, 2005. [108] R. T. White, F. Mueller, C. Healy, D. Whalley, and M. Harmon. Timing analysis for data and wrap-around fill caches. Real-Time Syst., 17(23):209–233, Dec. 1999. [109] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström. The worst-case executiontime problem - overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst., 7(3):36:1–36:53, May 2008. [110] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, PLDI ’91, pages 30–44, 1991. 149 [111] H. Wu, J. Xue, and S. Parameswaran. Optimal WCET-aware code selection for scratchpad memory. In Proceedings of the 10th ACM international conference on Embedded software, EMSOFT ’10, pages 59–68, 2010. [112] J. Yan and W. Zhang. WCET analysis for multi-core processors with shared l2 instruction caches. In Proceedings of the 14th IEEE RealTime and Embedded Technology and Applications Symposium, RTAS ’08, pages 80–89, 2008. [113] W. Zhang and J. Yan. Accurately estimating worst-case execution time for multi-core processors with shared direct-mapped instruction caches. In Proceedings of the 15th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA ’09, pages 455–463, 2009. [114] W. Zhao, D. Whalley, C. Healy, and F. Mueller. WCET code positioning. In Proceedings of the 25th IEEE International Real-Time Systems Symposium, RTSS ’04, pages 81–91, 2004. 150 [...]... corresponding time deadlines, while no timing constraint is required in generalpurpose computer systems Real- time systems that have timing constraint can be classified into two types, soft real- time systems and hard real- time systems In soft real- time systems, the timing constraint is elastic Miss of the deadline in soft real- time systems only results in loss of QoS but not the failure of systems Thus, the time. .. employed in modern embedded real- time systems It stores copies of instructions and speeds up the instruction fetch in the processors Instruction cache is accessed by the CPU almost very cycle in the processors, and it significantly in uences the average-case performance of processors Moreover, instruction cache also consumes a large part of the power in the processors [19] In embedded real- time systems, instruction. .. 3.1 Cache Analysis in Uni-processor We introduce the existing cache analysis approaches that target both intra-task cache conflict and inter-task cache interference in uni-processors 3.1.1 Intra-task Cache Conflict Analysis Cache makes the worst-case timing analysis in real- time systems challenging, as the timing is unpredictable due to the cache Conservatively assuming that all memory accesses are cache. .. dynamic cache locking for single task Finally, we consider cache optimizations in multi-core processor with shared cache 1.4 Thesis Contributions In this thesis, we perform post-compilation instruction cache optimizations via partial cache locking in embedded real- time systems We select the locked contents based on a static analysis of the program binary executable We make the following contributions in. .. partial cache locking for single task to multitasking in uni-processors, in order to improve the schedulability/utilization of real8 time systems In our approach, each task statically locks a portion of the cache, while there is still unlocked cache space that is shared by all tasks in a time- multiplexed style Locking a memory block in multitasking real- time systems in uences both WCET and CRPD (Cache- related... locking, cache locking is performed at the granularity of cache ways When a cache way is locked, all the sets in this particular way are locked Way locking is employed in [1], [2] and so on While line locking allows different number of cache lines to be locked in different cache sets Thus, compared to way locking, line locking is more flexible and fine-grained Line locking is used in [6], [5] and so on Cache. .. feature, there are also real- time constraints in embedded systems, such as timing constraint With the timing constraint, embedded systems are not merely required to produce correct results, but also have to meet the requirement of real- time response time, in order to guarantee the quality of service (QoS) or proper functioning In other words, applications on embedded real- time systems need to complete... unpredictable timing in embedded real- time systems In order to deal with the timing unpredictability problem of cache, many approaches have been proposed, including static cache analysis and cache locking method Static Cache Analysis Static cache analysis statically analyzes the program and models the cache, in order to capture the cache behavior of the program It is commonly used to model the intra-task cache. .. carefully handled in timing analysis Instruction cache modeling attracts lots of attentions in micro-architectural modeling One of the most well-known approach for instruction cache modeling is abstract interpretation [101] This method is also used in modeling of multi-level cache [48] and shared cache [62] In abstract interpretation approach, abstract cache states are defined at each program point to represent... al [47] reduce the inter-core interference in the shared cache through bypassing static single usage blocks from the shared caches via compile time analysis In [96] and [75], cache partitioning is employed in the shared cache to eliminate intercore cache interference However, cache partitioning may limit the shared cache performance, as each task can only use a portion of the shared cache 1.3 Research . types, soft real-time systems and hard real-time systems. In soft real-time systems, the timing constraint is elastic. Miss of the deadline in soft real-time systems only results in loss of QoS but. in cache misses, due to such intra-task cache conflict in T . In preemptive multitasking real-time systems, multiple tasks are scheduled on the same processor. Inter-task interference in the cache. for multitasking applications via cache locking. In summary, this thesis proposes and studies partial instruction cache locking in the context of different architectures and system models in embedded real-time

Định dạng
Số trang	164
Dung lượng	2,36 MB