EMBEDDED SOFTWARE FOR SoC This page intentionally left blank Embedded Software for SoC Edited by Ahmed Amine Jerraya TIMA Laboratory, France Sungjoo Yoo TIMA Laboratory, France Diederik Verkest IMEC, Belgium and Norbert Wehn University of Kaiserlautern, Germany KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW eBook ISBN: Print ISBN: 0-306-48709-8 1-4020-7528-6 ©2004 Springer Science + Business Media, Inc Print ©2003 Kluwer Academic Publishers Dordrecht All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America Visit Springer's eBookstore at: and the Springer Global Website Online at: http://www.ebooks.kluweronline.com http://www.springeronline.com DEDICATION This book is dedicated to all designers working in hardware hell This page intentionally left blank TABLE OF CONTENTS Dedication v Contents vii Preface xiii Introduction xv PART I: EMBEDDED OPERATING SYSTEMS FOR SOC Chapter APPLICATION MAPPING TO A HARDWARE PLATFORM ATOMATED CODE GENERATION TARGETING A RTOS Monica Besana and Michele Borgatti THROUGH Chapter FORMAL METHODS FOR INTEGRATION OF AUTOMOTIVE SOFTWARE Marek Jersak, Kai Richter, Razvan Racu, Jan Staschulat, Rolf Ernst, Jörn-Christian Braam and Fabian Wolf 11 Chapter LIGHTWEIGHT IMPLEMENTATION OF THE POSIX THREADS API FOR AN ON-CHIP MIPS MULTIPROCESSOR WITH VCI INTERCONNECT Frédéric Pétrot, Pascal Gomez and Denis Hommais 25 Chapter DETECTING SOFT ERRORS BY A PURELY SOFTWARE APPROACH: METHOD, TOOLS AND EXPERIMENTAL RESULTS B Nicolescu and R Velazco 39 PART II: OPERATING SYSTEM ABSTRACTION AND TARGETING 53 Chapter RTOS MODELLING FOR SYSTEM LEVEL DESIGN Andreas Gerstlauer, Haobo Yu and Daniel D Gajski 55 Chapter MODELING AND INTEGRATION OF PERIPHERAL DEVICES IN EMBEDDED SYSTEMS Shaojie Wang, Sharad Malik and Reinaldo A Bergamaschi 69 vii viii Table of Conents Chapter SYSTEMATIC EMBEDDED SOFTWARE GENERATION FROM SYSTEMIC F Herrera, H Posadas, P Sánchez and E Villar PART III: EMBEDDED SOFTWARE DESIGN AND IMPLEMENTATION 83 95 Chapter EXPLORING SW PERFORMANCE USING SOC TRANSACTION-LEVEL MODELING Imed Moussa, Thierry Grellier and Giang Nguyen 97 Chapter A FLEXIBLE OBJECT-ORIENTED SOFTWARE ARCHITECTURE FOR SMART WIRELESS COMMUNICATION DEVICES Marco Götze 111 Chapter 10 SCHEDULING AND TIMING ANALYSIS OF HW/SW ON-CHIP COMMUNICATION IN MP SOC DESIGN Youngchul Cho, Ganghee Lee, Kiyoung Choi, Sungjoo Yoo and Nacer-Eddine Zergainoh 125 Chapter 11 EVALUATION OF APPLYING SPECC TO THE INTEGRATED DESIGN METHOD OF DEVICE DRIVER AND DEVICE Shinya Honda and Hiroaki Takada 137 Chapter 12 INTERACTIVE RAY TRACING ON RECONFIGURABLE SIMD MORPHOSYS H Du, M Sanchez-Elez, N Tabrizi, N Bagherzadeh, M L Anido and M Fernandez 151 Chapter 13 PORTING A NETWORK CRYPTOGRAPHIC SERVICE TO THE RMC2000 Stephen Jan, Paolo de Dios, and Stephen A Edwards 165 PART IV: EMBEDDED OPERATING SYSTEMS FOR SOC Chapter 14 INTRODUCTION TO HARDWARE ABSTRACTION LAYERS Sungjoo Yoo and Ahmed A Jerraya Chapter 15 HARDWARE/SOFTWARE PARTITIONING Vincent J Mooney III OF FOR 177 SOC 179 OPERATING SYSTEMS 187 Table of Conents ix Chapter 16 EMBEDDED SW IN DIGITAL AM-FM CHIPSET M Sarlotte, B Candaele, J Quevremont and D Merel 207 PART V: SOFTWARE OPTIMISATION FOR EMBEDDED SYSTEMS 213 Chapter 17 CONTROL FLOW DRIVEN SPLITTING OF LOOP NESTS AT THE SOURCE CODE LEVEL Heiko Falk, Peter Marwedel and Francky Catthoor 215 Chapter 18 DATA SPACE ORIENTED SCHEDULING M Kandemir, G Chen, W Zhang and I Kolcu 231 Chapter 19 COMPILER-DIRECTED ILP EXTRACTION FOR CLUSTERED VLIW/EPIC MACHINES Satish Pillai and Margarida F Jacome 245 Chapter 20 STATE SPACE COMPRESSION IN HISTORY DRIVEN QUASI-STATIC SCHEDULING Antonio G Lomeña, Marisa López-Vallejo, Yosinori Watanabe and Alex Kondratyev 261 Chapter 21 SIMULATION TRACE VERIFICATION FOR QUANTITATIVE CONSTRAINTS Xi Chen, Harry Hsieh, Felice Balarin and Yosinori Watanabe 275 PART VI: ENERGY AWARE SOFTWARE TECHNIQUES Chapter 22 EFFICIENT POWER/PERFORMANCE ANALYSIS OF EMBEDDED GENERAL PURPOSE SOFTWARE APPLICATIONS Venkata Syam P Rapaka and Diana Marculescu 287 AND 289 Chapter 23 DYNAMIC PARALLELIZATION OF ARRAY BASED ON-CHIP MULTIPROCESSOR APPLICATIONS M Kandemir W Zhang and M Karakoy 305 Chapter 24 SDRAM-ENERGY-AWARE MEMORY ALLOCATION FOR DYNAMIC MULTI-MEDIA APPLICATIONS ON MULTI-PROCESSOR PLATFORMS P Marchal, J I Gomez, D Bruni, L Benini, L Piñuel, F Catthoor and H Corporaal 319 516 Chapter 37 be updated when a line is allocated again This is the approach used in the design presented here EXPERIMENTAL SETUP To evaluate the WDU design, the Wattch version 1.02 simulator [1] was augmented with a model for the WDU Based on SimpleScalar [2], Wattch is a simulator for a superscalar processor that can simulate the energy consumption of all major componets of a CPU The CMOS process parameters for the simulated architecture are 400 MHz clock and feature size The processor modeled uses a memory and cache organization based on XScale [5]: 32 KB data and instruction L1 caches with 32 byte lines and cycles latency, no L2 cache, 50 cycle main memory access latency The machine is in-order, it has a load/store queue with 32 entries The machine is 2-issue, it has one of each of the following units: integer unit, floating point unit and multiplication/division unit, all with cycle latency The branch predictor is bimodal and has 128 entries The instruction and data TLBs are fully associative and have 32 entries 4.1 The WDU energy consumption model The WDU tags and way storage are modeled using a Wattch model for a fully associative cache The processor modeled is 32 bit and has a virtually indexed L1 data cache with 32 byte lines, so the WDU tags are 32 – = 27 bits wide, and the data store is 1, 2, 3, or bits wide for a 2, 4, or Low Energy Associative Data Caches for Embedded Systems 517 32-way set associative L1, respectively The energy consumption of the modulo counter is insignificant compared to the rest of the WDU The energy consumption model takes into account the energy consumed by the different units when idle For a processor with a physically tagged cache the size of the WDU is substantially smaller and so would be the energy consumption of a WDU for such an architecture Cacti3 [10] has been used to model and check the timing parameters of the WDU in the desired technology 4.2 Benchmarks MiBench [3] is a publicly available benchmark suite designed to be representative for several embedded system domains The benchmarks are divided in six categories targeting different parts of the embedded systems market The suites are: Automotive and Industrial Control (basicmath, bitcount, susan (edges, corners and smoothing)), Consumer Devices (jpeg encode and decode, lame, tiff2bw, tiff2rgba, tiffdither, tiffmedian, typeset), Office Automation (ghostscript, ispell, stringsearch), Networking (dijkstra, patricia), Security (blowfish encode and decode, pgp sign and verify, rijndael encode and decode, sha) and Telecommunications (CRC32, FFT direct and inverse, adpcm encode and decode, gsm encode and decode) All the benchmarks were compiled with the –O3 compiler flag and were simulated to completion using the “large” input set Various cache associativities and WDU sizes have been simulated, all the other processor parameters where kept constant during this exploration PERFORMANCE EVALUATION Figure 37-3 shows the percentage of load/store instructions for which a 8, 16, 32 or 64-entry WDU can determine the correct cache way An 8-entry WDU can determine the way for between 51 and 98% of the load/store instructions, with an average of 82% With few exceptions (susan_s, tiff2bw, tiff2rgba, pgp, adpcm, gsm_u) for the majority of benchmarks increasing the WDU size to 16 results in a significant improvement in the number of instructions with way determination The increase from 16 to 32 entries only improves the performance for a few benchmarks, and the increase from 32 to 64 for even fewer benchmarks Figure 37-4 shows the percentage data cache energy consumption savings for a 32-way cache when using an 8, 16, 32 or 64-entry WDUs For space and legibility reasons all other results will only show averages, the complete set of results can be found in [9] A summary of the average number of instructions for which way determination worked for 2, 4, 8, 16 and 32-way set-associative L1 data cache and 518 Chapter 37 Low Energy Associative Data Caches for Embedded Systems 519 520 Chapter 37 8, 16, 32 and 64-entry WDU is presented in Figure 37-5 It is remarkable that the WDU detects a similar number of instructions independent of the L1 cache associativity Increasing the WDU size from to 16 produces the highest increase in the percentage of instructions with way determination, from 82% to 88% The corresponding values for a 32 and 64-entry WDU are 91% and 93% Figure 37-6 shows the average data cache energy consumption savings for the MiBench benchmark suite due to using the WDU, compared to a system that does not have a WDU When computing the energy consumption savings the WDU energy consumption is added to the D-cache energy consumption For all the associativities studied the 16-entry WDU has the best implementation cost/energy savings ratio It’s average D-cache energy consumption savings of 36%, 56%, 66%, 72% and 76% for, respectively, a 2, 4, 8, 16 and 32-way set associative cache are within 1% of the energy consumption savings of a 32-entry WDU for a given associativity The even smaller 8-entry WDU is within at most 3% of the best case For the 64-entry WDU the WDU energy consumption overhead becomes higher than the additional energy savings due to the increased number of WDU entries, so the 64-entry WDU performs worse than the 32-entry one for a given associativity Figure 37-7 shows the percentage of total processor energy consumption reduction when using a WDU For a 16-entry WDU the energy consumption savings are 3.73%, 6.37%, 7.21%, 9.59% and 13.86% for, respectively a 2, 4, 8, 16 and 32-way set associative L1 data cache The total processor energy Low Energy Associative Data Caches for Embedded Systems 521 savings are greater for higher levels of associativity due to the increased Dcache energy consumption savings and to the increased share of the D-cache energy consumption in the total energy consumption budget The energy consumption savings for a data cache system using a WDU varies significantly with the associativity of the data cache Figure 37-8 shows the data cache energy consumption savings when using a 32-entry WDU with 522 Chapter 37 Low Energy Associative Data Caches for Embedded Systems 523 data caches of different associativies The size of the cache is the same in all cases 32 KB For a 4-way data cache the energy savings grow significantly to 56% compared to 34% for a 2-way cache, on average The savings don’t increase as much for 8, 16 and 32-way caches 5.1 Comparison with way prediction Figure 37-9 shows the average percentage load/store instructions for which the way can be determined by a WDU of different sizes or can be predicted by a Most Recently Used Way Predictor (MRU) (as presented in [4]) For the MRU way predictor the percentage instructions for which the way is predicted correctly decreases when the cache associativity increases For and 4-way set associative caches the way predictor has a greater coverage than any WDU For higher associativities WDU has better coverage, for a 32-way cache the 16-entry WDU already has better performance than the MRU way predictor Figure 37-10 shows the data cache energy consumption savings for dif- 524 Chapter 37 ferent sizes WDUs and the MRU way predictor For a 2-way the data cache power savings are smaller for the MRU predictor that for the WDU Although the MRU predictor has a higher prediction rate, the energy consumption overhead of the predictor MRU reduces the total data cache power savings For higher associativities the predictor overhead decreases, but so does the prediction rate so, except for small WDU sizes, the WDU has better energy savings CONCLUSIONS This paper addresses the problem of the increased energy consumption of associative data caches in modern embedded processors A design for a Way Determination Unit (WDU) that reduces the D-cache energy consumption by allowing the cache controller to only access one cache way for a load/store operation was presented Reducing the number of way accesses greatly reduces the energy consumption of the data cache Unlike previous work, our design is not a predictor It does not incur mis-prediction penalties and it does not require changes in the ISA or in the compiler Not having mis-predictions is an important feature for an embedded system designer, as the WDU does not introduce any new non-deterministic behavior in program execution The energy consumption reduction is achieved with no performance penalty and it grows with the increase in the associativity of the cache The WDU components, a small f u l l y associative cache and a modulo counter, are well understood, simple devices that can be easily synthesized It was shown that very a small (8–16 entries) WDU adds very l i t t l e to the design gate count, but can still provide significant energy consumption savings The WDU evaluation was done on a 32-bit processor with virtually indexed L1 cache For a machine with a physically indexed cache the WDU overhead would be even smaller resulting in higher energy consumption savings REFERENCES D Brooks, V Tiwari, and M Martonosi “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations.” In ISCA, pp 83–94, 2000 D Burger and T M A ustin “The Simplescalar Tool Set, Version 2.0.” Technical Report TR-97-1342, University of Wisconsin-Madison, 1997 M R Guthaus, J S Ringenberg, D Ernst, T M Austin, T Mudge, and R B Brown “Mibench: A Free, Commercially Representative Embedded Benchmark Suite.” In IEEE 4th Annual Workshop on Workload Characterization, pp 83–94, 2001 K Inoue, T Ishihara, and K Murakami “Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption.” In ACM/IEEE International Symposium on Low Power Electronics and Design, pp 273–275, 1999 Low Energy Associative Data Caches for Embedded Systems 525 Intel Intel XScale Microarchitecture, 2001 R E Kessler “The Alpha 21264 Microprocessor.” IEEE Micro, Vol 19, No 2, pp 24–36, March/April 1999 A Klaiber “The Technology Behind Crusoe Processors.” Technical Report, Transmeta Corporation, January 2000 Motorola MPC7450 RISC Microprocessor Family User’s Manual, 2001 D Nicolaescu, A Veidenbaum, and A Nicolau “Reducing Power Consumption for HighAssociativity Data Caches in Embedded Processors.” In Proceedings DATE03, 2003 10 P Shivakumar and N P Jouppi Cacti 3.0: An Integrated Cache Timing, Power, and Area Model 11 W Tang, A Veidenbaum, A Nicolau, and R Gupta Simultaneous Way-Footprint Prediction and Branch Prediction for Energy Savings in Set-Associative Instruction Caches 12 E.Witchel, S Larsen, C S Ananian, and K Asanovic “Direct Addressed Caches for Reduced Power Consumption.” In Proceedings of the 34th Annual International Symposium on Microarchitectures, 2001 This page intentionally left blank INDEX abstract communication access latency of cache access pattern address computation AES affine condition affinity AHB multi-layer analyzable reference APIs APOS application hotspots Application Specific Instruction-Set Processor (ASIP) architecture mapping architectural model ARM core array-based applications array interleaving assembly assertion checker assertion language automatic and application-specific design of HAL automation automotive automotive software bandwidth bank switching banked memory banking of cache behavioral model bit patterns in data bit-flips Bitline scaling Buffer Overflows Bus Models cache cache bypassing cache effects cache optimization cache performance call-by-need certifiable software chip-scale parallelism clock gateing clustered co-design co-scheduling code generation communication channels communication refinement communication scheduling communications compilers computation reuse concurrency configurable processor configurability context switching control flow coprocessor selection coroutines correct by construction code synthesis COTS cryptography data assignment data dependency data reuse data transformation data upsets data-dominated data-flow analysis debugging design design flow design space exploration detection efficiency device driver device driver synthesis device programming interface distributed real-time systems dynamic behavior dynamic C dynamic parallelization dynamic power management 527 Index 528 dynamic scheduling dynamic voltage scaling eCOS effect-less electronic control unit certification embedded benchmarks embedded OS embedded processor embedded processors embedded RTOS embedded software embedded software streaming embedded systems energy consumption energy efficiency EPIC error containment error rate Ethernet event driven finite state machine event handling fault containment fault tolerance filesystem flow aggregation formal timing analysis framework functional unit assignment genetic algorithm HAL standard hardware abstraction layer hardware dependent software hardware detection hardware software interface synthesis hardware/software co-design hardware/software optimization hardware/software partitioning hash table high associativity high performance HW/SW partitioning If-statement ILP extraction inte-processor communication interactive ray tracing Intermediate representation interrupt handling intervals instruction reuse Instruction selection IP based embedded system design iSSL Kahn process network kernel libraries link to implementation load balance locality locality improvement Logic of Constraints (LOC) loop iteration loop nest splitting loop optimization loop transformation low energy low power mapping marking equation memory layouts memory recycling middleware microcontroller model refinement modeling modulo scheduling MP SoC multi-media applications multi-processor,multi-tasked applications Multi-rate Multimedia multiprocessor embedded kernel multiprocessor SoC multitasking multithread nested-loop hierarchy network processors network-on-chip networking on-chip communication on-chip communicationanalyzable reference on-chip generic interconnect on-chip multiprocessing on-demand computation on-line checkpointing on/off instructions, operating systems optimization OS-driven Software Index packet flows parameter passing pareto points partitioning performance analysis performance constraint performance estimation peripheral device modeling petri nets physical layer picture-in-picture (PIP) pipeline pipeline latches pipelined cache pipeline stall platform independence platform-based HW/SW co-design polytope posix threads porting power estimation predication process process scheduling product line programming languages protected OS quantitative constraint quasi-static scheduling rabbit reachability graph reactive systems real time operating systems (RTOS) real-time real-time systems region detection reliability model Rijndael RMC2000 resource conflict RTOS RTOS characterization RTOS model/modeling and abstraction RTOS modeling runtime optimization safety critical safety-critical system Satisfiability schedulability analysis scheduling analysis sdram 529 selfishness Sequence-Loss SET SEU SIMD Reconfigurable architecture SimpleScalar simulation simulation model of HAL simulation monitor simulation speedup simulation verification simultaneous multithreading SoC sockets software engineering software generation software integration software performance validation software synthesis software-detection source code transformation speculation speech recognition specification methodology static/dynamic energy stochastic communication synchronization Errors system-level design system-level design language (SLDL) (system-level) design methodology system-level modeling System-on-Chip System on Chip Bus SystemC SWCD system-on-a-chip software integration Software Architectural Transformations Software Streaming SoC software reuse State space compression Software synthesis task management task migration task synchronization & communication TCP thread time-sharing time-triggered architecture time/delay modeling timing back-annotation transaction Level Modeling TLM 530 trace analysis tracking device ultra-dependable computer applications UML virtual chip interconnect Voice encoder/decoder VLIW way determination X-by-Wire Index [...]... embedded software optimization ones xv xvi Introduction To understand embedded software design for SoC, we need to know current issues in embedded software design We want to classify the issues into two parts: software reuse for SoC integration and architecture-specific software optimization Architecture-specific software optimization has been studied for decades On the other side, software reuse for. .. importance of embedded software in the design of a System-on-Chip Embedded Software for SoC covers all software related aspects of SoC design Embedded and application-domain specific operating systems, interplay between application, operating system, and architecture System architecture for future SoC, application-specific architectures based on embedded processors and requiring sophisticated hardware /software. .. of architectures, the embedded software, and the interaction between the embedded software, the SoC architecture, and the applications for which the SoC is designed This book collects contributions from the Embedded Software Forum of the Design, Automation and Test in Europe Conference (DATE 03) that took place in March 2003 in Munich, Germany The success of the Embedded Software Forum at DATE reflects... exploiting the characteristics of underlying hardware Embedded software design is not a novel topic Then, why do people consider that embedded software design is more and more important for SoC these days? A simple, maybe not yet complete, answer is that we are more and more dealing with platform-based design for SoC [2] Platform-based SoC design means to design SoC with relatively fixed architectures This... Virtual Socket Interface Alliance, or by anyone) to enable platform-based SoC design by reusing software components In SoC design with multi-layer software architecture, another important problem is the validation and evaluation of reused software on the platform Main issues are related to software validation without the final platform and, on the other hand, to assess the performance of the reused software. .. reusing software components as well as hardware components, SoC design becomes an integration of reused software and hardware components When SoC designers do SoC integration with a platform and a multi-layer software architecture, the first question can be ‘what is the API that gives an abstraction of my platform?’ We call the API that abstracts a platform ‘platform API’ Considering the multi-layer software. .. architectures Embedded software for applications in the domains of automotive, avionics, multimedia, telecom, networking, This book is a must-read for SoC designers that want to broaden their horizons to include the ever-growing embedded software content of their next SoC design In addition the book will provide embedded software designers invaluable insights into the constraints imposed by the use of embedded. .. levels of software We think that there has been little research work covering both the abstraction levels of software and hardware in this problem GUIDE TO THIS BOOK The book is organised into 10 parts corresponding to sessions presented at the Embedded Systems Forum at DATE’03 Both software reuse for SoC and application specific software optimisations are covered The topic of Software reuse for SoC integration... generation of software layers, in chapters 6 and 11 SoC integration in chapters 10, 12 and 13 Architecture-specific software optimization problems are mainly addressed in five parts, Software Optimization for Embedded Systems”, Embedded System Architecture”, “Transformations for Real-Time Software , “Energy Aware Software Techniques”, and “Low Power Software The key issues addressed are: Sub-system-specific... methodology for safe integration of automotive software functions where required performance information is exchanged while each partner’s IP is protected We claim that in principle performance requirements and constraints (timing‚ memory consumption) for each software component and for the complete ECU can be formally validated‚ and believe that ultimately such formal analysis will be required for legal ... the Embedded Software Forum at DATE reflects the increasing importance of embedded software in the design of a System-on-Chip Embedded Software for SoC covers all software related aspects of SoC. .. Introduction To understand embedded software design for SoC, we need to know current issues in embedded software design We want to classify the issues into two parts: software reuse for SoC integration... sessions presented at the Embedded Systems Forum at DATE’03 Both software reuse for SoC and application specific software optimisations are covered The topic of Software reuse for SoC integration is