Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 336 2009-10-2 336 Model-Based Design for Embedded Systems M ONTIUMC. The MONTIUM design methodology to map DSP applications on the M ONTIUM TP is divided into three steps: 1. The high-level description of the DSP application is analyzed and com- putationally intensive DSP kernels are identified. 2. The identified DSP kernels or parts of the DSP kernels are mapped on one or multiple M ONTIUM TPs that are available in a SoC. The DSP oper- ations are programmed on the M ONTIUM TP using MONTIUMC. 3. Depending on the layout of the SoC in which the M ONTIUM processing tiles are applied, the M ONTIUM processing tiles are configured for a par- ticular DSP kernel or part of the DSP kernel. Furthermore, the channels in the NoC between the processing tiles are configured. A NNABELLE Heterogeneous System-on-Chip In this section, the prototype A NNABELLE SoC is described according to the heterogeneous SoC template mentioned before, which is intended to be used for digital radio broadcasting receivers (e.g., digital audio broadcast- ing, digital radio mondiale). Figure 11.6 shows the overall architecture of the A NNABELLE SoC. The ANNABELLE SoC consists of an ARM926 GPP with a five- layer AMBA AHB, four M ONTIUM TPs, an NoC, a Viterbi decoder, two ADCs, two DDCs, a DMA controller, SRAM/SDRAM memory interfaces, and exter- nal bus interfaces. The four M ONTIUM TPs and the NoC are arranged in a reconfigurable subsystem, labelled “reconfigurable fabric.” The reconfigurable fabric is con- nected to the AHB bus and serves as a slave to the AMBA system. A config- urable clock controller generates the clocks for the individual M ONTIUM TPs. Every individual M ONTIUM TP has its own adjustable clock and runs at its own speed. A prototype chip of the A NNABELLE SoC has been produced using the Atmel 130 nm CMOS process [8]. The reconfigurable fabric that is integrated in the A NNABELLE SoC is shown in detail in Figure 11.7. The reconfigurable fabric acts as a Clock controller ARM926 GPP DMA controller M ONTIUM TP M ONTIUM TP M ONTIUM TP M ONTIUM TP CCU DDC DDC ADC ADC Viterbi decoder IRQ controller SRAM/ SDRAM External bus interface 5-Layer AMBA advanced high-performance bus CCU Network-on-chip Reconfigurable fabric CCU CCU FIGURE 11.6 Block diagram of the A NNABELLE SoC. Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 337 2009-10-2 Reconfigurable MultiCore Architectures 337 Port 1 Port 1 Port 3 Port 3 Port 0 Port 0 Port 2Port 2 Queue 0 Queue 1 Control AHB port AHB-NoC bridge Router 1 Router 2 M ONTIUM TP 4 MONTIUM TP 1 M ONTIUM TP 2 MONTIUM TP 3 CCU 3CCU 2 CCU 1 CCU 4 FIGURE 11.7 The A NNABELLE SoC reconfigurable fabric. reconfigurable coprocessor for the ARM926 processor. Computationally intensive DSP algorithms are typically offloaded from the ARM926 proces- sor and processed on the coarse-grained reconfigurable M ONTIUM TPs inside the reconfigurable fabric. The reconfigurable fabric contains four M ONTIUM TPs, which are connected via a CCU to a circuit-switched NoC. The reconfig- urable fabric is connected to the AMBA system through a AHB–NoC bridge interface. Configurations, generated at design-time can be loaded onto the M ONTIUM TPs at run-time. The reconfigurable fabric provides “block mode” and “streaming mode” computation services. For ASIC synthesis, worst-case military conditions are assumed. In par- ticular, the supply voltage is 1.1 V and the temperature is 125 ◦ C. Results obtained with the synthesis are as follows: • The area of one M ONTIUM core is 3.5 mm 2 of which 0.2 mm 2 is for the CCU and 3.3 mm 2 is for the MONTIUM TP (including memory). • With Synopsys tooling we estimated that the M ONTIUM TP, within the A NNABELLE ASIC realization, can implement an FIR filter at about 100 MHz or an FFT at 50 MHz. The worst-case clock frequency of the A NNABELLE chip is 25 MHz. • With the Synopsys prime power tool, we estimated the energy consumption using placed and routed netlists. The following section provides some of the results. Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 338 2009-10-2 338 Model-Based Design for Embedded Systems TABLE 11.2 Dynamic Power Consumption of one M ONTIUM on ANNABELLE Energy (mW/MHz) Module FIR-5 FFT-512 FFT-288 Datapath 0.19 0.24 0.15 Memories 0.0 0.27 0.21 Sequencer 0.02 0.07 0.05 Decoders 0.0 0 0.0 CCU 0.02 0.02 0.02 Total 0.23 0.60 0.43 TABLE 11.3 Energy Comparison of MONTIUM/ ARM926 MONTIUM ARM926 Algorithm (μJ) (μJ) Ratio FIR-5 0.243 — — FFT-112 0.357 9 25 FFT-176 0.616 16 26 FFT-256 0.707 14 20 FFT-288 1.001 23 23 FFT-512 1.563 30 19 FFT-1920 5.054 168 33 Average Power Consumption To determine the average power consumption of the A NNABELLE as accurate as possible, we performed a number of power estimations on the placed and routed netlist using the Synopsys power compiler. Table 11.2 pro- vides the dynamic power consumption in mW/MHz of various M ONTIUM blocks for three well-known DSP algorithms. These figures show that the overhead of the sequencer and decoder is low: <16% of the total dynamic power consumption. Finally, Table 11.3 compares the energy consumption of the M ONTIUM and the ARM926 on ANNABELLE. For the FIR-5 algorithm the memory is not used. Locality of Reference As mentioned above, locality of reference is an important design parame- ter. One of the reasons for the excellent energy figures of the M ONTIUM is the use of locality of reference. To illustrate this, Table 11.4 gives the amount of memory references local to the cores compared to the amount of off-core communications. These figures are, as expected, algorithm dependent. Therefore, in this table, we chose three well-known algorithms in the Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 339 2009-10-2 Reconfigurable MultiCore Architectures 339 TABLE 11.4 Internal and External Memory References per Execution of an Algorithm Number of Memory References Algorithm Internal External Ratio 1024p FFT 51200 4096 12.5 200 tap FIR 405 2 202.5 SISO algorithm (N softbits) 18*N 3*N 6 TABLE 11.5 Reconfiguration of Algorithms on the MONTIUM Algorithm Change Size # cycles 1024p FFT Scaling factors ≤150 bit ≤10 to iFFT Twiddle factors 16384 bit 512 200 tap FIR Filter coefficients ≤3200 bit ≤80 streaming DSP application domain: a 1024p FFT, a 200 tap FIR filter, and a part of a Turbo decoder (SISO algorithm [17]). The results show that for these algorithms 80%–99% of the memory references are local (within a tile). Partial Dynamic Reconfiguration One of the advantages of a multicore SoC organization is that each individ- ual core can be reconfigured while the other cores are operational. In the M ONTIUM, the configuration memory is organized as a RAM memory. This means that to reconfigure the M ONTIUM, the entire configuration memory need not be rewritten, but only the parts that are changed. Furthermore, because the M ONTIUM has a coarse-grained reconfigurable architecture, the configuration memory is relatively small. The M ONTIUM has a configuration size of only 2.6 kB. Table 11.5 gives some examples of reconfigurations. To reconfigure a M ONTIUM from executing a 1024 point FFT to executing a 1024 point inverse FFT requires updating the scaling and twiddle factors. Updating these factors requires less than 522 clock cycles in total. To change the coefficients of a 200 tap FIR filter requires less than 80 clock cycles. 11.3.2 Aspex Linedancer The Linedancer [4] is an “associative” processor and it is an example of a homogeneous SoC. Associative processing is the property of instructions to execute only on those PEs where a certain value in their data register matches a value in the instruction. Associative processing is built around an intel- ligent memory concept: content addressable memory (CAM). Unlike stan- dard computer memory (random access memory or RAM) in which the user Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 340 2009-10-2 340 Model-Based Design for Embedded Systems supplies a memory address and the RAM returns the data word stored at that address, a CAM is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere in it. If the data word is found, the CAM returns a tag list of one or more storage addresses where the word was found. Each CAM line, that contains a word, can be seen as a processor element (PE) and each tag list element as a 1 bit condition register. Dependending on this register, the aggregate asso- ciative processor can either instruct the PEs to continue processing on the indicated subset, or to return the involved words subsequently for further processing. There are several implementations possible which vary from bit serial to word parallel, but the latest implementations [4,5] can perform the involved lookups in parallel in a single clock cycle. In general the Linedancer belongs to the subclass of massively parallel SIMD architectures, with typically more than 512 processors. This SIMD sub- class is perfectly suited to support data parallelism, for example, for signal, image, and video processing; text retrieval; and large databases. The asso- ciative functions furthermore allow the processor to function like an intelli- gent memory (CAM), permitting high speed searching and data-dependent image processing operations (such as median filters and object recognition/ labeling). The so called “ASProCore” of the Linedancer, is designed around a very large number—up to 4,096—of simple PEs arranged in a line, see Figure 11.8. Application areas are diverse but have in common the simple process- ing of very large amounts of data, from samples in 1D-streams to pixels in 2D or 3D-images. To mention a few: software defined radio (e.g., WiMAX), broadcast (Video compression), medical imaging (3D reconstruction), and in high-end printers—in particular for raster image processing (RIP). Program On-chip RISC Associative string processing array (ASProCore) Network cascadable over chips Thousands of PEs On-chip or off-chip memory PE 1 PE 0 PE 4,095 PE 4,094 PE 4,093 PE 2 Inter-PE communication network Common instruction bus FIGURE 11.8 The scalable architecture of Linedancer. Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 341 2009-10-2 Reconfigurable MultiCore Architectures 341 191 0 4,095 EXT CAM ALU PDS RLP Single PE 63 64 3 0 0 63 64 Mask Extended memory Associative memory ALU array Inter- PE comm. Bulk IO memory Data LLP FIGURE 11.9 The architecture of Linedancer’s associative string processor (ASProCore). In the following sections, the associative processor (ASProCore) and the Linedancer family are introduced. At the end, we present the development tool chain and a brief conclusion on the Linedancer application domain. ASProCore Architecture Each PE has a 1-bit ALU, 32–64 bit full associative memory array, and 128 bit extended memory. See Figure 11.9 for a detailed view on the ASProCore architecture. The processors are connected in a 1D network, actually a 4K bit shift register, in between the indicated “left link port” (LLP) and “right link port” (RLP). The network allows data to be shared between PEs with minimum overhead. The ASProCore also has a separate word serial bit par- allel memory, the primary data store (PDS), for high-speed data input. The on-chip DMA engine automatically translates 2D and 3D images into the 1D array (and passed through via the PDS). The 1D architecture allows for linear scaling of performance, memory, and communication, provided the applica- tion is expressed in a scalable manner. The Linedancer features also a single or dual bit RISC core (P1, HD, respectively) for sequential processing and controlling the ASProCore. Linedancer Hardware Architecture The current Linedancers, the P1 and the HD, have been realized in 0.13 μm CMOS process. Both have one or two 32-bit SPARC core(s) with 128 kB inter- nal program memory. System clock frequencies vary from 300, 350, 400 MHz. The Linedancer-P1 integrates an associative processor (ASProCore, with 4K PEs), a single SPARC core with a 4 kB instruction cache, and a DMA con- troller capable of transferring 64 bit at 66 MHz over a PCI-interface, as shown in Figure 11.10. It further hosts 128 kB internal data memory. The chip consumes 3.5 W typical at 300 MHz. The Linedancer-HD integrates two associative proces- sors (2 × 2K PEs), two SPARC cores with each 8 kB instruction cache and Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 342 2009-10-2 342 Model-Based Design for Embedded Systems Instr DMA engine 32 bit RISC CPU 128 kB RAM External DRAM (prog) External DRAM (data) PCI Neighbor Linedancers Neighbor Linedancers 128 kB RAM ASProCore 4096 processing elements 1Mbit storage Data FIGURE 11.10 The Linedancer-P1 layout. 4 kB data cache, four internal DMA engines, and an external data channel capable of transferring 64 bit at 133 MHz over a PCI-X interface, as shown in Figure 11.11. The ASProCore has been extended with a chordal ring inter-PE communication network that allows for faster 2D- and 3D-image processing. It further hosts four external DDR2 DRAM interfaces, eight dedicated streaming data I/O ports (up to 3.2 GB/s), and 1088 kB internal data memory. The chip consumes 4.5 W typical at 300 MHz. Design Methodology The software development environment for Linedancer consists of a com- piler, linker, and debugger. The Linedancer is programmed in C, with some parallel extensions to support the ASProCore processing array. The toolchain is based on the GNU compiler framework, with dedicated pre and postpro- cessing tools to compile and optimise the parallel extensions to C. Associative SIMD processing adds an extra dimension to massive parallel processing, enabling new views on problem modeling and the subsequent implementation (for example, in searching/sorting and data-dependent image processing). The Linedancer’s 1D-architecture scales better than a 2D array often used in multi-ALU arrays as PACT’s XPP [6] or the Tilera’s 2D multicore array [7]. Because of the large size of the array, power consumption is relatively high compared to the M ONTIUM processor and prevents applica- tion into handheld devices. Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 343 2009-10-2 Reconfigurable MultiCore Architectures 343 GPIO PCI-X Control External data DRAM (4 banks) JTAG Program memory Program memory V7 ASProCore 4,096 processing elements 1 Mbit storage 8 × Direct data interfaces 4 × DMA engines Internal data memory 32 bit RISC CPU 32 bit RISC CPU Neighbor Linedancers Neighbor Linedancers 3.2 GB/s streaming data I/O ASProCore controller Instr FIGURE 11.11 The Linedancer-HD layout. 11.3.3 PACT-XPP The eXtreme processing platform (XPP) is an example of a homogeneous array structure. It is a run-time reconfigurable coarse-grained data process- ing architecture. The XPP provides parallel processing power for high band- width data such as video and audio processing. The XPP targets streaming DSP applications in the multimedia and telecommunications domain [10,20]. Architecture The XPP architecture is based on a hierarchical array of coarse-grained, adap- tive computing elements, called processing array elements (PAEs). The PAE are clustered in processing array clusters (PACs). All PAEs in the XPP archi- tecture are connected through a packet-oriented communication network. Figure 11.12 shows the hierarchical structure of the XPP array and the PAEs clustered in a PAC. Different PAEs are identified in the XPP array: “ALU-PAE, RAM-PAE,” and “FNC-PAE.” The ALU-PAE contains a multiplier and is used for DSP operations. The RAM-PAE contains a RAM to store data. The FNC-PAE is a unique sequential VLIW-like processor core. The FNC-PAEs are dedicated to the control flow and sequential sections of applications. Every PAC contains ALU-PAEs, RAM-PAEs, and FNC-PAEs. The PAEs operate according to a data flow principle; a PAE starts processing data as soon as all required input packets are available. If a packet cannot be processed, the pipeline stalls until the packet is received. Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 344 2009-10-2 344 Model-Based Design for Embedded Systems PAC CM CM SCM CM PAC PAC PAC IO IO IO IO IO IO IO RAM RAM RAM RAM RAM RAM RAM RAM FNC FNC FNC FNC ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU IO CM FIGURE 11.12 The structure of an XPP array composed of four PACs. (From Baumgarte, V. et al., J. Supercomput., 26(2), 167, September 2003.) Each PAC is controlled by a configuration manager (CM). The CM is responsible for writing configuration data into the configurable object of the PAC. Multi-PAC XPP arrays contain additional CMs for concurrent con- figuration data handling, arranged in a hierarchical tree of CMs. The top CM, called supervising CM (SCM), has an external interface, not shown in Figure 11.12, that connects the supervising CM to an external configuration memory. Design Methodology DSP algorithms are directly mapped onto the XPP array according to their data flow graphs. The flow graph nodes define the functionality and oper- ations of the PAEs, whereas the edges define the connections between the PAEs. The XPP array is programmed using the native mapping lan- guage (NML), see [20]. In NML descriptions, the PAEs are explicitly allo- cated and the connections between the PAEs are specified. Optionally, the allocated PAEs are placed onto the XPP array. NML also includes statements Nicolescu/Model-Based Design for Embedded Systems 67842_C011 Finals Page 345 2009-10-2 Reconfigurable MultiCore Architectures 345 to support configuration handling. Configuration handling is an explicit part of the application description. A vectorizing C compiler is available to translate C functions to NML modules. The vectorizing compiler for the XPP array analyzes the code for data dependencies, vectorizes those code sections automatically, and gener- ates highly parallel code for the XPP array. The vectorizing C compiler is typically used to program “regular” DSP operations that are mapped on “ALU-PAEs” and “RAM-PAEs” of the XPP array. Furthermore, a coarse- grained parallelization into several FNC-PAE threads is very useful when “irregular” DSP operations exist in an application. This allows running even irregular, control-dominated code in parallel on several FNC-PAEs. The FNC-PAE C compiler is similar to a conventional RISC compiler extended with VLIW features to take advantage of ILP within the DSP algorithms. 11.3.4 Tilera The Tile64 [7] is a TP based on the mesh architecture that was originally developed for the RAW machine [26]. The chip consists of a grid of processor tiles arranged in a network (see Figure 11.13), where each tile consists of a GPP, a cache, and a nonblocking router that the tile uses to communicate with the other tiles on the chip. The Tilera processor architecture incorporates a 2D array of homogenous, general-purpose cores. Next to each processor there is a switch that connects DDR I/O General I/O DDR I/O General I/O Processor Caches Switch FIGURE 11.13 Tile64 processor. . Nicolescu /Model-Based Design for Embedded Systems 67842_C011 Finals Page 336 2009-10-2 336 Model-Based Design for Embedded Systems M ONTIUMC. The MONTIUM design methodology to map. section provides some of the results. Nicolescu /Model-Based Design for Embedded Systems 67842_C011 Finals Page 338 2009-10-2 338 Model-Based Design for Embedded Systems TABLE 11.2 Dynamic Power Consumption. memory or RAM) in which the user Nicolescu /Model-Based Design for Embedded Systems 67842_C011 Finals Page 340 2009-10-2 340 Model-Based Design for Embedded Systems supplies a memory address and

