Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 216 2009-10-14 216 Model-Based Design for Embedded Systems the necessary information needed for each translation step. Based on the task-dependency information that tells how to connect the tasks, the translator determines the number of intertask communication channels. Based on the period and deadline information of tasks, the run-time sys- tem is synthesized. With the memory map information of each processor, the translator defines the shared variables in the shared region. To support a new target architecture in the proposed workflow, we have to add translation rules of the generic API to the translator, make a target- specific-OpenMP-translator for data parallel tasks, and apply the generation rule of task scheduling codes tailored for the target OS. Each step of CIC translator will be explained in this section. 8.5.1 Generic API Translation Since the CIC task code uses generic APIs for target-independent specifi- cation, the translation of generic APIs to target-dependent APIs is needed. If the target processor has an OS installed, generic APIs are translated into OS APIs; otherwise, they are translated into communication APIs that are defined by directly accessing the hardware devices. We implement the OS API library and communication API library, both optimized for each target architecture. For most generic APIs, API translation is achieved by simple redefini- tion of the API function. Figure 8.6a shows an example where the trans- lator replaces MQ_RECEIVE API with a “read_port” function for a target processor with pthread support. The read_port function is defined using MQ_RECEIVE (port_id, buf, size); Generic API 1. int read_port(int channel_id, unsigned char *buf, int len) { 2. 3. pthread_mutex_lock (channel_mutex); 7. pthread_mutex_unlock(channel_mutex); 4. 6. 8. } 5. memcpy(buf, channel->start, len); (a) Generic API #include <stdio.h> fclose(file); fread(data, 1, 100, file); file = fopen("input.dat", "r"); file = OPEN("input.dat", O_RDONLY); READ(file, data, 100); CLOSE(file); #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> read(file, data, 100); close(file); file = open ("input.dat", O_RDONLY); (b) FIGURE 8.6 Examples of generic API translation: (a) MQ_RECEIVE operation, (b) READ operation. Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 217 2009-10-14 Retargetable, Embedded Software Design Methodology 217 pthread APIs and the memcpy C library function. However some APIs need additional treatment: For example, the READ API needs different function prototypes depending on the target architecture as illustrated in Figure 8.6b. Maeng et al. [14] presented a rule-based translation technique that is general enough to translate any API if the translation rule is defined in a pattern-list file. 8.5.2 HW-Interfacing Code Generation If there is a code segment contained within a HW pragma section and its translation rule exists in an architecture information file, the CIC translator replaces the code segment with the HW-interfacing code, considering the parameters of the HW accelerator and buffer variables that are defined in the architecture section of the CIC. The translation rule of HW-interfacing code for a specific HW is separately specified as a HW-interface library code. Note that some HW accelerators work together with other HW IPs. For example, a HW accelerator may notify the processor of its completion through an interrupt; in this case an interrupt controller is needed. The CIC translator generates a combination of the HW accelerator and interrupt con- troller, as shown in the next section. 8.5.3 OpenMP Translator If an OpenMP compiler is available for the target, then task codes with OpenMP directives can be used easily. Otherwise, we somehow need to translate the task code with OpenMP directives to a parallel code. Note that we do not need a general OpenMP translator since we use OpenMP direc- tives only to specify the data parallel CIC task. But we have to make a sepa- rate OpenMP translator for each target architecture in order to achieve opti- mal performance. For a distributed memory architecture, we developed an OpenMP trans- lator that translates an OpenMP task code to the MPI codes using a minimal subset of the MPI library for the following reasons: (1) MPI is a standard that is easily ported to various software platforms. (2) Porting the MPI library is much easier than modifying the OpenMP translator itself for the new target architecture. Figure 8.7 shows the structure of the translated MPI program. As shown in the figure, the translated code has the master–worker structure: The master processor executes the entire core while worker pro- cessors execute the parallel region only. When the master processor meets the parallel region, it broadcasts the shared data to worker processors. Then, all processors concurrently execute the parallel region. The master proces- sor synchronizes all the processors at the end of the parallel loop and col- lects the results from the worker processors. For performance optimization, we have to minimize the amount of interprocessor communication between processors. Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 218 2009-10-14 218 Model-Based Design for Embedded Systems Work alone Work alone Initialize Initialize Initialize BCast share data BCast share data BCast share data BCast share data Work in parallel region Work in parallel region Work in parallel region Work in parallel region Receive & update Send shared data Send shared data Send shared data Master processor Worker processor Worker processor Worker processor Parallel region start Parallel region end FIGURE 8.7 The workflow of translated MPI codes. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) 8.5.4 Scheduling Code Generation The last step of the proposed CIC translator is to generate thetask-scheduling code for each processor core. There will be many tasks mapped to each processor, with different real-time constraints and dependency information. We remind the reader that a task code is defined by three functions: “{task name}_init(), {task name}_go(), and {task name}_wrapup().” The generated scheduling code initializes the mapped tasks by calling “{task name}_init()” and wraps them up after the scheduling loop finishes its execution, by calling “{task name}_wrapup().” The main body of the scheduling code differs depending on whether there is an OS available for the target processor. If there is an OS that is POSIX-compliant, we generate a thread-based scheduling code, as shown in Figure 8.8a. A POSIX thread is created for each task (lines 17 and 18) with an assigned priority level if available. The thread, as shown in lines 3 to 5, executes the main body of the task, “{task name}_go(),” and schedules the thread itself based on its timing constraints by calling the “sleep()” method. If the OS is not POSIX-compliant, the CIC translator should be extended to generate the OS-specific scheduling code. If there is no available OS for the target processor, the translator should synthesize the run-time scheduler that schedules the mapped tasks. The CIC translator generates a data structure of each task, containing three main functions of tasks (“init(), go(), and wrapup()”). With this data structure, a Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 219 2009-10-14 Retargetable, Embedded Software Design Methodology 219 1. void ∗ thread_task_0_func(void ∗ argv) { 2. 3. task_0_go(); 4. get_time(&time); 5. sleep(task_0->next_period – time); // sleep for remained time 6. 7. } 8. int main() { 9. 10. pthread_t thread_task_0; 11. sched_param thread_task_0_param; 12. 13. thread_task_0_param.sched_priority = 0; 14. pthread_attr_setschedparam( , &thread_task_0_param); 15. 16. task_init(); / ∗ {task_name}_init() functions are called ∗ / 17. pthread_create(&thread_task_0, 18. &thread_task_0_attr, thread_task_0_func, NULL); 19. 20. task_wrapup(); / ∗ {task_name}_wrapup() functions are called ∗ / 21. } (a) 1. typedef struct { 2. void ( ∗ init)(); 3.int( ∗ go()); 4.void( ∗ wrapup)(); 5. int period, priority, ; 6. } task; 7. task taskInfo[] = { {task 1_init, task 1_go, task 1_wrapup, 100, 0} 8. , {task2_init, task2_go, task2_wrapup, 200, 0}}; 9. 10. void scheduler() { 11. while(all_task_done()==FALSE) { 12. int taskld = get_next_task(); 13. taskInfo[taskld]->go() 14. } 15. } 16. 17. int main() { 18. init(); / ∗ {task_name}_init() functions are called ∗ / 19. scheduler(); / ∗ scheduler code ∗ / 20. wrapup(); / ∗ {task_name}_wrapup() functions are called ∗ / 21.return0; 22.} (b) FIGURE 8.8 Pseudocode of generated scheduling code: (a) if OS is available, and (b) if OS is not available. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 220 2009-10-14 220 Model-Based Design for Embedded Systems real-time scheduler is synthesized by the CIC translator. Figure 8.8b shows the pseudocode of a generated scheduling code. Generated scheduling code may be changed by replacing the function “void scheduler()” or “int get_next_task()” to support another scheduling algorithm. 8.6 Preliminary Experiments An embedded software development framework based on the proposed methodology, named HOPES, is under development. While it allows the use of any model for initial specification, the current implementation is being done with the PeaCE model. PeaCE model is one that is used in PeaCE hardware–software codesign environment for multimedia embedded sys- tems design [15]. To verify the viability of the proposed programming, we built a virtual prototyping system, based on the Carbon SoC Designer [16], that consists of multiple subsystems of arm926ej-s connected to each other through a shared bus as shown in Figure 8.9. H.263 Decoder as depicted in Figure 8.3 is used for preliminary experiments. 8.6.1 Design Space Exploration We specified the functional parallelism of the H.263 decoder with six tasks as shown in Figure 8.3, where each task is assigned an index. For data- parallelism, the data parallel region of motion compensation task is specified with an OpenMP directive. In this experiment, we explored the design space of parallelizing the algorithm, considering both functional and data paral- lelisms simultaneously. Asis evidentin Figure 8.3,tasks 1 to3 can beexecuted in parallel; thus, they are mapped to multiple-processors with three configu- rations as shown in Table 8.1. For example, task 1 is mapped to processor 1, and the other tasks are mapped to processor 0 for the second configuration. Interrupt ctrl. Local mem. HW1 HW2 Arm926ej-s Interrupt ctrl. Local mem. HW1 HW2 Arm926ej-s HW3 Shared memory FIGURE 8.9 The target architecture for preliminary experiments. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 221 2009-10-14 Retargetable, Embedded Software Design Methodology 221 TABLE 8.1 Task Mapping to Processors The Configuration of Task Mapping Processor Id 1 2 3 0 Task 0, Task 1, Task 2, Task 3, Task 4, Task 5 Task 0, Task 2, Task 3, Task 4, Task 5 Task 0, Task 3, Task 4, Task 5 1 N/A Task 1 Task 1 2N/A N/A Task2 Source: Kwon,S.etal.,ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission. TABLE 8.2 Execution Cycles for Nine Configurations The Configuration of Task Mapping The Number of Processors for Data-Parallelism 1 2 3 No OpenMP 158,099,172 146,464,503 146,557,779 2 167,119,458 152,753,214 153,127,710 4 168,640,527 154,159,995 155,415,942 Source: Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission. For each configuration of task mapping, we parallelized task 4, using one, two, and four processors. As a result, we have prepared nine configurations in total as illustrated in Table 8.2. In the proposed framework, each configu- ration is simply specified by changing the task-mapping information in the architecture information file. The CIC translator generates the executable C codes automatically. Table 8.2 shows the performance result for these nine configurations. For functional parallelism, the best performance can be obtained by using two processors as reported in the first row (“No OpenMP” case). H.263 decoder algorithm uses a 4:1:1 format frame, so computation of Y macroblock decod- ing is about four times larger than those of U and V macroblocks. Therefore macroblock decoding of U and V can be merged in one processor during macroblock decoding of Y in another processor. There is no performance gain obtained by exploiting data parallelism. This is because the computa- tion workload of motion compensation is not large enough to outweigh the communication overhead incurred by parallel execution. 8.6.2 HW-Interfacing Code Generation Next, we accelerated the code segment of IDCT in the macroblock decod- ing tasks (task 1 to task 3) with a HW accelerator, as shown in Figure 8.10a. We use the RealView SoC designer to model the entire system including the Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 222 2009-10-14 222 Model-Based Design for Embedded Systems #pragma hardware IDCT (output.data, input.data){ / ∗ code segments for IDCT ∗ / } (a) 1. <hardware> 2. <name>IDCT</name> 3. <protocol>IDCT_slave</protocol> 4. <param>0x2F000000</param> 5. </hardware> (b) 1. <hardware> 2. <name>IDCT</name> 3. <protocol>IDCT_interrupt</protocol> 4. <param>0x2F000000</param> 5. </hardware> 6. <hardware> 7. <name>IRQ_CONTROLLER</name> 8. <protocol>irq_controller</name> 9. <param>0xA801000</param> 10. </hardware> (c) FIGURE 8.10 (a) Code segment wrapped with HW pragma and architecture section infor- mation of IDCT, (b) when interrupt is not used, and (c) when interrupt is used. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) HW accelerator. Two kinds of inverse discrete cosine transformation (IDCT) accelerator are used. One uses an interrupt signal for completion notifica- tion, and other uses polling to detect the completion. The latter is specified in the architecture section as illustrated in Figure 8.10b, where the library name of the HW-interfacing code is set to IDCT_slave and its base address to 0x2F000000. Figure 8.11a shows the assigned address map of the IDCT accelerator and Figure 8.11b shows the generated HW-interfacing code. This code is sub- stituted for the code segment contained within a HW pragma section. In Figure 8.11b, bold letters are changeable according to the parameters spec- ified in a task code and in the architecture information file; they specify the base address for the HW interface data structure and the input and output port names of the associated CIC task. Note that interfacing code uses polling at line 6 of Figure 8.11b. If we use the accelerator with interrupt, an interrupt controller is additionally attached to the target platform, as shown in Figure 8.10c, with information on the code library name, IRQ_CONTROLLER, and its base address 0xA801000. The new IDCT accelerator has the same address map as the previous one, except for Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 223 2009-10-14 Retargetable, Embedded Software Design Methodology 223 Address (Offset) I/O Type Comment 0 Read Semaphore 4 Write IDCT start 8 Read Complete flag 12 Write IDCT clear 64 ∼ 191 Write Input data 192 ∼ 319 Read Output data (a) 1. int i; 2. volatile unsigned int ∗ idct_base = (volatile unsigned int ∗ ) 0x2F000000; 3. while(idct_base[0]==1); // try to obtain hardware resource 4. for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int ∗ )(input.data))[i]; 5. idct_base[1]= 1; // send start signal to IDCT accelerator 6. while(idct_base[2]==0); // wait for completion of IDCT operation 7. for (i=0;i<32;i++) ((unsigned int ∗ )(output.data)[i] = idct_base[i+48]; 8. idct_base[3]= 1; // clear and unlock hardware (b) FIGURE 8.11 (a) The address map of IDCT, and (b) its generated interfacing code. (From Kwon,S.etal.,ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) the complete flag. The address of the complete flag (address 8 in Figure 8.11a) is assigned to “interrupt clear.” Figure 8.12a shows the generated interfacing code for the IDCT with interrupt. Note that the interfacing code does not access the HW to check the completion of IDCT, but checks the variable “complete.” In the gener- ated code of the interrupt handler, this variable is set to 1 (Figure 8.12b). The initialize code for the interrupt controller (“initDevices()”) is also generated and called in the “{task_name}_init()” function. 8.6.3 Scheduling Code Generation We generated the task-scheduling code of the H.263 decoder while chang- ing the working conditions, OS support, and scheduling policy. At first, we used the eCos real-time OS for arm926ej-s in the RealView SoC designer, and generated the scheduling code, the pseudocode of which is shown in Figure 8.13. In function cyg_user_start() of eCos, each task is created as a thread. The CIC translator generates the parameters needed for thread cre- ation such as stack variable information and stack size (fifth and sixth param- eter of cyg_thread_create()). Moreover, we placed “{task_name}_go” in a while loop inside the created thread (lines 10 to 14 of Figure 8.13). Function {task_name}_init() is called in init_task(). Note that TE_main() is also created as a thread. TE_main() checks whether execution of all tasks is finished, and calls “{task_name}_wrapup()” in wrapup_task() before finishing the entire program. Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 224 2009-10-14 224 Model-Based Design for Embedded Systems 1. int complete; 2. 3. volatile unsigned int ∗ idct_base = (volatile unsigned int ∗ ) 0x 2F000000; 4. while(idct_base[0]== 1); // try to obtain hardware resource 5. complete=0; 6. for (i=0;i<32;i++) idct_base[i+16]= ((unsigned int ∗ )(input.data))[i]; 7. idct_base[1] = 1; // send start signal to IDCT accelerator 8. while(complete==0); // wait for completion of IDCT operation 9. for (i = 0; i < 32; i ++) ((unsigned int ∗ )(output.data)[i] =idct_base[i +48]; 10. idct_base[3]= 1; // clear and unlock hardware (a) 1. extern int complete; 2. __irq void IRQ_Handler() { 3. IRQ_CLEAR(); // interrupt clear of interrupt controller 4. idct_base[2] =1; // interrupt clear of IDCT 5. complete=1; 6. } 7. void initDevices(){ 8. IRQ_INIT(); // initialize of interrupt controller 9. } (b) FIGURE 8.12 (a) Interfacing code for the IDCT with interrupt, and (b) the interrupt handler code. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) For a processor without OS support, the current CIC translator supports two kinds of scheduling code: default and rate-monotonic scheduling (RMS). The default scheduler just keeps the execution frequency of tasks considering the period ratio of tasks. Figure 8.14a and b show the pseudocode of function get_next_task(), which is called in the function scheduler() of Figure 8.8b, for the default and RMS, respectively. 8.6.4 Productivity Analysis For the productivity analysis, we recorded the elapsed time to manually modify the software (including debugging time) when we change the target architecture and task mapping. Such manual modification was performed by an expert programmer who is a PhD student. For a fair comparison of automatic code generation and manual-coding overhead, we made the following assumptions. First, the application task codes are prepared and functionally verified. We chose an H.263 decoder as the application code that consists of six tasks, as illustrated in Figure 8.3. Second, the simulation environment is completely prepared for the ini- tial configuration, as shown in Figure 8.15a. We chose the RealView SoC designer as the target simulator, prepared two different kinds of HW IPs Nicolescu/Model-Based Design for Embedded Systems 67842_C008 Finals Page 225 2009-10-14 Retargetable, Embedded Software Design Methodology 225 1. void cyg_user_start(void) { 2. cyg_threaad_create(taskInfo[0]->priority, TE_task_0, 3. (cyg_addrword_t)0, “TE_task_0”, (void ∗ )&TaskStk[0], 4. TASK_STK_SIZE-1, &handler[0], &thread[0]); 5. 6. init_task(); 7. cyg_thread_resume(handle[0]); 8. 9. } 10. Void TE_task_0(cyg_addrword_t data) { 11. while(!finished) 12. if (this task is executable) tasklnfo[0]->go(); 13. else cyg_thread_yield(); 14. } 15. void TE_main(cyg_addrword_t data) { 16. while(1) 17. if (all_task_is_done()) { 18. wrapup_task(); 19. exit(1); 20. } 21. } FIGURE 8.13 Pseudocode of an automatically generated scheduler for eCos. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) 1. int get_next_task() { 2. a. find executable tasks 3. b. find the tasks that has the smallest value of time count 4. c. select the task that is not executed for the longest time 5. d. add period to the time count of selected task 6. e. return selected task id 7. } (a) 1. int get_next_task() { 2. a. find executable tasks 3. b. select the task that has the smallest period 4. c. update task information 5. d. return selected task id 6. } (b) FIGURE 8.14 Pseudocode of “get_next_task()” without OS support: (a) default, and (b) RMS scheduler. (From Kwon, S. et al., ACM Trans. Des. Autom. Electron. Syst., 13, Article 39, July 2008. With permission.) for the IDCT function block. Third, the software environment for the tar- get system is prepared, which includes the run-time scheduler and target- dependent API library. . Nicolescu /Model-Based Design for Embedded Systems 67842_C008 Finals Page 216 2009-10-14 216 Model-Based Design for Embedded Systems the necessary information needed for each translation. RealView SoC designer to model the entire system including the Nicolescu /Model-Based Design for Embedded Systems 67842_C008 Finals Page 222 2009-10-14 222 Model-Based Design for Embedded Systems #pragma. “{task_name}_wrapup()” in wrapup_task() before finishing the entire program. Nicolescu /Model-Based Design for Embedded Systems 67842_C008 Finals Page 224 2009-10-14 224 Model-Based Design for Embedded Systems 1. int