Model-Based Design for Embedded Systems- P8 pps

Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 186 2009-10-2 186 Model-Based Design for Embedded Systems object request broker (HORBA) when the support of small-grain parallelism is needed. Our most recent developments in MultiFlex are mostly focused on the support of the streaming programming model, as well as its interaction with the client–server model. SMP subsystems are still of interest, and they are becoming increasingly well supported commercially [14,21]. Moreover, our focus is on data-intensive applications in multimedia and communications. For these applications, our focus has been primarily on streaming and client–server programming models for which explicit communication centric approaches seem most appropriate. This chapter will introduce the MultiFlex framework specialized at supporting the streaming and client–server programming models. However, we will focus primarily on our recent streaming programming model and mapping tools. 7.2.1 Iterative Mapping Flow MultiFlex supports an iterative process, using initial mapping results to guide the stepwise refinement and optimization of the application-to- platform mapping. Different assignment and scheduling strategies can be employed in this process. An overview of the MultiFlex toolset, which supports the client–server and streaming programming models, is given in Figure 7.2. The design methodology requires three inputs: Application specification Client/server Streaming Abstract platform specification Performance analysis Component assembly Visualization Application constraints, profiling Application core C functions Streaming Static tools Streaming Dynamic tools Intermediate representation (IR) Map, Transform & schedule of IR Client/server Client/server Target platform Video platform Mobile platform Multimedia platform FIGURE 7.2 MultiFlex toolset overview. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 187 2009-10-2 MPSoC Platform Mapping Tools for Data-Dominated Applications 187 • The application specification—the application can be specified as a set of communicating blocks; it can be programmed using the streaming model or client–server programming model semantics. • Application-specific information (e.g., quality-of-service requirements, measured or estimated execution characteristics of the application, data I/O characteristics, etc.). • The abstract platform specification—this information includes the main characteristics of the target platform which will execute the application. An intermediate representation (IR) is used to express the high-level application in a language-neutral form. It is translated automatically from one or more user-level capture environments. The internal structure of the application capture is highly inspired by the Fractal component model [23]. Although we have focused mostly on the IR-to-platform mapping stages, we have experimented with graphical capture from a commercial toolset [7], and a textual capture language similar to StreamIt [3] has also been experimented with. In the MultiFlex approach, the IR is mapped, transformed, and scheduled; finally the application is transformed into targeted code that can run on the platform. There is a flexibility or performance trade-off between what can be calculated and compiled statically, and what can be evaluated at run- time. As shown on Figure 7.2, our approach is currently implemented using a combination of both, allowing a certain degree of adaptive behaviors, while making use of more powerful offline static tools when possible. Finally, the MultiFlex visualization and performance analysis tools help to validate the final results or to provide information for the improvement of the results through further iterations. 7.2.2 Streaming Programming Model As introduced above, the streaming programming model [1] has been designed for use with data-dominated applications. In this computing model, an application is organized into streams and computational kernels to expose its inherent locality and concurrency. Streams represent the flow of data, while kernels are computational tasks that manipulate and transform the data. Many data-oriented applications can easily be seen as sequences of transformations applied on a data stream. Examples of languages based on the streaming computing models are: ESTEREL [4], Lucid [5], StreamIt [3], Brooks [2]. Frameworks for stream computing visualization are also available (e.g., Ptolemy [6] and Simulink R  [7]). In essence, our streaming programming model is well suited to a distributed-memory, parallel architecture (although mapping is possible on shared-memory platforms), and favors an implementation using software libraries invoked from the traditional sequential C language, rather than proposing language extensions, or a completely new execution model. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 188 2009-10-2 188 Model-Based Design for Embedded Systems The entry to the mapping tools uses an XML-based IR that describes the application as a topology with semantic tags on tasks. During the mapping process, the semantic information is used to generate the schedulers and all the glue necessary to execute the tasks according to their firing conditions. In summary, the objectives of the streaming design flow are: • To refine the application mapping in an iterative process, rather than having a one-way top-down code generation • To support multiple streaming execution models and firing conditions • To support both restricted synchronous data-flow and more dynamic data-flow blocks • To be controlled by the user to achieve the mechanical transformations rather than making decisions for him We first present the mapping flow in the Section 7.3, and at the end of the section, we will give more details on the streaming programming model. 7.3 MultiFlex Streaming Mapping Flow The MultiFlex technology includes support for a range of streaming programming model variants. Streaming applications can be used alone or in interoperation with client–server applications. The MultiFlex streaming tool flow is illustrated in Figure7.3. The different stages of this flow will be described in the next sections. The application mapping begins with the assignment of the application blocks to the platform resources. The IR transformations consist mainly in splitting and/or clustering the application blocks; they are performed for optimization purposes (e.g., memory optimization); the transformations also imply the insertion of communication mechanisms (e.g., FIFOs, and local buffers). The scheduling defines the sharing of a processor between several blocks of the application. Most of the IR mapping, transforming, and scheduling is realized statically (at compilation time), rather than dynamically (at run- time). The methodology targets large-scale multicore platforms including a uniform layered communication network based on STMicroelectronics’ network-on-chip (NoC) backbone infrastructure [18] and a small number of H/W-based communication IPs for efficient data transfer (e.g., stream- oriented DMAs or message-passing accelerators [9]). Although we consider our methodology to be compatible with the integration of application- specific hardware accelerators using high-level hardware synthesis, we are not targeting such platforms currently. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 189 2009-10-2 MPSoC Platform Mapping Tools for Data-Dominated Applications 189 Application functional capture Filter core C functions Filter dataflow B1 B2 B3 Comm. and H/W abstraction library Component assembly User assignment directives Application constraints Profiling Communication Core functions wrappers Storage resources Commn. resources and topology Number and types of PE Platform specification Abstract Comm. services Intermediate representation (IR) MpAssign MpCompose Target platform Video platform Mobile platform Multimedia platform FIGURE 7.3 MultiFlex tool flow for streaming applications. 7.3.1 Abstraction Levels In the MultiFlex methodology, a data-dominated application is gradually mapped on a multicore platform by passing through several abstractions: • The application level—at this level, the application is organized as a set of communicating blocks. The targeted architecture is completely abstracted. • The partitioning level—at this level the application blocks are grouped in partitions; each partition will be executed on a PE of the target architecture. PEs can be instruction-set programmable processors, reconfigurable hardware or standard hardware. • The communication level—at this level, the scheduling and the communication mechanisms used on each processor between the different blocks forming a partition are detailed. • The target architecture level—at this level, the final code executed on the targeted platforms is generated. Table 7.2 summarizes the different abstractions, models, and tools provided by MultiFlex in order to map complex data-oriented applications onto multiprocessor platforms. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 190 2009-10-2 190 Model-Based Design for Embedded Systems TABLE 7.2 Abstraction, Models, and Tools in MultiFlex Abstraction Level Model Refinement Tool Application level Set of communicating blocks Textual or graphical front-end Partition level Set of communicating blocks and directives to assign blocks to processors MpAssign Communication level Set of communicating blocks and required communication components MpCompose Target architecture level Final code loaded and executed on the target platform Component-based compilation back-end 7.3.2 Application Functional Capture The application is functionally captured as a set of communicating blocks. A basic (or primitive) block consists of a behavior that implements a known interface. The implementation part of the block uses streaming application programming interface (API) calls to get input and output data buffers to communicate with other tasks. Blocks are connected through communication channels (in short, channels) via their interfaces. The basic blocks can be grouped in hierarchical blocks or composites. The main types of basic blocks supported in MultiFlex approach are • Simple data-flow block: This type of block consumes and produces tokens on all inputs and outputs, respectively, when executed. It is launched when there is data available at all inputs, and there is suf- ficient free space in downstream components for all outputs to write the results. • Synchronous client–server block: This block needs to perform one or many remote procedural calls before being able to push data in the output interface. It must therefore be scheduled differently than the simple data-flow block. • Server block: This block can be executed once all the arguments of the call are available. Often this type of block can be used to model a H/W coprocessor. • Delay memory: This type of block can be used to store a given number of data tokens (an explicit state). Figure 7.4 gives the graphical representation of a streaming application capture which interacts with a client–server application. Here, we focus mostly on streaming applications. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 191 2009-10-2 MPSoC Platform Mapping Tools for Data-Dominated Applications 191 Sync. dataflow semantics Composite Memory delay App interface Dataflow interface Init Process End State Max elem Data type Token rate Synchronous client semantics Server(s) semantics int method1 ( ) int method2 ( ) FIGURE 7.4 Application functional capture. From the point of view of the application programmer, the first step is to split the application into processing blocks with buffer-based I/O ports. User code corresponding to the block behavior is written using the C language. Using component structures, each block has its private state, and implements a constructor (init), a work section (process), and a destructor (end). To obtain access to I/O port data buffers, the blocks have to use a predefined API. A run-to-completion execution model is proposed as a compromise between programming and mapping flexibility. The user can extend the local schedulers to allow the local control of the components, based on application-specific control interfaces. The dataflow graph may contain blocks that use client–server semantics, with application-specific interfaces, to perform remote object calls that can be dispatched to a pool of servers. 7.3.3 Application Constraints The following application constraints are used by the MultiFlex streaming tools: 1. Block profiling information. For a given block, this represents the average number of clock cycles required for the block execution, on a target processor. 2. Communication volume: the size of data exchanged on this channel. 3. User assignment directives. Three types of directives are supported by the tool: a. Assign a block to a specific processor b. Assign two blocks to the same processor (can be any processor) c. Assign two blocks to any two different processors Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 192 2009-10-2 192 Model-Based Design for Embedded Systems 7.3.4 The High-Level Platform Specification The high-level platform specification is an abstraction of the processing, communication, and storage resources of the target platform. In the current implementation, the information stored is as follows: • Number and type of PEs. • Program and data memory size constraints (for each programmable PE). • Information on the NoC topology. Our target platform uses the STNoC, which is based on the “Spidergon” topology [18]. We include the latency measures for single and multihop communication. • Constraints on communication engines: Number of physical links available for communication with the NoC. 7.3.5 Intermediate Format MultiFlex relies on intermediate representations (IRs) to capture the application, the constraints, and high-level platform descriptions. The topology of the application—the block declaration and their connectivity—is expressed using an XML-based intermediate format. It is also used to store task annotations, such as the block execution semantics. Other block annotations are used for the application profiling and block assignments. Edges are anno- tated with the communication volume information. The IR is designed to support the refinement of the application as it is iteratively mapped to the platform. This implies supporting the multiple abstraction levels involved in the assignment and mapping process described in the next sections. 7.3.6 Model Assumptions and Distinctive Features In this section, we provide more details about the streaming model. This background information will help in explaining the mapping tools in the next section. The task specification includes the data type for each I/O port as well as the maximum amount of data consumed or produced on these ports. This information is an important characteristic of the application capture because it is at the foundation of our streaming model: each task has a known computation grain size. This means we know the amount of data required to fire the process function of the task for a single iteration without starving on input data, and we know the maximum amount of output data that can be produced each time. This is a requirement for the nonblocking, or run-to- completion execution of the task, which simplifies the scheduling and communication infrastructure and reduces the system overhead. Finally, we can quantify the computation requirements of each task for a single iteration. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 193 2009-10-2 MPSoC Platform Mapping Tools for Data-Dominated Applications 193 The run-to-completion execution model allows dissociating the scheduling of the tasks from the actual processing function, providing clear scheduling points. Application developers focus on implementing and optimizing the task functions (using the C language), and expressing the functionality in a way that is natural for the application, without trying to balance the task loads in the first place. This means each task can work on a different data packet size and have different computation loads. The assignment and scheduling of the tasks can be done in a separate phase (usually performed later), allowing the exploration of the mapping parameters, such as the task assignment, the FIFO, and buffer sizes, to be conducted without changing the functionality of the tasks: a basic principle to allow correct-by- construction automated refinement. The run-to-completion execution model is a compromise, requiring more constrained programming but leads to higher flexibility in terms of mapping. However, in certain cases, we have no choice but to support multiple concurrent execution contexts. We use cooperative threading to schedule special tasks that use a mix of streaming and client–server constructs. Such tasks are able to invoke remote services via client–server (DSOC) calls, including synchronous methods (with return values) that cause the caller task to block, waiting for an answer. In addition, we are evaluating the pros and cons of supporting tasks with unrestricted I/O and very fine-grain communication. To be able to eventu- ally run several tasks of this nature on the same processor, we may need a software kernel or make use of hardware threading if the underlying platform provides it. To be able to choose the correct scheduler to deploy on each PE, we have introduced semantic tags, which describe the high-level behavior type of each task. This information is stored in the IR. We have defined a small set of task types, previously listed in Section7.3.2. This allows a mix of execution models and firing conditions, thus providing a rich programming environ- ment. Having clear semantic tags is a way to ensure the mapping tools can optimize the scheduling and communications on each processor, rather than systematically supporting all features and be designed for the worst case. The nonblocking execution is only one characteristic of streaming compared to our DSOC client–server message-passing programming model. As opposed to DSOC, our streaming programming model does not provide data marshaling (although, in principle, this could be integrated in the case of heterogeneous streaming subsystems). When compared to asynchronous concurrent components, another dis- tinction of the streaming model is the data-driven scheduling. In event- based programming, asynchronous calls (of unknown size) can be generated during the execution of a single reaction, and those must be queued. The quantity of events may result in complex triggering protocols to be defined and implemented by the application programmer. This remains to be a well- known drawback of event-based systems. With the data-flow approach, the Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 194 2009-10-2 194 Model-Based Design for Embedded Systems clear data-triggered execution semantic, and the specification of I/O data ports resolve the scheduling, memory management, and memory ownership problems inherent to asynchronous remote method invocations. Finally, another characteristic of our implementation of the streaming programming model, which is also shared with our SMP and DSOC models, is the fact that application code is reused “as is,” i.e., no source code transformations are performed. We see two beneficial consequences of this common approach. In terms of debugging, it is an asset, since the programmer can use a standard C source-level debugger, to verify the unmodified code of the task core functions. The other main advantage is related to profiling. Once again, it is relatively easy for an application engineer to understand and optimize the task functions with a profiling report, because his source code is untouched. 7.4 MultiFlex Streaming Mapping Tools 7.4.1 Task Assignment Tool The main objective of the MpAssign tool (see Figure 7.5) is to assign application blocks to processors while optimizing two objectives: 1. Balance the task load on all processors 2. Minimize the inter-processor communication load Filter core C functions Application constraints Profiling MpAssign B1 B1 PE2 PE1 B2 B2 Application graph B4 B4 B5 B5 B3 B3 Assignment directives Communication volume Platform specification Number and types of PE Commn. resources and topology Storage resources User assignment directives FIGURE 7.5 MpAssign tool. Nicolescu/Model-Based Design for Embedded Systems 67842_C007 Finals Page 195 2009-10-2 MPSoC Platform Mapping Tools for Data-Dominated Applications 195 The inter-processor communication cost is given by the data volume exchanged between two processors, related to each task. The tool receives as inputs the application capture, the application constraints, and the high-level platform specification. The output of the tool is a set of assignment directives specifying which blocks are mapped on each processor, the average load of each processor, and the cost for each inter-processor communication. The lower portion of Figure 7.5 gives a visual representation of the MpAssign output. The tool provides the visual display of the resulting block assignments to processors. The implemented algorithm for the MpAssign tool is inspired from Marculescu’s research [10] and is based on graph traversal approaches, where ready tasks with maximal 2-minimal cost-variance are assigned iteratively. The two main graph traversal approaches implemented in MpAssign are • The list-based approach, using mainly the breadth-first principle—a task is ready if all its predecessors are assigned • The path-based approach, using mainly the depth-first principle—a task is ready if one predecessor is assigned and it is on the critical path A cost estimator C  t, p  of assigning a task t on processor p is used. This cost estimator is computed using the following equation: C  t, p  = w 1 ∗C proc +w 2 ∗C comm +w 3 ∗C succ (7.1) where C proc is the additional average processing cost required when the task t is assigned to processor p C comm is the communication cost required for the communication of task t with the preceding tasks C succ represents a look-ahead cost concerning the successor tasks, the minimal cost estimate of mapping a number of successor tasks This assumes state space exploration for a predefined look-ahead depth. w i represents the weight associated with each cost factor (C proc , C comm ,and C succ ) and indicates the significance of the factor in the total cost C  p, t  as compared with the other factors. The factors are weighted by the designer to set their relative importance. 7.4.2 Task Refinement and Communication Generation Tools The main objective of the MpCompose tool (see Figure 7.6) is to generate one application graph per PE, each graph containing the desired computation blocks from the application, one local scheduler, and the required communication components. To perform this functionality, MpCompose requires the following three inputs: [...]... parallel-programming framework for MPSoC, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol 13, No 3, Article 39, July 2008 207 208 Model-Based Design for Embedded Systems design methodology of MPSoC, most efforts have focused on the design of hardware architecture But the real bottleneck will be software design, as preverified hardware platforms tend to be reused in platform-based designs Unlike... increasing need for flexibility in multimedia SoCs for consumer applications is leading to a new class of programmable, multiprocessor solutions The high computation and data bandwidth requirements of these 204 Model-Based Design for Embedded Systems applications pose new challenges in the expression of the applications, the platform architectures to support them and the application-to-platform mapping... C code But automatic parallelization of a C code has been successful only for a limited class of applications after a long period of extensive research [7] In order to increase the design productivity of embedded software, we propose a novel methodology for embedded software design based on a Retargetable, Embedded Software Design Methodology 209 parallel programming model, called a common intermediate... The scheduler interleaves communication and processing at the block level For each input port, the scheduler scans if there is available data in the local memory If not, it checks if the input FIFO is empty If not, the scheduler orders the input FIFO to perform the transfer into local memory This is 198 Model-Based Design for Embedded Systems typically done by some coprocessors such as DMA or specialized... system, embedded software is not easy to debug at run time Furthermore, software failure may not be tolerated in safety-critical applications So the correctness of the embedded software should be guaranteed at compile time Embedded software design is very challenging since it amounts to a parallel programming for nontrivial heterogeneous multiprocessors with diverse communication architectures and design. .. Pacific Design Automation Conference), Yokohama, Japan, January 2007, pp 749–750 12 P.G Paulin, C Pilkington, M Langevin, E Bensoudane, D Lyonnard, O Benny, B Lavigueur, D Lo, G Beltrame, V Gagné, and G Nicolescu, Parallel programming models or a multi-processor SoC platform applied to networking and multimedia, IEEE Transactions on VLSI Journal, 14(7), July 2006, 667–680 206 Model-Based Design for Embedded. .. lib./constraints CIC translation Target-executable C code Virtual prototyping system FIGURE 8.1 The proposed framework of software generation from CIC 212 Model-Based Design for Embedded Systems may explore the design space at a later stage of design The CIC program consists of two sections, a task code section and an architecture section The next step is to map task codes to processing components,... Syst., 13, Article 39, July 2008 With permission.) 214 Model-Based Design for Embedded Systems if they are specified as separate tasks in the CIC Note that data parallelism is specified with OpenMP directives within a task code, as shown at line 9 of Figure 8.2c If there are HW accelerators in the target platform, we may want to use them to improve the performance To open this possibility in a task code,...196 Model-Based Design for Embedded Systems Filter core C functions Abstract comm services Scheduler shell Local bindings (buffer) Global bindings (fifo) Application IR (with assignments) PE1 B2 B1 PE2 B3 MpCompose PE1 PE2 Scheduler Control I/F Scheduler LB 1/2 B1 Data I/F B2 GB 1/3 GB1/3 B3 FIGURE 7.6 MpCompose tool • The application capture • The platform description • The set... Infrastructure for Multiprocessor Architecture, MPSoC 2008, available on line at http://www.mpsoc-forum.org/slides/2-4%20Coppola.pdf 19 M Leclercq, O Lobry, E Özcan, J Polakovic, and J.B Stefani, THINK C implementation of fractal and its ADL tool-chain, ECOOP 2006, 5th Fractal Workshop, Nantes, France, July 2006 20 P Paulin, Emerging Challenges for MPSoC Design, MPSoC 2006, available online at http://www.mpsoc-forum.org/2006/slides/Paulin.pdf . processors Nicolescu /Model-Based Design for Embedded Systems 67842_C007 Finals Page 192 2009-10-2 192 Model-Based Design for Embedded Systems 7.3.4 The High-Level Platform Specification The high-level platform. applications onto multiprocessor platforms. Nicolescu /Model-Based Design for Embedded Systems 67842_C007 Finals Page 190 2009-10-2 190 Model-Based Design for Embedded Systems TABLE 7.2 Abstraction,. perform this functionality, MpCompose requires the following three inputs: Nicolescu /Model-Based Design for Embedded Systems 67842_C007 Finals Page 196 2009-10-2 196 Model-Based Design for Embedded

Định dạng
Số trang	30
Dung lượng	812,13 KB