Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 56 2009-10-13 Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 57 2009-10-13 3 Formal Performance Analysis for Real-Time Heterogeneous Embedded Systems Simon Schliecker, Jonas Rox, Rafik Henia, Razvan Racu, Arne Hamann, and Rolf Ernst CONTENTS 3.1 Introduction 58 3.2 FormalMultiprocessorPerformanceAnalysis 59 3.2.1 Application Model 60 3.2.2 Event Streams 60 3.2.3 Local Component Analysis 62 3.2.4 Compositional System-Level Analysis Loop 63 3.3 FromDistributedSystemstoMPSoCs 64 3.3.1 Deriving Output Event Models 65 3.3.2 Response Time Analysis in the Presence of Shared Memory Accesses 66 3.3.3 Deriving Aggregate Busy Time 68 3.4 HierarchicalCommunication 69 3.5 Scenario-AwareAnalysis 73 3.5.1 Echo Effect 74 3.5.2 Compositional Scenario-Aware Analysis 75 3.6 Sensitivity Analysis 76 3.6.1 Performance Characterization 76 3.6.2 Performance Slack 77 3.7 RobustnessOptimization 79 3.7.1 Use-Cases for Design Robustness 79 3.7.2 Evaluating Design Robustness 80 3.7.3 Robustness Metrics 81 3.7.3.1 Static Design Robustness 81 3.7.3.2 Dynamic Design Robustness 81 3.8 Experiments 82 3.8.1 Analyzing Scenario 1 85 3.8.2 Analyzing Scenario 2 86 3.8.3 Considering Scenario Change 86 3.8.4 Optimizing Design 87 3.8.5 System Dimensioning 87 3.9 Conclusion 87 References 88 57 Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 58 2009-10-13 58 Model-Based Design for Embedded Systems 3.1 Introduction Formal approaches to system performance modeling have always been used in the design of real-time systems. With increasing system complexity, there is a growing demand for the use of more sophisticated formal methods in a wider range of systems to improve system predictability, and determine sys- tem robustness to changes, enhancements, and design pitfalls. This demand can be addressed by the significant progress in the last couple of years in performance modeling and analysis on all levels of abstraction. New modular models and methods now allow the analysis of large-scale, heterogeneous systems, providing reliable data on transitional load situa- tions, end-to-end timing, memory usage, and packet losses. A compositional performance analysis allows to decompose the system into the analysis of individual components and their interaction, providing a versatile method to approach real-world architectures. Early industrial adopters are already using such formal methods for the early evaluation and exploration of a design, as well as for a formally complete performance verification toward the end of the design cycle—neither of which could be achieved solely with simulation-based approaches. The formal methods, as presented in this chapter, are based on abstract load and execution data, and are thus applicable even before executable hardware or software models are available. Such data can even be estimates derived from previous product generations, similar implementations, or sim- ply engineering competence allowing for first evaluations of the application and the architecture. This already allows tuning an architecture for maxi- mum robustness against changes in system execution and communication load, reducing the risk of late and expensive redesigns. During the design process, these models can be iteratively refined, eventually leading to a veri- fiable performance model of the final implementation. The multitude of diverse programming and architectural design paradigms, often used together in the same system, call for formal methods that can be easily extended to consider the corresponding timing effects. For example, formal performance analysis methods are also becoming increas- ingly important in the domain of tightly integrated multiprocessor system- on-chips (MPSoCs). Although such components promise to deliver higher performance at a reduced production cost and power consumption, they introduce a new level of integration complexity. Like in distributed embed- ded systems, multiprocessing comes at the cost of higher timing com- plexity of interdependent computation, communication, and data storage operations. Also, many embedded systems (distributed or integrated) feature com- munication layers that introduce a hierarchical timing structure into the communication. This is addressed in this chapter with a formal Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 59 2009-10-13 Formal Performance Analysis 59 representation and accurate modeling of the timing effects induced during transmission. Finally, today’s embedded systems deliver a multitude of different software functions, each of which can be particularly important in a spe- cific situation (e.g., in automotives: an electronic stability program (ESP) and a parking assistance). A hardware platform designed to execute all of these functions at the same time will be expensive and effectively over- dimensioned given that the scenarios are often mutually exclusive. Thus, in order to supply the desired functions at a competitive cost, systems are only dimensioned for subsets of the supplied functions, so-called scenarios, which are investigated individually. This, however, poses new pitfalls when dimensioning distributed systems under real-time constraints. It becomes mandatory to also consider the scenario-transition phase to prevent timing failures. This chapter presents an overview of a general, modular, and formal per- formance analysis framework, which has successfully accommodated many extensions. First, we present its basic procedure in Section 3.2. Several exten- sions are provided in the subsequent sections to address specific properties of real systems: Section 3.3 visits multi-core architectures and their implica- tions on performance; hierarchical communication as is common in automo- tive networks is addressed in Section 3.4; the dynamic behavior of switching between different application scenarios during runtime is investigated in Section 3.5. Furthermore, we present a methodology to systematically inves- tigate the sensitivity of a given system configuration and to explore the design space for optimal configurations in Sections 3.6 and 3.7. In an experi- mental section (Section 3.8), we investigate timing bottlenecks in an example heterogeneous automotive architecture, and show how to improve the per- formance guided by sensitivity analysis and system exploration. 3.2 Formal Multiprocessor Performance Analysis In past years, compositional performance analysis approaches [6,14,16] have received an increasing attention in the real-time systems community. Com- positional performance analyses exhibit great flexibility and scalability for timing and performance analyses of complex, distributed embedded real- time systems. Their basic idea is to integrate local performance analysis tech- niques, for example, scheduling analysis techniques known from real-time research, into system-level analyses. This composition is achieved by con- necting the component’s inputs and outputs by stream representations of their communication behaviors using event models. This procedure is illus- trated in Sections 3.2.1 through 3.2.4. Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 60 2009-10-13 60 Model-Based Design for Embedded Systems 3.2.1 Application Model An embedded system consists of hardware and software components inter- acting with each other to realize a set of functionalities. The traditional approach to formal performance analysis is performed bottom-up. First, the behavior of the individual functions needs to be investigated in detail to gather all relevant data, such as the execution time. This information can then be used to derive the behavior within individual components, accounting for local scheduling interference. Finally, the system-level timing is derived on the basis of the lower-level results. For an efficient system-level performance verification, embedded systems are modeled with the highest possible level of abstraction. The smallest unit modeling performance characteristics at the application level is called a task. Furthermore, to distinguish computation and communication, tasks are cat- egorized into computational and communication tasks. The hardware plat- form is modeled by computational and communication resources, which are referred to as CPUs and buses, respectively. Tasks are mapped on resources in order to be executed. To resolve conflicting requests, each resource is asso- ciated with a scheduler. Tasks are activated and executed due to activating events that can be gen- erated in a multitude of ways, including timer expiration, and task chaining according to inter-task dependencies. Each task is assumed to have one input first-in first-out (FIFO) buffer. In the basic task model, a task reads its acti- vating data solely from its input FIFO and writes data into the input FIFOs of dependent tasks. This basic model of a task is depicted in Figure 3.1a. Var- ious extensions of this model also exist. For example, if the task may be sus- pended during its execution, this can be modeled with the requesting-task model presented in Section 3.3. Also, the direct task activation model has been extended to more complex activation conditions and semantics [10]. 3.2.2 Event Streams The timing properties of the arrival of workload, i.e., activating events, at the task inputs are described with an activation model. Instead of considering each activation individually, as simulation does, formal performance anal- ysis abstracts from individual activating events to event streams. Generally, (b) System-level transactions Local task execution Termination(a) Activation Local task execution FIGURE 3.1 Task execution model. Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 61 2009-10-13 Formal Performance Analysis 61 event streams can be described using the upper and lower event-arrival func- tions, η + and η − , as follows. Definition 3.1 (Upper Event-Arrival Function, η + ) The upper event-arrival function, η + (Δt), specifies the maximum number of events that may occur in the event stream during any time interval of size Δt. Definition 3.2 (Lower Event-Arrival Function, η − ) The lower event-arrival function, η − (Δt), specifies the minimum number of events that may occur in the event stream during any time interval of size Δt. Correspondingly, an event model can also be specified using the func- tions δ − (n) and δ + (n) that represent the minimum and maximum distances between any n events in the stream. This representation is more useful for latency considerations, while the η-functions better express the resource loads. Each can be derived from the other (as they are “pseudo-inverse,” as defined in [5]). Different parameterized event models have been developed to efficiently describe the timings of events in the system [6,14]. One popular and computationally efficient abstraction for representing event streams is provided by so-called standard event models [33], as visual- ized in Figure 3.2. Standard event models capture the key properties of event streams using three parameters: the activation period, P; the activation jitter, J; and the minimum distance, d. Periodic event models have one parameter P stating that each event arrives periodically at exactly every P time units. This simple model can be extended with the notion of jitter, leading to periodic with jitter event models, which are described by two parameters, namely, P and J. Events generally occur periodically, yet they can jitter around their exact position within a jitter interval of size J. If the jitter value is larger than the period, then two or more events can occur simultaneously, leading to bursts. To describe bursty event models, periodic with jitter event models can be extended with the parameter d min capturing the minimum distance between the occurrences of any two events. 4 3 2 1 P–J 2P 3P 4P P2P J J J 3P 4Pd min PP+J Periodic Periodic with jitter Periodic with burst P2P3P4P Δ Δ Δ η – η + η – η + η – η + η 4 3 2 1 η 4 3 2 1 η FIGURE 3.2 Standard event models. Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 62 2009-10-13 62 Model-Based Design for Embedded Systems 3.2.3 Local Component Analysis Based on the underlying resource-sharing strategy, as well as stream repre- sentations of the incoming workload modeled through the activating event models, local component analyses systematically derive worst-case scenarios to calculate worst-case (sometimes also best-case) task response times (BCRT and WCRT), that is, the time between task activation and task completion, for all tasks sharing the same component (i.e., the processor). Thereby, local component analyses guarantee that all observable response times fall into the calculated [best-case, worst-case] interval. These analyses are therefore considered conservative. Note that different approaches use different models of computation to perform local component analyses. For instance, SymTA/S [14,43] is based on the algebraic solution of response time formulas using the sliding- window technique proposed by, for example, Lehoczky [23], whereas the real-time calculus utilizes arrival curves and service curves to characterize the workload and processing capabilities of components, and determines their real-time behavior [6]. These concepts are based on the network cal- culus. For details please refer to [5]. Additionally, local component analyses determine the communication behaviors at the outputs of the analyzed tasks by considering the effects of scheduling. The basic model assumes that tasks produce output events at the end of each execution. Like the input timing behavior, the output event tim- ing behavior can also be captured by event models. The output event models can then be derived for every task, based on the local response time analysis. For instance, standard event models used by SymTA/S allow the specifi- cation of very simple rules to obtain output event models during the local component analysis. Note that in the simplest case (i.e., if tasks produce exactly one output event for each activating event) the output event model period equals the activation period. A discussion on how output event model periods are determined for more complex semantics (when considering rate transitions) can be found in [19]. The output event model jitter, J out , is calcu- lated by adding the difference between maximum and minimum response times, R max − R min , the response time jitter, to the activating event model jitter, J in [33]: J out = J in +(R max −R min ) (3.1) The output event model calculation can also be performed for general event models that are specified solely with the upper and lower event-arrival functions. This method will be applied in Section 3.4 to hierarchical event models (HEMs). Recently, a more exact output jitter calculation algorithm was proposed for the local component analysis based on standard event models [15] and general event models [43]. The approaches exploit the fact that the response time of a task activation is correlated with the timings of Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 63 2009-10-13 Formal Performance Analysis 63 preceding events—the task activation arriving with worst-case jitter does not necessarily experience the worst-case response time. 3.2.4 Compositional System-Level Analysis Loop On the highest level of the timing hierarchy, the compositional system-level analysis [6,14] derives the system’s timing properties from the lower-level results. For this, the local component analysis (as explained in Section 3.2.3) is alternated with the output event model propagation. The basic idea is visualized on the right-hand side of Figure 3.3. (The shared-resource anal- ysis depicted on the left-hand side will be explained in Section 3.3.) In each global iteration of the compositional system-level analysis, input event model assumptions are used to perform local scheduling analyses for all components. From this, their response times and output event models are derived as described above. Afterward, the calculated output event models are propagated to the connected components, where they are used as acti- vating input event models for the subsequent global iteration. Obviously, this iterative analysis represents a fix-point problem. If all calculated out- put event models remain unmodified after an iteration, the convergence is reached and the last calculated task response times are valid [20,34]. To successfully apply the compositional system-level analysis, the input event models of all components need to be known or must be computable by the local component analysis. Obviously, for systems containing feedback between two or more components, this is not the case, and, thus, the system- level analysis cannot be performed without additional measures. The con- crete strategies to overcome this issue depend on the component types and their input event models. One possibility is the so-called starting point gen- eration of SymTA/S [33]. Environment model Input event model Shared resource access analysis Derive output event models Until convergence or nonschedulability Local scheduling analysis FIGURE 3.3 MPSoC performance analysis loop. Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 64 2009-10-13 64 Model-Based Design for Embedded Systems 3.3 From Distributed Systems to MPSoCs The described procedure appropriately covers the behaviors of hardware and software tasks that consume all relevant data upon activation and pro- duce output data into a single FIFO. This represents the prevailing design practice in many real-time operating systems [24] and parallel program- ming concepts [21]. However, it is also common—particularly in MPSoCs— to access shared resources such as a memory during the execution of a task. The diverse interactions and correlations between integrated system com- ponents then pose fundamental challenges to the timing predictions. Fig- ure 3.4 shows an example dual-core system in which three tasks access the same shared memory during execution. In this section, the scope of the above approach is extended to cover such behaviors. The model of the task is for this purpose extended to include local exe- cution as well as memory transactions during the execution [38]. While the classical task model is represented as an execution time interval (Figure 3.1a), a so-called requesting task performs transactions during its execution, as depicted in Figure 3.1b. The depicted task requires three chunks of data from an external resource. It issues a request and may only continue execution after the transaction is transmitted over the bus, processed on the remote component, and transmitted back to the requesting source. Thus, whenever a transaction has been issued, but is not finished, the task is not ready. The accesses to the shared resource may be logical shared resources (as in [27]), but for the scope of this chapter, we assume that the accesses go to a shared memory. Such memory accesses may be explicit data-fetch operations or implicit cache misses. The timing of such memory accesses, especially cache misses, is extremely difficult to accurately predict. Therefore, an analysis cannot predict the timing of each individual transaction with an acceptable effort. Instead, a shared-resource access analysis algorithm will be utilized in Section 3.3.3 that subsumes all transactions of a task execution and the interference by Shared memory T3 Multicore comp. T1 T2 FIGURE 3.4 Multicore component with three requesting tasks that access the same shared memory during execution. Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 65 2009-10-13 Formal Performance Analysis 65 other system activities. Even though this presumes a highly unfavorable and unlikely coincidence of events, this approach is much less conservative than the consideration of individual transactions. The memory is considered to be a separate component, and an analysis must be available for it to predict the accumulated latency of a set of mem- ory requests. For this analysis to work, the event models for the amount of requests issued from the various processors are required. The outer anal- ysis loop in the procedure of Figure 3.3, as described in Section 3.2, pro- vides these event models for task activations throughout the system. These task activating event models allow the derivation of bounds on the task’s number of requests to the shared-resource. These bounds can be used by the shared-resource analysis to derive the transaction latencies. The proces- sor’s scheduling analysis finally needs to account for the delays experienced during the task execution by integrating the transaction latencies. This inter- mediate analysis is shown on the left hand side of Figure 3.3. As it is based on the current task activating event model assumptions of the outer analysis, the shared-resource analysis possibly needs to be repeated when the event models are refined. In order to embed the analysis of requesting tasks into the compositional analysis framework described in Section 3.2, three major building blocks are required: 1. Deriving the number of transactions issued by a task and all tasks on a processor 2. Deriving the latency experienced by a set of transactions on the shared resource 3. Integrating the transaction latency into the tasks’ worst-case response times These three steps will be carried out in the following. We begin with the local investigation of deriving the amount of initiated transac- tions (Section 3.3.1) and the extended worst-case response time analysis (Section 3.3.2). Finally, we turn to the system-level problem of deriving the transaction latency (Section 3.3.3). 3.3.1 Deriving Output Event Models For each individual task-activation, the amount of issued requests can be bound by closely investigating the task’s internal control flow. For example, a task may explicitly fetch data each time it executes a for-loop that is repeated several times. By multiplying the maximum number of loop iterations with the amount of fetched data, a bound on the memory accesses can be derived. Focused on the worst-case execution time problem, previous research has provided various methods to find the longest path through such a program description with the help of integer linear programming (see [49]). Implicit data fetches such as cache misses are more complicated to cap- ture, as they only occur during runtime and cannot be directly identified . Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 56 2009-10-13 Nicolescu /Model-Based Design for Embedded Systems 67842_C003 Finals Page 57 2009-10-13 3 Formal Performance. Optimizing Design 87 3.8.5 System Dimensioning 87 3.9 Conclusion 87 References 88 57 Nicolescu /Model-Based Design for Embedded Systems 67842_C003 Finals Page 58 2009-10-13 58 Model-Based Design for Embedded. through 3.2.4. Nicolescu /Model-Based Design for Embedded Systems 67842_C003 Finals Page 60 2009-10-13 60 Model-Based Design for Embedded Systems 3.2.1 Application Model An embedded system consists