Model-Based Design for Embedded Systems- P7 pps

Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 36 2009-10-13 36 Model-Based Design for Embedded Systems a translation of the binary code into the SystemC code generates a fast code compared to an interpreting ISS, as no decoding of instructions is needed and the generated SystemC code can be easily used within a SystemC simulation environment. However, this approach has some major disadvantages. One main drawback is that the same problems that have to be solved in the static compilation (binary translation) have to be solved here (e.g., addresses of calculated branch targets have to be determined). Another disadvantage is that the automatically generated code is not very easily read by humans. 2.4.1 Back-Annotation of WCET/BCET Values In this section, we will describe our approachinmore detail. Figure 2.6 shows an overview of the approach. First, the C source code has to be taken and translated using an ordinary C (cross)-compiler into the binary code for the embedded processor (source processor). After that, our back-annotation tool reads the object file and a description of the used source processor. This description contains both a description of the architecture and a description of the instruction set of the processor. Figure 2.4 shows an example for the description of the architecture. It contains information about the resources of the processor (Figure 2.4a). This information is used for the modeling of the pipeline. Furthermore, it contains a description of the properties of the instruction (Figure 2.4b) and data caches (Figure 2.4c). Furthermore, such a description can contain information about the branch prediction of the processor. Annotation of C code for a basic block Architectural model C code corresponding to the cache analysis blocks of the basic block Cache model Branch prediction model C code corresponding to a basic block Function call of consume function if necessary (e.g. before I/O access) consume(getTaskTime()); delay(cycleCalculationICache(tag, iStart, iEnd)); delay(cycleCalculationForConditionalBranch()); delay(statically predicted number of cycles); FIGURE 2.3 Back-annotation of WCET/BCET values. (From Schnerr, J. et al., High- performance timing simulation of embedded software, in: Proceedings of the 45th Design Automation Conference (DAC), Anaheim, CA, pp. 290–295, June 2008. Copyright: ACM. Used with permission.) Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 37 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 37 <architecture> <resource>FI</resource> <resource>DI</resource> (a) <resource>EX</resource> <resource>WB</resource> <icache> <associativity>2</associativity> <cachelinesize>8</cachelinesize> (b) <cachesize>4096</cachesize> <replacement>lru</replacement> </icache> <dcache> <associativity>2</associativity> <cachelinesize>8</cachelinesize> <cachesize>4096</cachesize> (c) <replacement>lru</replacement> <writebackpolicy>write-back</writebackpolicy> </dcache> </architecture> FIGURE 2.4 Example for a description of the architecture. Figure 2.5 shows an example for the description of the instruction set. This description contains information about the structure of the bit image of the instruction code (Figure 2.5c). It also contains information to determine the timing behavior of instructions and the timing behavior of instructions that are executed in context with other instructions (Figure 2.5d). Further- more, for debugging and documentation purposes more information about the instruction can be given (Figure 2.5a and b). Using this description, the object code is decoded and translated into an intermediate representation consisting of a list of objects. Each of these objects represents one intermediate instruction. In the next step, the basic blocks of this program are determined using the intermediate representation consisting of a list of objects. As a result, using this list, a list of basic blocks is built. After that, the execution time is statically calculated for each basic block with respect to the provided pipeline model of the proposed source processor. This calculation step is described in more detail in Section 2.4.3. Subsequently, the back-annotation correspondences between the C source code and the binary code are identified. Then, the back-annotation process takes place. This is done by automated code instrumentation for cycle generation and dynamic cycle correction. The structure and functional- ity of this code are described in Section 2.4.2. Not every impact of the processor architecture on the number of cycles can be predicted statically. Therefore, if dynamic, data-dependent effects (e.g., branch prediction and caches) have to be taken into account, an Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 38 2009-10-13 38 Model-Based Design for Embedded Systems <processor> <defr>a4</defr> <defr>b4</defr> <defr>c4</defr> <defr>d4</defr> <def>n2</def> . . . <! 0x06000001 addsc.a Ac, Ab, Da, n (RRS) > <instruction> <syntax> addsc.a A<par>c</par>,A<par>b</par>,D<par>a</par>, <par>n</par> (a) </syntax> <description> Left-shift the contents of data register Da by the amount specified by n, where n can be 0, 1, 2, or 3. Add that value to the contents (b) of address register Ab and put the result in address register Ac. </description> <image> <par>c</par>0110000000<par>n</par><par>b</par><par>a</par>00000001 (c) </image> <uses>FI 1</uses> <uses>DI 1</uses> (d) <uses>EX 1</uses> <uses>WB 1</uses> </instruction> . . . </processor> FIGURE 2.5 Example for a description of an instruction. additional code needs to be added. Further details concerning this code are described in Section 2.4.5. During back-annotation, the C program is transformed into a cycle- accurate SystemC program that can be compiled to be executed on the processor of the simulation host (target processor). One advantage of this approach is a fast execution of the annotated code as the C source code does not need major changes for back-annotation. More- over, the generated SystemC code can be easily used within a SystemC simulation environment. The difficulty in using this approach is to find the corresponding parts of the binary code in the C source code if the compiler opti- mizes or changes the structure of the binary code too much. If this happens, recompilation techniques [4] have to be used to find the correspondences. 2.4.2 Annotation of SystemC Code On the left-hand side of Figure 2.3, there is the necessary annotation of a piece of the C code that corresponds to a basic block. The right-hand side of Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 39 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 39 Binary code Annotated SystemC program Find correspondences between C source code and binary code Construction of intermediate representation C source code Static cycle calculation Building of basic blocks C Compiler Processor description Insertion of dynamic correction code Insertion of cycle generation code Analysis binary code Analysis source code Back-annotation Back-annotation tool FIGURE 2.6 General principle for a basic block annotation. (Copyright: ACM. Used with permission.) this figure shows the cache model and the branch prediction model that are used during runtime. As described in further detail in Section 2.4.7, a function delay is used for accumulating the execution time of an annotated basic block during simulation. At the beginning of the annotated basic block code, the annotation tool adds a call of the delay function that contains the statically determined number of cycles this basic block would use on the source processor as a parame- ter. How this number is calculated is described in more detail in Section 2.4.3. In modern processor architectures, the impact of the processor architecture on the number of executed cycles cannot be completely predicted statically. Especially the branch prediction and the caches of a processor have a sig- nificant impact on the number of used cycles. Therefore, the statically determined number of cycles has to be corrected dynamically. The partitioning of the basic block for the calculation of additional cycles of instruction cache misses, as shown in Figure 2.3, is explained in Section 2.4.5. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 40 2009-10-13 40 Model-Based Design for Embedded Systems If there is a conditional branch at the end of a basic block, branch prediction has to be considered and possible correction cycles have to be added. This is described in more detail in Section 2.4.5. As shown in Figure 2.3, the back-annotation tool adds a call to the consume function that performs cycle generation at the end of each basic block code. If necessary, this instruction generates the number of cycles this basic block would need on the source processor. How this consume function works is described in Section 2.4.7. In order to guarantee both—as fast as possible the execution of the code as well as the highest possible accuracy—it is possible to choose different accuracy levels of the generated code that parameterize the annotation tool. The first and the fastest one is a purely static prediction. The second one additionally includes the modeling of the branch prediction. And the third one takes also the dynamic inclusion of instruction caches into account. The cycle calculation in these different levels will be discussed in more detail in the following sections. 2.4.3 Static Cycle Calculation of a Basic Block In modern architectures, pipeline effects, superscalarity, and caches have an important impact on the execution time. Because of this, a calculation of the execution time of a basic block by summing the execution or latency times of the single instructions of this block is very inaccurate. Therefore, the incorporation of a pipeline model per basic block becomes necessary [21]. This model helps statically predict pipeline effects and the effects of superscalarity. For the generation of this model, informations about the instruction set and the pipelines of the used processor are needed. These informations is contained in the processor description that is used by the annotation tool. With regard to this, the tool uses a modeling of the pipeline to determine which instructions of the basic block will be executed in parallel on a superscalar processor and which combinations of instructions in the basic block will cause pipeline stalls. Details of this will be described in the next section. With the information gained by basic block modeling, a prediction is carried out. This prediction determines the number of cycles the basic block would have needed on the source processor. Section 2.4.5 will show how this kind of prediction is improved during runtime, and how a cache model is included. 2.4.4 Modeling of Pipeline for a Basic Block As previously mentioned, the processor description contains informations of the resources the processor has and of the resources a certain instruction uses. These informations about the resources are used to build a resource usage model that specifies microarchitecture details of the used processor. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 41 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 41 For this model, it is assumed that all units in the processor such as functional units, pipeline stages, registers, and ports form a set of resources. These resources can be allocated or released by every instruction that is executed. This means that the resource usage model is based on the assumption that every time when an instruction is executed, this instruction allocates a set of resources and carries out an action. When the execution proceeds, the allocated resources and the carried-out actions change. If two instructions wait for the same resource, then this is resolved by allocating the resource to the instruction that entered the pipeline earlier. This model is powerful enough to describe pipelines, superscalar execution, and other microarchitectures. 2.4.4.1 Modeling with the Help of Reservation Tables The timing information of every program construct can be described with a reservation table. Originally, reservation tables were proposed to describe and analyze the activities in a pipeline [32]. Traditionally, reservation tables were used to detect conflicts for the scheduling of instructions [25]. In a reservation table, the vertical dimension represents the pipeline stages and the horizontal dimension represents the time. Figure 2.7 shows an example of a basic block and the corresponding reservation table. In the figure, every entry in the reservation table shows that the corresponding pipeline stage is used in the particular time slot. The entry consists of the number of the instruction that uses the resource. The timing interdependencies between the instructions of a basic block are analyzed using the composition of their basic block. In the reservation table, not only conflicts that occur because of the different pipeline stages, but also data dependencies between the instructions can be considered. 76 5 543 21 21 521 521 521 43 43 43 43 int-Pipeline EX DI FI DI EX WB WB FI 1 add d1,d2,d3 2 add d4,d5,d6 3 ld a3,[a2]0 4 ld a4,[a5]0 5 sub d7,d8,d9 Basic block ls-Pipeline Resources Time in clock cycles FIGURE 2.7 Example of a reservation table for a basic block. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 42 2009-10-13 42 Model-Based Design for Embedded Systems 2.4.4.1.1 Structural Hazards In the following, a modeling of the instructions in a pipeline using reservation tables will be described [12,32]. To determine at which time after the start of an instruction the execution of a new instruction can start without causing a collision, these reservation tables have to be analyzed. One possi- bility to determine if two instructions can be started in the distance of K time units is to overlap the reservation with itself using an offset of K time units. If a used resource is overlapped by another, then there will be a collision in this segment and K is a forbidden latency. Otherwise, no collision will occur and K is an allowed latency. 2.4.4.1.2 Data Hazards The time delay caused by data hazards is modeled in the same way as the delay caused by structural hazards. As the result of the pipelining of an instruction sequence should be the same as the result of sequentially executed instructions, register accesses should be in the same order as they are in the program. This restriction is comparable with the usage of pipeline stages in the order they are in the program, and, therefore, it can be modeled by an extension of the reservation table. 2.4.4.1.3 Control Hazards Some processors (like the MIPS R3000 [12]) use delayed branches to avoid the waiting cycle that otherwise would occur because of the control hazard. This can be modeled by adding a delay slot to the basic block with the branch instruction. Such a modeling is possible, because the instruction in the delay slot is executed regardless of the result of the branch instruction. 2.4.4.2 Calculation of Pipeline Overlapping In order to be able to model the impact of architectural components such as pipelines, the state of these components has to be known when the basic block is entered. If the state is known, then it is possible to find out the gain that results from the use of this component. If it is known that in the control-flow graph of the program, node e i is the predecessor of node e j , and the pipeline state after the execution of node e i is also known, then the information about this state can be used to calculate the execution time of node e j . This means the gain resulting from the fact that node e i is executed before node e j can be calculated. The gain will be calculated for every pair of succeeding basic blocks using the pipeline overlapping. This pipeline overlapping is determined using reservation tables [29]. Appending a reservation table of a basic block to a reservation table of another basic block works the same way as appending an instruction to this reservation table. Therefore, it is sufficient to consider only the first and the last columns. The maximum number of columns that Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 43 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 43 have to be considered does not have to be larger than the maximum number of cycles for which a single instruction can stay in the pipeline [21]. 2.4.5 Dynamic Correction of Cycle Prediction As previously described, the actual cycle count a processor needs for exe- cuting a sequence of instructions cannot be predicted correctly in all cases. This is the case if, for example, a conditional branch at the end of a basic block produces a pipeline flush, or if additional delays occur because of cache misses in instruction caches. The combination of static analysis and dynamic execution provides a well-suited solution for this problem, since statically unpredictable effects of branch and cache behaviors can be determined during execution. This is done by inserting appropriate function calls into the translated basic blocks. These calls interact with the architectural model in order to determine the additional number of cycles caused by mispredicted branch and cache behaviors. At the end of each basic block, the generation of previously calculated cycles (static cycles plus correction cycles) can occur (Figure 2.3). 2.4.5.1 Branch Prediction Conditional branches have different cycle times depending on four different cases resulting from the combination of predicted and mispredicted branches, as well as taken and non-taken branches. A correctly predicted branch needs less cycles for execution than a mispredicted one. Furthermore, additional cycles can be needed if a correctly predicted branch is taken, as the branch target has to be calculated and loaded in the program counter. This problem is solved by implementing a model of the branch prediction and by a comparison of the predicted branch behavior with the executed branch behavior. If dynamic branch prediction is used, a model of the under- lying state machine is implemented and its results are compared with the executed branch behavior. The cycle count of each possible case is calculated and added to the cumulative cycle count before the next basic block is entered. 2.4.5.2 Instruction Cache Figure 2.3 shows that for the simulation of the instruction cache, every basic block of the translated program has to be divided into several cache analysis blocks. This has to be done until the tag changes or the basic block ends. After that, a function call to the cache handling model is added. This code uses a cache model to find out possible cache hits or misses. The cache simulation will be explained in more detail in the next few paragraphs. This explanation will start with a description of the cache model. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 44 2009-10-13 44 Model-Based Design for Embedded Systems cycleCalcICache C program Binary code Cache model datalru tagv asm_inst 1 asm_inst l+1 asm_inst n asm_inst 2l+1 asm_inst 2l asm_inst l C_stmnt 1 C_stmnt 2 C_stmnt 3 C_stmnt 4 FIGURE 2.8 Correspondence C—assembler—cache line. (Copyright: ACM. Used with permission.) 2.4.5.3 Cache Model The cache model, as it can be seen on the right-hand side of Figure 2.8, contains data space that is used for the administration of the cache. In this space, the valid bit, the cache tag, and the least recently used (lru) information (con- taining the replacement strategy) for each cache set during runtime is saved. The number of cache tags and the according amount of valid bits that are needed depend on the associativity of the cache (e.g., for a two-way set associative cache, two sets of tags and valid bits are needed). 2.4.5.4 Cache Analysis Blocks In the middle of Figure 2.8, the C source code that corresponds to a basic block is divided in several smaller blocks, the so-called cache analysis blocks. These blocks are needed for the consideration of the effects of instruction caches. Each one of these blocks contains the part of a basic block that fits into a single cache line. As every machine language instruction in such a cache analysis block has the same tag and the same cache index, the addresses of the instructions can be used to determine how a basic block has to be divided into cache analysis blocks. This is because each address consists of the tag information and the cache index. The cache index information (iStart to iEnd in Figure 2.3) is used to determine at which cache position the instruction with this address is cached. The tag information is used to determine which address was cached, as there can be multiple addresses with the same cache index. Therefore, a changed cache tag can be easily determined during the traversal of the binary code with respect to the cache parameters. The block offset information is not needed for the cache simulation, as no real caching of data takes place. After the tag has been changed or at the end of a basic block, a function call that handles the simulated cache and the calculation of the additional cycles of cache misses are added to this block. More details about this function are described in the next section. Nicolescu/Model-Based Design for Embedded Systems 67842_C002 Finals Page 45 2009-10-13 SystemC-Based Performance Analysis of Embedded Systems 45 ✞ ☎ int cycleCalculationICache( tag, iStart, iEnd ) { for index = iStart to iEnd { if tag is found in index and valid bit is set then { // cache hit renew lru information return 0 } else { // cache miss use lru information to determine tag to overwrite write new tag set valid bit of written tag renew lru information return additional cycles needed for cache miss } } } ✌ ✝ ✆ Listing 2.1 Function for cache cycle correction. 2.4.5.5 Cycle Calculation Code As previously mentioned, each cache analysis block is characterized by a combination of tag and cache-set index informations. At the end of each basic block, a call to a function is included. During runtime, this function should determine whether the different cache analysis blocks that the basic block consists of are in the simulated cache or not. This way, cache misses are detected. The function is shown in Listing 2.1. It has the tag and the range of cache- set indices (iStart to iEnd) as parameters. To find out if there is a cache hit or a cache miss, the function checks whether the tag of each cache analysis block can be found in the specified set and whether the valid bit for the found tag is set. If the tag can be found and the valid bit is set, the block is already cached (cache hit) and no additional cycles are needed. Only the lru information has to be renewed. In all other cases, the lru information has to be used to determine which tag has to be overwritten. After that, the new tag has to be written instead of the found old one, and the valid bit for this tag has to be set. The lru information has to be renewed as well. In the final step, the additional cycles are returned and added to the cycle correction counter. . 2.7 Example of a reservation table for a basic block. Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 42 2009-10-13 42 Model-Based Design for Embedded Systems 2.4.4.1.1 Structural. 2.3, is explained in Section 2.4.5. Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 40 2009-10-13 40 Model-Based Design for Embedded Systems If there is a conditional branch. Nicolescu /Model-Based Design for Embedded Systems 67842_C002 Finals Page 36 2009-10-13 36 Model-Based Design for Embedded Systems a translation of the binary

Định dạng
Số trang	10
Dung lượng	479,86 KB