ARM System Developer’s Guide phần 9 doc

15.1 Advanced DSP and SIMD Support in ARMv6 15.1.1 SIMD Arithmetic Operations 15.1.2 Packing Instructions 15.1.3 Complex Arithmetic Support 15.1.4 Saturation Instructions 15.1.5 Sum of Absolute Differences Instructions 15.1.6 Dual 16-Bit Multiply Instructions 15.1.7 Most Significant Word Multiplies 15.1.8 Cryptographic Multiplication Extensions 15.2 System and Multiprocessor Support Additions to ARMv6 15.2.1 Mixed-Endianness Support 15.2.2 Exception Processing 15.2.3 Multiprocessing Synchronization Primitives 15.3 ARMv6 Implementations 15.4 Future Technologies beyond ARMv6 15.4.1 TrustZone 15.4.2 Thumb-2 15.5 Summary Chapter The Future of the Architecture 15 John Rayfield In October 1999, ARM began to consider the future direction of the architecture that would eventually become ARMv6, first implemented in a new product called ARM1136J-S. By this time, ARM already had designs for many different applications, and the future requirements of each of those designs needed to be evaluated, as well as the new application areas for which ARM would be used in the future. As system-on-chip designs have become more sophisticated, ARM processors have become the central processors in systems with multiple processing elements and subsystems. In particular, the portable and mobile computing markets were introducing new software and performance challenges for ARM. Areas that needed addressing were digital signal processing (DSP) and video performance for portable devices, interworking mixed-endian systems such as TCP/IP, and efficient synchronization in multiprocessing environments. The challenge for ARM was to address all of these market requirements and yet maintain its competitive advantage in computational efficiency (computing power per mW) as the best in the industry. This chapter describes the components within ARMv6 introduced by ARM to address these market requirements, including enhanced DSP support and support for a multiprocessing environment. The chapter also introduces the first high-performance ARMv6 implementations and, in addition to the ARMv6 technologies, one of ARM’s latest technologies—TrustZone. 549 550 Chapter 15 The Future of the Architecture 15.1 Advanced DSP and SIMD Support in ARMv6 Early in the ARMv6 project, ARM considered howto improve the DSP and media processing capabilities of the architecture beyond the ARMv5E extensions described in Section 3.7. This work was carried out very closely with the ARM1136J-S engineering team, which was in the early stages of developing the microarchitecture for the product. SIMD (Single Instruction Multiple Data) is a popular technique used to garner considerable data parallelism and is particularly effective in very math-intensive routines that are commonly used in DSP, video and graphics processing algorithms. SIMD is attractive for high code density and low power since the number of instructions executed (and hence memory system accesses) is kept low. The price for this efficiency is the reduced flexibility of having to compute things arranged in certain blocked data patterns; this, however, works very well in many image and signal processing algorithms. Using the standard ARM design philosophy of computational efficiency with very low power, ARM came up with a simple and elegant way of slicing up the existing ARM 32-bit datapath into four 8-bit and two 16-bit slices. Unlike many existing SIMD architectures that add separate datapaths for the SIMD operations, this method allows the SIMD to be added to the base ARM architecture with very little extra hardware cost. The ARMv6 architecture includes this “lightweight” SIMD approach that costs virtually nothing in terms of extra complexity (gate count) and therefore power. At the same time the new instructions can improve the processing throughput of some algorithms by up to two times for 16-bit data or four times for 8-bit data. In common with most operations in the ARM instruction set architecture, all of these new instructions are executed conditionally, as described in Section 2.2.6. You can find a full description of all ARMv6 instructions in the instruction set tables of Appendix A. 15.1.1 SIMD Arithmetic Operations Table 15.1 shows a summary of the 8-bit SIMD operations. Each byte result is formed from the arithmetic operation on each of the corresponding byte slices through the source operands. The results of these 8-bit operations may require that up to 9 bits be represented, which either causes a wraparound or a saturation to take place, depending on the particular instruction used. In addition to the 8-bit SIMD operations, there are an extensive range of dual 16-bit operations, shown in Table 15.2. Each halfword (16-bit) result is formed from the arithmetic operation on each of the corresponding 16-bit slices through the source operands. The results may need 17 bits to be stored, and in this case they can either wrap around or are saturated to within the range of a 16-bit signed result with the saturating version of the instruction. 15.1 Advanced DSP and SIMD Support in ARMv6 551 Table 15.1 8-bit SIMD arithmetic operations. Instruction Description SADD8{<cond>} Rd, Rn, Rm Signed 8-bit SIMD add SSUB8{<cond>} Rd, Rn, Rm Signed 8-bit SIMD subtract UADD8{<cond>} Rd, Rn, Rm Unsigned 8-bit SIMD add USUB8{<cond>} Rd, Rn, Rm Unsigned 8-bit SIMD subtract QADD8{<cond>} Rd, Rn, Rm Signed saturating 8-bit SIMD add QSUB8{<cond>} Rd, Rn, Rm Signed saturating 8-bit SIMD subtract UQADD8{<cond>} Rd, Rn, Rm Unsigned saturating 8-bit SIMD add UQSUB8{<cond>} Rd, Rn, Rm Unsigned saturating 8-bit SIMD subtract Table 15.2 16-bit SIMD arithmetic operations. Instruction Description SADD16{<cond>} Rd, Rn, Rm Signed add of the 16-bit pairs SSUB16{<cond>} Rd, Rn, Rm Signed subtract of the 16-bit pairs UADD16{<cond>} Rd, Rn, Rm Unsigned add of the 16-bit pairs USUB16{<cond>} Rd, Rn, Rm Unsigned subtract of the 16-bit pairs QADD16{<cond>} Rd, Rn, Rm Signed saturating add of the 16-bit pairs QSUB16{<cond>} Rd, Rn, Rm Signed saturating subtract of the 16-bit pairs UQADD16{<cond>} Rd, Rn, Rm Unsigned saturating add of the 16-bit pairs UQSUB16{<cond>} Rd, Rn, Rm Unsigned saturating subtract of the 16-bit pairs Operands for the SIMD instructions are not always found in the correct order within the source registers; to improve the efficiency of dealing with these situations, there are 16-bit SIMD operations that perform swapping of the 16-bit words of one operand register. These operations allow a great deal of flexibility in dealing with halfwords that may be aligned in different ways in memory and are particularly useful when working with 16-bit complex number pairs that are packed into 32-bit registers. There are signed, unsigned, saturating signed, and saturating unsigned versions of these operations, as shown in Table 15.3. The X in the instruction mnemonic signifies that the two halfwords in Rm are swapped before the operations are applied so that operations like the following take place: Rd[15:0] = Rn[15:0] - Rm[31:16] Rd[31:16] = Rn[31:16] + Rm[15:0] The addition of the SIMD operations means there is now a need for some way of showing an overflow or a carry from each SIMD slice through the datapath. The cpsr as originally 552 Chapter 15 The Future of the Architecture Table 15.3 16-bit SIMD arithmetic operations with swap. Instruction Description SADDSUBX{<cond>} Rd, Rn, Rm Signed upper add, lower subtract, with a swap of halfwords in Rm UADDSUBX{<cond>} Rd, Rn, Rm Unsigned upper add, lower subtract, with swap of halfwords in Rm QADDSUBX{<cond>} Rd, Rn, Rm Signed saturating upper add, lower subtract, with swap of halfwords in Rm UQADDSUBX{<cond>} Rd, Rn, Rm Unsigned saturating upperadd, lower subtract, with swap of halfwords in Rm SSUBADDX{<cond>} Rd, Rn, Rm Signed upper subtract, lower add, with a swap of halfwords in Rm USUBADDX{<cond>} Rd, Rn, Rm Unsigned upper subtract, lower add, with swap of halfwords in Rm QSUBADDX{<cond>} Rd, Rn, Rm Signed saturating upper subtract, lower add, with swap of halfwords in Rm UQSUBADDX{<cond>} Rd, Rn, Rm Unsigned saturating uppersubtract, lower add, with swap of halfwords in Rm described in Section 2.2.5 is modified by adding four additional flag bits to represent each 8-bit slice of the data path. The newly modified cpsr register with the GE bits is shown in Figure 15.1 and Table 15.4. The functionality of each GE bit is that of a “greater than or equal” flag for each slice through the datapath. Operating systems already save the cpsr register on a context switch. Adding these bits to the cpsr has little effect on OS support for the architecture. In addition to basic arithmetic operations on the SIMD data slices, there is considerable use for operations that allow the picking of individual data elements within the datapath and forming new ensembles of these elements. A select instruction SEL can independently select each eight-bit field from one source register Rn or another source register Rm, depending on the associated GE flag. 31 030 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NZCV modeQJRes Res Res EA I F TGE [3:0] Figure 15.1 cpsr layout for ARMv6. 15.1 Advanced DSP and SIMD Support in ARMv6 553 Table 15.4 cpsr fields for ARMv6. Field Use N Negative flag. Records bit 31 of the result of flag-setting operations. Z Zero flag. Records if the result of a flag-setting operation is zero. C Carry flag. Records unsigned overflow for addition, not-borrow for subtraction, and is also used by the shifting circuit. See Table A.3. V Overflow flag. Records signed overflows for flag-setting operations. Q Saturation flag. Certain saturating operations set this flag on saturation. See for example QADD in Appendix A (ARMv5E and above). JJ= 1 indicates Java execution (must have T = 0). Use the BXJ instruction to change this bit (ARMv5J and above). Res These bits are reserved for future expansion. Software should preserve the values in these bits. GE[3:0] The SIMD greater-or-equal flags. See SADD in Appendix A (ARMv6). E Controls the data endianness. See SETEND in Appendix A (ARMv6). AA= 1 disables imprecise data aborts (ARMv6). II= 1 disables IRQ interrupts. FF= 1 disables FIQ interrupts. TT= 1 indicates Thumb state. T = 0 indicates ARM state. Use the BX or BLX instructions to change this bit (ARMv4T and above). mode The current processor mode. See Table B.4. SEL Rd, Rn, Rm Rd[31:24] = GE[3] ? Rn[31:24] : Rm[31:24] Rd[23:16] = GE[2] ? Rn[23:16] : Rm[23:16] Rd[15:08] = GE[1] ? Rn[15:08] : Rm[15:08] Rd[07:00] = GE[0] ? Rn[07:00] : Rm[07:00] These instructions, together with the other SIMD operations, can be used very effec- tively to implement the core of the Viterbi algorithm, which is used extensively for symbol recovery in communication systems. Since the Viterbi algorithm is essentially a statistical maximum likelihood selection algorithm, it is also used in such areas as speech and hand- writing recognition engines. The core of Viterbi is an operation that is commonly known as add-compare-select (ACS), and in fact many DSP processors have customized ACS instructions. With its parallel (SIMD) add, subtract (which can be used to compare), and selection instructions, ARMv6 can implement an extremely efficient add-compare-select: ADD8 Rp1, Rs1, Rb1 ; path 1 = state 1 + branch 1 (metric update) ADD8 Rp2, Rs2, Rb2 ; path 2 = state 2 + branch 2 (mteric update) 554 Chapter 15 The Future of the Architecture Table 15.5 Packing instructions. Instruction Description PKHTB{<cond>} Rd, Rn, Rm {, ASR #<shift_imm>} Pack the top 16 bits of Rn with the bottom 16 bits of the shifted Rm into the destination Rd PKHBT{<cond>} Rd, Rn, Rm {, LSL #<shift_imm>} Pack the top 16 bits of the shifted Rm with the bottom 16 bits of Rn into the destination Rd USUB8 Rt, Rp1, Rp2 ; compare metrics - setting the SIMD flags SEL Rd, Rp2, Rp1 ; choose best (smallest) metric This kernel performs the ACS operation on four paths in parallel and takes a total of 4 cycles on the ARM1136J-S. The same sequence coded for the ARMv5TE instruction set must perform each of the operations serially, taking at least 16 cycles. Thus the add- compare-select function is four times faster on ARM1136J-S for eight-bit metrics. 15.1.2 Packing Instructions The ARMv6 architecture includes a new set of packing instructions, shown in Table 15.5, that are used to construct new 32-bit packed data from pairs of 16-bit values in different source registers. The second operand can be optionally shifted. Packing instructions are particularly useful for pairing 16-bit values so that you can make use of the 16-bit SIMD processing instructions described earlier. 15.1.3 Complex Arithmetic Support Complex arithmetic is commonly used in communication signal processing, and in particular in the implementations of transform algorithms such as the Fast Fourier Transform as described in Chapter 8. Much of the implementation detail examined in that chapter con- cerns the efficient implementation of the complex multiplication using ARMv4 or ARMv5E instruction sets. ARMv6 adds new multiply instructions to accelerate complex multiplication, shown in Table 15.6. Both of these operations optionally swap the order of the two 16-bit halves of source operand Rs if you specify the X suffix. Example 15.1 In this example Ra and Rb hold complex numbers with 16-bit coefficients packed with their real parts in the lower half of a register and their imaginary part in the upper half. 15.1 Advanced DSP and SIMD Support in ARMv6 555 Table 15.6 Instructions to support 16-bit complex multiplication. Instruction Description SMUAD{X}{<cond>} Rd, Rm, Rs Dual 16-bit signed multiply and add SMUSD{X}{<cond>} Rd, Rm, Rs Dual 16-bit signed multiply and subtract We multiply Ra and Rb to produce a new complex number Rc. The code assumes that the 16-bit values represent Q15 fractions. Here is the code for ARMv6: SMUSD Rt, Ra, Rb ; real*real–imag*imag at Q30 SMUADX Rc, Ra, Rb ; real*imag+imag*real at Q30 QADD Rt, Rt, Rt ; convert to Q31 & saturate QADD Rc, Rc, Rc ; convert to Q31 & saturate PKHTB Rc, Rc, Rt, ASR #16 ; pack results Compare this with an ARMv5TE implementation: SMULBB Rc, Ra, Rb ; real*real SMULTT Rt, Ra, Rb ; imag*imag QSUB Rt, Rc, Rt ; real*real-imag*imag at Q30 SMULTB Rc, Ra, Rb ; imag*real SMLABT Rc, Ra, Rb ; + real*imag at Q30 QADD Rt, Rt, Rt ; convert to Q31 & saturate QADD Rc, Rc, Rc ; convert to Q31 & saturate MOV Rc, Rc, LSR #16 MOV Rt, Rt, LSR #16 ORR Rt, Rt, Rc, LSL#16 ; pack results There are 10 cycles for ARMv5E versus 5 cycles for ARMv6. Clearly with any algorithm doing very intense complex maths, a two times improvement in performance can be gained for the complex multiply. ■ 15.1.4 Saturation Instructions Saturating arithmetic was first addressed with the E extensions that were added to the ARMv5TE architecture, which was introduced with the ARM966E and ARM946E products. ARMv6 takes this further with individual and more flexible saturation instructions that can operate on 32-bit words and 16-bit halfwords. In addition to these instructions, shown in Table 15.7, there are the new saturating arithmetic SIMD operations that have already been described in Section 15.1.1. 556 Chapter 15 The Future of the Architecture Table 15.7 Saturation instructions. Instruction Description SSAT Rd, #<BitPosition>, Rm,{<Shift>} Signed 32-bit saturation at an arbitrary bit position. Shift can be an LSL or ASR. SSAT16{<cond>} Rd, #<immed>, Rm Dual 16-bit saturation at the same position in both halves. USAT Rd, #<BitPosition>, Rm,{<Shift>} Unsigned 32-bit saturation at an arbitrary bit position. Shift can be LSL or ASR. USAT16{<cond>} Rd, #<immed>, Rm Unsigned dual 16-bit saturation at the same position in both halves. Note that in the 32-bit versions of these saturation operations there is an optional arithmetic shift of the source register Rm before saturation, allowing scaling to take place in the same instruction. 15.1.5 Sum of Absolute Differences Instructions These two new instructions are probably the most application specific within the ARMv6 architecture—USAD8 and USADA8. They are used to compute the absolute difference between eight-bit values and are particularly useful in motion video compression algorithms such as MPEG or H.263, including motion estimation algorithms that measure motion by comparing blocks using many sum-of-absolute-difference operations (see Figure 15.2). Table 15.8 lists these instructions. Table 15.8 Sum of absolute differences. Instruction Description USAD8{<cond>} Rd, Rm, Rs Sum of absolute differences USADA8{<cond>} Rd, Rm, Rs, Rn Accumulated sum of absolute differences To compare an N ×N square at (x, y) in image p 1 with an N ×N square p 2 , we calculate the accumulated sum of absolute differences: a(x, y) = N −1  i=0 N −1  j=0   p 1 (x + i, y + j) −p 2 (i, j)   15.1 Advanced DSP and SIMD Support in ARMv6 557 Rn Rm Rs Rd absdiff absdiff absdiff absdiff + Figure 15.2 Sum-of-absolute-differences operation. To implement this using the new instructions, use the following sequence to compute the sum-of-absolute differences of four pixels: LDR p1,[p1Ptr],#4 ; load 4 pixels from p1 LDR p2,[p2Ptr],#4 ; load 4 pixels from p2 ;load delay-slot ;load delay-slot USADA8 acc, p1, p2 ; accumlate sum abs diff There is a tremendous performance advantage for this algorithm over an ARMv5TE implementation. There is a four times improvement in performance from the eight-bit SIMD alone. Additionally the USADA8 operation includes the accumulation operation. The USAD8 operation will typically be used to carry out the setup into the loop before there is an existing accumulated value. 15.1.6 Dual 16-Bit Multiply Instructions ARMv5TE introduced considerable DSP performance to ARM, but ARMv6 takes this much further. Implementations of ARMv6 (such as ARM1136J) have a dual 16 × 16 multiply capability, which is comparable with many high-end dedicated DSP devices. Table 15.9 lists these instructions. [...]... version ARM1 136J-S ARM1 156T2-S ARM1 176JZ-S ARMv6J ARMv6 + Thumb-2 ARMv6J + TrustZone This Page Intentionally Left Blank A.1 Using This Appendix A.2 Syntax A.2.1 A.2.2 A.2.3 A.2.4 A.2.5 Optional Expressions Register Names Values Stored as Immediates Condition Codes and Flags Shift Operations A.3 Alphabetical List of ARM and Thumb Instructions A.4 ARM Assembler Quick Reference A.4.1 A.4.2 A.4.3 A.4.4 ARM. .. Reference A.4.1 A.4.2 A.4.3 A.4.4 ARM Assembler Variables ARM Assembler Labels ARM Assembler Expressions ARM Assembler Directives A.5 GNU Assembler Quick Reference A.5.1 GNU Assembler Directives A Appendix ARM and Thumb Assembler Instructions This appendix lists the ARM and Thumb instructions available up to, and including, ARM architecture ARMv6, which was just released at the time of writing We list... 15 The Future of the Architecture 15.2 System and Multiprocessor Support Additions to ARMv6 As systems become more complicated, they incorporate multiple processors and processing engines These engines may share different views of memory and even use different endiannesses (byte order) To support communication in these systems, ARMv6 adds support for mixed-endian systems, fast exception processing, and... address in Rn and flag if successful in Rd (Rd = 0 if successful) 15.4 Future Technologies beyond ARMv6 15.3 563 ARMv6 Implementations ARM completed development of ARM1 136J in December 2002, and at this writing consumer products are being designed with this core The ARM1 136J pipeline is the most sophisticated ARM implementation to date As shown in Figure 15.4, it has an eight-stage pipeline with separate... are two of a number of standard fields described in Section A.2 Rd and Rn denote ARM registers The instruction is only executed if the 5 69 570 Appendix A ARM and Thumb Assembler Instructions Table A.1 Instruction types Type Meaning ARMvX THUMBvX MACRO 32-bit ARM instruction first appearing in ARM architecture version X 16-bit Thumb instruction first appearing in Thumb architecture version... It is common for operating systems to save the return state of an interrupt or exception on a stack ARMv6 adds the instructions in Table 15.13 to improve the efficiency of this operation, which can occur very frequently in interrupt/scheduler driven systems 15.2.3 Multiprocessing Synchronization Primitives As system- on-chip (SoC) architectures have become more sophisticated, ARM cores are now often found... when ARM announced the ARM1 176JZ-S The fundamental idea is that operating systems (even on embedded devices) are now so complex that it is very hard to verify security and correctness in the software The ARM solution to this problem is to add new operating “states” to the architecture where only a small verifiable software kernel will run, and this will provide services to the larger operating system. .. reference guides to the ARM and GNU assemblers armasm and gas We have designed this appendix for practical programming use, both for writing assembly code and for interpreting disassembly output It is not intended as a definitive architectural ARM reference In particular, we do not list the exhaustive details of each instruction bitmap encoding and behavior For this level of detail, see the ARM Architecture... in the ARM1 156T2-S processor Details of this architecture are not public at the time of writing 566 Chapter 15 The Future of the Architecture 15.5 Summary The ARM architecture is not a static constant but is being developed and improved to suit the applications required by today’s consumer devices Although the ARMv5TE architecture was very successful at adding some DSP support to the ARM, the ARMv6... instruction is available from the listed ARM architecture version onwards Table A.1 shows the entries possible for this column Note that there is no direct correlation between the Thumb architecture number and the ARM architecture number The THUMBv1 architecture is used in ARMv4T processors; the THUMBv2 architecture, in ARMv5T processors; and the THUMBv3 architecture, in ARMv6 processors Each instruction . addressed with the E extensions that were added to the ARMv5TE architecture, which was introduced with the ARM9 66E and ARM9 46E products. ARMv6 takes this further with individual and more flexible. 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NZCV modeQJRes Res Res EA I F TGE [3:0] Figure 15.1 cpsr layout for ARMv6. 15.1 Advanced DSP and SIMD Support in ARMv6. Implementations 15.4 Future Technologies beyond ARMv6 15.4.1 TrustZone 15.4.2 Thumb-2 15.5 Summary Chapter The Future of the Architecture 15 John Rayfield In October 199 9, ARM began to consider the future direction

Định dạng
Số trang	70
Dung lượng	449,02 KB