Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 70 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
70
Dung lượng
454,51 KB
Nội dung
268 Chapter 8 Digital Signal Processing or in integer C: Y[t] = isqrt( X[t] << (2*d-n) ); The function isqrt finds the nearest integer to the square root of the integer. See Section 7.4 for efficient implementation of square root operations. 8.1.7 Summary: How to Represent a Digital Signal To choose a representation for a signal value, use the following criteria: ■ Use a floating-point representation for prototyping algorithms. Do not use floating point in applications wherespeed is critical. Most ARM implementations do not include hardware floating-point support. ■ Use a fixed-point representation for DSP applications where speed is critical with mod- erate dynamic range. The ARM cores provide good support for 8-, 16- and 32-bit fixed-point DSP. ■ For applications requiring speed and high dynamic range, use a block-floating or logarithmic representation. Table 8.2 summarizes how you can implement standard operations in fixed-point arithmetic. It assumes there are three signals x[t], c[t], y[t], that have Qn,Qm,Qd representations X[t], C[t], Y[t], respectively. In other words: X[t ]=2 n x[t], C[t]=2 m c[t], Y [t]=2 d y[t] (8.19) to the nearest integer. To make the table more concise, we use <<< as shorthand for an operation that is either a left or right shift according to the sign of the shift amount. Formally: x <<< s := x << s if s>=0 Table 8.2 Summary of standard fixed-point operations. Signal operation Integer fixed-point equivalent y[t]=x[t] Y[t]=X[t] <<< (d-n); y[t]=x[t]+c[t] Y[t]=(X[t] <<< (d-n))+(C[t] <<< (d-m)); y[t]=x[t]-c[t] Y[t]=(X[t] <<< (d-n))-(C[t] <<< (d-m)); y[t]=x[t]*c[t] Y[t]=(X[t]*C[t]) <<< (d-n-m); y[t]=x[t]/c[t] Y[t]=(X[t] <<< (d-n+m))/C[t]; y[t]=sqrt(x[t]) Y[t]=isqrt(X[t] <<< (2*d-n)); 8.2 Introduction to DSP on the ARM 269 x >> (-s) if s<0 and rounding is not required (x+round) >> (-s) if s<0 and rounding is required round := (1 << (-1-s)) if 0.5 should round up (1 << (-1-s))-1 if 0.5 should round down You must always check the precision and dynamic range of the intermediate and output values. Ensure that there are no overflows or unacceptable losses of precision. These considerations determine the representations and size to use for the container integers. These equations are the most general form. In practice, for addition and subtraction we usually take d = n = m. For multiplication we usually take d = n + m or d = n. Since you know d, n, and m, at compile time, you can eliminate shifts by zero. 8.2 Introduction to DSP on the ARM This section begins by looking at the features of the ARM architecture that are useful for writing DSP applications. We look at each common ARM implementation in turn, highlighting its strengths and weaknesses for DSP. The ARM core is not a dedicated DSP. There is no single instruction that issues a multiply accumulate and data fetch in parallel. However, by reusing loaded data you can achieve a respectable DSP performance. The key idea is to use block algorithms that calcu- late several results at once, and thus require less memory bandwidth, increase performance, and decrease power consumption compared with calculating single results. The ARM also differs from a standard DSP when it comes to precision and saturation. In general, ARM does not provide operations that saturate automatically. Saturating versions of operations usually cost additional cycles. Section 7.7 covered saturating operations on the ARM. On the other hand, ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit operations very well. These operations are particularly important for CD-quality audio applications, which require intermediate precision at greater than 16 bits. From ARM9 onwards, ARM implementations use a multistage execute pipeline for loads and multiplies, which introduces potential processor interlocks. If you load a value and then use it in either of the following two instructions, the processor may stall for a number of cycles waiting for the loaded value to arrive. Similarly if you use the result of a multiply in the following instruction, this may cause stall cycles. It is particularly important to schedule code to avoid these stalls. See the discussion in Section 6.3 on instruction scheduling. Summary Guidelines for Writing DSP Code for ARM ■ Design the DSP algorithm so that saturation is not required because saturation will cost extra cycles. Use extended-precision arithmetic or additional scaling rather than saturation. 270 Chapter 8 Digital Signal Processing ■ Design the DSP algorithm to minimize loads and stores. Once you load a data item, then perform as many operations that use the datum as possible. You can often do this by calculating several output results at once. Another way of increasing reuse is to concatenate several operations. For example, you could perform a dot product and signal scale at the same time, while only loading the data once. ■ Write ARM assembly to avoid processor interlocks. The results of load and multiply instructions are often not available to the next instruction without adding stall cycles. Sometimes the results will not be available for several cycles. Refer to Appendix D for details of instruction cycle timings. ■ There are 14 registers available for general use on the ARM, r0 to r12 and r14. Design the DSP algorithm so that the inner loop will require 14 registers or fewer. In the following sections we look at each of the standard ARM cores in turn. We implement a dot-product example for each core. A dot-product is one of the simplest DSP operations and highlights the difference among different ARM implementations. A dot-product combines N samples from two signals x t and c t to produce a correlation value a: a = N −1 i=0 c i x i (8.20) The C interface to the dot-product function is int dot_product(sample *x, coefficient *c, unsigned int N); where ■ sample is the type to hold a 16-bit audio sample, usually a short ■ coefficient is the type to hold a 16-bit coefficient, usually a short ■ x[i] and c[i] are two arrays of length N (the data and coefficients) ■ the function returns the accumulated 32-bit integer dot product a 8.2.1 DSP on the ARM7TDMI The ARM7TDMI has a 32-bit by 8-bit per cycle multiply array with early termination. It takes four cycles for a 16-bit by 16-bit to 32-bit multiply accumulate. Load instructions take three cycles and store instructions two cycles for zero-wait-state memory or cache. See Section D.2 in Appendix D for details of cycle timings for ARM7TDMI instructions. Summary Guidelines for Writing DSP Code for the ARM7TDMI ■ Load instructions are slow, taking three cycles to load a single value. To access mem- ory efficiently use load and store multiple instructions LDM and STM. Load and store 8.2 Introduction to DSP on the ARM 271 multiples only require a single cycle for each additional word transferred after the first word. This often means it is more efficient to store 16-bit data values in 32-bit words. ■ The multiply instructions use early termination based on the second operand in the product Rs. For predictable performance use the second operand to specify constant coefficients or multiples. ■ Multiply is one cycle faster than multiply accumulate. It is sometimes useful to split an MLA instruction into separate MUL and ADD instructions. You can then use a barrel shift with the ADD to perform a scaled accumulate. ■ You can often multiply by fixed coefficients faster using arithmetic instructions with shifts. For example, 240x = (x 8) − (x 4). For any fixed coefficient of the form ±2 a ± 2 b ± 2 c , ADD and SUB with shift give a faster multiply accumulate than MLA. Example 8.2 This example shows a 16-bit dot-product optimized for the ARM7TDMI. Each MLA takes a worst case of four cycles. We store the 16-bit input samples in 32-bit words so that we can use the LDM instruction to load them efficiently. x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 5) acc RN 3 ; accumulator x_0 RN 4 ; elements from array x[] x_1 RN 5 x_2 RN 6 x_3 RN 7 x_4 RN 8 c_0 RN 9 ; elements from array c[] c_1 RN 10 c_2 RN 11 c_3 RN 12 c_4 RN 14 ; int dot_16by16_arm7m(int *x, int *c, unsigned N) dot_16by16_arm7m STMFD sp!, {r4-r11, lr} MOV acc, #0 loop_7m ; accumulate 5 products LDMIA x!, {x_0, x_1, x_2, x_3, x_4} LDMIA c!, {c_0, c_1, c_2, c_3, c_4} MLA acc, x_0, c_0, acc MLA acc, x_1, c_1, acc MLA acc, x_2, c_2, acc MLA acc, x_3, c_3, acc 272 Chapter 8 Digital Signal Processing MLA acc, x_4, c_4, acc SUBS N, N, #5 BGT loop_7m MOV r0, acc LDMFD sp!, {r4-r11, pc} This code assumes that the number of samples N is a multiple of five. Therefore we can use a five-word load multiple to increase data bandwidth. The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had used LDR or LDRSH. The inner loop requires a worst case of 7 +7 +5 ∗ 4 + 1 + 3 = 38 cycles to process each block of 5 products from the sum. This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap for a 16-bit dot-product. The block filter algorithm of Section 8.3 gives a much better performance per tap if you are calculating multiple products. ■ 8.2.2 DSP on the ARM9TDMI The ARM9TDMI has the same 32-bit by 8-bit per cycle multiplier array with early termina- tion as the ARM7TDMI. However, load and store operations are much faster compared to the ARM7TDMI. They take one cycle provided that you do not attempt to use the loaded value for two cycles after the load instruction. See Section D.3 in Appendix D for cycle timings of ARM9TDMI instructions. Summary Writing DSP Code for the ARM9TDMI ■ Load instructions are fast as long as you schedule the code to avoid using the loaded value for two cycles. There is no advantage to using load multiples. Therefore you should store 16-bit data in 16-bit short type arrays. Use the LDRSH instruction to load the data. ■ The multiply instructions use early termination based on the second operand in the product Rs. For predictable performance use the second operand to specify constant coefficients or multiples. ■ Multiply is the same speed as multiply accumulate. Try to use the MLA instruction rather than a separate multiply and add. ■ You can often multiply by fixed coefficients faster using arithmetic instructions with shifts. For example, 240x = (x 8) − (x 4). For any fixed coefficient of the form ±2 a ±2 b ±2 c , ADD and SUB with shift give a faster multiply accumulate than using MLA. Example 8.3 This example shows a 16-bit dot-product optimized for the ARM9TDMI. Each MLA takes a worst case of four cycles. We store the 16-bit input samples in 16-bit short integers, since there is no advantage in using LDM rather than LDRSH, and using LDRSH reduces the memory size of the data. 8.2 Introduction to DSP on the ARM 273 x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 4) acc RN 3 ; accumulator x_0 RN 4 ; elements from array x[] x_1 RN 5 c_0 RN 9 ; elements from array c[] c_1 RN 10 ; int dot_16by16_arm9m(short *x, short *c, unsigned N) dot_16by16_arm9m STMFD sp!, {r4-r5, r9-r10, lr} MOV acc, #0 LDRSH x_0, [x], #2 LDRSH c_0, [c], #2 loop_9m ; accumulate 4 products SUBS N, N, #4 LDRSH x_1, [x], #2 LDRSH c_1, [c], #2 MLA acc, x_0, c_0, acc LDRSH x_0, [x], #2 LDRSH c_0, [c], #2 MLA acc, x_1, c_1, acc LDRSH x_1, [x], #2 LDRSH c_1, [c], #2 MLA acc, x_0, c_0, acc LDRGTSH x_0, [x], #2 LDRGTSH c_0, [c], #2 MLA acc, x_1, c_1, acc BGT loop_9m MOV r0, acc LDMFD sp!, {r4-r5, r9-r10, pc} We have assumed that the number of samples N is a multiple of four. Therefore we can unroll the loop four times to increase performance. The code is scheduled so that there are four instructions between a load and the use of the loaded value. This uses the preload tricks of Section 6.3.1.1: ■ The loads are double buffered. We use x 0 , c 0 while we are loading x 1 , c 1 and vice versa. ■ We load the initial values x 0 , c 0 , before the inner loop starts. This initiates the double buffer process. ■ We are always loading one pair of values ahead of the ones we are using. Therefore we must avoid the last pair of loads or we will read off the end of the arrays. We do this 274 Chapter 8 Digital Signal Processing by having a loop counter that counts down to zero on the last loop. Then we can make the final loads conditional on N>0. The inner loop requires 28 cycles per loop, giving 28/4 = 7 cycles per tap. See Section 8.3 for more efficient block filter implementations. ■ 8.2.3 DSP on the StrongARM The StrongARM core SA-1 has a 32-bit by 12-bit per cycle signed multiply array with early termination. If you attempt to use a multiply result in the following instruction, or start a new multiply, then the core will stall for one cycle. Load instructions take one cycle, except for signed byte and halfword loads, which take two cycles. There is a one-cycle delay before you can use the loaded value. See Section D.4 in Appendix D for details of the StrongARM instruction cycle timings. Summary Writing DSP Code for the StrongARM ■ Avoid signed byte and halfword loads. Schedule the code to avoid using the loaded value for one cycle. There is no advantage to using load multiples. ■ The multiply instructions use early termination based on the second operand in the product Rs. For predictable performance use the second operand to specify constant coefficients or multiples. ■ Multiply is the same speed as multiply accumulate. Try to use the MLA instruction rather than a separate multiply and add. Example 8.4 This example shows a 16-bit dot-product. Since a signed 16-bit load requires two cycles, it is more efficient to use 32-bit data containers for the StrongARM. To schedule StrongARM code, one trick is to interleave loads and multiplies. x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 4) acc RN 3 ; accumulator x_0 RN 4 ; elements from array x[] x_1 RN 5 c_0 RN 9 ; elements from array c[] c_1 RN 10 ; int dot_16by16_SA1(int *x, int *c, unsigned N) dot_16by16_SA1 STMFD sp!, {r4-r5, r9-r10, lr} 8.2 Introduction to DSP on the ARM 275 MOV acc, #0 LDR x_0, [x], #4 LDR c_0, [c], #4 loop_sa ; accumulate 4 products SUBS N, N, #4 LDR x_1, [x], #4 LDR c_1, [c], #4 MLA acc, x_0, c_0, acc LDR x_0, [x], #4 LDR c_0, [c], #4 MLA acc, x_1, c_1, acc LDR x_1, [x], #4 LDR c_1, [c], #4 MLA acc, x_0, c_0, acc LDRGT x_0, [x], #4 LDRGT c_0, [c], #4 MLA acc, x_1, c_1, acc BGT loop_sa MOV r0, acc LDMFD sp!, {r4-r5, r9-r10, pc} We have assumed that the number of samples N is a multiple of four and so have unrolled by four times. For worst-case 16-bit coefficients, each multiply requires two cycles. We have scheduled to remove all load and multiply use interlocks. The inner loop uses 19 cycles to process 4 taps, giving a rating of 19/4 = 4.75 cycles per tap. ■ 8.2.4 DSP on the ARM9E The ARM9E core has a very fast pipelined multiplier array that performs a 32-bit by 16-bit multiply in a single issue cycle. The result is not available on the next cycle unless you use the result as the accumulator in a multiply accumulate operation. The load and store operations are the same speed as on the ARM9TDMI. See Section D.5 in Appendix D for details of the ARM9E instruction cycle times. To access the fast multiplier, you will need to use the multiply instructions defined in the ARMv5TE architecture extensions. For 16-bit by 16-bit products use SMULxy and SMLAxy. See Appendix A for a full list of ARM multiply instructions. Summary Writing DSP Code for the ARM9E ■ The ARMv5TE architecture multiply operations are capable of unpacking 16-bit halves from 32-bit words and multiplying them. For best load bandwidth you should use word load instructions to load packed 16-bit data items. As for the ARM9TDMI you should schedule code to avoid load use interlocks. 276 Chapter 8 Digital Signal Processing ■ The multiply operations do not early terminate. Therefore you should only use MUL and MLA for multiplying 32-bit integers. For 16-bit values use SMULxy and SMLAxy. ■ Multiply is the same speed as multiply accumulate. Try to use the SMLAxy instruction rather than a separate multiply and add. Example 8.5 This example shows the dot-product for the ARM9E. It assumes that the ARM is configured for a little-endian memory system. If the ARM is configured for a big-endian memory system, then you need to swap the B and T instruction suffixes. You can use macros to do this for you automatically as in Example 8.11. We use the naming convention x_10 to mean that the top 16 bits of the register holds x 1 and the bottom 16 bits x 0 . x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 8) acc RN 3 ; accumulator x_10 RN 4 ; packed elements from array x[] x_32 RN 5 c_10 RN 9 ; packed elements from array c[] c_32 RN 10 ; int dot_16by16_arm9e(short *x, short *c, unsigned N) dot_16by16_arm9e STMFD sp!, {r4-r5, r9-r10, lr} MOV acc, #0 LDR x_10, [x], #4 LDR c_10, [c], #4 loop_9e ; accumulate 8 products SUBS N, N, #8 LDR x_32, [x], #4 SMLABB acc, x_10, c_10, acc LDR c_32, [c], #4 SMLATT acc, x_10, c_10, acc LDR x_10, [x], #4 SMLABB acc, x_32, c_32, acc LDR c_10, [c], #4 SMLATT acc, x_32, c_32, acc LDR x_32, [x], #4 SMLABB acc, x_10, c_10, acc LDR c_32, [c], #4 SMLATT acc, x_10, c_10, acc LDRGT x_10, [x], #4 SMLABB acc, x_32, c_32, acc LDRGT c_10, [c], #4 8.2 Introduction to DSP on the ARM 277 SMLATT acc, x_32, c_32, acc BGT loop_9e MOV r0, acc LDMFD sp!, {r4-r5, r9-r10, pc} We have unrolled eight times, assuming that N is a multiple of eight. Each load instruc- tion reads two 16-bit values, giving a high memory bandwidth. The inner loop requires 20 cycles to accumulate 8 products, a rating of 20/8 = 2.5 cycles per tap. A block filter gives even greater efficiency. ■ 8.2.5 DSP on the ARM10E Like ARM9E, the ARM10E core also implements ARM architecture ARMv5TE. The range and speed of multiply operations is the same as for the ARM9E, except that the 16-bit multiply accumulate requires two cycles rather than one. For details of the ARM10E core cycle timings, see Section D.6 in Appendix D. The ARM10E implements a background loading mechanism to accelerate load and store multiples. A load or store multiple instruction issues in one cycle. The operation will run in the background, and if you attempt to use the value before the background load completes, then the core will stall. ARM10E uses a 64-bit-wide data path that can transfer two registers on every background cycle. If the address isn’t 64-bit aligned, then only 32 bits can be transferred on the first cycle. Summary Writing DSP Code for the ARM10E ■ Load and store multiples run in the background to give a high memory bandwidth. Use load and store multiples whenever possible. Be careful to schedule the code so that it does not use data before the background load has completed. ■ Ensure data arrays are 64-bit aligned so that load and store multiple operations can transfer two words per cycle. ■ The multiply operations do not early terminate. Therefore you should only use MUL and MLA for multiplying 32-bit integers. For 16-bit values use SMULxy and SMLAxy. ■ The SMLAxy instruction takes one cycle more than SMULxy. It may be useful to split a multiply accumulate into a separate multiply and add. Example 8.6 In the example code the number of samples N is a multiple of 10. x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 10) acc RN 3 ; accumulator [...]... SMLATB a _5, x_10, c_32, a _5 SMLATT a_0, x_32, c_32, a_0 LDR x_32, [x], #4 SMLABT a_1, x _54 , c_32, a_1 SMLATT a_2, x _54 , c_32, a_2 SMLABT a_3, x_10, c_32, a_3 SMLATT a_4, x_10, c_32, a_4 SMLABT a _5, x_32, c_32, a _5 SMLABB a_0, x _54 , c_10, a_0 SMLATB a_1, x _54 , c_10, a_1 SMLABB a_2, x_10, c_10, a_2 SMLATB a_3, x_10, c_10, a_3 SMLABB a_4, x_32, c_10, a_4 SMLATB a _5, x_32, c_10, a _5 SMLATT a_0, x _54 , c_10,... a_3 SMLABB a_4, x _54 , c_10, a_4 SMLATB a _5, x _54 , c_10, a _5 SMLATT a_0, x_10, c_10, a_0 LDR x_10, [x], #4 ; load two coefficients SMLABT a_1, x_32, c_10, a_1 SMLATT a_2, x_32, c_10, a_2 SMLABT a_3, x _54 , c_10, a_3 SMLATT a_4, x _54 , c_10, a_4 SMLABT a _5, x_10, c_10, a _5 LDR c_10, [c], #4 SMLABB a_0, x_32, c_32, a_0 SMLATB a_1, x_32, c_32, a_1 SMLABB a_2, x _54 , c_32, a_2 SMLATB a_3, x _54 , c_32, a_3 SMLABB... LDMFD sp!, {N, M} 293 294 Chapter 8 Digital Signal Processing Table 8.6 32-bit by 32-bit filter timing Processor Inner loop cycles Filter rating cycles/tap ARM7 TDMI ARM9 TDMI StrongARM ARM9 E ARM1 0E XScale 54 50 31 26 22 22 54 /6 = 9 50 /6 = 8.3 31/6 = 5. 2 26/6 = 4.3 22/6 = 3.7 22/6 = 3.7 STMIA SUB SUB ADD SUBS BGT LDMFD a!, {a_0l, a_0h, a_1l, a_1h, a_2l, a_2h} c, c, M, LSL#2 x, x, M, LSL#2 x, x, #(3-2)*4... 10 samples, or 2 .5 cycles per tap ■ 8.2.6 DSP on the Intel XScale The Intel XScale implements version ARMv5TE of the ARM architecture like ARM9 E and ARM1 0E The timings of load and multiply instructions are similar to the ARM9 E, and 8.2 Introduction to DSP on the ARM 279 code you’ve optimized for the ARM9 E should run efficiently on XScale See Section D.7 in Appendix D for details of the XScale core cycle... yt = −0 45xt + 0 9xt −1 − 0 45xt −2 (8.26) Suppose we represent xi and ci by Qn, Qm 16-bit fixed-point signals X [t ] and C[i] Then, C[0] = −0 45 × 2m , C[1] = 0 90 × 2m , C[2] = −0 45 × 2m (8.27) Since X [t ] is a 16-bit integer, |X [t ]| ≤ 2 15 , and so, using the first inequality above, |A[t ]| ≤ 2 15 × 1 8 × 2m = 1 8 × 2 15+ m (8.28) A[t ] will not overflow a 32-bit integer, provided that m ≤ 15 So, take... x _54 , [x], #4 SMLABT a_1, x_10, c_10, a_1 SMLATT a_2, x_10, c_10, a_2 SMLABT a_3, x_32, c_10, a_3 289 290 Chapter 8 Digital Signal Processing Table 8.4 ARMv5TE 16-bit block filter timings Processor Inner loop cycles Filter rating cycles/tap ARM9 E ARM1 0E XScale 46 78 46 46/36 = 1.28 78/36 = 2.17 46/36 = 1.28 SMLATT SMLABT BGT LDMFD STMIA SUB SUB SUBS BGT LDMFD a_4, x_32, c_10, a_4 a _5, x _54 , c_10, a _5. .. it processes 16 filter taps in 76 cycles, giving a block FIR rating of 4. 75 cycles/tap This code also works well for other ARMv4 architecture processors such as the StrongARM On StrongARM the inner loop requires 61 cycles, or 3.81 cycles/tap ■ Example 8.11 The ARM9 E has a faster multiplier than previous ARM processors The ARMv5TE 16-bit multiply instructions also unpack 16-bit data when two 16-bit values... SUBS M, M, #1 BGT iir_next_biquad LDMFD sp!, {r4-r11, pc} 299 300 Chapter 8 Digital Signal Processing Table 8.7 ARMv4T IIR timings Processor Cycles per loop Cycles per biquad-sample ARM9 TDMI StrongARM 44 33 22 16 .5 The timings on ARM9 TDMI and StrongARM are shown in Table 8.7 ■ Example With ARMv5TE processors, we can pack two 16-bit values into each register This means 8.16 we can store the state and coefficients... {c _54 , c_76, c_98} SMLABB acc, x_32, c_32, acc SMLATT acc, x_32, c_32, acc LDMGTIA x!, {x_10, x_32} SMLABB acc, x _54 , c _54 , acc SMLATT acc, x _54 , c _54 , acc SMLABB acc, x_76, c_76, acc LDMGTIA c!, {c_10, c_32} SMLATT acc, x_76, c_76, acc SMLABB acc, x_98, c_98, acc SMLATT acc, x_98, c_98, acc BGT loop_10 MOV r0, acc LDMFD sp!, {r4-r11, pc} The inner loop requires 25 cycles to process 10 samples, or 2 .5. .. c_32, a_2 a_3, x_2, c_32, a_3 a_4, x_3, c_32, a_4 next_tap32 _arm9 e sp!, {a, N, M} a!, {a_0, a_1, a_2, a_3, a_4} c, c, M, LSL#1 x, x, M, LSL#2 x, x, # (5- 4)*4 N, N, #5 next_sample32 _arm9 e sp!, {r4-r11, pc} Each iteration of the inner loop updates five filter outputs, accumulating four products to each Table 8 .5 gives cycle counts for architecture ARMv5TE processors ■ Example High-quality audio applications . rating of 20/8 = 2 .5 cycles per tap. A block filter gives even greater efficiency. ■ 8.2 .5 DSP on the ARM1 0E Like ARM9 E, the ARM1 0E core also implements ARM architecture ARMv5TE. The range and. requires 25 cycles to process 10 samples, or 2 .5 cycles per tap. ■ 8.2.6 DSP on the Intel XScale The Intel XScale implements version ARMv5TE of the ARM architecture like ARM9 E and ARM1 0E. The. add. Example 8 .5 This example shows the dot-product for the ARM9 E. It assumes that the ARM is configured for a little-endian memory system. If the ARM is configured for a big-endian memory system, then