394 FIXED-POINT MATHEMATICS APPENDIX A does not require this large a range, as only magnitudes up to 2 32 need to be representable, and for colors even 2 10 is enough. The precision of these fixed-point numbers is fixed: (1/65536), whereas the precision of floats depends on the magnitude of the values. Values close to zero have a very high accuracy: two consecutive floats at around 1.0 have a preci- sion of 1/16777216, floats at around 250.0 have roughly the same precision as fixed-point numbers, while larger numbers become more inaccurate (two consecutive floats around 17 million are further than 1.0 units apart). OpenGL requires only accuracy of one part in 10 5 , which is a little under 17 bits; single-precision floats have 24 bits of accuracy. Below are C macros for converting from float to fixed and vice versa: #define float_to_fixed( a ) (int) ((a) * (1<<16)) #define fixed_to_float( a ) (((float)a) / (1<<16)) These are “quick-and-dirty” versions of conversion. float_to_fixed can overflow if the magnitude of the float value is too great, or underflow if it is too small. fixed_to_float can be made slightly more accurate by rounding. For example, asymmetric arithmetic rounding works by adding 0.5 to the number before truncating it to an integer, e.g., (int)floor((a) / 65536.0f + 0.5f). Finally, note that some of these conversions are expensive on some processors and thus should not be used in performance-critical code such as inner loops. Here are some 16.16 fixed-point numbers, expressed in hexadecimal, and the correspond- ing decimal numbers: 0x 0001 0000 1.0 0x 0002 0000 2.0 0x 0010 0000 16.0 0x 0000 8000 0.5 0x 0000 4000 0.25 0x 0000 2000 0.125 0x 0000 0001 1.0/65536 0x ffff ffff −1.0/65536 0x fffe 0000 −2.0 Depending on the situation it may make sense to move the decimal point to some other location, although 16.16 is a good general choice. For example, if you are only interested in numbers between zero and one (but excluding one), you should move the decimal point all the way to the left; ifyou use 32 bits denote that with u0.32 (here u stands for unsigned). In rasterization, the number of sub-pixel bits and the size of the screen in pixels determine the number of bits you should have on the right side of the decimal point. Signed 16.16 is a compromise that is relatively easy to use, and gives the same relative importance to numbers between zero and one as to values above one. SECTION A.1 FIXED-POINT METHODS IN C 395 In the upcoming examples we also use other fixed-point formats. For example, a 32.32 fixed-point value would be stored using 64 bits and it could be converted to a float by dividing it by 2 32 , whereas 32.16 would take 48 bits and have 32 integer and 16 decimal bits, and 32.0 would denote a regular 32-bit signed integer. To distinguish between unsigned (such as u0.32) and signed two’s complement fixed-point formats we prepend unsigned formats with u. In this appendix, we first go through fixed-point processing in C. We then follow by showing what you can do by using assembly language, and conclude with a section on fixed-point programming in Java. A.1 FIXED-POINT METHODS IN C In this section we first discuss the basic fixed-point operations, followed by the shared exponent approach for vector operations, and conclude with an example that precalcu- lates trigonometric functions in a table. A.1.1 BASIC OPERATIONS The addition of two fixed-point numbers is usually very straightforward (and subtraction is just a signed add): #define add_fixed_fixed( a, b ) ((a)+(b)) We have to watch out, though; the operation may overflow. As opposed to floats, the overflow is totally silent, there is no warning about the result being wrong. Therefore, you should always insert a debugging code to your fixed-point math, the main idea being that the results before and after clamping from 64-bit integers to 32-bit integers have to agree. 1 Here is an example of how that can be done: #if defined(DEBUG) int add_fixed_fixed_chk( int a, int b ) { int64 bigresult = ((int64)a) + ((int64)b); int smallresult=a+b; assert(smallresult == bigresult); return smallresult; } #endif #if defined(DEBUG) # define add_fixed_fixed( a, b ) add_fixed_fixed_chk( a, b ) #else # define add_fixed_fixed( a, b ) ((a)+(b)) #endif 1 Code examples are not directly portable. Minimally you have to select the correct platform 64-bit type. Examples: long long, __int64, int64 396 FIXED-POINT MATHEMATICS APPENDIX A Another point to note is that these fixed-point routines should always be macros or inlined functions, not called through regular functions. The function calling overhead would take away most of the speed benefits of fixed-point programming. For the debug versions using regular functions is fine, though. Multiplications are more complicated than additions. Let us analyze the case of mul- tiplying two 16.16 numbers and storing the result into another 16.16 number. When we multiply two 16.16 numbers, the accurate result is a 32.32 number. We ignore the last 16 bits of the result simply by shifting right 16 steps, y ielding a 32.16 number. If all the remaining bits are zero, either one or both of the operands were zero, or we underflowed, i.e., the magnitude of the result was too small to be represented in a 16.16 fixed-point number. Similarly, if the result is too large to fit in 16.16, we overflow. But if the result is representable as a 16.16 number, we can simply take the lowest 32 bits. Note that the intermediate result must be stored in a 64-bit integer, unless the magnitude of the result is known to be under 1.0 before multiplication. We are finally ready to define multiplication: #define mul_fixed_fixed( a, b ) (int)(((int64)(a)*(int64)(b)) >> 16) If one of the multiplicands is an int, then the inputs are 16.16 and 32.0, the result is 48.16, and we can omit the shift operation: #define mul_fixed_int( a, b ) (int)((int64)(a) * (int64)(b)) Multiplications overflow even more easily than additions. The following example shows how you can check for overflows in debug builds: #if defined(DEBUG) int mul_fixed_fixed_chk( int a, int b ) { int64 bigresult = (((int64)a) * ((int64)b)) >> 16; /* high bits must be just sign bits (0’s or 1’s) */ int64 sign = (bigresult >> 32); assert( (sign == 0) || (sign == — 1) ); return (int)bigresult; } #endif Note also that multiplications by power-of-two are typically faster when done with shifts instead of normal multiplication. For example: assert((a << 4) == (a * 16)); Let us then see how division works. Dividing two 16.16 numbers gives you an integer, and loses precision in the process. However, as we want the result to be 16.16, we should shift the nominator left 16 steps and store it in an int64 before the division. This also SECTION A.1 FIXED-POINT METHODS IN C 397 avoids losing the fractional bits. Here are several versions of the division with different arguments (fixed or int), producing a 16.16 result: #define div_fixed_fixed( a, b ) (int)( (((int64)(a))<<16) / (b) ) #define div_int_int( a, b ) (int)( (((int64)(a))<<16) / (b) ) #define div_int_fixed( a, b ) (int)( (((int64)(a))<<32) / (b) ) #define div_fixed_int( a, b ) ((a) / (b)) These simple versions do not check for overflows, nor do they trap the case b = 0.Divi- sion, however, is usually a much slower operation than multiplication. If the interval of operations is small enough, it may be possible to precalculate a table of reciprocals and perform multiplication. With a wider interval one can do a sparse table of reciprocals and interpolate the nearest results. For slightly more precision, we can incorporate rounding into the fixed-point operations. Rounding works much the same way as when converting a float to a fixed-point number: add 0.5 before truncating to an integer. Since we use integer division in the operations, we just have to add 0.5 before the division. For multiplication this is easy and fairly cheap: since our divider is the fixed value of 1 << 16, we add one half of that, 1 << 15, befor e the shift: #define mul_fixed_fixed_round( a,b)\ (int)( ((int64)(a) * (int64)(b) + (1<<15)) >> 16) Similarly, for correct rounding in division of a by b, we should add b/2 to a before dividing by b. A.1.2 SHARED EXPONENTS Sometimes the range that is required for calculations is too great to fit into 32-bit registers. In some of those cases you can still avoid the use of full floating point. For example, you can create your own floating-point operations that do not deal with the trickiest parts of the IEEE standard, e.g., the handling of infinities, NaNs (Not-a-Numbers), or floating- point exceptions. However, with vector operations, which are often needed in 3D g raphics, another pos- sibility is to store the meaningful bits, the mantissas, separately into integers, perform integer calculations using them, and to share the exponent across all terms. For example, if you need to calculate a dot product of a floating-point vector against a vector of inte- ger or fixed-point numbers, you could normalize the floating-point vector to a common base exponent, perform the multiplications and additions in fixed point, and finally, if needed, adjust the base exponent depending on the result. Another name for this practice of shared exponents is block floating point . Using a shared exponent may lead to underflow, truncating some of the terms to zero. In some cases such truncation may lead to a large error. Here is a bit contrived example of a 398 FIXED-POINT MATHEMATICS APPENDIX A worst-case error: [1.0e40, 1.0e8, 1.0e8, 1.0e8] · [0, 32768, 32768, 32768]. With a shared exponent the first vector becomes [1, 0, 0, 0] ∗ 1e40, w hich, when dotted with the second vector, produces a result that is very different from the true answer. The resulting number sequence, mantissas together with the shared exponent, is really a vectorized floating-point number and needs to be treated as such in the subsequent calcu- lations, until to the point where the exponent can be finally eliminated. It may seem that since the exponent must be normalized in the end in any case, we are not saving much. Keep in mind, though, that the most expensive operations are only performed once for the full dot product. It may even be possible that the required multiplication and addi- tion operations can be done with efficient multiply-and-accumulate (MAC) operations in assembler if the processor supports such operations. Conversion from floating point vectors into vectorized floating point is only useful in situations where the cost of conversion can be amortized somehow. For example, if you run 50 dot products where the floating-point vector stays the same and the fixed-point vectors vary, this method can save a lot of computation. An example where you might need this kind of functionality is in your physics library. A software implementation of vertex array tr ansformation by modelview and projection matrices is another example where this approach could be attempted: multiplication of a homogeneous vertex with a 4 × 4 matrix can be done with four dot products. Many processors support operations that can be used for normalizing the result. For example ARM processors with the ARMv5 instruction set or later support the CLZ instruction that counts the number of leading zero bits in an integer. Even when the processor supports these operations, they are only typically expressed either as compiler- specific intrinsic functions or through inline assembler. For example, a portable version of count-leading-zeros can be implemented as follows: /* Table stores the CLZ value for a byte */ static unsigned char clz_table[256]={8,7,6,6, }; INLINE int clz_unsigned( unsigned int num ) { int res = 24; if (num >> 16) { num >>= 16; res — = 16; } if (num > 255) { num >>= 8; res — = 8; } SECTION A.1 FIXED-POINT METHODS IN C 399 return clz_table[num] + res; } GCC compiler has a built-in command for CLZ that can be used like this: INLINE int clz_unsigned( unsigned int num ) { return __builtin_clz(num); } The built-in will get compiled to ARM CLZ opcode when compiled to ARM target. The performance of this routine depends on the processor architecture, and for some processors it may be faster to calculate the result with arithmetic instructions instead of table lookups. In comparison, the ARM assembly variant of the same thing is: INLINE int clz_unsigned( unsigned int num ) { int result; __asm { clz result, num } return result; } A.1.3 TRIGONOMETRIC OPERATIONS The use of trigonometric functions such as sin, cos,orarctan can be expensive both in floating-point and fixed-point domains. But since these functions are repeating, sy m- metric, have a compact range [−1,1], and can sometimes be expressed in terms of each other (e.g., sin(θ + 90 ◦ ) = cos(θ)), you can precalculate them directly into tables and store the results in fixed point. A case in point is sin (and from that cos), for which only a 90 ◦ segment needs to be tab- ulated, and the rest can be obtained through the symmetry and continuity properties of sin. Since the table needs to be indexed by an integer, the input parameter needs to be discretized as well. Quantizing 90 ◦ to 1024 steps usually gives a good trade-off between accuracy, table size, and ease of manipulation of angle values (since 1024 is a power of two). The following code precalculates such a table. short sintable[1024]; int ang; for( ang = 0; ang < 1024 ; ang++ ) 400 FIXED-POINT MATHEMATICS APPENDIX A { /* angle_in_radians = ang/1024 * pi/2 */ double rad_angle = (ang * PI) / (1024.0 * 2.0); sintable[ang] = (short)( — sin(rad_angle) * 32768.0); } In the loop we first convert the table index into radians. Using that value we evaluate sin and scale the result to the chosen fixed-point range. The values of sin vary from 0.0 to 1.0 within the first quadrant. If we multiply value 1.0 of sin by 32768.0 and convert to short, the result overflows to zero. A solution is to negate the sin values in the table and negate those back after the value is read from the table. Here is an example function of extracting values for sin. Note that the return value is sin scaled by 32768.0. INLINE int fixed_sin( int angle ) { int phase = angle & (1024 + 2048); int subang = angle & 1023; if ( phase == 0 ) return — (int)sintable[ subang ]; else if ( phase == 1024 ) return —(int)sintable[ 1023 — subang ]; else if ( phase == 2048 ) return (int)sintable[ subang ]; else return (int)sintable[ 1023 — subang ]; } A.2 FIXED-POINT METHODS IN ASSEMBLY LANGUAGE Typically all processors have instructions that are helpful for fixed-point computations. For example, most processors support multiplication of two 32-bit values into a 64- bit result. However, it may be difficult for the compiler to find the optimal instruction sequence for the C code; direct assembly code is sometimes the only way to achieve good performance. Depending on the compiler and the processor, improvements of more than 2× can be often achieved using optimized assembly code. Let us take the fixed-point multiplication covered earlier as an example. If you multiply two 32-bit integers, the result will also be a 32-bit integer, which may overflow the results before you have a chance to shift the results back into a safe range. Even if the target processor supports the optimized multiplication, it may be impossible to get a compiler to generate such assembly instructions. To be safe, you have to promote at least one of the arguments to a 64-bit integer. There are two solutions to this dilemma. The first (easy) solution is to use a good optimizing compiler that detects the casts around the operands, and then performs a narrower and faster multiplication. You might even be able to study the machine code sequences that the compiler produces to learn how to express operations SECTION A.2 FIXED-POINT METHODS IN ASSEMBLY LANGUAGE 401 so that they lead to efficient machine code. The second solution is to use inlined assembly and explicitly use the narrowest multiply that you can get away with. Here we show an example of how to do fixed-point operations using ARM assembler. ARM processor is a RISC-type processor with sixteen 32-bit registers (r0-r15), out of which r15 is restricted to program counter (PC) and r13 to stack pointer (SP), and r14 is typically used as a link register (LR); the rest are available for arbitrary use. All ARM opcodes can be prefixed with a conditional check based on which the operation is either executed or ignored. All data opcodes have three-register forms where a constant shift operation can be applied to the rightmost register operand with no performance cost. For example, the following C-code int INLINE foo( int a, int b ) { intt=a+(b>>16); if(t < 0) return — t; else return t; } executes in just two cycles when converted to ARM: adds r0,r2,r3,asr #16 ; r0 = r2 + (r3 >> 16) and update flags rsbmi r0,r0,#0 ; if result (r0) was negative, r0 = 0 — r0 (reverse subtract) For more details about ARM assembler, see www.arm.com/documentation. Note that the following examples are not optimized for any particular ARM implementa- tion. The pipelining rules for different ARM variants, as well as different implementations of each variant, can be different. The following example code multiplies a u0.32 fixed-point number with another u0.32 fixed-point number and stores the resulting high 32 bits to register r0. ; assuming: ; r2 = input value 0 ; r3 = input value 1 umull r1,r0,r2,r3 ; (high:low) r0:r1 = r2*r3 ; result is directly in r0 register, low bits in r1 In the example above there is no need to actually shift the result by 32 as we can directly store the high bits of the result to the correct register. To fully utilize this increased control of operations and intermediate result ranges, you should combine primitive operations (add, sub, mul) into larger blocks. The following example shows how to multiply a nor- malized vec4 dot product with a vertex or a normal vector represented as 16.16 fixed point. 402 FIXED-POINT MATHEMATICS APPENDIX A We want to make the code run as fast as possible and we have selected the fixed-point ranges accordingly. In the example we have chosen the range of the normalized vector of the transformation matrix to be 0.30, as we are going to accumulate the results of four multiplications together, and we need 2 bits of extra room for accumulation: ; input: ; r0 = pointer to the 16.16 vector data (will be looped over) ; r1-r4 = vec4 (assumed to be same over N input vectors) X,Y,Z,W ; ; in the code: ; r8 = high 32 bits of the accumulated 64-bit number ; r7 = low 32 bits -’’- ldr r5,[r0],#4 ; r5 = *r0++; (x) ldr r6,[r0],#4 ; r6 = *r0++; (y) smull r7,r8,r1,r5 ; multiply X*x: (low:high) r7:r8 = r1 * r5 ldr r5,[r0],#4 ; r5 = *r0++; (z) smlal r7,r8,r2,r6 ; multiply AND accumulate Y*y ldr r6,[r0],#4 ; r6 = *r0++; (w) smlal r7,r8,r3,r5 ; multiply AND accumulate Z*z smlal r7,r8,r4,r6 ; multiply AND accumulate W*w ; 64-bit output is in r8:r7, ; we take the high 32 bits (r8 register) directly As we implemented the whole operation as one vec4 · vec4 dot product instead of a collection of primitive fixed-point operations, we avoided intermediate shifts and thus improved the accuracy of the result. By using the 0.30 fixed-point format we reduced the accuracy of the input vector by 2 bits, but usually the effect is negligible: remember that even IEEE floats have only 24 significant bits. With careful selection of ranges, we avoided overflows altogether and eliminated a 64-bit shift operation which would require several cycles. By using ARM-specific multiply-and-accumulate instructions that operate directly in 64 bits, we avoided doing 64-bit accumulations that usually require 2 assembly opcodes: ADD and ADC (add with carry). In the previous example the multiplication was done in fixed point. If the input values, e.g., vertex positions, are small, some accuracy is lost in the final output because of the fixed position of the decimal point. For more accuracy, the exponents should be tracked as well. In the following example the input matrix is stored in a format where each matrix column has a common exponent and the scalar parts are normalized to that exponent. The code shows how one row is multiplied. Note that this particular variant assumes availability of the ARMv5 instruction CLZ and will thus not run on ARMv4 devices. ; input: ; r0 = pointer to the 16.16 vector data ; r1 = pointer to the matrix (format: x0 y0 z0 w0 e0 x1 ) ; ; in the code: SECTION A.2 FIXED-POINT METHODS IN ASSEMBLY LANGUAGE 403 ; r2 — r6 = X,Y,Z,W,E (exponent) ldmia r1!,{r2 — r6} ; r2 = *r1++; r3 = *r1++; r6 = *r1++; ldr r7,[r0],#4 ; r7 = *r0++; (x) smull r8,r9,r2,r7 ; multiply X*x ldr r7,[r0],#4 ; r7 = *r0++; (y) smlal r8,r9,r3,r7 ; multiply and accumulate Y*y ldr r7,[r0],#4 ; r7 = *r0++; (z); smlal r8,r9,r4,r7 ; multiply and accumulate Z*z ldr r7,[r0],#4 ; r7 = *r0++; (w) smlal r8,r9,r5,r7 ; multiply and accumulate W*w ; Code below does not do tight normalization (e.g., if ; we have number 0x00000000 00000001, we don’t return ; 0x40000000, but we subtract the exponent with 32 and return ; 0x00000001). This is because we do only highest-bit ; counting in the high 32 bits of the result. No accuracy ; is lost due to this at this stage. ; ; If tight normalization is required, it can be added with ; extra comparisons. ; The following opcode (eor) calculates the rough abs(r9) ; value. Positive values stay the same, but negative ; values are bit-inverted — > outcome of ~abs( — 1) = 0 etc. ; This is enough for our range calculation. Note that we ; use arithmetic shift that extends the sign bits. ; It is used to get a mask of 111's for negative numbers ; and a mask of 000's for positive numbers. eor r7,r9,r9,asr #31 ; r7 = r9 ^ (r9 >> 31) clz r7,r7 ; Count Leading Zeros of abs(high) [0,32] subs r7,r7,#1 ; We don’t shift if CLZ gives 1 (changes sign) ; note: if (clz — 1) resulted in — 1, we just want to take the high ; value of the result and not touch the exponent at all. ; This is achieved by appending rest of the opcodes with ; PL (plus) conditional. ; note2: ARM register shift with zero returns the original value ; and register shift with 32 returns zero. The code below ; works thus for any shift value from 0 to 32 that can come ; from the CLZ instruction above. subpl r6,r6,r7 ; subtract from the base exponent rsbpl r3,r7,#32 ; calculate 32-shift value to r3 movpl r9,r9,lsl r7 ; r9 = high bits << (leading zeros — 1) orrpl r9,r9,r8,lsr r3 ; r9 = low bits >> (32 — (leading zeros — 1)) ; output in r9 (scalar) and r6 (exponent) . tables and store the results in fixed point. A case in point is sin (and from that cos), for which only a 90 ◦ segment needs to be tab- ulated, and the rest can be obtained through the symmetry and. (changes sign) ; note: if (clz — 1) resulted in — 1, we just want to take the high ; value of the result and not touch the exponent at all. ; This is achieved by appending rest of the opcodes with ;. operations that do not deal with the trickiest parts of the IEEE standard, e.g., the handling of infinities, NaNs (Not-a-Numbers), or floating- point exceptions. However, with vector operations, which