Section 3.9 Section 3.9 Heterogeneous Data Structures 275
3.11.6 Floating-Point C~mparison Operatiqns
AVX2 provides two instructions for comparing floating-point values:
Instruction ucomiss S1, Sz
ucomisd S1, Sz
Based on Description
Compare single precision Compare double precision
These instructions are similar to the CMP instructions (see Section 3.6), in that they compare operands S1 and S2 (but in the opposite order one might expect) and set the condition codes to indicate their relative values. As with cmpq, they follow the ATJLformat convention of listing•the•operands in reverse order. Argument S2 must Ile.in an XMM register, while s1 can be either in an XMM register or in memory.
The floating-point comparison instructionsãset three condition codes; the zero flag ZF, the carry flag CF, and the parity flag PF. We did not document the parity flag in Section 3.6.1, because it is not commonly found in Gee-generated x86 code.
For integer operations, this flag is set when the most recent arithmetic or logical operation yielded a value 'where the least significant byte"has 'even parify (i.e., an even number of ones in the byte). For floating-point comparisons, however, the flag is set when either operand is NaN. By convention, any''comparison in C is consider~d to fail when one of the arguments is NaN, and this flag is used to detect such a condition. For example, even the co111parison x == x yields 0 when x is NaN.
'rile condifion codes are set as follows:
Ordering l52:s1 CF ZF PF
Unordered 1 1 1
S2 < S1 1 0 0
S2=S1 0 1 0
S2 > S1 0 0 0
The unordered case occurs when either operand is NaN. This can be detected with the parity flag. Commonly, the jp (for "junip on parity") instnfction is used to conditionally jump when a floating-point comparison yields an unordered result Except for this case, the values of the carry and zero flags are the same as those for an unsigned comparison: ZF is set when the two operands are equal, and CF is
Section 3.11 Floating-Point Code 307 (a) C code
typedef enum {NEG, ZERO, POS, OTHER} range_t;
range_t find_range(float x) {
}
int result;
if (x < 0) result NEG;
else i f (x = o'i
result ZERO;
else i f (x > 0) result POS;
else
result = OTHER;
return result;
(ti) Generated assembly cod~
range_t find_range(float x)
x< ill X:rmmo 1 find_range:
2 vxorps %xmm1, %xmm1,
' vucom±ss %:xmmO,
4 j.a .L5
5 vucomiss %xmm1,
6 jp .L8
7 movl $1, %eax
8 je .L9
9 .L8:
'
%xmm1
%xmm1
%xmrn0
10 vucomiss .LCO(%rip), 7.xmmo
11 set be %al
12 movzbl %al, %eax 1l addl $2, %eax
14 ret
15 .L5:
16 movl $0, %eax .L3:
rep; ret
Set Xxmm1 = 0 Compare O:x It >, goto neg Compare x:O
If NaN, goto poeornan result - ZERO
If ~. goto done posornan;
Compare x:O
Set result ~NaN ? 1 : 0 Zero-extend
result += 2 (POS for> 0, OTHER :for NaN) Return
neg:
result "' NEG done:
Return
Figure 3.51 Illustration of conditional branching in floating-point code.
3,08 Chapter 3 Machine-Level Representation of Programs
set when S2 < S1. Instructions such as j a and jb are used to conditionally jump otl various combinations of these flags.
As an example of floating-point comparisons, the C function of Figure 3.Slt a) classifies argument x according to its relation to 0.0, returning an enumerated type as the result. Enumerated types in C are encoded as integers, and so the possible function values are: 0 (NEG), 1 (ZERO), 2 (POS), and 3 (OTHER). This final outcome occurs when the value of xis NaN.
Gee generates the code shown in Figure 3.Sl(b) for find_range. The code is not very efficient-it compares x to 0.0 three times, eveà though the required information could be obtained with a single comparison. It also generates floating- point constant 0.0 twice-once using vxorps, and once by reading the valueJrom memory. Let us trace the flow of the function for the four possible comparison results:
x < 0.0 The j a branch on line 4 will be taken, jumping to the end with a return value of 0.
x = 0.0 The ja (line 4) and jp (line 6) branches will not be taken, but the je branch (line 8) will, returning with %eax equal to 1.
x > 0.0 N.one of t)ie three branches will be taken. The set be (line 11) will yield
IJ; and this will tie incremented by the addl instruction (line i3) to give a return value of 2.
x =NaN The jp branch (line 6) will be taken. The third vucomiss instruction (line 10) will set both the carry and the zero flag, and so the setbe instruction (line 11) and the following instruction will s~t %eax to 1. This gets incremented by the addl instruction (line 13) to give a return value of3.
In Homework Problems 3.73 and 3.74, you are challenged to hand-generate more efficient implementations of find_range.
L!'1i!~(ii.~:f!r@iiim~tM~s§.LN&'.a fN(1g ~<!}<;~;:'!j1:i~1£!! :~:;r.: : 1
Function funct3 has the following prototype:
double funct3(int *ap, double b, long c, float *dp);
For this function, ace generates the following code:
2 3 4 5 6 7
double funct3(int *ap, double b, long c, float *dp) ap in %rdj, b in 7.xmmO, c in %r~i, dp,in %rdx
funct3: l
vmovss (%rdx), %xmm1
vcvtsi2sd (%rdi), %xmm2, %xmm2 vucomisd %xmm2, %xrnm0
jbe .1<8
vcvtsi2ssq %rsi, %xmmO, %xmmO vmulss %xmm1, %xmm0, %xmm1
Section 3.12 Summary 309 a. vunpcklps %xmm1, %xmm1, %xmm1
9 VCY.tps2pd %xmm1, %xmmO
10 ret
11 .LS:
12 vaddss %xmm1, r..xmm1, %xmm1
13 vcvtsi2ssq i.rsi, %xmm0, %xmmO 14 vaddss %xmm1, %xmmO, %xmmO
15 vunpcklps %xmmo, %xmm0, %xmm0 16 vcvtps2pd %xmmO, %xmmO
17 ret.
Write a C version of funct3.
3. 11.7 Observations about F.loating-Point Code
We see that the general style of macl)ine code generated for operating on fioating- point data with A VX.2 is similar to what we have seen for operating on integer data.
Both use a collection of registers to hold and operate on values, and they use these registern for passing function arguments.
Of course, there are many complexities in dealing with the different data types and the ~ules for e'valuati11g expressions containing 'a mixture of data types, and AVX2 code involves many more different instru'ctions and formats than is usually seen with functiops that perform mtly jnteger arithmetic.
AVx'2. .also h\15 the potedtial to make computaiions run faster by performing parallel operations on packed data. Compil~r developers are working on automat- ing tlie conversion of scalar code to parallel cod~, but currently the most reliable way to achieve higher performance thiough parallelism is to use the extensions to the C language supported by ace for manipulating vectors of data. See Web Aside
OPT:SIMD on page 546 to see how this can be done.