What is the difference between "maximum" and "maximum number" in the description of NEON intrinsics? E.g. (from https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics)
float32x4_t vmaxq_f32 (float32x4_t a, float32x4_t b)
Floating-point Maximum (vector). This instruction compares corresponding vector elements in the two source SIMD&FP registers, places the larger of each of the two floating-point values into a vector, and writes the vector to the destination SIMD&FP register.
and
float32x4_t vmaxnmq_f32 (float32x4_t a, float32x4_t b)
Floating-point Maximum Number (vector). This instruction compares corresponding vector elements in the two source SIMD&FP registers, writes the larger of the two floating-point values into a vector, and writes the vector to the destination SIMD&FP register.
Is it just different treatment of NaNs?
Is it just different treatment of NaNs?
Yes.
FMAXNM in the Armv8 reference manual says.
NaNs are handled according to the IEEE 754-2008 standard. If one vector element is numeric and the other is a quiet NaN, the result placed in the vector is the numerical value, otherwise the result is identical to FMAX.
Related
I have AVX (no AVX2 or AVX-512). I have a vector with 32bit values (only 4 lowest bits are used, rest is always zero):
[ 1010, 0000, 0000, 0000, 0000, 1010, 1010, 0000]
Internally, I keep vector as __m256 because of bitwise operations and the bits represents "float numbers". I need to export single 8-bit number from the vector, which will contain 1 for non-zero a 0 for zero bits.
So for above example, I need 8-bit number: 10000110
I have idea to use _mm256_cmp_ps and then _mm256_movemask_ps. However, for cmp, I dont know if it will work correctly, if numbers are not exactly floats and can be any "junk". In this case, which operand to use for cmp?
Or is there any other solution?
Conceptually, what you're doing should work. Floats with the upper 24 bits zero are valid floats. However, they are denormal.
While it should work, there are two potential problems:
If the FP mode is set to set to flush denormals to zero, then they will all be treated as zero. (thus, breaking that approach)
Because these are denormals, you may end up taking massive performance penalties depending on whether the hardware can handle them natively.
Alternative Approach:
Since the upper 24 bits are zero, you can normalize them. Then do the floating-point comparison.
(Warning: untested code)
int to_mask(__m256 data){
const __m256 MASK = _mm256_set1_ps(8388608.); // 2^23
data = _mm256_or_ps(data, MASK);
data = _mm256_cmp_ps(data, MASK, _CMP_NEQ_UQ);
return _mm256_movemask_ps(data);
}
Here, data is your input where the upper 24 bits of each "float" are zero. Let's call each of these 8-bit integers x.
OR'ing with 2^23 sets the mantissa of the float such that it becomes a normalized float with value 2^23 + x.
Then you compare against 2^23 as float - which will give a 1 only if the x is non-zero.
Alternate answer, for future readers that do have AVX2
You can cast to __m256i and use SIMD integer compares.
This avoid any problems with DAZ treating these small-integer bit patterns as exactly zero, or microcode assists for denormal (aka subnormal) inputs.
There might be 1 extra cycle of bypass latency between vcmpeqd and vpmovmskps on some CPUs, but you still come out ahead because integer compare is lower latency than FP compare.
int nonzero_positions_avx2(__m256 v)
{
__m256i vi = _mm256_castps_si256(v);
vi = _mm256_cmpeq_epi32(vi, _mm256_setzero_si256());
return _mm256_movemask_ps(_mm256_castsi256_ps(vi));
}
I would like to know how vminnm is working. Since the pseudo-code is bit unclear, I am not able to understand what is the exact function of this instruction.
vminnm.f32 d3, d5, d13
where
d5 = 0xffd5432100000000
d13 = 0x7ff0056000000000
Result:
d3 = 0x7fc0000000000000
How we are arriving with this result ?
This is the definition on the ARM Reference Manual for that instruction.
(NaN is Not a Number, that means that the value is not a valid floating point number.)
VMINNM
This instruction determines the floating point minimum number.
It handles NaNs in consistence with the IEEE754-2008 specification. It returns the numerical operand when one
operand is numerical and the other is a quiet NaN, but otherwise the result is identical to floating-point VMIN .
This instruction is not conditional.
vminnm.f32 d3, d5, d13
In your example, the values in d5 and d13 are compared and the result of the comparison is stored in d3. Take into consideration that you are dealing with vectors and you have two elements in each vector, which are 32-bit floating point each.
The value 0xffd5432100000000 is a valid 64-bit double, but not two 32-bit floating point, i.e 0xffd54321 is not a number and 0x00000000 is 0, so when you compare these values you need to be aware of the width of the values you are comparing. (You can check the values of floating points here.)
New to C programming, and I've been told to avoid unions which in general makes perfect sense and I agree with. However, as part of an academic exercise I'm writing an emulator for hardware single-precision floating point addition by doing bit manipulation operations on unsigned 32-bit integers. I only mention that to explain why I want to use unions; I'm having no trouble with the emulation.
In order to test this emulator, I wrote a test program. But of course I'm trying to find the bit representation of floats on my hardware, so I thought this could be the perfect use for a union. I wrote this union:
typedef union {
float floatRep;
uint32_t unsignedIntRep;
} FloatExaminer;
This way, I can initialize a float with the floatRep member and then examine the bits with the unsignedIntRep member.
This worked most of the time, but when I got to NaN addition, I started running into trouble. The exact situation was that I wrote a function to automate these tests. The gist of it was this:
void addTest(float op1, float op2){
FloatExaminer result;
result.floatRep = op1 + op2;
printf("%f + %f = %f\n", op1, op2, result.floatRep);
//print bit pattern as well
printf("Bit pattern of result: %08x", result.unsignedIntRep);
}
OK, now for the confusing part:
I added a NAN and a NAN with different mantissa bit patterns to differentiate between the two. On my particular hardware, it's supposed to return the second NAN operand (making it quiet if it was signalling). (I'll explain how I know this below.) However, passing the bit patterns op1=0x7fc00001, op2=0x7fc00002 would return op1, 0x7fc00001, every time!
I know it's supposed to return the second operand because I tried--outside the function--initializing as an integer and casting to a float as below:
uint32_t intRep1 = 0x7fc00001;
uint32_t intRep2 = 0x7fc00002;
float *op1 = (float *) &intRep1;
float *op2 = (float *) &intRep2;
float result = *op1 + *op2;
uint32_t *intResult = (uint32_t *)&result;
printf("%08x", *intResult); //bit pattern 0x7fc00002
In the end, I've concluded that unions are evil and I should never use them. However, does anyone know why I'm getting the result I am? Did I make stupid mistake or assumption? (I understand that hardware architecture varies, but this just seems bizarre.)
I'm assuming that when you say "my particular hardware", you are referring to an Intel processor using SSE floating point. But in fact, that architecture has a different rule, according to the Intel® 64 and IA-32 Architectures
Software Developer's Manual. Here's a summary of Table 4.7 ("Rules for handling NaNs") from Volume 1 of that documentation, which describes the handling of NaNs in arithmetic operations: (QNaN is a quiet NaN; SNaN is a signalling NaN; I've only included information about two-operand instructions)
SNaN and QNaN
x87 FPU — QNaN source operand.
SSE — First source operand, converted to a QNaN.
Two SNaNs
x87 FPU — SNaN source operand with the larger significand, converted to a QNaN
SSE — First source operand, converted to a QNaN.
Two QNaNs
x87 FPU — QNaN source operand with the larger significand
SSE — First source operand
NaN and a floating-point value
x87/SSE — NaN source operand, converted to a QNaN.
SSE floating point machine instructions generally have the form op xmm1, xmm2/m32, where the first operand is the destination register and the second operand is either a register or a memory location. The instruction will then do, in effect, xmm1 <- xmm1 (op) xmm2/m32, so the first operand is both the left-hand operand of the operation and the destination. That's the meaningof "first operand" in the above chart. AVX adds three-operand instructions, where the destination might be a different register; it is then the third operand and does not figure in the above chart. The x87 FPU uses a stack-based architecture, where the top of the stack is always one of the operands and the result replaces either the top of the stack or the other operand; in the above chart, it will be noted that the rules do not attempt to decide which operand is "first", relying instead on a simple comparison.
Now, suppose we're generating code for an SSE machine, and we have to handle the C statement:
a = b + c;
where none of those variables are in a register. That means we might emit code something like this: (I'm not using real instructions here, but the principle is the same)
LOAD r1, b (r1 <- b)
ADD r1, c (r1 <- r1 + c)
STORE r1, a (a <- r1)
But we could also do this, with (almost) the same result:
LOAD r1, c (r1 <- c)
ADD r1, b (r1 <- r1 + b)
STORE r1, a (a <- r1)
That will have precisely the same effect, except for additions involving NaNs (and only when using SSE). Since arithmetic involving NaNs is unspecified by the C standard, there is no reason why the compiler should care which of these two options it chooses. In particular, if r1 happened to already have the value c in it, the compiler would probably choose the second option, since it saves a load instruction. (And who is going to complain? We all want the compiler to generate code which runs as quickly as possible, no?)
So, in short, the order of the operands of the ADD instruction will vary with the intricate details of how the compiler chooses to optimize the code, and the particular state of the registers at the moment in which the addition operator is being emitted. It is possible that this will be effected by the use of a union, but it is equally or more likely that it has to do with the fact that in your code using the union, the values being added are arguments to the function and therefore are already placed in registers.
Indeed, different versions of gcc, and different optimization settings, produce different results for your code. And forcing the compiler to emit x87 FPU instructions produces yet different results, because the hardware operates according to a different logic.
Note:
If you want some bedtime reading, you can download the entire Intel SDM (currently 4,684 pages / 23.3MB, but it keeps on getting bigger) from their site.
To implement real numbers between 0 and 1, one usually uses ANSI floats or doubles. But fixed precision numbers between 0 and 1 (decimals modulo 1) can be efficiently implemented as 32 bit integers or 16 bit words, which add like normal integers/words, but which multiply the "wrong way", meaning that when you multiply X times Y, you keep the high order bits of the product. This is equivalent to multiplying 0.X and 0.Y, where all the bits of X are behind the decimal point. Likewise, signed numbers between -1 and 1 are also implementable this way with one extra bit and a shift.
How would one implement fixed-precision mod 1 or mod 2 in C (especially using MMX or SSE)?
I think this representation could be useful for efficient representation of unitary matrices, for numerically intensive physics simulations. It makes for more MMX/SSE to have integer quantities, but you need higher level access to PMULHW.
If 16 bit fixed point arithmetic is sufficient and you are on x86 or a similar architecture, you can directly use SSE.
The SSE3 instruction pmulhrsw directly implements signed 0.15 fixed point arithmetic multiplication (mod 2 as you call it, from -1..+1) in hardware. Addition is not different than the standard 16 bit vector operations, just using paddw.
So a library which handles multiplication and addition of eight signed 16 bit fixed point variables at a time could look like this:
typedef __v8hi fixed16_t;
fixed16_t mul(fixed16_t a, fixed16_t b) {
return _mm_mulhrs_epi16(a,b);
}
fixed16_t add(fixed16_t a, fixed16_t b) {
return _mm_add_epi16(a,b);
}
Permission granted to use it in any way you like ;-)
I have a fairly complicated function that takes several double values that represent two vectors in 3-space of the form (magnitude, latitude, longitude) where latitude and longitude are in radians, and an angle. The purpose of the function is to rotate the first vector around the second by the angle specified and return the resultant vector. I have already verified that the code is logically correct and works.
The expected purpose of the function is for graphics, so double precision is not necessary; however, on the target platform, trig (and sqrt) functions that take floats (sinf, cosf, atan2f, asinf, acosf and sqrtf specifically) work faster on doubles than on floats (probably because the instruction to calculate such values may actually require a double; if a float is passed, the value must be cast to a double, which requires copying it to an area with more memory -- i.e. overhead). As a result, all of the variables involved in the function are double precision.
Here is the issue: I am trying to optimize my function so that it can be called more times per second. I have therefore replaced the calls to sin, cos, sqrt, et cetera with calls to the floating point versions of those functions, as they result in a 3-4 times speed increase overall. This works for almost all inputs; however, if the input vectors are close to parallel with the standard unit vectors (i, j, or k), round-off errors for the various functions build up enough to cause later calls to sqrtf or inverse trig functions (asinf, acosf, atan2f) to pass arguments that are just barely outside of the domain of those functions.
So, I am left with this dilemma: either I can only call double precision functions and avoid the problem (and end up with a limit of about 1,300,000 vector operations per second), or I can try to come up with something else. Ultimately, I would like a way to sanitize the input to the inverse trig functions to take care of edge cases (it is trivial for do so for sqrt: just use abs). Branching is not an option, as even a single conditional statement adds so much overhead that any performance gains are lost.
So, any ideas?
Edit: someone expressed confusion over my using doubles versus floating point operations. The function is much faster if I actually store all my values in double-size containers (I.E. double-type variables) than if I store them in float-size containers. However, floating point precision trig operations are faster than double precision trig operations for obvious reasons.
Basically, you need to find a numerically stable algorithm that solves your problem. There are no generic solutions to this kind of thing, it needs to be done for your specific case using concepts such as the condition number if the individual steps. And it may in fact be impossible if the underlying problem is itself ill-conditioned.
Single precision floating point inherently introduces error. So, you need to build your math so that all comparisons have a certain degree of "slop" by using an epsilon factor, and you need to sanitize inputs to functions with limited domains.
The former is easy enough when branching, eg
bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < 0.001f; } // or
bool IsAlmostEqual( float a, float b ) { return fabs(a-b) < (a * 0.0001f); } // for relative error
but that's messy. Clamping domain inputs is a little trickier, but better. The key is to use conditional move operators, which in general do something like
float ExampleOfConditionalMoveIntrinsic( float comparand, float a, float b )
{ return comparand >= 0.0f ? a : b ; }
in a single op, without incurring a branch.
These vary depending on architecture. On the x87 floating point unit you can do it with the FCMOV conditional-move op, but that is clumsy because it depends on condition flags being set previously, so it's slow. Also, there isn't a consistent compiler intrinsic for cmov. This is one of the reasons why we avoid x87 floating point in favor of SSE2 scalar math where possible.
Conditional move is much better supported in SSE by pairing a comparison operator with a bitwise AND. This is preferable even for scalar math:
// assuming you've already used _mm_load_ss to load your floats onto registers
__m128 fsel( __m128 comparand, __m128 a, __m128 b )
{
__m128 zero = {0,0,0,0};
// set low word of mask to all 1s if comparand > 0
__m128 mask = _mm_cmpgt_ss( comparand, zero );
a = _mm_and_ss( a, mask ); // a = a & mask
b = _mm_andnot_ss( mask, b ); // b = ~mask & b
return _mm_or_ss( a, b ); // return a | b
}
}
Compilers are better, but not great, about emitting this sort of pattern for ternaries when SSE2 scalar math is enabled. You can do that with the compiler flag /arch:sse2 on MSVC or -mfpmath=sse on GCC.
On the PowerPC and many other RISC architectures, fsel() is a hardware opcode and thus usually a compiler intrinsic as well.
Have you looked at the Graphics Programming Black Book or perhaps handing the calculations off to your GPU?