Have problem understanding this assembly code from CS:APP [duplicate]

Have problem understanding this assembly code from CS:APP [duplicate] - c

I am reading about x86-64 (and assembly in general) through the book "computer systems a programmer's perspective"(3rd edition). The author, in compliance with other sources from the web, states that idivq takes one operand only - just as this one claims. But then, the author, some chapters later, gives an example with the instruction idivq $9, %rcx.
Two operands? I first thought this was a mistake but it happens a lot in the book from there.
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
Here is an example of an exercise (too lazy to write it all down - so a picture is the way to go). It claims that GCC emits idivq $9, %rcx when compiling a short C function.

That's a mistake. Only imul has immediate and 2-register forms.
mul, div, or idiv still only exist in the one-operand form introduced with 8086, using RDX:RAX as the implicit double-width operand for output (and input for division).
Or EDX:EAX, DX:AX, or AH:AL, depending on operand-size of course. Consult an ISA reference like Intel's manual, not this book! https://www.felixcloutier.com/x86/idiv
Also see When and why do we sign extend and use cdq with mul/div? and Why should EDX be 0 before using the DIV instruction?
x86-64's only hardware division instructions are idiv and div. 64-bit mode removed aam, which does 8-bit division by an immediate. (Dividing in Assembler x86 and Displaying Time in Assembly has an example of using aam in 16-bit mode).
Of course for division by constants idiv and div (and aam) are very inefficient. Use shifts for powers of 2, or a multiplicative inverse otherwise, unless you're optimizing for code-size instead of performance.
CS:APP 3e Global Edition apparently has multiple serious x86-64 instruction-set mistakes like this in practice problems, claiming that GCC emits impossible instructions. Not just typos or subtle mistakes, but misleading nonsense that's very obviously wrong to people familiar with the x86-64 instruction set. It's not just a syntax mistake, it's trying to use instructions that aren't encodeable (no syntax can exist to express them, other than a macro that expands to multiple instructions. Defining idivq as a pseudo-instruction using a macro would be pretty weird).
e.g. I correctly guessed missing part of a function, but gcc generated assembly code doesn't match the answer is another one where it suggests that (%rbx, %rdi, %rsi) and (%rsi, %rsi, 9) are valid addressing modes! The scale factor is actually a 2-bit shift count so these are total garbage and a sign of a serious lack of knowledge by the authors about the ISA they're teaching, not a typo.
Their code won't assemble with any AT&T syntax assembler.
Also What does this x86-64 addq instruction mean, which only have one operand? (From CSAPP book 3rd Edition) is another example, where they have a nonsensical addq %eax instead of inc %rdx, and a mismatched operand-size in a mov store.
It seems that they're just making stuff up and claiming it was emitted by GCC. IDK if they start with real GCC output and edit it into what they think is a better example, or actually write it by hand from scratch without testing it.
GCC's actual output would have used multiplication by a magic constant (fixed-point multiplicative inverse) to divide by 9 (even at -O0, but this is clearly not debug-mode code. They could have used -Os).
Presumably they didn't want to talk about Why does GCC use multiplication by a strange number in implementing integer division? and replaced that block of code with their made-up instruction. From context you can probably figure out where they expect the output to go; perhaps they mean rcx /= 9.
These errors are from 3rd-party practice problems in the Global Edition
From the publisher's web site (https://csapp.cs.cmu.edu/3e/errata.html)
Note on the Global Edition: Unfortunately, the publisher arranged for the generation of a different set of practice and homework problems in the global edition. The person doing this didn't do a very good job, and so these problems and their solutions have many errors. We have not created an errata for this edition.
So CS:APP 3e is probably a good textbook, as long as you get the North American edition, or ignore the practice / homework problems. This explains the huge disconnect between the textbook's reputation and wide use vs. the serious and obvious (to people familiar with x86-64 asm) errors like this one that go beyond sloppy into don't-know-the-language territory.
How a hypothetical idiv reg, reg or idiv $imm, reg would be designed
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
If Intel or AMD had introduced a new convenient forms for div or idiv, they would have designed it to use a single-width dividend because that's how compilers always use it.
Most languages are like C and implicitly promote both operands for + - * / to the same type and produce a result of that width. Of course if the inputs are known to be narrow that can be optimized away. (e.g. using one imul r32 to implement a * (int64_t)b).
But div and idiv fault if the quotient overflows so it's not safe to use a single 32-bit idiv when compiling int32_t q = (int64_t)a / (int32_t)b.
Compilers always use xor edx,edx before DIV or cdq or cqo before IDIV to actually do n / n => n-bit division.
Real full-width division using a dividend that isn't just zero- or sign-extended is only done by hand with intrinsics or asm (because gcc/clang and other compilers don't know when the optimization is safe), or in gcc helper functions that do e.g. 64-bit / 64-bit division in 32-bit code. (Or 128-bit division in 64-bit code).
So what would be most helpful is a div/idiv that avoids the extra instruction to set up RDX, too, as well as minimizing the number of implicit register operands. (Like imul r32, r/m32 and imul r32, r/m32, imm do: making the common case of non-widening multiplication more convenient with no implicit registers. That's Intel-syntax like the manuals, destination first)
The simplest way would be a 2-operand instruction that did dst /= src. Or maybe replaced both operands with quotient and remainder. Using a VEX encoding for 3 operands like BMI1 andn, you could maybe have
idivx remainder_dst, dividend, divisor. With the 2nd operand also an output for the quotient. Or you could have the remainder written to RDX with a non-destructive destination for the quotient.
Or more likely to optimize for the simple case where only the quotient is needed, idivx quot, dividend, divisor and not store the remainder anywhere. You can always use regular idiv when you want the quotient.
BMI2 mulx uses an implicit rdx input operand because its purpose is to allow multiple dep chains of add-with-carry for extended-precision multiply. So it still has to produce 2 outputs. But this hypothetical new form of idiv would exist to save code-size and uops around normal uses of idiv that aren't widening. So 386 imul reg, reg/mem is the point of comparison, not BMI2 mulx.
IDK if it would make sense to introduce an immediate form of idivx as well; you'd only use it for code-size reasons. Multiplicative inverses are more efficient division by constants so there's very little real-world use-case for such an instruction.

I think your book has made a mistake.
idivq only has one operand. If I try to assemble this snippet:
idivq $9, %rcx
I get this error:
test.s: Assembler messages:
test.s:1: Error: operand type mismatch for `idiv'
This works:
idivq %rcx
but you probably already know that.
It may also be a macro (unlikely, but possible. credit to #HansPassant for this).
Perhaps you should contact the book's author so that they can add an entry to the errata.

Interestingly, gas seems to allow the following:
mov $20, %rax
mov $0, %rdx
mov $5, %rcx
idivq %rcx, %rax
ret
This is still performing the one operand division under the hood, but it LOOKS like two-operand form. As long as the first operand is a register and the second operand is specifically %rax, this works. However, in general idivq seems to require the one operand form.

Related

question about an assembly code correspondence to a C code practice question [duplicate]

I am reading about x86-64 (and assembly in general) through the book "computer systems a programmer's perspective"(3rd edition). The author, in compliance with other sources from the web, states that idivq takes one operand only - just as this one claims. But then, the author, some chapters later, gives an example with the instruction idivq $9, %rcx.
Two operands? I first thought this was a mistake but it happens a lot in the book from there.
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
Here is an example of an exercise (too lazy to write it all down - so a picture is the way to go). It claims that GCC emits idivq $9, %rcx when compiling a short C function.

I think your book has made a mistake.
idivq only has one operand. If I try to assemble this snippet:
idivq $9, %rcx
I get this error:
test.s: Assembler messages:
test.s:1: Error: operand type mismatch for `idiv'
This works:
idivq %rcx
but you probably already know that.
It may also be a macro (unlikely, but possible. credit to #HansPassant for this).
Perhaps you should contact the book's author so that they can add an entry to the errata.

Interestingly, gas seems to allow the following:
mov $20, %rax
mov $0, %rdx
mov $5, %rcx
idivq %rcx, %rax
ret
This is still performing the one operand division under the hood, but it LOOKS like two-operand form. As long as the first operand is a register and the second operand is specifically %rax, this works. However, in general idivq seems to require the one operand form.

Can we store a floating point in a regular register?

As I understand, floating points are stored in XMM registers and not the general purpose registers such as eax, so I did an experiment:
float a = 5;
in this case, a is stored as 1084227584 in the XMM register.
Here is an assembly version:
.text
.global _start
.LCO:
.long 1084227584
_start:
mov .LCO, %eax
movss .LCO, %xmm0
Executing the above assembly and debugging it using gdb shows that the value in eax will be 1084227584, however the value in ymm0 is 5.
Here are is my questions:
1- What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
why can't I set the same bits in a regular register?
2- Are float and double values always stored as a floating point?
Can we never store them as a fixed point in C or assembly?

however the value in ymm0 is 5.
The bit-pattern in ymm0 is 1084227584. The float interpretation of that number is 5.0.
But you can print /x $xmm0.v4_int32 to see a hex representation of the bits in xmm0.
What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
No, in asm everything is just bytes.
Some compilers will use an integer register to copy a float or double from one memory location to another, if not doing any computation on it. (Integer instructions are often smaller.) e.g. clang will do this: https://godbolt.org/z/76EWMY
void copy(float *d, float *s) { *d = *s; }
# clang8.0 -O3 targeting x86-64 System V
copy: # #copy
mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
ret
XMM/YMM/ZMM registers are special because they're the only registers that FP ALU instructions exist for (ignoring x87, which is only used for 80-bit long double in x86-64).
addsd xmm0, xmm1 (add scalar double) has no equivalent for integer registers.
Usually FP and integer data don't mingle very much, so providing a whole separate set of architectural registers allows more space for more data to be in registers. (Given the same instruction-encoding constraints, it's a choice between 16 FP + 16 GP integer vs. 16 unified registers, not vs. 32 unified registers).
Plus, a major microarchitectural benefit of a separate register file is that it can be physically close to the FP ALUs, while the integer register file can be physically close to the integer ALUs. For more, see Is there any architecture that uses the same register space for scalar integer and floating point operations?
are float and double values always stored as a floating point? can we never store them as a fixed point in C or assembly?
x86 compilers use float = IEEE754 binary32 https://en.wikipedia.org/wiki/Single-precision_floating-point_format. (And double = IEEE754 binary64). This is specified as part of the ABI.
Internally the as-if rule allows the compiler to do whatever it wants, as long as the final result is identical. (Or with -ffast-math, to pretend that FP math is associative, and assume NaN/Inf aren't possible.)
Compilers can't just randomly choose a different object representation for some float that other separately-compiled functions might look at.
There might be rare cases for locals that are never visible to other functions where a "human compiler" (hand-writing asm to implement C) could prove that fixed-point was safe. Or more likely, that the float values were exact integers small enough that double wouldn't round them, so your fixed-point could degenerate to integer (except maybe for a final step).
But it would be rare to know this much about possible values without just being able to do constant propagation and optimize everything away. That's why I say a human would have to be involved, to prove things the compiler wouldn't know to look for.
I think in theory you could have a C implementation that did use a fixed-point float or double. ISO C puts very little restrictions on what float and double actually are.
But limits.h constants like FLT_RADIX and DBL_MAX_EXP have interactions that might not make sense for a fixed-point format, which has constant distance between each representable value, instead of being much closer together near 0 and much farther apart for large number. (Rounding error of 0.5ulp is relative to the magnitude, instead of absolute.)
Still, most programs don't actually do things that would break if the "mantissa" and exponent limits didn't correspond to what you'd expect for DBL_MIN and DBL_MAX.
Another interesting possibility is to make float and double based on the Posit format (similar to traditional floating-point, but with a variable-length exponent encoding. https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-number/ https://posithub.org/index).
Modern hardware, especially Intel CPUs, has very good support for IEEE float/double, so fixed-point is often not a win. There are some nice SIMD instructions for 16-bit fixed-point, though, like high-half-only multiply, and even pmulhrsw which does fixed-point rounding.
But general 32-bit integer multiply has worse throughput than packed-float multiply. (Because the SIMD ALUs optimized for float/double only need 24x24-bit significand multipliers per 32 bits of vector element. Modern Intel CPUs run integer multiply and shift on the FMA execution units, with 2 uops per clock throughput.)

are they the only type of registers to store floating points?
No. There are the 80-bit floating-point registers (fp0-fp7) in the 8087-compatible FPU which should still be present in most modern CPUs.
Most 32-bit programs use these registers.
Can we store a floating point in a regular [integer] register?
Yes. 30 years ago many PCs contained a CPU without 80x87 FPU, so there were no fp0-fp7 registers. CPUs with XMM registers came even later.
We find a similar situation in mobile devices today.
What's so special about the XMM registers?
Using the 80x87 FPU seems to be more complicated than using XMM registers. Furthermore, I'm not sure if using the 80x87 is allowed in 64-bit programs in every operating system.
If you store a floating-point value in an integer register (such as eax), you don't have any instructions performing arithmetic: On x86 CPUs, there is no instruction for doing a multiplication or addition of floating-point values that are stored in integer registers.
In the case of CPUs without FPU, you have to do floating-point emulation. This means you have to perform one floating-point operation by doing multiple integer operations - just like you would do it with paper and pencil.
However, if you only want to store a floating-point value, you can of course also use an integer register. The same is true for copying a value or checking if two values are equal and similar operations.
Can we never store them as a fixed point in C or assembly?
Fixed point is used a lot when using CPUs that do not have an FPU.
For example when using 8- or 16-bit CPUs which are still used in automotive industry, consumer devices or PC peripheral devices.
However, I doubt that there are C compilers that automatically translate the keyword "float" to fixed point.

What is the instruction that gives branchless FP min and max on x86?

To quote (thanks to the author for developing and sharing the algorithm!):
https://tavianator.com/fast-branchless-raybounding-box-intersections/
Since modern floating-point instruction sets can compute min and max without branches
Corresponding code by the author is just
dmnsn_min(double a, double b)
{
return a < b ? a : b;
}
I'm familiar with e.g. _mm_max_ps, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.
Question:
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
Is it safe to assume it's going to be applied, or how do I call it?
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.

Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).
GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.
Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.
Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.
MINSS a,b exactly implements (a<b) ? a : b according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b, on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a), because those functions are based on the same expression. Note the b,a operand order for the std:: functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.
MAXSS a,b exactly implements (b<a) ? a : b, again keeping the source operand (b) on unordered. Like std::max(b,a).
Looping over an array with x = std::min(arr[i], x); (i.e. minss or maxss xmm0, [rsi]) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.
Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]); (or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.
(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)
Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/
But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):
// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
return a < b ? a : b;
}
Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.
(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)
The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.
I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.
Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss as the intrinsic for MINSS.
When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)
Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.
Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.
As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.
The commutative-_mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).
You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.
(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)
In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax for performance unless you need their NaN behaviour.
In C, I think just write your own min and max functions (or macros if you do it safely).
C & asm on the Godbolt compiler explorer
float minfloat(float a, float b) {
return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
minss xmm0, xmm1
ret
// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
# This implementation of std::min uses (b<a) : b : a;
# So it can produce the result only in the register that b was in
# This isn't worse (when inlined), just opposite
minss xmm1, xmm0
movaps xmm0, xmm1
ret
float minfloat_fmin(float a, float b) { return fminf(a, b); }
# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
movaps xmm2, xmm0
cmpunordss xmm2, xmm2
movaps xmm3, xmm2
andps xmm3, xmm1
minss xmm1, xmm0
andnps xmm2, xmm1
orps xmm2, xmm3
movaps xmm0, xmm2
ret
# Obviously you don't want this if you don't need it.
If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.
If you don't expect NaNs, or want to handle them specially, write stuff like
lowest = _mm_min_ps(lowest, some_loop_variable);
so the register holding lowest can be updated in-place (even without AVX).
Taking advantage of MINPS's NaN behaviour:
Say your scalar code is something like
if(some condition)
lowest = min(lowest, x);
Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31.)
The straight-forward way to implement this would be to blend x with +Inf based on the condition mask. Or do newval = min(lowest, x); and blend newval into lowest. (either BLENDVPS or AND/ANDNOT/OR).
But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:
__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;
x = _mm_or_ps(x, condition); // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest); // NaN elements in x mean no change in lowest
// REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
// AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.
So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).
The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).

Peter Cordes's answer is great, I just figured I'd jump in with some shorter point-by-point answers:
What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
I was referring to minss/minsd. And even other architectures without such instructions should be able to do this branchlessly with conditional moves.
Is it safe to assume it's going to be applied, or how do I call it?
gcc and clang will both optimize (a < b) ? a : b to minss/minsd, so I don't bother using intrinsics. Can't speak to other compilers though.
Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
The individual a < b tests are pretty much completely unpredictable, so it is very important to avoid branching for those. Tests like if (ray.dir.x != 0.0) are very predictable, so avoiding those branches is less important, but it does shrink the code size and make it easier to vectorize. The most important part is probably removing the divisions though.
Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?
Yes, minss/minsd behave exactly like (a < b) ? a : b, including their treatment of infinities and NaNs.
Also, I wrote a followup post to the one you referenced that talks about NaNs and min/max in more detail.

Any preference to SHUFPD or PSHUFD for reversing two packed double in an XMM?

Question today is fairly short. Consider the following toy C program shuffle.c for reversing two packed double in register xmm0:
#include <stdio.h>
void main () {
double x[2] = {0.0, 1.0};
asm volatile (
"movupd (%[x]), %%xmm0\n\t"
"shufpd $1, %%xmm0, %%xmm0\n\t" /* method 1 */
//"pshufd $78, %%xmm0, %%xmm0\n\t" /* method 2 */
"movupd %%xmm0, (%[x])\n\t"
:
: [x] "r" (x)
: "xmm0", "memory");
printf("x[0] = %.2f, x[1] = %.2f\n", x[0], x[1]);
}
After a dry run: gcc -msse3 -o shuffle shuffle.c | ./test, both methods/instructions will return the correct result x[0] = 1.00, x[1] = 0.00. This page says that shufpd has a latency of 6 cycles, while the intel intrinsic guide says that pshufd only has a latency of 1 cycles. This sounds like great preference to pshufd. However, This instruction is truly for packed integers. When using it for packed doubles, will there be any penalty associated with "wrong type"?
As a similar question, I also heard that instruction movaps is 1-byte smaller than movapd, and they do the same thing by reading 128bits from a 16-bit aligned address. So can we always use the former for move (between XMMs) / load (from memory) / store (to memory)? This seems crazy. I think there must be some reason to reject this. Can someone give me an explanation? Thank you.

You'll always get correct results, but it can matter for performance.
Prefer FP shuffles for FP data that will be an input to FP math instructions (like addps or vfma..., as opposed to insns like xorps).
This avoids any extra bypass-delay latency on some microarchitectures, including potentially current Intel chips. See Agner Fog's microarchitecture guide. AMD Bulldozer-family does all shuffles in the vector-integer domain, so there's a bypass delay whichever shuffle you use.
If it saves instructions, it can be worth it to use an integer shuffle anyway. (But usually it's the other way around, where you want to use shufps to combine data from two integer vectors. That's fine in even more cases, and mostly a problem only on Nehalem, IIRC.)
http://x86.renejeschke.de/html/file_module_x86_id_293.html lists the latency for CPUID 0F3n/0F2n CPUs, i.e. Pentium4 (family 0xF model 2 (Northwood) / model 3 (Prescott)). Those numbers are obviously totally irrelevant, and don't even match Agner Fog's P4 table for shufpd.
Intel's intrinsics guide sometimes has numbers that don't match experimental testing, either. See Agner Fog's instruction tables for good latency/throughput numbers, and microarch guides to understand the details.
movaps vs. movapd: No existing microarchitectures care which you use. It would be possible for someone in the future to design an x86 CPU that kept double vectors separate from float vectors internally, but for now the only distinction has been int vs. FP.
Always prefer the ps instruction when the behaviour is identical (xorps over xorpd, movhps over movhpd).
Some compilers (maybe both gcc and clang, I forget) will compile a _mm_store_si128 integer vector store to movaps, because there's no performance downside on any existing hardware, and it's one byte shorter.
IIRC, there's also no perf downside to loading integer vector data with movaps / movups, but I'm less sure about that.
There is a perf downside to using the wrong mov instruction for a reg-reg move, though. movdqa xmm1, xmm2 between two FP instructions is bad on Nehalem.
re: your inline asm:
It doesn't need to be volatile, and you could drop the "memory" clobber if you used a 16 byte struct or something as a "+m" input/output operand. Or a "+x" vector-register operand for an __m128d variable.
You'll probably get better results from intrinsics than from inline asm, unless you write whole loops in inline asm or stand-alone functions.
See the x86 tag wiki for a link to my inline asm guide.

Intel Vs. AT&T syntax when addressing xmm and floating instruction

Hello everyone
I am working on writing an assembly program and I would like to acquire some knowledge before I start on the looks of AT&T and Intel syntax when addressing xmm and fp. I know that in regular instructions a push when function on a byte is "pushb" in AT&T while "push byte" in Intel. Can anyone provide a similar comparison to when using xmm or fp? In sum I want to know how xmm operands are addressed
Thanks in advance

I'm not an AT&T fan/user, but the first place to start for intel would be the intel developer manuals(volumes 2a and 2b contain the instruction references), these list the sizes they operate on, which almost all intel syntax assemblers will try deduce (push will try narrow the variable or align it, depending on settings) if not specified, else you'll generally be using qword/dword for fp (for the likes of fld) and dword/qword/dqword for mmx/sse ops.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight