Is an extra move somehow faster when doing division-by-multiplication?

Is an extra move somehow faster when doing division-by-multiplication? - c

Consider this function:
unsigned long f(unsigned long x) {
return x / 7;
}
With -O3, Clang turns the division into a multiplication, as expected:
f: # #f
movabs rcx, 2635249153387078803
mov rax, rdi
mul rcx
sub rdi, rdx
shr rdi
lea rax, [rdi + rdx]
shr rax, 2
ret
GCC does basically the same thing, except for using rdx where Clang uses rcx. But they both appear to be doing an extra move. Why not this instead?
f:
movabs rax, 2635249153387078803
mul rdi
sub rdi, rdx
shr rdi
lea rax, [rdi + rdx]
shr rax, 2
ret
In particular, they both put the numerator in rax, but by putting the magic number there instead, you avoid having to move the numerator at all. If this is actually better, I'm surprised that neither GCC nor Clang do it this way, since it feels so obvious. Is there some microarchitectural reason that their way is actually faster than my way?
Godbolt link.

This very much looks like a missed optimization by both gcc and clang; no benefit to that extra mov.
If it's not already reported, GCC and LLVM both accept missed-optimization bug reports: https://bugs.llvm.org/ and https://gcc.gnu.org/bugzilla/. For GCC there's even a bug tag "missed-optimization".
Wasted mov instructions are unfortunately not rare, especially when looking at tiny functions where the input / output regs are nailed down the calling convention, not up to the register allocator. The do still happen in loops sometimes, like doing a bunch of extra work each iteration so everything is in the right places for the code that runs once after a loop. /facepalm.
Zero-latency mov (mov-elimination) helps reduce the cost of such missed optimizations (and cases where mov isn't avoidable), but it still takes a front-end uop so it's pretty much strictly worse. (Except by chance where it helps alignment of something later, but if that's the reason then a nop would have been as good).
And it takes up space in the ROB, reducing how far ahead out-of-order exec can see past a cache miss or other stall. mov is never truly free, only the execution-unit and latency part is eliminated - Can x86's MOV really be "free"? Why can't I reproduce this at all?
My total guess about compiler internals:
Probably gcc/clang's internal machinery need to learn that this division pattern is commutative and can take the input value in some other register and put the constant in RAX.
In a loop they'd want the constant in some other register so they could reuse it, but hopefully the compiler could still figure that out for cases where it's useful.

Visual Studio 2015 generates the code you expected, rcx = input dividend:
mov rax, 2635249153387078803
mul rcx
sub rcx, rdx
shr rcx, 1
lea rax, QWORD PTR [rdx+rcx]
shr rax, 2
A divisor of 7 needs a 65 bit multiplier to get the proper accuracy.
floor((2^(64+ceil(log2(7))))/7)+1 = floor((2^67)/7)+1 = 21081993227096630419
Removing the most significant bit, 2^64, results in 21081993227096630419 - 2^64 = 2635249153387078803, which is the multiplier actually used in the code.
The generated code compensates for the missing 2^64 bit, which is explained in figure 4.1 and equation 4.5 in this pdf file:
https://gmplib.org/~tege/divcnst-pldi94.pdf
Further explanation can be seen in this prior answer:
Why does GCC use multiplication by a strange number in implementing integer division?
If the 65 bit multiplier has a trailing 0 bit, then it can be shifted right 1 bit to result in a 64 bit multiplier, reducing the number of instructions. For example if dividing by 5:
floor((2^(64+ceil(log2(5))))/5)+1 = floor((2^67)/5)+1 = 29514790517935282586
29514790517935282586 >> 1 = 14757395258967641293
mov rax, -3689348814741910323 ; == 14757395258967641293 == 0cccccccccccccccdH
mul rcx
shr rdx, 2
mov rax, rdx

Your version does not appear to be faster.
Edit: The ROB (reorder buffer) can do register renaming, so the extra mov does not actually have to move any data. It can merely adjust the index into the PRF [physical register file] for the generated uop. So, the mov is possibly fused away.
I've coded up both your asm versions:
Here is orig.s:
.file "orig.c"
.text
.p2align 4,,15
.globl orig
.type orig, #function
orig:
.LFB0:
.cfi_startproc
movabsq $2635249153387078803, %rdx
movq %rdi, %rax
mulq %rdx
subq %rdx, %rdi
shrq %rdi
leaq (%rdx,%rdi), %rax
shrq $2, %rax
ret
.cfi_endproc
.LFE0:
.size orig, .-orig
.ident "GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)"
.section .note.GNU-stack,"",#progbits
And, fix1.s:
.file "fix1.c"
.text
.p2align 4,,15
.globl fix1
.type fix1, #function
fix1:
.LFB0:
.cfi_startproc
movabsq $2635249153387078803, %rax
mulq %rdi
subq %rdx, %rdi
shrq %rdi
leaq (%rdx,%rdi), %rax
shrq $2, %rax
ret
.cfi_endproc
.LFE0:
.size fix1, .-fix1
.ident "GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)"
.section .note.GNU-stack,"",#progbits
Here is a test program, main.c. (You may need to vary the iteration constant in test):
#include <stdio.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = 10000000; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
return end;
}
int
main(void)
{
tsc_t odif = test(orig);
tsc_t fdif = test(fix1);
printf("odif=%lld fdif=%lld (%lld)\n",odif,fdif,odif - fdif);
return 0;
}
Build with:
gcc -O3 -o main main.c orig.s fix1.s
Here are test test results of 20 runs:
odif=43937784 fdif=34104334 (9833450)
odif=39791246 fdif=42641752 (-2850506)
odif=25818191 fdif=25586750 (231441)
odif=35056015 fdif=25276729 (9779286)
odif=43955175 fdif=31112246 (12842929)
odif=25731472 fdif=25493826 (237646)
odif=25627395 fdif=26202191 (-574796)
odif=28029957 fdif=25627366 (2402591)
odif=25828608 fdif=26291294 (-462686)
odif=25690703 fdif=25703610 (-12907)
odif=25908418 fdif=26411828 (-503410)
odif=25690776 fdif=25673766 (17010)
odif=25992890 fdif=25982718 (10172)
odif=25693459 fdif=25636974 (56485)
odif=26572724 fdif=25870050 (702674)
odif=25627334 fdif=25621802 (5532)
odif=27760054 fdif=27382748 (377306)
odif=26343245 fdif=26195134 (148111)
odif=27289865 fdif=25840818 (1449047)
odif=25985794 fdif=25721351 (264443)
UPDATE:
Your data doesn't appear to support your conclusion, unless I'm misinterpreting something.
Like I said, you may have to vary the number of iterations (up or down). Or, do many runs and take the minimum. But, otherwise, I would expect the final number on each line to be more or less invariant either positive or negative, not swinging +/-. It may be difficult to measure, without a better test
You should note that modern x86 models (e.g. Sandy Bridge or later), do massive superscalar and instruction reorder, along with fusing uops, so I wouldn't count on a literal translation. For example, see: https://www.realworldtech.com/sandy-bridge/
Here's a better(?) version, but it still shows the same thing. Namely, that sometimes original is faster and sometimes "improved" is faster
#include <stdio.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
typedef struct {
tsc_t orig;
tsc_t fix1;
} bnc_t;
#define BNCMAX 100
bnc_t bnclist[BNCMAX];
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = 10000000; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
return end;
}
void
run(bnc_t *bnc)
{
tsc_t odif = test(orig);
tsc_t fdif = test(fix1);
bnc->orig = odif;
bnc->fix1 = fdif;
}
int
main(void)
{
bnc_t *bnc;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
run(bnc);
}
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
printf("orig=%lld fix1=%lld (%lld)\n",
bnc->orig,bnc->fix1,bnc->orig - bnc->fix1);
}
return 0;
}
And, here's the output (no real change):
orig=31588215 fix1=26821473 (4766742)
orig=25748732 fix1=25917183 (-168451)
orig=25805426 fix1=25635759 (169667)
orig=25479642 fix1=26037620 (-557978)
orig=26668860 fix1=25959444 (709416)
orig=26047616 fix1=25540493 (507123)
orig=25772292 fix1=25460041 (312251)
orig=25709852 fix1=26172701 (-462849)
orig=26124151 fix1=25766472 (357679)
orig=25539018 fix1=26845018 (-1306000)
orig=26884105 fix1=26869566 (14539)
orig=26184938 fix1=27826408 (-1641470)
orig=25841934 fix1=25482603 (359331)
orig=25509107 fix1=25436511 (72596)
orig=25448812 fix1=25473302 (-24490)
orig=25433894 fix1=25812646 (-378752)
orig=25868190 fix1=26180032 (-311842)
orig=25451573 fix1=25503657 (-52084)
orig=25393540 fix1=25484952 (-91412)
orig=26032526 fix1=26825219 (-792693)
orig=25859126 fix1=25529430 (329696)
orig=25692214 fix1=25431668 (260546)
orig=25463849 fix1=25370236 (93613)
orig=25650185 fix1=25401441 (248744)
orig=25702951 fix1=26858126 (-1155175)
orig=26187072 fix1=25800102 (386970)
orig=26493916 fix1=25591639 (902277)
orig=26456983 fix1=25724181 (732802)
orig=25842746 fix1=26119019 (-276273)
orig=26654148 fix1=29452577 (-2798429)
orig=27936505 fix1=28494045 (-557540)
orig=30067162 fix1=27029523 (3037639)
orig=25785637 fix1=25856415 (-70778)
orig=25521760 fix1=25286859 (234901)
orig=25433035 fix1=25626380 (-193345)
orig=25373358 fix1=25541615 (-168257)
orig=25846496 fix1=25446494 (400002)
orig=25368198 fix1=25321934 (46264)
orig=25615453 fix1=28574223 (-2958770)
orig=26660896 fix1=25508745 (1152151)
orig=25891979 fix1=25546436 (345543)
orig=25296369 fix1=25382779 (-86410)
orig=25438794 fix1=25372736 (66058)
orig=25531652 fix1=25498422 (33230)
orig=25977272 fix1=25456931 (520341)
orig=25336327 fix1=25423638 (-87311)
orig=26037148 fix1=25313703 (723445)
orig=25314995 fix1=25538181 (-223186)
orig=26638367 fix1=26446762 (191605)
orig=25915537 fix1=25633327 (282210)
orig=25409105 fix1=25287069 (122036)
orig=25633931 fix1=26423463 (-789532)
orig=26074523 fix1=26524398 (-449875)
orig=25602157 fix1=25580893 (21264)
orig=25490481 fix1=25557287 (-66806)
orig=25666843 fix1=25496179 (170664)
orig=26573635 fix1=25796737 (776898)
orig=26133811 fix1=26226840 (-93029)
orig=28262664 fix1=26022265 (2240399)
orig=25336820 fix1=25683095 (-346275)
orig=25899602 fix1=25660778 (238824)
orig=25440453 fix1=25630320 (-189867)
orig=25356601 fix1=25422670 (-66069)
orig=25419887 fix1=25611533 (-191646)
orig=25766460 fix1=25596927 (169533)
orig=25619510 fix1=25449303 (170207)
orig=25359373 fix1=25380306 (-20933)
orig=25474687 fix1=27194210 (-1719523)
orig=26389253 fix1=26709738 (-320485)
orig=26132999 fix1=25671907 (461092)
orig=25416724 fix1=25540911 (-124187)
orig=25440277 fix1=25364387 (75890)
orig=25704885 fix1=25661456 (43429)
orig=25544376 fix1=25380520 (163856)
orig=25340926 fix1=25956342 (-615416)
orig=25383668 fix1=25397807 (-14139)
orig=25636178 fix1=25769479 (-133301)
orig=26237022 fix1=29897502 (-3660480)
orig=28235814 fix1=25475574 (2760240)
orig=25457466 fix1=25450557 (6909)
orig=25775658 fix1=25802380 (-26722)
orig=27577521 fix1=25444772 (2132749)
orig=25380927 fix1=25409250 (-28323)
orig=25417872 fix1=25336530 (81342)
orig=25995656 fix1=26338512 (-342856)
orig=25553088 fix1=25334495 (218593)
orig=25416197 fix1=25521031 (-104834)
orig=29150160 fix1=25717390 (3432770)
orig=26026892 fix1=26916678 (-889786)
orig=25694048 fix1=25496660 (197388)
orig=25576011 fix1=25676045 (-100034)
orig=25461907 fix1=25462593 (-686)
orig=25736879 fix1=27349093 (-1612214)
orig=25687558 fix1=25829963 (-142405)
orig=25492417 fix1=25752421 (-260004)
orig=25559702 fix1=25423874 (135828)
orig=25799145 fix1=28961932 (-3162787)
orig=25912111 fix1=26018163 (-106052)
orig=25725927 fix1=25794091 (-68164)
orig=25528795 fix1=25855893 (-327098)
UPDATE #2:
Here's my newest test version:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef unsigned long ulong;
ulong orig(ulong);
ulong fix1(ulong);
typedef ulong (*fnc_p)(ulong);
typedef long long tsc_t;
typedef struct {
tsc_t dif;
ulong tot;
} test_t;
typedef struct {
test_t orig;
test_t fix1;
} bnc_t;
#define BNCMAX 100
bnc_t bnclist[BNCMAX];
ulong itermax;
static inline tsc_t
tscget(void)
{
struct timespec ts;
tsc_t tsc;
clock_gettime(CLOCK_MONOTONIC,&ts);
tsc = ts.tv_sec;
tsc *= 1000000000;
tsc += ts.tv_nsec;
return tsc;
}
tsc_t
test(test_t *tst,fnc_p fnc)
{
tsc_t beg;
tsc_t end;
ulong tot = 0;
beg = tscget();
for (ulong cnt = itermax; cnt > 0; --cnt)
tot += fnc(cnt);
end = tscget();
end -= beg;
tst->dif = end;
tst->tot = tot;
return end;
}
void
run(bnc_t *bnc)
{
tsc_t odif = test(&bnc->orig,orig);
tsc_t fdif = test(&bnc->fix1,fix1);
}
int
main(int argc,char **argv)
{
bnc_t *bnc;
test_t bestorig;
test_t bestfix1;
--argc;
++argv;
if (argc > 0)
itermax = atoll(*argv);
else
itermax = 10000000;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
run(bnc);
}
bnc = &bnclist[0];
bestorig = bnc->orig;
bestfix1 = bnc->orig;
for (int pass = 0; pass < BNCMAX; ++pass) {
bnc = &bnclist[pass];
printf("orig=%lld fix1=%lld (%lld)\n",
bnc->orig.dif,bnc->fix1.dif,bnc->orig.dif - bnc->fix1.dif);
if (bnc->orig.tot != bnc->fix1.tot)
printf("FAIL: orig=%ld fix1=%ld\n",bnc->orig.tot,bnc->fix1.tot);
if (bnc->orig.dif < bestorig.dif)
bestorig = bnc->orig;
if (bnc->fix1.dif < bestfix1.dif)
bestfix1 = bnc->fix1;
}
printf("\n");
printf("itermax=%ld\n",itermax);
printf("orig=%lld\n",bestorig.dif);
printf("fix1=%lld\n",bestfix1.dif);
return 0;
}

Related

FreeBSD syscall clobbering more registers than Linux? Inline asm different behaviour between optimization levels

Recently I was playing with freebsd system calls I had no problem for i386 part since its well documented at here. But i can't find same document for x86_64.
I saw people are using same way like on linux but they use just assembly not c. I suppose in my case system call actually changing some register which is used by high optimization level so it gives different behaviour.
/* for SYS_* constants */
#include <sys/syscall.h>
/* for types like size_t */
#include <unistd.h>
ssize_t sys_write(int fd, const void *data, size_t size){
register long res __asm__("rax");
register long arg0 __asm__("rdi") = fd;
register long arg1 __asm__("rsi") = (long)data;
register long arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
: "=r" (res)
: "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
: "rcx", "r11", "memory"
);
return res;
}
int main(){
for(int i = 0; i < 1000; i++){
char a = 0;
int some_invalid_fd = -1;
sys_write(some_invalid_fd, &a, 1);
}
return 0;
}
In above code I just expect it to call sys_write 1000 times then return main. I use truss to check system call and their parameters. Everything works fine with -O0 but when I go -O3 for loop getting stuck forever. I believe system call changing i variable or 1000 to something weird.
Dump of assembler code for function main:
0x0000000000201900 <+0>: push %rbp
0x0000000000201901 <+1>: mov %rsp,%rbp
0x0000000000201904 <+4>: mov $0x3e8,%r8d
0x000000000020190a <+10>: lea -0x1(%rbp),%rsi
0x000000000020190e <+14>: mov $0x1,%edx
0x0000000000201913 <+19>: mov $0xffffffffffffffff,%rdi
0x000000000020191a <+26>: nopw 0x0(%rax,%rax,1)
0x0000000000201920 <+32>: movb $0x0,-0x1(%rbp)
0x0000000000201924 <+36>: mov $0x4,%eax
0x0000000000201929 <+41>: syscall
0x000000000020192b <+43>: add $0xffffffff,%r8d
0x000000000020192f <+47>: jne 0x201920 <main+32>
0x0000000000201931 <+49>: xor %eax,%eax
0x0000000000201933 <+51>: pop %rbp
0x0000000000201934 <+52>: ret
What is wrong with sys_write()? Why for loop getting stuck?

Optimization level determines where clang decides to keep its loop counter: in memory (unoptimized) or in a register, in this case r8d (optimized). R8D is a logical choice for the compiler: it's a call-clobbered reg it can use without saving at the start/end of main, and you've told it all the registers it could use without a REX prefix (like ECX) are either inputs / outputs or clobbers for the asm statement.
Note: if FreeBSD is like MacOS, system call error / no-error status is returned in CF (the carry flag), not via RAX being in the -4095..-1 range. In that case, you'd want a GCC6 flag-output operand like "=#ccc" (err) for int err(#ifdef __GCC_ASM_FLAG_OUTPUTS__ - example) or a setc %cl in the template to materialize a boolean manually. (CL is a good choice because you can just use it as an output instead of a clobber.)
FreeBSD's syscall handling trashes R8, R9, and R10, in addition to the bare minimum clobbering the Linux does: RAX (retval) and RCX / R11 (The syscall instruction itself uses them to save RIP / RFLAGS so the kernel can find its way back to user-space, so the kernel never even sees the original values.)
Possibly also RDX, we're not sure; the comments call it "return value 2" (i.e. as part of a RDX:RAX return value?). We also don't know what future-proof ABI guarantees FreeBSD intends to maintain in future kernels.
You can't assume R8-R10 are zero after syscall because they're actually preserved instead of zeroed when tracing / single-stepping. (Because then the kernel chooses not to return via sysret, for the same reason as Linux: hardware / design bugs make that unsafe if registers might have been modified by ptrace while inside the system call. e.g. attempting to sysret with a non-canonical RIP will #GP in ring 0 (kernel mode) on Intel CPUs! That's a disaster because RSP = user stack at that point.)
The relevant kernel code is the sysret path (well spotted by #NateEldredge; I found the syscall entry point by searching for swapgs, but hadn't gotten to looking at the return path).
The function-call-preserved registers don't need to be restored by that code because calling a C function didn't destroy them in the first place. and the code does restore the function-call-clobbered "legacy" registers RDI, RSI, and RDX.
R8-R11 are the registers that are call-clobbered in the function-calling convention, and that are outside the original 8 x86 registers. So that's what makes them "special". (R11 doesn't get zeroed; syscall/sysret uses it for RFLAGS, so that's the value you'll find there after syscall)
Zeroing is slightly faster than loading them, and in the normal case (syscall instruction inside a libc wrapper function) you're about to return to a caller that's only assuming the function-calling convention, and thus will assume that R8-R11 are trashed (same for RDI, RSI, RDX, and RCX, although FreeBSD does bother to restore those for some reason.)
This zeroing only happens when not single-stepping or tracing (e.g. truss or GDB si). The syscall entry point into an amd64 kernel (Github) does save all the incoming registers, so they're available to be restored by other ways out of the kernel.
Updated asm() wrapper
// Should be fixed for FreeBSD, plus other improvements
ssize_t sys_write(int fd, const void *data, size_t size){
register ssize_t res __asm__("rax");
register int arg0 __asm__("edi") = fd;
register const void *arg1 __asm__("rsi") = data; // you can use real types
register size_t arg2 __asm__("rdx") = size;
__asm__ __volatile__(
"syscall"
// RDX *maybe* clobbered
: "=a" (res), "+r" (arg2)
// RDI, RSI preserved
: "a" (SYS_write), "r" (arg0), "r" (arg1)
// An arg in R10, R8, or R9 definitely would be
: "rcx", "r11", "memory", "r8", "r9", "r10" ////// The fix: r8-r10
// see below for a version that avoids the "memory" clobber with a dummy input operand
);
return res;
}
Use "+r" output/input operands with any args that need register long arg3 asm("r10") or similar for r8 or r9.
This is inside a wrapper function so the modified value of the C variables get thrown away, forcing repeated calls to set up the args every time. That would be the "defensive" approach until another answer identifies more definitely-non-trashed registers.
I did break *0x000000000020192b then info registers when break happened. r8 is zero. Program still gets stuck in this case
I assume that r8 wasn't zero before you did that GDB continue across the syscall instruction. Yes, that test confirms that the FreeBSD kernel is trashing r8 when not single-stepping. (And behaving in a way that matches what we see in the source code.)
Note that you can tell the compiler that a write system call only reads memory (not writes) using a dummy "m" input operand instead of a "memory" clobber. That would let it hoist the store of c out of the loop. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)
i.e. "m"(*(const char (*)[size]) data) as an input instead of a "memory" clobber.
If you're going to write specific wrappers for each syscall you use, instead of a generic wrapper you use for every 3-operand syscall that just casts all operands to unsigned long, this is the advantage you can get from doing that.
Speaking of which, there's absolutely no point in making your syscall args all be long; making user-space sign-extend int fd into a 64-bit register is just wasted instructions. The kernel ABI will (almost certainly) ignore the high bytes of registers for narrow args, like Linux does. (Again, unless you're making a generic syscall3 wrapper that you just use with different SYS_ numbers to define write, read, and other 3-operand system calls; then you would cast everything to register-width and just use a "memory" clobber).
I made these changes for my modified version below.
Also note that for RDI, RSI, and RDX, there are specific-register letter constraints which you can use instead of register-asm locals, just like you're doing for the return value in RAX ("=a"). BTW, you don't really need a matching constraint for the call number, just use an "a" input; it's easier to read because you don't need to look at another operand to check that you're matching the right output.
// assuming RDX *is* clobbered.
// could remove the + if it isn't.
ssize_t sys_write(int fd, const void *data, size_t size)
{
// register long arg3 __asm__("r10") = ??;
// register-asm is useful for R8 and up
ssize_t res;
__asm__ __volatile__("syscall"
// RDX
: "=a" (res), "+d" (size)
// EAX/RAX RDI RSI
: "a" (SYS_write), "D" (fd), "S" (data),
"m" (*(const char (*)[size]) data) // tells compiler this mem is an input
: "rcx", "r11" //, "memory"
#ifndef __linux__
, "r8", "r9", "r10" // Linux always restores these
#endif
);
return res;
}
Some people prefer register ... asm("") for all the operands because you get to use the full register name, and don't have to remember the totally-non-obvious "D" for RDI/EDI/DI/DIL vs. "d" for RDX/EDX/DX/DL

Here's a test framework to work with. It is [loosely] modeled on a H/W logic analyzer and/or things like dtrace.
It will save registers before and after the syscall instruction in a large global buffer.
After the loop terminates it will dump out a trace of all the register values that were stored.
It is multiple files. To extract:
save the code below to a file (e.g. /tmp/archive).
Create a directory: (e.g.) /tmp/extract
cd to /tmp/extract.
Then do: perl /tmp/archive -go.
It will create some subdirectories: /tmp/extract/syscall and /tmp/extract/snaplib and store a few files there.
cd to the program target directory (e.g.) cd /tmp/extract/syscall
build with: make
Then, run with: ./syscall
Here is the file:
Edit: I've added a check for overflow of the snaplist buffer in the snapnow function. If the buffer is full, dumpall is called automatically. This is good in general but also necessary if the loop in main never terminates (i.e. without the check the post loop dump would never occur)
Edit: And, I've added optional "x86_64 red zone" support
#!/usr/bin/perl
# FILE: ovcbin/ovcext.pm 755
# ovcbin/ovcext.pm -- ovrcat archive extractor
#
# this is a self extracting archive
# after the __DATA__ line, files are separated by:
# % filename
ovcext_cmd(#ARGV);
exit(0);
sub ovcext_cmd
{
my(#argv) = #_;
local($xfdata);
local($xfdiv,$divcur,%ovcdiv_lookup);
$pgmtail = "ovcext";
ovcinit();
ovcopt(\#argv,qw(opt_go opt_f opt_t));
$xfdata = "ovrcat::DATA";
$xfdata = \*$xfdata;
ovceval($xfdata);
ovcfifo($zipflg_all);
ovcline($xfdata);
$code = ovcwait();
ovcclose(\$xfdata);
ovcdiv();
ovczipd_spl()
if ($zipflg_spl);
}
sub ovceval
{
my($xfdata) = #_;
my($buf,$err);
{
$buf = <$xfdata>;
chomp($buf);
last unless ($buf =~ s/^%\s+([\#\$;])/$1/);
eval($buf);
$err = $#;
unless ($err) {
undef($buf);
last;
}
chomp($err);
$err = " (" . $err . ")"
}
sysfault("ovceval: bad options line -- '%s'%s\n",$buf,$err)
if (defined($buf));
}
sub ovcline
{
my($xfdata) = #_;
my($buf);
my($tail);
while ($buf = <$xfdata>) {
chomp($buf);
if ($buf =~ /^%\s+(.+)$/) {
$tail = $1;
ovcdiv($tail);
next;
}
print($xfdiv $buf,"\n")
if (ref($xfdiv));
}
}
sub ovcdiv
{
my($ofile) = #_;
my($mode);
my($xfcur);
my($err,$prt);
($ofile,$mode) = split(" ",$ofile);
$mode = oct($mode);
$mode &= 0777;
{
unless (defined($ofile)) {
while ((undef,$divcur) = each(%ovcdiv_lookup)) {
close($divcur->{div_xfdst});
}
last;
}
$ofile = ovctail($ofile);
$divcur = $ovcdiv_lookup{$ofile};
if (ref($divcur)) {
$xfdiv = $divcur->{div_xfdst};
last;
}
undef($xfdiv);
if (-e $ofile) {
msg("ovcdiv: file '%s' already exists -- ",$ofile);
unless ($opt_f) {
msg("rerun with -f to force\n");
last;
}
msg("overwriting!\n");
}
unless (defined($err)) {
ovcmkdir($1)
if ($ofile =~ m,^(.+)/[^/]+$,);
}
msg("$pgmtail: %s %s",ovcnogo("extracting"),$ofile);
msg(" chmod %3.3o",$mode)
if ($mode);
msg("\n");
last unless ($opt_go);
last if (defined($err));
$xfcur = ovcopen(">$ofile");
$divcur = {};
$ovcdiv_lookup{$ofile} = $divcur;
if ($mode) {
chmod($mode,$xfcur);
$divcur->{div_mode} = $mode;
}
$divcur->{div_xfdst} = $xfcur;
$xfdiv = $xfcur;
}
}
sub ovcinit
{
{
last if (defined($ztmp));
$ztmp = "/tmp/ovrcat_zip";
$PWD = $ENV{PWD};
$quo_2 = '"';
$ztmp_inp = $ztmp . "_0";
$ztmp_out = $ztmp . "_1";
$ztmp_perl = $ztmp . "_perl";
ovcunlink();
$ovcdbg = ($ENV{"ZPXHOWOVC"} != 0);
}
}
sub ovcunlink
{
_ovcunlink($ztmp_inp,1);
_ovcunlink($ztmp_out,1);
_ovcunlink($ztmp_perl,($pgmtail ne "ovcext") || $opt_go);
}
sub _ovcunlink
{
my($file,$rmflg) = #_;
my($found,$tag);
{
last unless (defined($file));
$found = (-e $file);
$tag //= "notfound"
unless ($found);
$tag //= $rmflg ? "cleaning" : "keeping";
msg("ovcunlink: %s %s ...\n",$tag,$file)
if (($found or $ovcdbg) and (! $ovcunlink_quiet));
unlink($file)
if ($rmflg and $found);
}
}
sub ovcopt
{
my($argv) = #_;
my($opt);
while (1) {
$opt = $argv->[0];
last unless ($opt =~ s/^-/opt_/);
shift(#$argv);
$$opt = 1;
}
}
sub ovctail
{
my($file,$sub) = #_;
my(#file);
$file =~ s,^/,,;
#file = split("/",$file);
$sub //= 2;
#file = splice(#file,-$sub)
if (#file >= $sub);
$file = join("/",#file);
$file;
}
sub ovcmkdir
{
my($odir) = #_;
my(#lhs,#rhs);
#rhs = split("/",$odir);
foreach $rhs (#rhs) {
push(#lhs,$rhs);
$odir = join("/",#lhs);
if ($opt_go) {
next if (-d $odir);
}
else {
next if ($ovcmkdir{$odir});
$ovcmkdir{$odir} = 1;
}
msg("$pgmtail: %s %s ...\n",ovcnogo("mkdir"),$odir);
next unless ($opt_go);
mkdir($odir) or
sysfault("$pgmtail: unable to mkdir '%s' -- $!\n",$odir);
}
}
sub ovcopen
{
my($file,$who) = #_;
my($xf);
$who //= $pgmtail;
$who //= "ovcopen";
open($xf,$file) or
sysfault("$who: unable to open '%s' -- $!\n",$file);
$xf;
}
sub ovcclose
{
my($xfp) = #_;
my($ref);
my($xf);
{
$ref = ref($xfp);
last unless ($ref);
if ($ref eq "GLOB") {
close($xfp);
last;
}
if ($ref eq "REF") {
$xf = $$xfp;
if (ref($xf) eq "GLOB") {
close($xf);
undef($$xfp);
}
}
}
undef($xf);
$xf;
}
sub ovcnogo
{
my($str) = #_;
unless ($opt_go) {
$str = "NOGO-$str";
$nogo_msg = 1;
}
$str;
}
sub ovcdbg
{
if ($ovcdbg) {
printf(STDERR #_);
}
}
sub msg
{
printf(STDERR #_);
}
sub msgv
{
$_ = join(" ",#_);
print(STDERR $_,"\n");
}
sub sysfault
{
printf(STDERR #_);
exit(1);
}
sub ovcfifo
{
}
sub ovcwait
{
my($code);
if ($pid_fifo) {
waitpid($pid_fifo,0);
$code = $? >> 8;
}
$code;
}
sub prtstr
{
my($val,$fmtpos,$fmtneg) = #_;
{
unless (defined($val)) {
$val = "undef";
last;
}
if (ref($val)) {
$val = sprintf("(%s)",$val);
last;
}
$fmtpos //= "'%s'";
if (defined($fmtneg) && ($val <= 0)) {
$val = sprintf($fmtneg,$val);
last;
}
$val = sprintf($fmtpos,$val);
}
$val;
}
sub prtnum
{
my($val) = #_;
$val = prtstr($val,"%d");
$val;
}
END {
msg("$pgmtail: rerun with -go to actually do it\n")
if ($nogo_msg);
ovcunlink();
}
1;
package ovrcat;
__DATA__
% ;
% syscall/syscall.c
/* for SYS_* constants */
#include <sys/syscall.h>
/* for types like size_t */
#include <unistd.h>
#include <snaplib/snaplib.h>
ssize_t
my_write(int fd, const void *data, size_t size)
{
register long res __asm__("rax");
register long arg0 __asm__("rdi") = fd;
register long arg1 __asm__("rsi") = (long)data;
register long arg2 __asm__("rdx") = size;
__asm__ __volatile__(
SNAPNOW
"\tsyscall\n"
SNAPNOW
: "=r" (res)
: "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
: "rcx", "r11", "memory"
);
return res;
}
int
main(void)
{
for (int i = 0; i < 8000; i++) {
char a = 0;
int some_invalid_fd = -1;
my_write(some_invalid_fd, &a, 1);
}
snapreg_dumpall();
return 0;
}
% snaplib/snaplib.h
// snaplib/snaplib.h -- register save/dump
#ifndef _snaplib_snaplib_h_
#define _snaplib_snaplib_h_
#ifdef _SNAPLIB_GLO_
#define EXTRN_SNAPLIB /**/
#else
#define EXTRN_SNAPLIB extern
#endif
#ifdef RED_ZONE
#define SNAPNOW \
"\tsubq\t$128,%%rsp\n" \
"\tcall\tsnapreg\n" \
"\taddq\t$128,%%rsp\n"
#else
#define SNAPNOW "\tcall\tsnapreg\n"
#endif
typedef unsigned long reg_t;
#ifndef SNAPREG
#define SNAPREG (1500 * 2)
#endif
typedef struct {
reg_t snap_regs[16];
} __attribute__((packed)) snapreg_t;
typedef snapreg_t *snapreg_p;
EXTRN_SNAPLIB snapreg_t snaplist[SNAPREG];
#ifdef _SNAPLIB_GLO_
snapreg_p snapcur = &snaplist[0];
snapreg_p snapend = &snaplist[SNAPREG];
#else
extern snapreg_p snapcur;
extern snapreg_p snapend;
#endif
#include <snaplib/snaplib.proto>
#include <snaplib/snapgen.h>
#endif
% snaplib/snapall.c
// snaplib/snapall.c -- dump routines
#define _SNAPLIB_GLO_
#include <snaplib/snaplib.h>
#include <stdio.h>
#include <stdlib.h>
void
snapreg_dumpall(void)
{
snapreg_p cur = snaplist;
snapreg_p endp = (snapreg_p) snapcur;
int idx = 0;
for (; cur < endp; ++cur, ++idx) {
printf("\n");
printf("%d:\n",idx);
snapreg_dumpgen(cur);
}
snapcur = snaplist;
}
// snapreg_crash -- invoke dump and abort
void
snapreg_crash(void)
{
snapreg_dumpall();
exit(9);
}
// snapreg_dumpone -- dump single element
void
snapreg_dumpone(snapreg_p cur,int regidx,const char *regname)
{
reg_t regval = cur->snap_regs[regidx];
printf(" %3s %16.16lX %ld\n",regname,regval,regval);
}
% snaplib/snapreg.s
.text
.globl snapreg
snapreg:
push %r14
push %r15
movq snapcur(%rip),%r15
movq %rax,0(%r15)
movq %rbx,8(%r15)
movq %rcx,16(%r15)
movq %rdx,24(%r15)
movq %rsi,32(%r15)
movq %rsi,40(%r15)
movq %rbp,48(%r15)
movq %rsp,56(%r15)
movq %r8,64(%r15)
movq %r9,72(%r15)
movq %r10,80(%r15)
movq %r11,88(%r15)
movq %r12,96(%r15)
movq %r13,104(%r15)
movq %r14,112(%r15)
movq 0(%rsp),%r14
movq %r14,120(%r15)
addq $128,%r15
movq %r15,snapcur(%rip)
cmpq snapend(%rip),%r15
jae snapreg_crash
pop %r15
pop %r14
ret
% snaplib/snapgen.h
#ifndef _snapreg_snapgen_h_
#define _snapreg_snapgen_h_
static inline void
snapreg_dumpgen(snapreg_p cur)
{
snapreg_dumpone(cur,0,"rax");
snapreg_dumpone(cur,1,"rbx");
snapreg_dumpone(cur,2,"rcx");
snapreg_dumpone(cur,3,"rdx");
snapreg_dumpone(cur,5,"rsi");
snapreg_dumpone(cur,5,"rsi");
snapreg_dumpone(cur,6,"rbp");
snapreg_dumpone(cur,7,"rsp");
snapreg_dumpone(cur,8,"r8");
snapreg_dumpone(cur,9,"r9");
snapreg_dumpone(cur,10,"r10");
snapreg_dumpone(cur,11,"r11");
snapreg_dumpone(cur,12,"r12");
snapreg_dumpone(cur,13,"r13");
snapreg_dumpone(cur,14,"r14");
snapreg_dumpone(cur,15,"r15");
}
#endif
% snaplib/snaplib.proto
// /home/cae/OBJ/ovrgen/snaplib/snaplib.proto -- prototypes
// FILE: /home/cae/preserve/ovrbnc/snaplib/snapall.c
// snaplib/snapall.c -- dump routines
void
snapreg_dumpall(void);
// snapreg_crash -- invoke dump and abort
void
snapreg_crash(void);
// snapreg_dumpone -- dump single element
void
snapreg_dumpone(snapreg_p cur,int regidx,const char *regname);
% syscall/Makefile
# /home/cae/preserve/ovrbnc/syscall -- makefile
PGMTGT += syscall
LIBSRC += ../snaplib/snapreg.s
LIBSRC += ../snaplib/snapall.c
ifndef COPTS
COPTS += -O2
endif
CFLAGS += $(COPTS)
CFLAGS += -mno-red-zone
CFLAGS += -g
CFLAGS += -Wall
CFLAGS += -Werror
CFLAGS += -I..
all: $(PGMTGT)
syscall: syscall.c $(CURSRC) $(LIBSRC)
cc -o syscall $(CFLAGS) syscall.c $(CURSRC) $(LIBSRC)
clean:
rm -f $(PGMTGT)

divide and store quotient and reminder in different arrays

The standard div() function returns a div_t struct as parameter, for example:
/* div example */
#include <stdio.h> /* printf */
#include <stdlib.h> /* div, div_t */
int main ()
{
div_t divresult;
divresult = div (38,5);
printf ("38 div 5 => %d, remainder %d.\n", divresult.quot, divresult.rem);
return 0;
}
My case is a bit different; I have this
#define NUM_ELTS 21433
int main ()
{
unsigned int quotients[NUM_ELTS];
unsigned int remainders[NUM_ELTS];
int i;
for(i=0;i<NUM_ELTS;i++) {
divide_single_instruction(&quotient[i],&reminder[i]);
}
}
I know that the assembly language for division does everything in single instruction, so I need to do the same here to save on cpu cycles, which is bassicaly move the quotient from EAX and reminder from EDX into a memory locations where my arrays are stored. How can this be done without including the asm {} or SSE intrinsics in my C code ? It has to be portable.

Since you're writing to the arrays in-place (replacing numerator and denominator with quotient and remainder) you should store the results to temporary variables before writing to the arrays.
void foo (unsigned *num, unsigned *den, int n) {
int i;
for(i=0;i<n;i++) {
unsigned q = num[i]/den[i], r = num[i]%den[i];
num[i] = q, den[i] = r;
}
}
produces this main loop assembly
.L5:
movl (%rdi,%rcx,4), %eax
xorl %edx, %edx
divl (%rsi,%rcx,4)
movl %eax, (%rdi,%rcx,4)
movl %edx, (%rsi,%rcx,4)
addq $1, %rcx
cmpl %ecx, %r8d
jg .L5
There are some more complicated cases where it helps to save the quotient and remainder when they are first used. For example in testing for primes by trial division you often see a loop like this
for (p = 3; p <= n/p; p += 2)
if (!(n % p)) return 0;
It turns out that GCC does not use the remainder from the first division and therefore it does the division instruction twice which is unnecessary. To fix this you can save the remainder when the first division is done like this:
for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
if (!r) return 0;
This speeds up the result by a factor of two.
So in general GCC does a good job particularly if you save the quotient and remainder when they are first calculated.

The general rule here is to trust your compiler to do something fast. You can always disassemble the code and check that the compiler is doing something sane. It's important to realise that a good compiler knows a lot about the machine, often more than you or me.
Also let's assume you have a good reason for needing to "count cycles".
For your example code I agree that the x86 "idiv" instruction is the obvious choice. Let's see what my compiler (MS visual C 2013) will do if I just write out the most naive code I can
struct divresult {
int quot;
int rem;
};
struct divresult divrem(int num, int den)
{
return (struct divresult) { num / den, num % den };
}
int main()
{
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
}
And the compiler gives us:
struct divresult res = divrem(5, 2);
printf("%d, %d", res.quot, res.rem);
01121000 push 1
01121002 push 2
01121004 push 1123018h
01121009 call dword ptr ds:[1122090h] ;;; this is printf()
Wow, I was outsmarted by the compiler. Visual C knows how division works so it just precalculated the result and inserted constants. It didn't even bother to include my function in the final code. We have to read in the integers from console to force it to actually do the calculation:
int main()
{
int num, den;
scanf("%d, %d", &num, &den);
struct divresult res = divrem(num, den);
printf("%d, %d", res.quot, res.rem);
}
Now we get:
struct divresult res = divrem(num, den);
01071023 mov eax,dword ptr [num]
01071026 cdq
01071027 idiv eax,dword ptr [den]
printf("%d, %d", res.quot, res.rem);
0107102A push edx
0107102B push eax
0107102C push 1073020h
01071031 call dword ptr ds:[1072090h] ;;; printf()
So you see, the compiler (or this compiler at least) already does what you want, or something even more clever.
From this we learn to trust the compiler and only second-guess it when we know it isn't doing a good enough job already.

Toggle a given range of bits of an unsigned int in C

I am trying to replace the following piece of code
// code version 1
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while(time_stx < time_enx) time |= (1 << time_stx++);
with the following one without a loop
// code version 2
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
It turns out that in code version 1, time = 522240; in code version 2, time = 0; when I use
printf("%u\n", time);
to compare the result. I would like to know why is this happening and if there is any faster way to toggle bits in a given range. My compiler is gcc (Debian 4.9.2-10) 4.9.2.
Edit:
Thank you for your replies. I have made a silly mistake and I feel embarrassing posting my question without further inspecting my codes. I did
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time1 = 0;
while(time_stx < time_enx) time1 |= (1 << time_stx++); // version 1
//// what I should, but forgotten to do
// time_stx = 11;
// time_enx = 19;
// where time_stx = time_enx now...
unsigned int time2 = (1 << time_enx) - (1 << time_stx); // version 2
// then obviously
printf("time1 = %u\n", time1); // time1 = 522240
printf("time2 = %u\n", time2); // time2 = 0
I am so sorry for any inconvenience incurred.
Remark: both time_stx and time_enx are generated in the run-time and are not fixed.
As suggested that I made a mistake and the problem is solved now. Thank you!!

Read Bit twiddling hacks. Even if the answer isn't in there, you'll be better educated on bit twiddling. Also, the original code is simply setting the bits in the range; toggling means turning 1 bits into 0 bits and vice versa (normally achieved using ^ or xor).
As to the code, I converted three variants of the expression into the following C code:
#include <stdio.h>
static void print(unsigned int v)
{
printf("0x%.8X = %u\n", v, v);
}
static void bit_setter1(void)
{
unsigned int time_stx = 11; // given range start
unsigned int time_enx = 19; // given range end
unsigned int time = 0; // desired output
while (time_stx < time_enx)
time |= (1 << time_stx++);
print(time);
}
static void bit_setter2(void)
{
unsigned int time_stx = 11;
unsigned int time_enx = 19;
unsigned int time = (1 << time_enx) - (1 << time_stx);
print(time);
}
static void bit_setter3(void)
{
unsigned int time = 0xFF << 11;
print(time);
}
int main(void)
{
bit_setter1();
bit_setter2();
bit_setter3();
return 0;
}
When I look at the assembler for it (GCC 5.1.0 on Mac OS X 10.10.3), I get:
.globl _main
_main:
LFB5:
LM1:
LVL0:
subq $8, %rsp
LCFI0:
LBB28:
LBB29:
LBB30:
LBB31:
LM2:
movl $522240, %edx
movl $522240, %esi
leaq LC0(%rip), %rdi
xorl %eax, %eax
call _printf
LVL1:
LBE31:
LBE30:
LBE29:
LBE28:
LBB32:
LBB33:
LBB34:
LBB35:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL2:
LBE35:
LBE34:
LBE33:
LBE32:
LBB36:
LBB37:
LBB38:
LBB39:
movl $522240, %edx
movl $522240, %esi
xorl %eax, %eax
leaq LC0(%rip), %rdi
call _printf
LVL3:
LBE39:
LBE38:
LBE37:
LBE36:
LM3:
xorl %eax, %eax
addq $8, %rsp
LCFI1:
ret
That's an amazingly large collection of labels!
The compiler has fully evaluated all three minimal bit_setterN() functions and inlined them, along with the call to print, into the body of main(). That includes evaluating the expressions to 522240 each time.
Compilers are good at optimization. Write clear code and let them at it, and they will optimize better than you can. Clearly, if the 11 and 19 are not fixed in your code (they're some sort of computed variables which can vary at runtime), then the precomputation isn't as easy (and bit_setter3() is a non-starter). Then the non-loop code will work OK, as will the loop code.
For the record, the output is:
0x0007F800 = 522240
0x0007F800 = 522240
0x0007F800 = 522240
If your Debian compiler is giving you a zero from one of the code fragments, then there's either a difference between what you compiled and what you posted, or there's a bug in the compiler. On the whole, and no disrespect intended, it is more likely that you've made a mistake than that the compiler has a bug in it that shows up in code as simple as this.

Peculiar instruction sequence generated from straightforward C "if" lack condition

I am trying to debug some simple C code under gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 for x86-64. The code is built with CFLAGS += -std=c99 -g -Wall -O0
#include <errno.h>
#include <stdio.h>
#include <string.h>
#pragma pack(1)
int main (int argc, char **argv)
{
FILE *f = fopen ("the_file", "r"); /* error checking removed for clarity */
struct {
short len;
short itm [4];
char nul;
} f00f;
int n = fread (&f00f, 1, sizeof f00f, f);
if (f00f.nul ||
f00f.len != 0x900 ||
f00f.itm [0] != 0xf00f ||
f00f.itm [1] != 0xf00f ||
f00f.itm [2] != 0xf00f ||
f00f.itm [3] != 0xf00f)
{
fprintf (stderr, "bitfile_hdr F00F data err:\n"
"\tNUL: 0x%x\n"
"\tlen: 0x%hx should be 0x900\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
"\tf00f: 0x%hx\n"
, f00f.nul, f00f.len,
f00f.itm[0], f00f.itm[1], f00f.itm[2], f00f.itm[3]
);
return 1;
}
return 0;
}
The data matches what the test expects, and—weirdly—the error message displays the correct data:
$ ./bit_parse
bitfile_hdr F00F data err:
NUL: 0x0
len: 0x900 should be 0x900
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
f00f: 0xf00f
Running it under gdb and examining the structure also shows correct data.
(gdb) p /x f00f
$1 = {len = 0x900, itm = {0xf00f, 0xf00f, 0xf00f, 0xf00f}, nul = 0x0}
Since that didn't make sense, I examined the instructions from inside gdb to reveal coding pathologies. The instructions corresponding to the non-functioning if are:
0x0000000000400736 <+210>: movzwl -0x38(%rbp),%eax
0x000000000040073a <+214>: movswl %ax,%r8d
0x000000000040073e <+218>: movzwl -0x3a(%rbp),%eax
0x0000000000400742 <+222>: movswl %ax,%edi
0x0000000000400745 <+225>: movzwl -0x3c(%rbp),%eax
0x0000000000400749 <+229>: movswl %ax,%r9d
0x000000000040074d <+233>: movzwl -0x3e(%rbp),%eax
0x0000000000400751 <+237>: movswl %ax,%r10d
0x0000000000400755 <+241>: movzwl -0x40(%rbp),%eax
0x0000000000400759 <+245>: movswl %ax,%ecx
0x000000000040075c <+248>: movzbl -0x36(%rbp),%eax
0x0000000000400760 <+252>: movsbl %al,%edx
0x0000000000400763 <+255>: mov $0x4008d8,%esi
0x0000000000400768 <+260>: mov 0x2008d1(%rip),%rax # 0x601040 <stderr##GLIBC_2.2.5>
0x000000000040076f <+267>: mov %r8d,0x8(%rsp)
0x0000000000400774 <+272>: mov %edi,(%rsp)
0x0000000000400777 <+275>: mov %r10d,%r8d
0x000000000040077a <+278>: mov %rax,%rdi
0x000000000040077d <+281>: mov $0x0,%eax
0x0000000000400782 <+286>: callq 0x400550 <fprintf#plt>
0x0000000000400787 <+291>: mov $0x6,%eax
0x000000000040078c <+296>: add $0x50,%rsp
0x0000000000400790 <+300>: pop %rbx
0x0000000000400791 <+301>: pop %r12
0x0000000000400793 <+303>: pop %rbp
0x0000000000400794 <+304>: retq
It is really hard to see how this could implement a conditional.
Anyone see why this (mis)behaves as it does?

Probably on your platform, short is 16-bit wide. Therefore no short can equal 0xf00f and the condition f00f.itm [0] != 0xf00f is always true. The compiler optimized accordingly.
You may have meant unsigned short in the definition of struct f00f, but this is only one way to fix it, of course. You could also compare f00f.itm [0] to (short)0xf00f, but if you meant f00f.itm[i] to be compared to 0xf00f, you definitely should have used unsigned short in the definition.

short val = 0xf00f; assigns the value -4081 to val.
You get hit by integer promotion rules.
f00f.itm [0] != 0xf00f
converts the short in f00f.itm [0] to an int, and that's -4081. 0xf00f as an int is 61455, and those two are not equal. Since the value is converted to an unsigned short when you print out the values (by using %hx), the issue isn't visible in the output.
Use unsigned values in your struct since you seem to treat the values as unsigned:
struct {
unsigned short len;
unsigned short itm [4];
char nul;
} f00f;
This sample program might make you understand what's going on a bit better:
#include <stdio.h>
int main(int argc,char *arga[])
{
short x = 0xf00f;
int y = 0xf00f;
printf("x = 0x%hx y = 0x%x\n", x, y);
printf("x = %d y = %d\n", x, y);
printf("x==y: %d\n", x == y);
return 0;
}

Getting started with Intel x86 SSE SIMD instructions

I want to learn more about using the SSE.
What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals?
Mainly I'm interested to work with the GCC X86 Built-in Functions.

First, I don't recommend on using the built-in functions - they are not portable (across compilers of the same arch).
Use intrinsics, GCC does a wonderful job optimizing SSE intrinsics into even more optimized code. You can always have a peek at the assembly and see how to use SSE to it's full potential.
Intrinsics are easy - just like normal function calls:
#include <immintrin.h> // portable to all x86 compilers
int main()
{
__m128 vector1 = _mm_set_ps(4.0, 3.0, 2.0, 1.0); // high element first, opposite of C array order. Use _mm_setr_ps if you want "little endian" element order in the source.
__m128 vector2 = _mm_set_ps(7.0, 8.0, 9.0, 0.0);
__m128 sum = _mm_add_ps(vector1, vector2); // result = vector1 + vector 2
vector1 = _mm_shuffle_ps(vector1, vector1, _MM_SHUFFLE(0,1,2,3));
// vector1 is now (1, 2, 3, 4) (above shuffle reversed it)
return 0;
}
Use _mm_load_ps or _mm_loadu_ps to load data from arrays.
Of course there are way more options, SSE is really powerful and in my opinion relatively easy to learn.
See also https://stackoverflow.com/tags/sse/info for some links to guides.

Since you asked for resources:
A practical guide to using SSE with C++: Good conceptual overview on how to use SSE effectively, with examples.
MSDN Listing of Compiler Intrinsics: Comprehensive reference for all your intrinsic needs. It's MSDN, but pretty much all the intrinsics listed here are supported by GCC and ICC as well.
Christopher Wright's SSE Page: Quick reference on the meanings of the SSE opcodes. I guess the Intel Manuals can serve the same function, but this is faster.
It's probably best to write most of your code in intrinsics, but do check the objdump of your compiler's output to make sure that it's producing efficient code. SIMD code generation is still a fairly new technology and it's very possible that the compiler might get it wrong in some cases.

I find Dr. Agner Fog's research & optimization guides very valuable! He also has some libraries & testing tools that I have not tried yet.
http://www.agner.org/optimize/

Step 1: write some assembly manually
I recommend that you first try to write your own assembly manually to see and control exactly what is happening when you start learning.
Then the question becomes how to observe what is happening in the program, and the answers are:
GDB
use the C standard library to print and assert things
Using the C standard library yourself requires a little bit of work, but nothing much. I have for example done this work nicely for you on Linux in the following files of my test setup:
lkmc.h
lkmc.c
lkmc/x86_64.h
Using those helpers, I then start playing around with the basics, such as:
load and store data to / from memory into SSE registers
add integers and floating point numbers of different sizes
assert that the results are what I expect
addpd.S
#include <lkmc.h>
LKMC_PROLOGUE
.data
.align 16
addps_input0: .float 1.5, 2.5, 3.5, 4.5
addps_input1: .float 5.5, 6.5, 7.5, 8.5
addps_expect: .float 7.0, 9.0, 11.0, 13.0
addpd_input0: .double 1.5, 2.5
addpd_input1: .double 5.5, 6.5
addpd_expect: .double 7.0, 9.0
.bss
.align 16
output: .skip 16
.text
/* 4x 32-bit */
movaps addps_input0, %xmm0
movaps addps_input1, %xmm1
addps %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, addps_expect, $0x10)
/* 2x 64-bit */
movaps addpd_input0, %xmm0
movaps addpd_input1, %xmm1
addpd %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, addpd_expect, $0x10)
LKMC_EPILOGUE
GitHub upstream.
paddq.S
#include <lkmc.h>
LKMC_PROLOGUE
.data
.align 16
input0: .long 0xF1F1F1F1, 0xF2F2F2F2, 0xF3F3F3F3, 0xF4F4F4F4
input1: .long 0x12121212, 0x13131313, 0x14141414, 0x15151515
paddb_expect: .long 0x03030303, 0x05050505, 0x07070707, 0x09090909
paddw_expect: .long 0x04030403, 0x06050605, 0x08070807, 0x0A090A09
paddd_expect: .long 0x04040403, 0x06060605, 0x08080807, 0x0A0A0A09
paddq_expect: .long 0x04040403, 0x06060606, 0x08080807, 0x0A0A0A0A
.bss
.align 16
output: .skip 16
.text
movaps input1, %xmm1
/* 16x 8bit */
movaps input0, %xmm0
paddb %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, paddb_expect, $0x10)
/* 8x 16-bit */
movaps input0, %xmm0
paddw %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, paddw_expect, $0x10)
/* 4x 32-bit */
movaps input0, %xmm0
paddd %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, paddd_expect, $0x10)
/* 2x 64-bit */
movaps input0, %xmm0
paddq %xmm1, %xmm0
movaps %xmm0, output
LKMC_ASSERT_MEMCMP(output, paddq_expect, $0x10)
LKMC_EPILOGUE
GitHub upstream.
Step 2: write some intrinsics
For production code however, you will likely want to use the pre-existing intrinsics instead of raw assembly as mentioned at: https://stackoverflow.com/a/1390802/895245
So now I try to convert the previous examples into more or less equivalent C code with intrinsics.
addpq.c
#include <assert.h>
#include <string.h>
#include <x86intrin.h>
float global_input0[] __attribute__((aligned(16))) = {1.5f, 2.5f, 3.5f, 4.5f};
float global_input1[] __attribute__((aligned(16))) = {5.5f, 6.5f, 7.5f, 8.5f};
float global_output[4] __attribute__((aligned(16)));
float global_expected[] __attribute__((aligned(16))) = {7.0f, 9.0f, 11.0f, 13.0f};
int main(void) {
/* 32-bit add (addps). */
{
__m128 input0 = _mm_set_ps(1.5f, 2.5f, 3.5f, 4.5f);
__m128 input1 = _mm_set_ps(5.5f, 6.5f, 7.5f, 8.5f);
__m128 output = _mm_add_ps(input0, input1);
/* _mm_extract_ps returns int instead of float:
* * https://stackoverflow.com/questions/5526658/intel-sse-why-does-mm-extract-ps-return-int-instead-of-float
* * https://stackoverflow.com/questions/3130169/how-to-convert-a-hex-float-to-a-float-in-c-c-using-mm-extract-ps-sse-gcc-inst
* so we must use instead: _MM_EXTRACT_FLOAT
*/
float f;
_MM_EXTRACT_FLOAT(f, output, 3);
assert(f == 7.0f);
_MM_EXTRACT_FLOAT(f, output, 2);
assert(f == 9.0f);
_MM_EXTRACT_FLOAT(f, output, 1);
assert(f == 11.0f);
_MM_EXTRACT_FLOAT(f, output, 0);
assert(f == 13.0f);
/* And we also have _mm_cvtss_f32 + _mm_shuffle_ps, */
assert(_mm_cvtss_f32(output) == 13.0f);
assert(_mm_cvtss_f32(_mm_shuffle_ps(output, output, 1)) == 11.0f);
assert(_mm_cvtss_f32(_mm_shuffle_ps(output, output, 2)) == 9.0f);
assert(_mm_cvtss_f32(_mm_shuffle_ps(output, output, 3)) == 7.0f);
}
/* Now from memory. */
{
__m128 *input0 = (__m128 *)global_input0;
__m128 *input1 = (__m128 *)global_input1;
_mm_store_ps(global_output, _mm_add_ps(*input0, *input1));
assert(!memcmp(global_output, global_expected, sizeof(global_output)));
}
/* 64-bit add (addpd). */
{
__m128d input0 = _mm_set_pd(1.5, 2.5);
__m128d input1 = _mm_set_pd(5.5, 6.5);
__m128d output = _mm_add_pd(input0, input1);
/* OK, and this is how we get the doubles out:
* with _mm_cvtsd_f64 + _mm_unpackhi_pd
* https://stackoverflow.com/questions/19359372/mm-cvtsd-f64-analogon-for-higher-order-floating-point
*/
assert(_mm_cvtsd_f64(output) == 9.0);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(output, output)) == 7.0);
}
return 0;
}
GitHub upstream.
paddq.c
#include <assert.h>
#include <inttypes.h>
#include <string.h>
#include <x86intrin.h>
uint32_t global_input0[] __attribute__((aligned(16))) = {1, 2, 3, 4};
uint32_t global_input1[] __attribute__((aligned(16))) = {5, 6, 7, 8};
uint32_t global_output[4] __attribute__((aligned(16)));
uint32_t global_expected[] __attribute__((aligned(16))) = {6, 8, 10, 12};
int main(void) {
/* 32-bit add hello world. */
{
__m128i input0 = _mm_set_epi32(1, 2, 3, 4);
__m128i input1 = _mm_set_epi32(5, 6, 7, 8);
__m128i output = _mm_add_epi32(input0, input1);
/* _mm_extract_epi32 mentioned at:
* https://stackoverflow.com/questions/12495467/how-to-store-the-contents-of-a-m128d-simd-vector-as-doubles-without-accessing/56404421#56404421 */
assert(_mm_extract_epi32(output, 3) == 6);
assert(_mm_extract_epi32(output, 2) == 8);
assert(_mm_extract_epi32(output, 1) == 10);
assert(_mm_extract_epi32(output, 0) == 12);
}
/* Now from memory. */
{
__m128i *input0 = (__m128i *)global_input0;
__m128i *input1 = (__m128i *)global_input1;
_mm_store_si128((__m128i *)global_output, _mm_add_epi32(*input0, *input1));
assert(!memcmp(global_output, global_expected, sizeof(global_output)));
}
/* Now a bunch of other sizes. */
{
__m128i input0 = _mm_set_epi32(0xF1F1F1F1, 0xF2F2F2F2, 0xF3F3F3F3, 0xF4F4F4F4);
__m128i input1 = _mm_set_epi32(0x12121212, 0x13131313, 0x14141414, 0x15151515);
__m128i output;
/* 8-bit integers (paddb) */
output = _mm_add_epi8(input0, input1);
assert(_mm_extract_epi32(output, 3) == 0x03030303);
assert(_mm_extract_epi32(output, 2) == 0x05050505);
assert(_mm_extract_epi32(output, 1) == 0x07070707);
assert(_mm_extract_epi32(output, 0) == 0x09090909);
/* 32-bit integers (paddw) */
output = _mm_add_epi16(input0, input1);
assert(_mm_extract_epi32(output, 3) == 0x04030403);
assert(_mm_extract_epi32(output, 2) == 0x06050605);
assert(_mm_extract_epi32(output, 1) == 0x08070807);
assert(_mm_extract_epi32(output, 0) == 0x0A090A09);
/* 32-bit integers (paddd) */
output = _mm_add_epi32(input0, input1);
assert(_mm_extract_epi32(output, 3) == 0x04040403);
assert(_mm_extract_epi32(output, 2) == 0x06060605);
assert(_mm_extract_epi32(output, 1) == 0x08080807);
assert(_mm_extract_epi32(output, 0) == 0x0A0A0A09);
/* 64-bit integers (paddq) */
output = _mm_add_epi64(input0, input1);
assert(_mm_extract_epi32(output, 3) == 0x04040404);
assert(_mm_extract_epi32(output, 2) == 0x06060605);
assert(_mm_extract_epi32(output, 1) == 0x08080808);
assert(_mm_extract_epi32(output, 0) == 0x0A0A0A09);
}
return 0;
GitHub upstream.
Step 3: go and optimize some code and benchmark it
The final, and most important and hard step, is of course to actually use the intrinsics to make your code fast, and then to benchmark your improvement.
Doing so, will likely require you to learn a bit about the x86 microarchitecture, which I don't know myself. CPU vs IO bound will likely be one of the things that comes up: What do the terms "CPU bound" and "I/O bound" mean?
As mentioned at: https://stackoverflow.com/a/12172046/895245 this will almost inevitably involve reading Agner Fog's documentation, which appear to be better than anything Intel itself has published.
Hopefully however steps 1 and 2 will serve as a basis to at least experiment with functional non-performance aspects and quickly see what instructions are doing.
TODO: produce a minimal interesting example of such optimization here.

You can use the SIMD-Visualiser to graphically visualize and animate the operations. It'll greatly help understanding how the data lanes are processed

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight