Emit DIV instruction, instead of __udivti3 - c

Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Related

When joining four 1-byte vars into one 4-byte word, which is a faster way to shift and OR ? (comparing generated assembly code)

So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.
the two ways are given below
After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or does this have some significant effect in the performance say for example if we are dealing with a larger word?
Two methods
int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0;
combine = byte4;
combine <<=8;
combine |= byte3;
combine <<=8;
combine |= byte2;
combine <<=8;
combine |= byte1;
return combine;
}
int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0, temp;
temp = byte4;
temp <<= 24;
combine |= temp;
temp = byte3;
temp <<= 16;
combine |= temp;
temp = byte2;
temp <<= 8;
combine |= temp;
temp = byte1;
combine |= temp;
return combine;
}
Disassembly
// method1(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edi, dil
movzx esi, sil
movzx edx, dl
movzx eax, cl
sal edi, 8
or esi, edi
sal esi, 8
or edx, esi
sal edx, 8
or eax, edx
ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edx, dl
movzx ecx, cl
movzx esi, sil
sal edi, 24
sal edx, 8
sal esi, 16
or edx, ecx
or edx, esi
mov eax, edx
or eax, edi
ret
This might be "premature optimization", but I just want to know if there is a difference.
now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
That hasn't been true for decades. Modern CPUs can execute instructions in parallel if they're independent of each other. See
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
In your case, method 2 is clearly better (with GCC11 -O2 specifically) because the 3 shifts can happen in parallel, leaving only a chain of or instructions. (Most modern x86 CPUs only have 2 shift units, but the 3rd shift can be happening in parallel with the first OR).
Your first version has one long dependency chain of shift/or/shift/or (after the movzx zero-extension), so it has the same throughput but worse latency. If it's not on the critical path of some larger computation, performance would be similar.
The first version also has a redundant movzx edi, dil, because GCC11 -O2 doesn't realize that the high bits will eventually get shifted out the top of the register by the 3x8 = 24 bits of shifting. Unfortunately GCC chose to movzx into the same register (RDI into RDI), not for example movzx eax, dil which would let mov-elimination work.
The second one has a wasted mov eax, edx because GCC didn't realize it should do one of the movzx operations into EAX, instead of zero-extending each reg into itself (defeating mov-elimination). It also could have used lea eax, [edx + edi] to merge into a different reg, because it could have proved that those values couldn't have any overlapping bit-positions, so | and + would be equivalent.
That wasted mov generally only happens in small functions; apparently GCC's register allocator has a hard time when values need to be in specific hard registers. If it had its choice of where to produce the value, it would just end up with it in EDX.
So on Intel CPUs, yes by coincidence of different missed optimizations, both versions have 3 non-eliminated movzx and one instruction which can benefit from mov-elimination. On AMD CPUs, movzx is never eliminated, so the movzx eax, cl doesn't benefit.
Unfortunately, your compiler chooses to do all three or operations in sequence, instead of a tree of dependencies. (a|b) | (c|d) would be lower worst-case latency than (((a|b) | c) | d), critical path length of 2 from all inputs vs. 3 from d, 2 from c, 1 from a or b. (Writing it in C those different ways doesn't actually change how compilers make asm, because they knows that OR is associative. I'm using that familiar syntax to represent to dependency pattern of the assembly.)
So if all four inputs were ready at the same time, combining pairs would be lower latency, although it's impossible for most CPUs to produce three shift results in the same cycle.
I was able to hand-hold earlier GCC (GCC5) into making this dependency pattern (Godbolt compiler explorer). I used unsigned to avoid ISO C undefined behaviour. (GNU C does define the behaviour of left shifts even when a set bit is shifted in or out of the sign bit.)
int method4(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
unsigned combine = (unsigned)byte4 << 24;
combine |= byte3 << 16;
unsigned combine_low = (byte2 << 8) | byte1;
combine |= combine_low;
return combine;
}
# GCC5.5 -O3
method4(unsigned char, unsigned char, unsigned char, unsigned char):
movzx eax, sil
sal edi, 24
movzx ecx, cl
sal eax, 16
or edi, eax # (byte4<<24)|(byte3<<16)
movzx eax, dl
sal eax, 8
or eax, ecx # (byte2<<8) | (byte1)
or eax, edi # combine halves
ret
But GCC11.2 makes the same asm for this vs. method2. That would be good if it was optimal, but it's not.
the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or
The dependency chains are the key factor. Mainstream x86 has been out-of-order exec for over 2 decades, and there haven't been any in-order exec x86 CPUs sold for years. So instruction scheduling (ordering of independent instructions) generally isn't a big deal over very small distances like a few instructions. Of course, in the alternating shl/or version, they're not independent so you couldn't reorder them without breaking it or rewriting it.
Can we do better with partial-register shenanigans?
This part is only relevant if you're a compiler / JIT developer trying to get a compiler to do a better job for source like this. I'm not sure there's a clear win here, although maybe yes if we can't inline this function so the movzx instructions are actually needed.
We can certainly save instructions, but even modern Intel CPUs still have partial-register merging penalties for high-8-bit registers like AH. And it seems the AH-merging uop can only issue in a cycle by itself, so it effectively costs at least 4 instructions of front-end bandwidth.
movzx eax, dl
mov ah, cl
shl eax, 16 # partial register merging when reading EAX
mov ah, sil # oops, not encodeable, needs a REX for SIL which means it can't use AH
mov al, dil
Or maybe this, which avoids partial-register stalls and false dependencies on Intel Haswell and later. (And also on uarches that don't rename partial regs at all, like all AMD, and Intel Silvermont-family including the E-cores on Alder Lake.)
# good on Haswell / AMD
# partial-register stalls on Intel P6 family (Nehalem and earlier)
merge4(high=EDI, mid_hi=ESI, mid_lo=EDX, low=ECX):
mov eax, edi # mov-elimination possible on IvB and later, also AMD Zen
# zero-extension not needed because more shifts are coming
shl eax, 8
shl edx, 8
mov al, sil # AX = incoming DIL:SIL
mov dl, cl # DX = incoming DL:CL
shl eax, 16
mov ax, dx # EAX = incoming DIL:SIL:DL:CL
ret
This is using 8-bit and 16-bit mov as an ALU merge operation pretty much like movzx + or, i.e. a bitfield insert into the low 8 or low 16. I avoided ever moving into AH or other high-8 registers, so there's no partial-register merging on Haswell or later.
This is only 7 total instructions (not counting ret), all of them single-uop. And one of them is just a mov which can often be optimized away when inlining, because the compiler will have its choice of which registers to have the value in. (Unless the original value of the high byte alone is still needed in a register). It will often know it already has a value zero-extended in a register after inlining, but this version doesn't depend on that.
Of course, if you were eventually storing this value to memory, doing 4 byte stores would likely be good, especially if it's not about to be reloaded. (Store-forwarding stall.)
Related:
Why doesn't GCC use partial registers? partial regs on other CPUs, and why my "good" version would be bad on crusty old Nehalem and earlier CPUs.
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent the basis for the partial-register versions
Why does GCC chose dword movl to copy a long shift count to CL? - re: using 32-bit mov to copy an incoming uint8_t. (Although that's saying why it's fine even on CPUs that do rename DIL separate from the full RDI, because callers will have written the full register.)
Semi-related:
Doing this without shifts, just partial-register moves, as an asm exercise: How do I shift left in assembly with ADD, not using SHL? sometimes using store/reload which will cause store-forwarding stalls.
I completely agree with Émerick Poulin:
So the real answer, would be to benchmark it on your target machine to
see which one is faster.
Nevertheless, I created a "method3()", and disassembled all three with gcc version 10.3.0, with -O0 and -O2, Here's a summary of the -O2 results:
Method3:
int method3(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = (byte4 << 24)|(byte3<<16)|(byte2<<8)|byte1;
return combine;
}
gcc -O2 -S:
;method1:
sall $8, %eax
orl %edx, %eax
sall $8, %eax
movl %eax, %edx
movzbl %r8b, %eax
orl %edx, %eax
sall $8, %eax
orl %r9d, %eax
...
;method2:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
...
;method3:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
method1 has fewer instructions than method2, and method2 compiles to exactly the same code has method3. method1 also has a couple of "moves", which are more "expensive" than "or" or "shift".
Without optimizations, method 2 seems to be a tiny bit faster.
This is an online benchmark of the code provided: https://quick-bench.com/q/eyiiXkxYVyoogefHZZMH_IoeJss
However, it is difficult to get accurate metrics from tiny operations like this.
It also depends on the speed of the instructions on a given CPU (different instructions can take more or less clock cycles).
Furthermore, bitwise operators tend to be pretty fast since they are "basic" instructions.
So the real answer, would be to benchmark it on your target machine to see which one is faster.
a 'c' only portable.
unsigned int portable1(unsigned char byte4, unsigned char byte3,
unsigned char byte2, unsigned char byte1)
{
union tag1
{
_uint32 result;
_uint8 a[4];
} u;
u.a[0] = byte1;
u.a[1] = byte2;
u.a[2] = byte3;
u.a[3] = byte4;
return u.result;
}
This should generate 4 MOV and a register load in most environments. If the union is defined at module scope or global, the final LOAD (or MOV) disappears from the function. We can also easily allow for a 9bit byte and Intel versus Network byte ordering...

Why does the asm code function in C, takes more time than the c code function?

I wrote a simple multiplication function in C, and another in assembly code, using GCC's "asm" keyword.
I took the execution time for each of them, and although their times are pretty close, the C function is a little faster than the one in assembly code.
I would like to know why, since I expected for the asm one to be faster. Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?
Here is the C function:
int multiply (int a, int b){return a*b;}
And here is the asm one in the C file:
int asmMultiply(int a, int b){
asm ("imull %1,%0;"
: "+r" (a)
: "r" (b)
);
return a;
}
my main where I take the times:
int main(){
int n = 50000;
clock_t asmClock = clock();
while(n>0){
asmMultiply(4,5);
n--;
}
asmClock = clock() - asmClock;
double asmTime = ((double)asmClock)/CLOCKS_PER_SEC;
clock_t cClock = clock();
n = 50000;
while(n>0){
multiply(4,5);
n--;
}
cClock = clock() - cClock;
double cTime = ((double)cClock)/CLOCKS_PER_SEC;
printf("Asm time: %f\n",asmTime);
printf("C code time: %f\n",cTime);
Thanks!
The assembly function is doing more work than the C function — it's initializing mult, then doing the multiplication and assigning the result to mult, and then pushing the value from mult into the return location.
Compilers are good at optimizing; you won't easily beat them on basic arithmetic.
If you really want improvement, use static inline int multiply(int a, int b) { return a * b; }. Or just write a * b (or the equivalent) in the calling code instead of int x = multiply(a, b);.
This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.
Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.
(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)
Both timed regions are empty because both multiplies can optimize away. (The asm is not asm volatile, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.
And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock(). Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc... See RDTSCP in NASM always returns the same value
Or you compiled with optimization disabled, which is pointless.
You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)
Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?
The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)
So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).
https://gcc.gnu.org/wiki/DontUseInlineAsm
The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.
You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)
Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:
# clang7.0 -O3 (The functions both inline and optimize away)
main: # #main
push rbx
sub rsp, 16
call clock
mov rbx, rax # save the return value
call clock
sub rax, rbx # end - start time
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp + 8], xmm0 # 8-byte Spill
call clock
mov rbx, rax
call clock
sub rax, rbx # same block again for the 2nd group.
xorps xmm0, xmm0
cvtsi2sd xmm0, rax
divsd xmm0, qword ptr [rip + .LCPI2_0]
movsd qword ptr [rsp], xmm0 # 8-byte Spill
mov edi, offset .L.str
mov al, 1
movsd xmm0, qword ptr [rsp + 8] # 8-byte Reload
call printf
mov edi, offset .L.str.1
mov al, 1
movsd xmm0, qword ptr [rsp] # 8-byte Reload
call printf
xor eax, eax
add rsp, 16
pop rbx
ret
TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.
How to remove "noise" from GCC/clang assembly output?
C++ code for testing the Collatz conjecture faster than hand-written assembly - why?
Modern x86 cost model
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

Dividing in Assembly

I am trying to define a calculator in C language based on the Linux command dc the structure of the program is not so important all you need to know that I get two numbers and I want to divide them when typing /. Therefore, I send this two numbers to an assembly function that makes the division (see code below). But this works for positive numbers only.
When typing 999 3 / it returns 333 which is correct but when typing -999 3 / I get the strange number 1431655432 and also when typing both negative numbers like -999 -3 / I get 0 every time for any two negative numbers.
The code in assembly is:
section .text
global _div
_div:
push rbp ; Save caller state
mov rbp, rsp
mov rax, rdi ; Copy function args to registers: leftmost...
mov rbx, rsi ; Next argument...
cqo
idiv rbx ; divide 2 arguments
mov [rbp-8], rax
pop rbp ; Restore caller state
Your comments say you are passing integers to _idiv. If you are using int those are 32-bit values:
extern int _div (int a, int b);
When passed to the function a will be in the bottom 32-bits of RDI and b will be in the bottom 32-bits of RSI. The upper 32-bits of the arguments can be garbage but often they are zero, but doesn't have to be the case.
If you use a 64-bit register as a divisor with IDIV then the division is RDX:RAX / 64-bit divisor (in your case RBX). The problem here is that you are using the full 64-bit registers to do 32-bit division. If we assume for arguments sake that the upper bits of RDI and RSI were originally 0 then RSI would be 0x00000000FFFFFC19 (RAX) and RDI would be 0x0000000000000003 (RBX). CQO would zero extend RAX to RDX. The upper most bit of RAX is zero so RDX would be zero. The division would look like:
0x000000000000000000000000FFFFFC19 / 0x0000000000000003 = 0x55555408
0x55555408 happens to be 1431655432 (decimal) which is the result you were seeing. One fix for this is to use 32-bit registers for the division. To sign extend EAX (lower 32-bit of RAX) into EDX you can use CDQ instead of CQO.You can then divide EDX:EAX by EBX. This should get you the 32-bit signed division you are looking for. The code would look like:
cdq
idiv ebx ; divide 2 arguments EDX:EAX by EBX
Be aware that RBX, RBP, R12 to R15 all need to be preserved by your function of you modify them (they are volatile registers in the AMD 64-bit ABI). If you modify RBX you need to make sure you save and restore it like you do with RBP. A better alternative is to use one of the volatile registers like RCX instead of RBX.
You don't need the intermediate register to place the divisor into. You could have used RSI (or ESI in the fixed version) directly instead of moving it to a register like RBX.
Your issue has to do with how arguments are passed to _div.
Assuming your _div's prototype is:
int64_t _div(int32_t, int32_t);
Then, the arguments are passed in edi and esi (i.e., 32-bit signed integers), the upper halves of the registers rdi and rsi are undefined.
Sign extension is needed when assigning edi and esi to rax and rbx for performing a 64-bit signed division (for performing a 64-bit unsigned division zero extension would be needed instead).
That is, instead of:
mov rax, rdi
mov rbx, rsi
use the instruction movsx, which sign extends the source, on edi and esi:
movsx rax, edi
movsx rbx, esi
Using true 64-bit operands for the 64-bit division
The previous approach consits of performing a 64-bit division on "fake" 64-bit operands (i.e., sign-extended 32-bit operands). Mixing 64-bit instructions with "32-bit operands" is usually not a very good idea because it may result in worse performance and larger code size.
A better approach would be to simply change the C prototype of your _div function to accept actual 64-bit arguments, i.e.:
int64_t _div(int64_t, int64_t);
This way, the argument will be passed to rdi and rsi (i.e., already 64-bit signed integers) and a 64-bit division will be performed on true 64-bit integers.
Using a 32-bit division instead
You may also want to consider using the 32-bit idiv if it suits your needs, since it performs faster than a 64-bit division and the resulting code size is smaller (no REX prefix):
...
mov eax, edi
mov ebx, esi
cdq
idiv ebx
...
_div's prototype would be:
int32_t _div(int32_t, int32_t);

Coaxing GCC to emit REPE CMPSB

How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.

GCC hotpatching?

When I compile this piece of code
unsigned char A[] = {1, 2, 3, 4};
unsigned int
f (unsigned int x)
{
return A[x];
}
gcc outputs
mov edi, edi
movzx eax, BYTE PTR A[rdi]
ret
on a x86_64 machine.
The question is: why is a nop instruction (mov edi, edi) there for?
Im am using gcc-4.4.4.
In 64-bit mode, mov edi, edi is not a no-op. What it does is set the top 32 bits of rdi to 0.
This is a special case of the general fact that all 32-bit operations clear the top 32 bits of the destination register in 64-bit mode. (This allows a more efficient CPU than leaving them unchanged and is perhaps more useful as well.)

Resources