Why is adding a superfluous mask and bitshift more optimizable? - c

While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)
char *i2s(int n){
static char buf[(sizeof(int)<<1)+1]={0};
int i=0;
while(i<(sizeof(int)<<1)+1){ /* mask the ith hex, shift it to lsb */
// buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
}
for(i=0;buf[i++]=='0';)
/*find first non-zero*/;
return (char *)buf+i;
}
With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:
movb $48, buf.1247
xorl %eax, %eax
movb $48, buf.1247+1
movb $48, buf.1247+2
movb $48, buf.1247+3
movb $48, buf.1247+4
movb $48, buf.1247+5
movb $48, buf.1247+6
movb $48, buf.1247+7
movb $48, buf.1247+8
.p2align 4,,7
.p2align 3
.L26:
movzbl buf.1247(%eax), %edx
addl $1, %eax
cmpb $48, %dl
je .L26
addl $buf.1247, %eax
ret
Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.
Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)

It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.
gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.
clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.
Obviously this means your function is buggy! And that's not the only bug.
off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).
clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i)));
That compiles to pretty clunky code, though. gcc probably does better with #Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)
In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.
It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).
I put a bugfixed version on godbolt, along with your original.
Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)
gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!
i2s_orig:
mov BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406, 48
cmp BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406+1, 48
mov BYTE PTR buf.1406+2, 48
mov BYTE PTR buf.1406+4, 48
mov BYTE PTR buf.1406+5, 48
mov BYTE PTR buf.1406+6, 48
mov BYTE PTR buf.1406+7, 48
mov BYTE PTR buf.1406+8, 48
mov BYTE PTR buf.1406+9, 0
jne .L7 # testing flags from the compare earlier
jne .L8
jne .L9
jne .L10
jne .L11
sete al
movzx eax, al
add eax, 8
.L3:
add eax, OFFSET FLAT:buf.1406
ret
.L7:
mov eax, 3
jmp .L3
... more of the same, setting eax to 4, or 5, etc.
Putting multiple jne instructions in a row is obviously useless.

I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.
Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.
Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.
The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.
Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.
Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.
const char* i2s_brcfree(int n)
{
static char buf[ sizeof(n)*2+1] = {0};
unsigned int nibble_shifter = n;
for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
char digit = '0' + curr_nibble;
// "promote" to hex if nibble is over 9,
// conditionally adding the difference between ('0'+nibble) and 'A'
enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
*p = digit;
}
return buf;
}
Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.

Related

When joining four 1-byte vars into one 4-byte word, which is a faster way to shift and OR ? (comparing generated assembly code)

So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.
the two ways are given below
After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or does this have some significant effect in the performance say for example if we are dealing with a larger word?
Two methods
int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0;
combine = byte4;
combine <<=8;
combine |= byte3;
combine <<=8;
combine |= byte2;
combine <<=8;
combine |= byte1;
return combine;
}
int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0, temp;
temp = byte4;
temp <<= 24;
combine |= temp;
temp = byte3;
temp <<= 16;
combine |= temp;
temp = byte2;
temp <<= 8;
combine |= temp;
temp = byte1;
combine |= temp;
return combine;
}
Disassembly
// method1(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edi, dil
movzx esi, sil
movzx edx, dl
movzx eax, cl
sal edi, 8
or esi, edi
sal esi, 8
or edx, esi
sal edx, 8
or eax, edx
ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edx, dl
movzx ecx, cl
movzx esi, sil
sal edi, 24
sal edx, 8
sal esi, 16
or edx, ecx
or edx, esi
mov eax, edx
or eax, edi
ret
This might be "premature optimization", but I just want to know if there is a difference.
now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
That hasn't been true for decades. Modern CPUs can execute instructions in parallel if they're independent of each other. See
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
In your case, method 2 is clearly better (with GCC11 -O2 specifically) because the 3 shifts can happen in parallel, leaving only a chain of or instructions. (Most modern x86 CPUs only have 2 shift units, but the 3rd shift can be happening in parallel with the first OR).
Your first version has one long dependency chain of shift/or/shift/or (after the movzx zero-extension), so it has the same throughput but worse latency. If it's not on the critical path of some larger computation, performance would be similar.
The first version also has a redundant movzx edi, dil, because GCC11 -O2 doesn't realize that the high bits will eventually get shifted out the top of the register by the 3x8 = 24 bits of shifting. Unfortunately GCC chose to movzx into the same register (RDI into RDI), not for example movzx eax, dil which would let mov-elimination work.
The second one has a wasted mov eax, edx because GCC didn't realize it should do one of the movzx operations into EAX, instead of zero-extending each reg into itself (defeating mov-elimination). It also could have used lea eax, [edx + edi] to merge into a different reg, because it could have proved that those values couldn't have any overlapping bit-positions, so | and + would be equivalent.
That wasted mov generally only happens in small functions; apparently GCC's register allocator has a hard time when values need to be in specific hard registers. If it had its choice of where to produce the value, it would just end up with it in EDX.
So on Intel CPUs, yes by coincidence of different missed optimizations, both versions have 3 non-eliminated movzx and one instruction which can benefit from mov-elimination. On AMD CPUs, movzx is never eliminated, so the movzx eax, cl doesn't benefit.
Unfortunately, your compiler chooses to do all three or operations in sequence, instead of a tree of dependencies. (a|b) | (c|d) would be lower worst-case latency than (((a|b) | c) | d), critical path length of 2 from all inputs vs. 3 from d, 2 from c, 1 from a or b. (Writing it in C those different ways doesn't actually change how compilers make asm, because they knows that OR is associative. I'm using that familiar syntax to represent to dependency pattern of the assembly.)
So if all four inputs were ready at the same time, combining pairs would be lower latency, although it's impossible for most CPUs to produce three shift results in the same cycle.
I was able to hand-hold earlier GCC (GCC5) into making this dependency pattern (Godbolt compiler explorer). I used unsigned to avoid ISO C undefined behaviour. (GNU C does define the behaviour of left shifts even when a set bit is shifted in or out of the sign bit.)
int method4(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
unsigned combine = (unsigned)byte4 << 24;
combine |= byte3 << 16;
unsigned combine_low = (byte2 << 8) | byte1;
combine |= combine_low;
return combine;
}
# GCC5.5 -O3
method4(unsigned char, unsigned char, unsigned char, unsigned char):
movzx eax, sil
sal edi, 24
movzx ecx, cl
sal eax, 16
or edi, eax # (byte4<<24)|(byte3<<16)
movzx eax, dl
sal eax, 8
or eax, ecx # (byte2<<8) | (byte1)
or eax, edi # combine halves
ret
But GCC11.2 makes the same asm for this vs. method2. That would be good if it was optimal, but it's not.
the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or
The dependency chains are the key factor. Mainstream x86 has been out-of-order exec for over 2 decades, and there haven't been any in-order exec x86 CPUs sold for years. So instruction scheduling (ordering of independent instructions) generally isn't a big deal over very small distances like a few instructions. Of course, in the alternating shl/or version, they're not independent so you couldn't reorder them without breaking it or rewriting it.
Can we do better with partial-register shenanigans?
This part is only relevant if you're a compiler / JIT developer trying to get a compiler to do a better job for source like this. I'm not sure there's a clear win here, although maybe yes if we can't inline this function so the movzx instructions are actually needed.
We can certainly save instructions, but even modern Intel CPUs still have partial-register merging penalties for high-8-bit registers like AH. And it seems the AH-merging uop can only issue in a cycle by itself, so it effectively costs at least 4 instructions of front-end bandwidth.
movzx eax, dl
mov ah, cl
shl eax, 16 # partial register merging when reading EAX
mov ah, sil # oops, not encodeable, needs a REX for SIL which means it can't use AH
mov al, dil
Or maybe this, which avoids partial-register stalls and false dependencies on Intel Haswell and later. (And also on uarches that don't rename partial regs at all, like all AMD, and Intel Silvermont-family including the E-cores on Alder Lake.)
# good on Haswell / AMD
# partial-register stalls on Intel P6 family (Nehalem and earlier)
merge4(high=EDI, mid_hi=ESI, mid_lo=EDX, low=ECX):
mov eax, edi # mov-elimination possible on IvB and later, also AMD Zen
# zero-extension not needed because more shifts are coming
shl eax, 8
shl edx, 8
mov al, sil # AX = incoming DIL:SIL
mov dl, cl # DX = incoming DL:CL
shl eax, 16
mov ax, dx # EAX = incoming DIL:SIL:DL:CL
ret
This is using 8-bit and 16-bit mov as an ALU merge operation pretty much like movzx + or, i.e. a bitfield insert into the low 8 or low 16. I avoided ever moving into AH or other high-8 registers, so there's no partial-register merging on Haswell or later.
This is only 7 total instructions (not counting ret), all of them single-uop. And one of them is just a mov which can often be optimized away when inlining, because the compiler will have its choice of which registers to have the value in. (Unless the original value of the high byte alone is still needed in a register). It will often know it already has a value zero-extended in a register after inlining, but this version doesn't depend on that.
Of course, if you were eventually storing this value to memory, doing 4 byte stores would likely be good, especially if it's not about to be reloaded. (Store-forwarding stall.)
Related:
Why doesn't GCC use partial registers? partial regs on other CPUs, and why my "good" version would be bad on crusty old Nehalem and earlier CPUs.
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent the basis for the partial-register versions
Why does GCC chose dword movl to copy a long shift count to CL? - re: using 32-bit mov to copy an incoming uint8_t. (Although that's saying why it's fine even on CPUs that do rename DIL separate from the full RDI, because callers will have written the full register.)
Semi-related:
Doing this without shifts, just partial-register moves, as an asm exercise: How do I shift left in assembly with ADD, not using SHL? sometimes using store/reload which will cause store-forwarding stalls.
I completely agree with Émerick Poulin:
So the real answer, would be to benchmark it on your target machine to
see which one is faster.
Nevertheless, I created a "method3()", and disassembled all three with gcc version 10.3.0, with -O0 and -O2, Here's a summary of the -O2 results:
Method3:
int method3(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = (byte4 << 24)|(byte3<<16)|(byte2<<8)|byte1;
return combine;
}
gcc -O2 -S:
;method1:
sall $8, %eax
orl %edx, %eax
sall $8, %eax
movl %eax, %edx
movzbl %r8b, %eax
orl %edx, %eax
sall $8, %eax
orl %r9d, %eax
...
;method2:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
...
;method3:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
method1 has fewer instructions than method2, and method2 compiles to exactly the same code has method3. method1 also has a couple of "moves", which are more "expensive" than "or" or "shift".
Without optimizations, method 2 seems to be a tiny bit faster.
This is an online benchmark of the code provided: https://quick-bench.com/q/eyiiXkxYVyoogefHZZMH_IoeJss
However, it is difficult to get accurate metrics from tiny operations like this.
It also depends on the speed of the instructions on a given CPU (different instructions can take more or less clock cycles).
Furthermore, bitwise operators tend to be pretty fast since they are "basic" instructions.
So the real answer, would be to benchmark it on your target machine to see which one is faster.
a 'c' only portable.
unsigned int portable1(unsigned char byte4, unsigned char byte3,
unsigned char byte2, unsigned char byte1)
{
union tag1
{
_uint32 result;
_uint8 a[4];
} u;
u.a[0] = byte1;
u.a[1] = byte2;
u.a[2] = byte3;
u.a[3] = byte4;
return u.result;
}
This should generate 4 MOV and a register load in most environments. If the union is defined at module scope or global, the final LOAD (or MOV) disappears from the function. We can also easily allow for a 9bit byte and Intel versus Network byte ordering...

how to derive the types of the following data types from assembly code

I came across an exercise, as I am still trying to familiarise myself with assembly code.
I am unsure how to derive the types for a given struct, given the assembly code and the skeleton c code. Could someone teach me how this should be done?
This is the assembly code, where rcx and rdx hold the arguments i and j respectively.
randFunc:
movslq %ecx,%rcx // move i into rcx
movslq %edx, %rdx // move j into rdx
leaq (%rcx,%rcx,2), %rax //3i into rax
leaq (%rdx,%rdx,2), %rdx // 3j into rdx
salq $5, %rax // shift arith left 32? so 32*3i = 96i
leaq (%rax,%rdx,8), %rax //24j + 96i into rax
leaq matrixtotest(%rip), %rdx //store address of the matrixtotest in rdx
addq %rax, %rdx //jump to 24th row, 6th column variable
cmpb $10, 2(%rdx) //add 2 to that variable and compare to 10
jg .L5 //if greater than 10 then go to l5
movq 8(%rdx), %rax // else add 8 to the rdx number and store in rax
movzwl (%rdx), %edx //move the val in rdx (unsigned) to edx as an int
subl %edx, %eax //take (val+8) -(val) = 8? (not sure)
ret
.L5
movl 16(%rdx),%eax //move 1 row down and return? not sure about this
ret
This is the C code:
struct mat{
typeA a;
typeB b;
typeC c;
typeD d;
}
struct mat matrixtotest[M][N];
int randFunc(int i, int j){
return __1__? __2__ : __3__;
}
How do I derive the types of the variables a,b,c,d? And what is happening in the 1) 2) 3) parts of the return statement ?
Please help me, I'm very confused about what's happening and how to derive the types of the struct from this assembly.
Any help is appreciated, thank you.
Due to the cmpb $10, 2(%rdx) you have a byte sized something at offset 2. Due to the movzwl (%rdx), %edx you have a 2 byte sized unsigned something at offset 0. Due to the movq 8(%rdx), %rax you have a 8 byte sized something at offset 8. Finally due to the movl 16(%rdx),%eax you have a 4 byte sized something at offset 16. Now sizes don't map to types directly, but one possibility would be:
struct mat{
uint16_t a;
int8_t b;
int64_t c;
int32_t d;
};
You can use unsigned short, signed char, long, int if you know their sizes.
The size of the structure is 24 bytes, with padding at the end due to alignment requirement of the 8 byte field. From the 96i you can deduce N=4 probably. M is unknown. As such 24j + 96i accesses item matrixtotest[i][j]. The rest should be clear.
How do I derive the types of the variables a,b,c,d?
You want to see how variables are used, which will give you a very strong indication as to their size & sign.  (These indications are not always perfect, but the best we can do with limited information, i.e. missing source code, and will suffice for your exercise.)
So, just work the code, one instruction after another to see what they do by the definitions they have in the assembler and their mapping to the instruction set, paying particular attention to the sizes, signs, and offsets specified by the instructions.
Let's start for example with the first instruction: movslq ecx, rcx — this is saying that the first parameter (which is found in ecx), is a 32-bit signed number.
Since rcx is Windows ABI first parameter, and the assembly code is asking for ecx to be sign extended into rcx, then we know that this parameter is a signed 32-bit integer.  And you proceed to the next instruction, to glean what you can from it — and so on.
And what is happening in the 1) 2) 3) parts of the return statement ?
The ?: operator is a ternary operator known as a conditional.  If the condition, placeholder __1__, is true, it will choose the __2__ value and if false it will choose __3__.  This is usually (but not always) organized as an if-then-else branching pattern, where the then-part represents placeholder __2__ and the else part placeholder __3__.
That if-then-else branching pattern looks something like this in assembly/machine code:
if <condition> /* here __1__ */ is false goto elsePart;
<then-part> // here __2__
goto ifDone;
elsePart:
<else-part> // here __3__
ifDone:
So, when you get to an if-then-else construct, you can fit that into the ternary operator place holders.
That code is nicely commented, but somewhat absent size, sign, and offset information.  So, following along and derive that missing information from the way the instructions tell the CPU what sizes, signs, and offsets to use.
As Jester describes, if the code indexes into the array, because it is two-dimensional, it uses two indexes.  The indexing takes the given indexes and computes the address of the element.  As such, the first index finds the row, and so must skip ahead one row for each value in the index.  The second index must skip ahead one element for each value in the index.  Thus, by the formula in the comments: 24j + 96i, we can say that a row is 96 bytes long and an element (the struct) is 24 bytes long.

Coaxing GCC to emit REPE CMPSB

How to coax the GCC compiler to emit the REPE CMPSB instruction in plain C, without the "asm" and "_emit" keywords, calls to an included library and compiler intrinsics?
I tried some C code like the one listed below, but unsuccessfully:
unsigned int repe_cmpsb(unsigned char *esi, unsigned char *edi, unsigned int ecx) {
for (; ((*esi == *edi) && (ecx != 0)); esi++, edi++, ecx--);
return ecx;
}
See how GCC compiles it at this link:
https://godbolt.org/g/obJbpq
P.S.
I realize that there are no guarantees that the compiler compiles a C code in a certain way, but I'd like to coax it anyway for fun and just to see how smart it is.
rep cmps isn't fast; it's >= 2 cycles per count throughput on Haswell, for example, plus startup overhead. (http://agner.org/optimize). You can get a regular byte-at-a-time loop to go at 1 compare per clock (modern CPUs can run 2 loads per clock) even when you have to check for a match and for a 0 terminator, if you write it carefully.
InstLatx64 numbers agree: Haswell can manage 1 cycle per byte for rep cmpsb, but that's total bandwidth (i.e. 2 cycles to compare 1 byte from each string).
Only rep movs and rep stos have "fast strings" support in current x86 CPUs. (i.e. microcoded implementations that internally use wider loads/stores when alignment and lack of overlap allow.)
The "smart" thing for modern CPUs is to use SSE2 pcmpeqb / pmovmskb. (But gcc and clang don't know how to vectorize loops with an iteration count that isn't known before loop entry; i.e. they can't vectorize search loops. ICC can, though.)
However, gcc will for some reason inline repz cmpsb for strcmp against short fixed strings. Presumably it doesn't know any smarter patterns for inlining strcmp, and the startup overhead may still be better than the overhead of a function call to a dynamic library function. Or maybe not, I haven't tested. Anyway, it's not horrible for code size in a block of code that compares something against a bunch of fixed strings.
#include <string.h>
int string_equal(const char *s) {
return 0 == strcmp(s, "test1");
}
gcc7.3 -O3 output from Godbolt
.LC0:
.string "test1"
string_cmp:
mov rsi, rdi
mov ecx, 6
mov edi, OFFSET FLAT:.LC0
repz cmpsb
setne al
movzx eax, al
ret
If you don't booleanize the result somehow, gcc generates a -1 / 0 / +1 result with seta / setb / sub / movzx. (Causing a partial-register stall on Intel before IvyBridge, and a false dependency on other CPUs, because it uses 32-bit sub on the setcc results, /facepalm. Fortunately most code only needs a 2-way result from strcmp, not 3-way).
gcc only does this with fixed-length string constants, otherwise it wouldn't know how to set rcx.
The results are totally different for memcmp: gcc does a pretty good job, in this case using a DWORD and a WORD cmp, with no rep string instructions.
int cmp_mem(const char *s) {
return 0 == memcmp(s, "test1", 6);
}
cmp DWORD PTR [rdi], 1953719668 # 0x74736574
je .L8
.L5:
mov eax, 1
xor eax, 1 # missed optimization here after the memcmp pattern; should just xor eax,eax
ret
.L8:
xor eax, eax
cmp WORD PTR [rdi+4], 49 # check last 2 bytes
jne .L5
xor eax, 1
ret
Controlling this behaviour
The manual says that -mstringop-strategy=libcall should force a library call, but it doesn't work. No change in asm output.
Neither does -mno-inline-stringops-dynamically -mno-inline-all-stringops.
It seems this part of the GCC docs is obsolete. I haven't investigated further with larger string literals, or fixed size but non-constant strings, or similar.

Understanding gcc output for if (a>=3)

I thought since condition is a >= 3, we should use jl (less).
But gcc used jle (less or equal).
It make no sense to me; why did the compiler do this?
You're getting mixed up by a transformation the compiler made on the way from the C source to the asm implementation. gcc's output implements your function this way:
a = 5;
if (a<=2) goto ret0;
return 1;
ret0:
return 0;
It's all clunky and redundant because you compiled with -O0, so it stores a to memory and then reloads it, so you could modify it with a debugger if you set a breakpoint and still have the code "work".
See also How to remove "noise" from GCC/clang assembly output?
Compilers generally prefer to reduce the magnitude of a comparison constant, so it's more likely to fit in a sign-extended 8-bit immediate instead of needing a 32-bit immediate in the machine code.
We can get some nice compact code by writing a function that takes an arg, so it won't optimize away when we enable optimizations.
int cmp(int a) {
return a>=128; // In C, a boolean converts to int as 0 or 1
}
gcc -O3 on Godbolt, targetting the x86-64 ABI (same as your code):
xorl %eax, %eax # whole RAX = 0
cmpl $127, %edi
setg %al # al = (edi>127) : 1 : 0
ret
So it transformed a >=128 into a >127 comparison. This saves 3 bytes of machine code, because cmp $127, %edi can use the cmp $imm8, r/m32 encoding (cmp r/m32, imm8 in Intel syntax in Intel's manual), but 128 would have to use cmp $imm32, r/m32.
BTW, comparisons and conditions make sense in Intel syntax, but are backwards in AT&T syntax. For example, cmp edi, 127 / jg is taken if edi > 127.
But in AT&T syntax, it's cmp $127, %edi, so you have to mentally reverse the operands or think of a > instead of <
The assembly code is comparing a to two, not three. That's why it uses jle. If a is less than or equal to two it logically follows that a IS NOT greater than or equal to 3, and therefore 0 should be returned.

Assembly: access quad-word, double word, and byte quantity of same register in function

I have some assembly code generated from C that I'm trying to make sense of. One part I just can't understand:
movslq %edx,%rcx
movzbl (%rdi,%rcx,1),%ecx
test %cl,%cl
What doesn't make sense is that %rcx, %ecx, and %cl are all in the same register (quad word, double word, and byte, respectively). How could a data type access all three in the same function like this?
Having a char* makes it improbably to access %ecx in this way, and similarly having an int* makes accessing %cl unlikely. I simply have no idea what data type could be stored in %rcx.
re: your comment: You can tell it's a byte array because it's scaling %rcx by 1.
Like Michael said in a comment, this is doing
int func(char *array /* rdi */, int pos /* ecx */)
{
if (array[pos]) ...;
// or maybe
int tmpi = array[pos];
char tmpc = tmpi;
if (tmpc) ...;
}
ecx has to get sign-extended to 64bit before being used as an offset in an effective address. If it was unsigned, it would still need to be zero-extended (e.g. mov %ecx, %ecx). The ABI doesn't guarantee that the upper 32 of a register are zeroed or sign extended when the parameter being passed in a register is smaller than 64bits.
In general, it's better to write at least 32b of a register, to avoid a false dependency on the previous contents on some CPUs. Only Intel P6/SnB family CPUs track the pieces of integer registers separately (and insert an extra uop to merge them with the old contents if you do something like read %ecx after writing %cl.)
So it's perfectly reasonable for a compiler to emit that code with the zero-extending movzbl load instead of just mov (%rdi,%rcx,1), %cl. It will potentially run faster on Silvermont and AMD. (And P4. Optimizations for old CPUs do hang around in compiler source code...)

Resources