Trouble understanding gcc's assembly output - c

While writing some C code, I decided to compile it to assembly and read it--I just sort of, do this from time to time--sort of an exercise to keep me thinking about what the machine is doing every time I write a statement in C.
Anyways, I wrote these two lines in C
asm(";move old_string[i] to new_string[x]");
new_string[x] = old_string[i];
asm(";shift old_string[i+1] into new_string[x]");
new_string[x] |= old_string[i + 1] << 8;
(old_string is an array of char, and new_string is an array of unsigned short, so given two chars, 42 and 43, this will put 4342 into new_string[x])
Which produced the following output:
#move old_string[i] to new_string[x]
movl -20(%ebp), %esi #put address of first char of old_string in esi
movsbw (%edi,%esi),%dx #put first char into dx
movw %dx, (%ecx,%ebx,2) #put first char into new_string
#shift old_string[i+1] into new_string[x]
movsbl 1(%esi,%edi),%eax #put old_string[i+1] into eax
sall $8, %eax #shift it left by 8 bits
orl %edx, %eax #or edx into it
movw %ax, (%ecx,%ebx,2) #?
(I'm commenting it myself, so I can follow what's going on).
I compiled it with -O3, so I could also sort of see how the compiler optimizes certain constructs. Anyways, I'm sure this is probably simple, but here's what I don't get:
the first section copies a char out of old_string[i], and then movw's it (from dx) to (%ecx,%ebx). Then the next section, copies old_string[i+1], shifts it, ors it, and then puts it into the same place from ax. It puts two 16 bit values into the same place? Wouldn't this not work?
Also, it shifts old_string[i+1] to the high-order dword of eax, then ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't ax just contain what was already in new_string[x]? so it saves the same thing to the same place in memory twice?
Is there something I'm missing? Also, I'm fairly certain that the rest of the compiled program isn't relevant to this snippet... I've read around before and after, to find where each array and different variables are stored, and what the registers' values would be upon reaching that code--I think that this is the only piece of the assembly that matters for these lines of C.
--
oh, turns out GNU assembly comments are started with a #.

Okay, so it was pretty simple after all.
I figured it out with a pen and paper, writing down each step, what it did to each register, and then wrote down the contents of each register given an initial starting value...
What got me was that it was using 32 bit and 16 bit registers for 16 and 8 bit data types...
This is what I thought was happening:
first value put into memory as, say, 0001 (I was thinking 01).
second value (02) loaded into 32 bit register (so it was like, 00000002, I was thinking, 0002)
second value shifted left 8 bits (00000200, I was thinking, 0200)
first value (0000001, I thought 0001) xor'd into second value (00000201, I thought 0201)
16 bit register put into memory (0201, I was thinking, just 01 again).
I didn't get why it wrote it to memory twice though, or why it was using 32 bit registers (well, actually, my guess is that a 32 bit processor is way faster at working with 32 bit values than it is with 8 and 16 bit values, but that's a totally uneducated guess), so I tried rewriting it:
movl -20(%ebp), %esi #gets pointer to old_string
movsbw (%edi,%esi),%dx #old_string[i] -> dx (0001)
movsbw 1(%edi,%esi),%ax #old_string[i + 1] -> ax (0002)
salw $8, %ax #shift ax left (0200)
orw %dx, %ax #or dx into ax (0201)
movw %ax,(%ecx,%ebx,2) #doesn't write to memory until end
This worked exactly the same.
I don't know if this is an optimization or not (aside from taking one memory write out, which obviously is), but if it is, I know it's not really worth it and didn't gain me anything. In any case, I get what this code is doing now, thanks for the help all.

I'm not sure what's not to understand, unless I'm missing something.
The first 3 instructions load a byte from old_string into dx and stores that to your new_string.
The next 3 instructions utilize what's already in dx and combines old_string[i+1] with it, and stores it as a 16-bit value (ax) to new_string.

Also, it shifts old_string[i+1] to the high-order dword of eax, then
ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't
ax just contain what was already in new_string[x]? so it saves the same
thing to the same place in memory twice?
Now you see why optimizers are a Good Thing. That kind of redundant code shows up pretty often in unoptimized, generated code, because the generated code comes more or less from templates that don't "know" what happened before or after.

Related

How is it possible that BITWISE AND operation to take more CPU clocks than ARITHMETIC ADDITION operation in a C program?

I wanted to test if bitwise operations really are faster to execute than arithmetic operation. I thought they were.
I wrote a small C program to test this hypothesis and to my surprise the addition takes less on average than bitwise AND operation. This is surprising to me and I cannot understand why this is happening.
From what I know for addition the carry from the less significant bits should be carried to the next bits because the result depends on the carry too. It does not make sense to me that a logic operator is slower than addition.
My cod is below:
#include<stdio.h>
#include<time.h>
int main()
{
int x=10;
int y=25;
int z=x+y;
printf("Sum of x+y = %i", z);
time_t start = clock();
for(int i=0;i<100000;i++)z=x+y;
time_t stop = clock();
printf("\n\nArithmetic instructions take: %d",stop-start);
start = clock();
for(int i=0;i<100000;i++)z=x&y;
stop = clock();
printf("\n\nLogic instructions take: %d",stop-start);
}
Some of the results:
Arithmetic instructions take: 327
Logic instructions take: 360
Arithmetic instructions take: 271
Logic instructions take: 271
Arithmetic instructions take: 287
Logic instructions take: 294
Arithmetic instructions take: 279
Logic instructions take: 266
Arithmetic instructions take: 265
Logic instructions take: 296
These results are taken from consecutive runs of the program.
As you can see on average the logic operator takes longer on average than the arithmetic operator.
ok, let's take this "measuring" and blow it up, 100k is a bit few
#include<stdio.h>
#include<time.h>
#define limit 10000000000
int main()
{
int x=10, y=25, z;
time_t start = clock();
for(long long i=0;i<limit;i++)z=x+y;
time_t stop = clock();
printf("Arithmetic instructions take: %ld\n",stop-start);
start = clock();
for(long long i=0;i<limit;i++)z=x&y;
stop = clock();
printf("Logic instructions take: %ld\n",stop-start);
}
this will run a bit longer. First let's try with no optimization:
thomas#TS-VB:~/src$ g++ -o trash trash.c
thomas#TS-VB:~/src$ ./trash
Arithmetic instructions take: 21910636
Logic instructions take: 21890332
you see, both loops take about the same time.
compiling with -S reveals why (only relevant part of .s file shown here) :
// this is the assembly for the first loop
.L3:
movl 32(%esp), %eax
movl 28(%esp), %edx
addl %edx, %eax // <<-- ADD
movl %eax, 40(%esp)
addl $1, 48(%esp)
adcl $0, 52(%esp)
.L2:
cmpl $2, 52(%esp)
jl .L3
cmpl $2, 52(%esp)
jg .L9
cmpl $1410065407, 48(%esp)
jbe .L3
// this is the one for the second
.L9:
movl 32(%esp), %eax
movl 28(%esp), %edx
andl %edx, %eax // <<--- AND
movl %eax, 40(%esp)
addl $1, 56(%esp)
adcl $0, 60(%esp)
.L5:
cmpl $2, 60(%esp)
jl .L6
cmpl $2, 60(%esp)
jg .L10
cmpl $1410065407, 56(%esp)
jbe .L6
.L10:
looking into the CPUs instruction set tells us, both ADD and AND will take the same amount of cycles --> the 2 loops will run the same amount of time
Now with optimization:
thomas#TS-VB:~/src$ g++ -O3 -o trash trash.c
thomas#TS-VB:~/src$ ./trash
Arithmetic instructions take: 112
Logic instructions take: 74
The loop has been optimized away. The value calculated is never needed, so the compiler decided to not run it at all
conclusion: If you shoot 3 times into the forest, and hit 2 boars and 1 rabbit, it doesn't mean there a twice as much boars than rabbits inside
Let's start by looking at your code. The loops don't actually do anything. Any reasonable compiler will see that you don't use the variable z after the first call to printf, so it is perfectly safe to throw away. Of course, the compiler doesn't have to do it, but any reasonable compiler with any reasonable levels of optimisations enabled will do so.
Let's look at what a compiler did to your code (standard clang with -O2 optimisation level):
leaq L_.str(%rip), %rdi
movl $35, %esi
xorl %eax, %eax
callq _printf
This is the first printf ("Sum of ..."), notice how the generated code didn't actually add anything, the compiler knows the values of x and y and just calculated its sum and calls printf with 35.
callq _clock
movq %rax, %rbx
callq _clock
Call clock, save its result in a temporary register, call clock again,
movq %rax, %rcx
subq %rbx, %rcx
leaq L_.str.1(%rip), %rdi
xorl %eax, %eax
movq %rcx, %rsi
Subtract start from end, set up arguments for printf,
callq _printf
Call printf.
The second loop is removed in an equivalent way. There are no loops because the compiler does the reasonable thing - it notices that z is not used after you modify it in the loop, so the compiler throws away any stores to it. And then since nothing is stored into it, it can also throw away x+y. And now, since the body of the loop is not doing anything, the loop can be thrown away. So your code essentially becomes:
printf("\n\nArithmetic instructions take: %d", clock() - clock());
Now, why is this relevant. It is relevant to understand some important concepts. The compiler doesn't slavishly translate one statement at a time into code. The compiler reads all (or as much as possible) of your code, tries to figure out what you actually mean and then generates code that behaves "as if" it executed all those statements. The language and compiler only care about preserving what we could call observable side effects. If computing a value is not observable, it doesn't need to be computed. The time to execute some code is not a side effect we care about, so the compiler doesn't care about preserving it, after all we want our code to be as fast as possible, so we'd like the time to execute something to be not observable at all.
The second part why this is relevant. It is pretty useless to measure how long something takes if you compiled it without optimisations. This is the loop in your code compiled without optimisations:
LBB0_1:
cmpl $100000, -28(%rbp)
jge LBB0_4
movl -8(%rbp), %eax
addl -12(%rbp), %eax
movl %eax, -16(%rbp)
movl -28(%rbp), %eax
addl $1, %eax
movl %eax, -28(%rbp)
jmp LBB0_1
LBB0_4:
You thought you were measuring the addl instruction here. But the whole loop contains so much more. In fact, most of the time in the loop is spent maintaining the loop, not actually executing your instruction. Most of the time is spent reading and writing values on the stack and calculating the loop variable. Any timing you happen to measure will be totally dominated by the infrastructure of the loop rather than operation you want to measure.
You loop very few times. I'm pretty certain that most of the time you actually measure is spent in clock() and not the code you're actually trying to measure. clock needs to do quite some work, reading time is often quite expensive.
Then we get to the question of the actual instructions you care about. They take the exact same amount of time. Here's the canonical source of everything related to instruction timing on x86.
But. It's very hard and almost pretty useless to reason about individual instructions. Pretty much every CPU in the past few decades is superscalar. This means that it will execute many instructions at the same time. What matters for how much time things take is more dependencies between instructions (can't start executing an instruction before its inputs are ready if those inputs are computed by previous instructions) rather than the actual instruction. While you can do dozens of calculations in registers per nanosecond, it can take hundreds of nanoseconds to fetch data from main memory. So even if we know that an instruction takes one cycle and your CPU does two cycles per nanosecond (it's usually around that), it can mean that the number of instructions we can finish in 100ns can be anywhere between 1 (if you need to wait for main memory) and 12800 (I don't know the real exact numbers, but I remember that Skylake can retire 64 floating point operations per cycle).
This is why microbenchmarks aren't seriously done anymore. If minor variations in how things are done can affect the outcome by a factor of twelve thousand, you can quickly see why measuring individual instructions is useless. Most measurements today are done on large parts of programs or whole programs. I do this a lot at work and I've had several situations where improving an algorithm changed memory access patterns that even though the algorithm could be mathematically proven faster, the behavior of the whole program suffered because of changed memory access patterns or such.
Sorry for such a rambling answer, but I'm trying to get to why even though there is a simple answer to your question: "your measurement method is bad" and also a real answer: "they are the same", there are actually interesting reasons why the question itself is unanswerable.
This is just a few minutes worth of work, would rather also demonstrate bare metal and other such things but not worth the time right now.
Through testing of some functions to see what the calling convention is and also noting that for add it is generating
400600: 8d 04 37 lea (%rdi,%rsi,1),%eax
400603: c3 retq
for and
400610: 89 f8 mov %edi,%eax
400612: 21 f0 and %esi,%eax
400614: c3 retq
three instructions instead of two, five bytes instead of four, these bits if information both do and dont matter depending. But to make it more fair will do the same thing for each operation.
Also want the do it a zillion times loop closely coupled and not compiled as that may end up creating some variation. And lastly the alignment try to make that fair.
.balign 32
nop
.balign 256
.globl and_test
and_test:
mov %edi,%eax
and %esi,%eax
sub $1,%edx
jne and_test
retq
.balign 32
nop
.balign 256
.globl add_test
add_test:
mov %edi,%eax
add %esi,%eax
sub $1,%edx
jne add_test
retq
.balign 256
nop
derived from yours
#include<stdio.h>
#include<time.h>
unsigned int add_test ( unsigned int a, unsigned int b, unsigned int x );
unsigned int and_test ( unsigned int a, unsigned int b, unsigned int x );
int main()
{
int x=10;
int y=25;
time_t start,stop;
for(int j=0;j<10;j++)
{
start = clock();
add_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
for(int j=0;j<10;j++)
{
start = clock();
and_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
return(0);
}
first run as expected the first loop took longer as it wasnt in the cache? should not have taken that much longer so that doesnt make sense, perhaps other reasons...
0 605678
1 520204
2 521311
3 520050
4 521455
5 520213
6 520315
7 520197
8 520253
9 519743
0 520475
1 520221
2 520109
3 520319
4 521128
5 520974
6 520584
7 520875
8 519944
9 521062
but we stay fairly consistent. second run, the times stay somewhat consistent.
0 599558
1 515120
2 516035
3 515863
4 515809
5 516069
6 516578
7 516359
8 516170
9 515986
0 516403
1 516666
2 516842
3 516710
4 516932
5 516380
6 517392
7 515999
8 516861
9 517047
note that this is 2 billion loops. four instructions per. my clocks per second is 1000000 at 3.4ghz 0.8772 clocks per loop or 0.2193 clocks per instruction, how is that possible? superscaler processor.
A LOT more work could be done, here, this was just a few minutes worth and hopefully it is just enough to demonstrate (as others have already as well) that you cant really see the difference with a test like this.
I could do a demo with something more linear like an arm and something we could read the clock/timer register as part of the code under test, as the calling of the clock code are all part of the code under test and can vary here. Hopefully that is not necessary, the results are much more consistent though using sram, controlling all the instructions under test, etc. and with that you can see alignment differences you can see the cost of the cache read on the first loop but not the remaining, etc...(a few clocks total although 10ms as we see here, hmm, might be on par for an x86 system dont know benchmarking x86 is a near complete waste of time, no fun in it and the results dont translate to other x86 computers that well)
As pointed out in your other question that was closed as a duplicate, and I hate using links here, should learn how to cut and paste pictures (TODO).
https://en.wikipedia.org/wiki/AND_gate
https://en.wikipedia.org/wiki/Adder_(electronics)
Assuming feeding the math/logic operation for add and and is the same and we are only trying to measure the difference between them you are correct the AND is faster not getting into further detail you can see an and only has one stage/gate. Where a full adder takes three levels, back of the envelope math, three times as much time to settle the signals once the inputs change than the AND....BUT....Although there are some exceptions, chips are not designed to take advantage of this (well multiply and divide vs add/and/xor/etc, yes they are or are more likely to be). One would design these simple operations to be one clock cycle, on a clock the inputs to the combinational logic (the actual AND or ADD) are latched, on the next clock cycle the result is latched from the other end and begins its journey to the register file or out the core to memory, etc...At some point in the design you do synthesis into the gates available for the foundry/process you are using then do a timing analysis/closure on that and look for long poles in the tent. Extremely unlikely (impossible) that the add is a long pole, both add and and are very very short poles, but you determine at that point what your max clock rate is, if you wanted a 4ghz procerssor but the result is 2.7 well you need to take those long poles and turn them into two or more clock operations. the time it takes to do an add vs and which should vary add should be longer, is so fast and in the noise that it is all within a clock cycle so even if you did a functional simulation of the logic design you would not see the difference you need to implement an and and a full adder in say pspice using transistors and other components, then feed step changes into the inputs and watch how long it takes to settle, that or build them from discrete components from radio shack and try, although the results might be too fast for your scope, so use pspice or other.
think of writing equations to solve something you can write maybe a long equation or you can break that into multiple smaller ones with intermediate variables
this
a = b+c+d+e+f+g;
vs
x=b+c;
y=d+e;
z=f+g;
a=x+y;
a=a+z;
one clock vs 5 clocks, but each of the 5 clocks can be faster if this was the longest pole in the tent. all the other logic is that much faster too feeding this. (actually x,y,z can be one clock, then either a=x+y+z in the next or make it two more)
multiply and divide are different simply because the logic explodes exponentially, there is no magic to multiply or divide they have to work the same way we do things on pencil and paper. you can take shortcuts with binary if you think about it. as you can only multiply by 0 or 1 before shifing and adding to the accumulator. the logic equations for one clock still explode exponentially and then you can do parallel things. it burns a ton of chip realestate, so you may choose to make multiply as well as divide take more than one clock and hide those in the pipeline. or you can choose to burn a significant amount of chip realestate...Look at the documentation for some of the arm cores you can at compile time (when you compile/synthesize the core) choose a one clock or multi-clock multiply to balance chip size vs performance. x86 we dont buy the IP and make the chips ourselves so it is up to intel how they do it, and very likely microcoded so by just microcoding you can tweak how things happen or do it in an alu type operation.
So you might detect multiply or divide performance vs add/and with a test like this but either they did it in one clock and you cant ever see it, or they may have buried it in two or more steps in the pipe so that it averages out right and to see it you would need access to a chip sim. using timers and running something a billion times is fun, but to actually see instruction performance you need a chip sim and need to tweak the code to keep the code not under test from affecting the results.

Assembly: access quad-word, double word, and byte quantity of same register in function

I have some assembly code generated from C that I'm trying to make sense of. One part I just can't understand:
movslq %edx,%rcx
movzbl (%rdi,%rcx,1),%ecx
test %cl,%cl
What doesn't make sense is that %rcx, %ecx, and %cl are all in the same register (quad word, double word, and byte, respectively). How could a data type access all three in the same function like this?
Having a char* makes it improbably to access %ecx in this way, and similarly having an int* makes accessing %cl unlikely. I simply have no idea what data type could be stored in %rcx.
re: your comment: You can tell it's a byte array because it's scaling %rcx by 1.
Like Michael said in a comment, this is doing
int func(char *array /* rdi */, int pos /* ecx */)
{
if (array[pos]) ...;
// or maybe
int tmpi = array[pos];
char tmpc = tmpi;
if (tmpc) ...;
}
ecx has to get sign-extended to 64bit before being used as an offset in an effective address. If it was unsigned, it would still need to be zero-extended (e.g. mov %ecx, %ecx). The ABI doesn't guarantee that the upper 32 of a register are zeroed or sign extended when the parameter being passed in a register is smaller than 64bits.
In general, it's better to write at least 32b of a register, to avoid a false dependency on the previous contents on some CPUs. Only Intel P6/SnB family CPUs track the pieces of integer registers separately (and insert an extra uop to merge them with the old contents if you do something like read %ecx after writing %cl.)
So it's perfectly reasonable for a compiler to emit that code with the zero-extending movzbl load instead of just mov (%rdi,%rcx,1), %cl. It will potentially run faster on Silvermont and AMD. (And P4. Optimizations for old CPUs do hang around in compiler source code...)

Assembly language - How it works

I am really new at learning assembly language and just started digging in to it so I was wondering if maybe some of you guys could help me figure one problem out. I have a homework assignment which tells me to compare assembly language instructions to c code and tell me which c code is equivalent to the assembly instructions. So here is the assembly instructions:
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
jge .L3 // Jump to .L3 if b is greater than a - else continue the instructions
movl %edx,%eax // If the term is not met here it will return b
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
I am trying to comment it out based on my understanding of this - but I might be totally wrong about this. The options for C functions which one of are suppose to be equivalent to are:
int fun1(int a, int b)
{
unsigned ua = (unsigned) a;
if (ua < b)
return b;
else
return ua;
}
int fun2(int a, int b)
{
if (b < a)
return b;
else
return a;
}
int fun3(int a, int b)
{
if (a < b)
return a;
else
return b;
}
I think the correct answer is fun3 .. but I'm not quite sure.
First off, welcome to StackOverflow. Great place, really it is.
Now for starters, let me help you; a lot; a whole lot.
You have good comments that help both you and me and everyone else tremendously, but they are so ugly that reading them is painful.
Here's how to fix that: white space, lots of it, blank lines, and grouping the instructions into small groups that are related to each other.
More to the point, after a conditional jump, insert one blank line, after an absolute jump, insert two blank lines. (Old tricks, work great for readability)
Secondly, line up the comments so that they are neatly arranged. It looks a thousand times better.
Here's your stuff, with 90 seconds of text arranging by me. Believe me, the professionals will respect you a thousand times better with this kind of source code...
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
// No, Think like this: "What is the value of edx with respect to the value of eax ?"
jge .L3 // edx is greater, so return the value in eax as it is
movl %edx,%eax // If the term is not met here it will return b
// (pssst, I think you're wrong; think it through again)
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
Now, back to your problem at hand. What he's getting at is the "sense" of the compare instruction and the related JGE instruction.
Here's the confuse-o-matic stuff you need to comprehend to survive these sorts of "academic experiences"
This biz, the cmpl %eax,%edx instruction, is one of the forms of the "compare" instructions
Try to form an idea something like this when you see that syntax, "...What is the value of the destination operand with respect to the source operand ?..."
Caveat: I am absolutely no good with the AT&T syntax, so anybody is welcome to correct me on this.
Anyway, in this specific case, you can phrase the idea in your mind like this...
"...I see cmpl %eax,%edx so I think: With respect to eax, the value in edx is..."
You then complete that sentence in your mind with the "sense" of the next instruction which is a conditional jump.
The paradigmatic process in the human brain works out to form a sentence like this...
"...With respect to eax, the value in edx is greater or equal, so I jump..."
So, if you are correct about the locations of a and b, then you can do the paradigmatic brain scrambler and get something like this...
"...With respect to the value in b, that value in a is greater or equal, so I will jump..."
To get a grasp of this, take note that JGE is the "opposite sense" if you will, of JL (i.e., "Jump if less than")
Okay, now it so happens that return in C is related to the ret instruction in assembly language, but it isn't the same thing.
When C programmers say "...That function returns an int..." what they mean is...
The assembly language subroutine will place a value in Eax
The subroutine will then fix the stack and put it back in neat order
The subroutine will then execute its Ret instruction
One more item of obfuscation is thrown in your face now.
These following conditional jumps are applicable to Signed arithmetic comparison operations...
JG
JGE
JNG
JL
JLE
JNL
There it is ! The trap waiting to screw you up in all this !
Do you want to do signed or unsigned compares ???
By the way, I've never seen anybody do anything like that first function where an unsigned number is compared with a signed number. Is that even legal ?
So anyway, we put all these facts together, and we get: This assembly language routine returns the value in a if it is less than the value in b otherwise it returns the value in b.
These values are evaluated as signed integers.
(I think I got that right; somebody check my logic. I really don't like that assembler's syntax at all.)
So anyway, I am reasonably certain that you don't want to ask people on the internet to provide you with the specific answer to your specific homework question, so I'll leave it up to you to figure it out from this explanation.
Hopefully, I have explained enough of the logic and the "sense" of comparisons and the signed and unsigned biz so that you can get your brain around this.
Oh, and disclaimer again, I always use the Intel syntax (e.g., Masm, Tasm, Nasm, whatever) so if I got something backwards here, feel free to correct it for me.

Why is adding a superfluous mask and bitshift more optimizable?

While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)
char *i2s(int n){
static char buf[(sizeof(int)<<1)+1]={0};
int i=0;
while(i<(sizeof(int)<<1)+1){ /* mask the ith hex, shift it to lsb */
// buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
}
for(i=0;buf[i++]=='0';)
/*find first non-zero*/;
return (char *)buf+i;
}
With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:
movb $48, buf.1247
xorl %eax, %eax
movb $48, buf.1247+1
movb $48, buf.1247+2
movb $48, buf.1247+3
movb $48, buf.1247+4
movb $48, buf.1247+5
movb $48, buf.1247+6
movb $48, buf.1247+7
movb $48, buf.1247+8
.p2align 4,,7
.p2align 3
.L26:
movzbl buf.1247(%eax), %edx
addl $1, %eax
cmpb $48, %dl
je .L26
addl $buf.1247, %eax
ret
Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.
Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)
It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.
gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.
clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.
Obviously this means your function is buggy! And that's not the only bug.
off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).
clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i)));
That compiles to pretty clunky code, though. gcc probably does better with #Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)
In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.
It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).
I put a bugfixed version on godbolt, along with your original.
Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)
gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!
i2s_orig:
mov BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406, 48
cmp BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406+1, 48
mov BYTE PTR buf.1406+2, 48
mov BYTE PTR buf.1406+4, 48
mov BYTE PTR buf.1406+5, 48
mov BYTE PTR buf.1406+6, 48
mov BYTE PTR buf.1406+7, 48
mov BYTE PTR buf.1406+8, 48
mov BYTE PTR buf.1406+9, 0
jne .L7 # testing flags from the compare earlier
jne .L8
jne .L9
jne .L10
jne .L11
sete al
movzx eax, al
add eax, 8
.L3:
add eax, OFFSET FLAT:buf.1406
ret
.L7:
mov eax, 3
jmp .L3
... more of the same, setting eax to 4, or 5, etc.
Putting multiple jne instructions in a row is obviously useless.
I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.
Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.
Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.
The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.
Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.
Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.
const char* i2s_brcfree(int n)
{
static char buf[ sizeof(n)*2+1] = {0};
unsigned int nibble_shifter = n;
for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
char digit = '0' + curr_nibble;
// "promote" to hex if nibble is over 9,
// conditionally adding the difference between ('0'+nibble) and 'A'
enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
*p = digit;
}
return buf;
}
Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.

What does 0x4 do in "movl $0x2d, 0x4(%esp)"?

I am looking into assembly code generated by GCC. But I don't understand:
movl $0x2d, 0x4(%esp)
In the second operand, what does 0x4 stands for? offset address? And what the use of register EAX?
movl $0x2d, 0x4(%esp) means to take the current value of the stack pointer (%esp), add 4 (0x4) then store the long (32-bit) value 0x2d into that location.
The eax register is one of the general purpose 32-bit registers. x86 architecture specifies the following 32-bit registers:
eax Accumulator Register
ebx Base Register
ecx Counter Register
edx Data Register
esi Source Index
edi Destination Index
ebp Base Pointer
esp Stack Pointer
and the names and purposes of some of then harken back to the days of the Intel 8080.
This page gives a good overview on the Intel-type registers. The first four of those in the above list can also be accessed as a 16-bit or two 8-bit values as well. For example:
3322222222221111111111
10987654321098765432109876543210
<- eax ->
<- ax ->
<- ah -><- al ->
The pointer and index registers do not allow use of 8-bit parts but you can have, for example, the 16-bit bp.
0x4(%esp) means *(%esp + 4) where * mean dereferencing.
The statement means store the immediate value 0x2d into some local variable occupying the 4th offset on the stack.
(The code you've shown is in AT&T syntax. In Intel syntax it would be mov [esp, 4], 2dh)
0x4 in the second operand is an offset from the value of the register in the parens. EAX is a general purpose register used for assembly coding (computations, storing temporary values, etc.) formally it's called "Accumulator register" but that's more historic than relevant.
You can read this page about the x86 architecture. Most relevant to your question are the sections on Addressing modes and General purpose registers
GCC assembly operands follow a byte (b), word (w), long (l) and so on such as :
movb
movw
movl
Registers are prefixed with a percentage sign (%).
Constants are prefixed with a dollar sign ($).
In the above example in your question that means the 4th offset from the stack pointer (esp).
Hope this helps,
Best regards,
Tom.
You're accessing something four bytes removed from where the stack pointer resides. In GCC this indicates a parameter (I think -- positive offset is parameters and negative is local variables if I remember correctly). You're writing, in other words, the value 0x2D into a parameter. If you gave more context I could probably tell you what was going on in the whole procedure.

Resources