Assembly language - How it works - c

I am really new at learning assembly language and just started digging in to it so I was wondering if maybe some of you guys could help me figure one problem out. I have a homework assignment which tells me to compare assembly language instructions to c code and tell me which c code is equivalent to the assembly instructions. So here is the assembly instructions:
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
jge .L3 // Jump to .L3 if b is greater than a - else continue the instructions
movl %edx,%eax // If the term is not met here it will return b
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
I am trying to comment it out based on my understanding of this - but I might be totally wrong about this. The options for C functions which one of are suppose to be equivalent to are:
int fun1(int a, int b)
{
unsigned ua = (unsigned) a;
if (ua < b)
return b;
else
return ua;
}
int fun2(int a, int b)
{
if (b < a)
return b;
else
return a;
}
int fun3(int a, int b)
{
if (a < b)
return a;
else
return b;
}
I think the correct answer is fun3 .. but I'm not quite sure.

First off, welcome to StackOverflow. Great place, really it is.
Now for starters, let me help you; a lot; a whole lot.
You have good comments that help both you and me and everyone else tremendously, but they are so ugly that reading them is painful.
Here's how to fix that: white space, lots of it, blank lines, and grouping the instructions into small groups that are related to each other.
More to the point, after a conditional jump, insert one blank line, after an absolute jump, insert two blank lines. (Old tricks, work great for readability)
Secondly, line up the comments so that they are neatly arranged. It looks a thousand times better.
Here's your stuff, with 90 seconds of text arranging by me. Believe me, the professionals will respect you a thousand times better with this kind of source code...
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
// No, Think like this: "What is the value of edx with respect to the value of eax ?"
jge .L3 // edx is greater, so return the value in eax as it is
movl %edx,%eax // If the term is not met here it will return b
// (pssst, I think you're wrong; think it through again)
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
Now, back to your problem at hand. What he's getting at is the "sense" of the compare instruction and the related JGE instruction.
Here's the confuse-o-matic stuff you need to comprehend to survive these sorts of "academic experiences"
This biz, the cmpl %eax,%edx instruction, is one of the forms of the "compare" instructions
Try to form an idea something like this when you see that syntax, "...What is the value of the destination operand with respect to the source operand ?..."
Caveat: I am absolutely no good with the AT&T syntax, so anybody is welcome to correct me on this.
Anyway, in this specific case, you can phrase the idea in your mind like this...
"...I see cmpl %eax,%edx so I think: With respect to eax, the value in edx is..."
You then complete that sentence in your mind with the "sense" of the next instruction which is a conditional jump.
The paradigmatic process in the human brain works out to form a sentence like this...
"...With respect to eax, the value in edx is greater or equal, so I jump..."
So, if you are correct about the locations of a and b, then you can do the paradigmatic brain scrambler and get something like this...
"...With respect to the value in b, that value in a is greater or equal, so I will jump..."
To get a grasp of this, take note that JGE is the "opposite sense" if you will, of JL (i.e., "Jump if less than")
Okay, now it so happens that return in C is related to the ret instruction in assembly language, but it isn't the same thing.
When C programmers say "...That function returns an int..." what they mean is...
The assembly language subroutine will place a value in Eax
The subroutine will then fix the stack and put it back in neat order
The subroutine will then execute its Ret instruction
One more item of obfuscation is thrown in your face now.
These following conditional jumps are applicable to Signed arithmetic comparison operations...
JG
JGE
JNG
JL
JLE
JNL
There it is ! The trap waiting to screw you up in all this !
Do you want to do signed or unsigned compares ???
By the way, I've never seen anybody do anything like that first function where an unsigned number is compared with a signed number. Is that even legal ?
So anyway, we put all these facts together, and we get: This assembly language routine returns the value in a if it is less than the value in b otherwise it returns the value in b.
These values are evaluated as signed integers.
(I think I got that right; somebody check my logic. I really don't like that assembler's syntax at all.)
So anyway, I am reasonably certain that you don't want to ask people on the internet to provide you with the specific answer to your specific homework question, so I'll leave it up to you to figure it out from this explanation.
Hopefully, I have explained enough of the logic and the "sense" of comparisons and the signed and unsigned biz so that you can get your brain around this.
Oh, and disclaimer again, I always use the Intel syntax (e.g., Masm, Tasm, Nasm, whatever) so if I got something backwards here, feel free to correct it for me.

Related

how to derive the types of the following data types from assembly code

I came across an exercise, as I am still trying to familiarise myself with assembly code.
I am unsure how to derive the types for a given struct, given the assembly code and the skeleton c code. Could someone teach me how this should be done?
This is the assembly code, where rcx and rdx hold the arguments i and j respectively.
randFunc:
movslq %ecx,%rcx // move i into rcx
movslq %edx, %rdx // move j into rdx
leaq (%rcx,%rcx,2), %rax //3i into rax
leaq (%rdx,%rdx,2), %rdx // 3j into rdx
salq $5, %rax // shift arith left 32? so 32*3i = 96i
leaq (%rax,%rdx,8), %rax //24j + 96i into rax
leaq matrixtotest(%rip), %rdx //store address of the matrixtotest in rdx
addq %rax, %rdx //jump to 24th row, 6th column variable
cmpb $10, 2(%rdx) //add 2 to that variable and compare to 10
jg .L5 //if greater than 10 then go to l5
movq 8(%rdx), %rax // else add 8 to the rdx number and store in rax
movzwl (%rdx), %edx //move the val in rdx (unsigned) to edx as an int
subl %edx, %eax //take (val+8) -(val) = 8? (not sure)
ret
.L5
movl 16(%rdx),%eax //move 1 row down and return? not sure about this
ret
This is the C code:
struct mat{
typeA a;
typeB b;
typeC c;
typeD d;
}
struct mat matrixtotest[M][N];
int randFunc(int i, int j){
return __1__? __2__ : __3__;
}
How do I derive the types of the variables a,b,c,d? And what is happening in the 1) 2) 3) parts of the return statement ?
Please help me, I'm very confused about what's happening and how to derive the types of the struct from this assembly.
Any help is appreciated, thank you.
Due to the cmpb $10, 2(%rdx) you have a byte sized something at offset 2. Due to the movzwl (%rdx), %edx you have a 2 byte sized unsigned something at offset 0. Due to the movq 8(%rdx), %rax you have a 8 byte sized something at offset 8. Finally due to the movl 16(%rdx),%eax you have a 4 byte sized something at offset 16. Now sizes don't map to types directly, but one possibility would be:
struct mat{
uint16_t a;
int8_t b;
int64_t c;
int32_t d;
};
You can use unsigned short, signed char, long, int if you know their sizes.
The size of the structure is 24 bytes, with padding at the end due to alignment requirement of the 8 byte field. From the 96i you can deduce N=4 probably. M is unknown. As such 24j + 96i accesses item matrixtotest[i][j]. The rest should be clear.
How do I derive the types of the variables a,b,c,d?
You want to see how variables are used, which will give you a very strong indication as to their size & sign.  (These indications are not always perfect, but the best we can do with limited information, i.e. missing source code, and will suffice for your exercise.)
So, just work the code, one instruction after another to see what they do by the definitions they have in the assembler and their mapping to the instruction set, paying particular attention to the sizes, signs, and offsets specified by the instructions.
Let's start for example with the first instruction: movslq ecx, rcx — this is saying that the first parameter (which is found in ecx), is a 32-bit signed number.
Since rcx is Windows ABI first parameter, and the assembly code is asking for ecx to be sign extended into rcx, then we know that this parameter is a signed 32-bit integer.  And you proceed to the next instruction, to glean what you can from it — and so on.
And what is happening in the 1) 2) 3) parts of the return statement ?
The ?: operator is a ternary operator known as a conditional.  If the condition, placeholder __1__, is true, it will choose the __2__ value and if false it will choose __3__.  This is usually (but not always) organized as an if-then-else branching pattern, where the then-part represents placeholder __2__ and the else part placeholder __3__.
That if-then-else branching pattern looks something like this in assembly/machine code:
if <condition> /* here __1__ */ is false goto elsePart;
<then-part> // here __2__
goto ifDone;
elsePart:
<else-part> // here __3__
ifDone:
So, when you get to an if-then-else construct, you can fit that into the ternary operator place holders.
That code is nicely commented, but somewhat absent size, sign, and offset information.  So, following along and derive that missing information from the way the instructions tell the CPU what sizes, signs, and offsets to use.
As Jester describes, if the code indexes into the array, because it is two-dimensional, it uses two indexes.  The indexing takes the given indexes and computes the address of the element.  As such, the first index finds the row, and so must skip ahead one row for each value in the index.  The second index must skip ahead one element for each value in the index.  Thus, by the formula in the comments: 24j + 96i, we can say that a row is 96 bytes long and an element (the struct) is 24 bytes long.

Understanding the difference between ++i and i++ at the Assembly Level

I know that variations of this question has been asked here multiple times, but I'm not asking what is the difference between the two. Just would like some help understanding the assembly behind both forms.
I think my question is more related to the whys than to the what of the difference.
I'm reading Prata's C Primer Plus and in the part dealing with the increment operator ++ and the difference between using i++ or ++i the author says that if the operator is used by itself, such as ego++; it doesn't matter which form we use.
If we look at the dissasembly of the following code (compiled with Xcode, Apple LLVM version 9.0.0 (clang-900.0.39.2)):
int main(void)
{
int a = 1, b = 1;
a++;
++b;
return 0;
}
we can see that indeed the form used doesn't matter, since the assembly code is the same for both (both variables would print out a 2 to the screen).
Initializaton of a and b:
0x100000f8d <+13>: movl $0x1, -0x8(%rbp)
0x100000f94 <+20>: movl $0x1, -0xc(%rbp)
Assembly for a++:
0x100000f9b <+27>: movl -0x8(%rbp), %ecx
0x100000f9e <+30>: addl $0x1, %ecx
0x100000fa1 <+33>: movl %ecx, -0x8(%rbp)
Assembly for ++b:
0x100000fa4 <+36>: movl -0xc(%rbp), %ecx
0x100000fa7 <+39>: addl $0x1, %ecx
0x100000faa <+42>: movl %ecx, -0xc(%rbp)
Then the author states that when the operator and its operand are part of a larger expression as, for example, in an assignment statement the use of prefix or postfix it does make a difference.
For example:
int main(void)
{
int a = 1, b = 1;
int c, d;
c = a++;
d = ++b;
return 0;
}
This would print 1 and 2 for c and b, respectively.
And:
Initialization of a and b:
0x100000f46 <+22>: movl $0x1, -0x8(%rbp)
0x100000f4d <+29>: movl $0x1, -0xc(%rbp)
Assembly for c = a++; :
0x100000f54 <+36>: movl -0x8(%rbp), %eax // eax = a = 1
0x100000f57 <+39>: movl %eax, %ecx // ecx = 1
0x100000f59 <+41>: addl $0x1, %ecx // ecx = 2
0x100000f5c <+44>: movl %ecx, -0x8(%rbp) // a = 2
0x100000f5f <+47>: movl %eax, -0x10(%rbp) // c = eax = 1
Assembly for d = ++b; :
0x100000f62 <+50>: movl -0xc(%rbp), %eax // eax = b = 1
0x100000f65 <+53>: addl $0x1, %eax // eax = 2
0x100000f68 <+56>: movl %eax, -0xc(%rbp) // b = eax = 2
0x100000f6b <+59>: movl %eax, -0x14(%rbp) // d = eax = 2
Clearly the assembly code is different for the assignments:
The form c = a++; includes the use of the registers eax and ecx. It uses ecx for performing the increment of a by 1, but uses eax for the assignment.
The form d = ++b; uses ecx for both the increment of b by 1 and the assignment.
My question is:
Why is that?
What determines that c = a++; requires two registers instead of just one (ecx for example)?
In the following statements:
a++;
++b;
neither of the evaluation of the expressions a++ and ++b is used. Here the compiler is actually only interested in the side effects of these operators (i.e.: incrementing the operand by one). In this context, both operators behave in the same way. So, it's no wonder that these statements result in the same assembly code.
However, in the following statements:
c = a++;
d = ++b;
the evaluation of the expressions a++ and ++b is relevant to the compiler because they have to be stored in c and d, respectively:
d = ++b;: b is incremented and the result of this increment assigned to d.
c = a++; : the value of a is first assigned to c and then a is incremented.
Therefore, these operators behave differently in this context. So, it would make sense to result in different assembly code, at least in the beginning, without more aggressive optimizations enabled.
A good compiler would replace this whole code with c = 1; d = 2;. And if those variables aren't used in turn, the whole program is one big NOP - there should be no machine code generated at all.
But you do get machine code, so you are not enabling the optimizer correctly. Discussing the efficiency of non-optimized C code is quite pointless.
Discussing a particular compiler's failure to optimize the code might be meaningful, if a specific compiler is mentioned. Which isn't the case here.
All this code shows is that your compiler isn't doing a good job, possibly because you didn't enable optimizations, and that's it. No other conclusions can be made. In particular, no meaningful discussion about the behavior of i++ versus ++i is possible.
Your test has flaws : the compiler optimized your code by replacing your value with what could be easily predicted.
The compiler can, and will, calculate the result in advance during compilation and avoid the use of 'jmp' instructions (jump to the the while each time condition is still true).
If you try this code:
int a = 0;
int i = 0;
while (i++ < 10)
{
a += i;
}
The assembly will not use a single jmp instruction.
It will directly assign value of ½ n (n + 1), here (0.5 * 10 * 6) = 30 to the register holding the value of 'a' variable
You would have the following assembly output:
mov eax, 30 ; a register
mov ecx, 10 ; i register, this line only if i is still used after.
Whether you write :
int i = 0;
while (i++ < 10)
{
...
}
or
int i = -1;
while (++i < 11)
{
...
}
will also result in the same assembly output.
If you had a much more complex code you would be able to witness differences in the assembly code.
a = ++i;
would translate into :
inc rcx ; increase i by 1, RCX holds the current value of both and i variables.
mov rax, rcx ; a = i;
and a = i++; into :
lea rax, [rcx+1] ; RAX now holds i, RCX now holds a.
mov rax, rcx ; a = i;
inc rcx ; increase i by 1
(edit: See comment below)
Both the expressions ++i and i++ have the effect of incrementing i. The difference is that ++i produces a result (a value stored somewhere, for example in a machine register, that can be used within other expressions) equal to the new value of i, whereas i++ produces a result equal to the original value of i.
So, assuming we start with i having a value of 2, the statement
b = ++i;
has the effect of setting both b and i equal to 3, whereas;
b = i++;
has the effect of setting b equal to 2 and i equal to 3.
In the first case, there is no need to keep track of the original value of i after incrementing i whereas in the second there is. One way of doing this is for the compiler to employ an additional register for i++ compared with ++i.
This is not needed for a trivial expression like
i++;
since the compiler can immediately detect that the original value of i will not be used (i.e. is discarded).
For simple expressions like b = i++ the compiler could - in principle at least - avoid using an additional register, by simply storing the original value of i in b before incrementing i. However, in slightly more complex expressions such as
c = i++ - *p++; // p is a pointer
it can be much more difficult for the compiler to eliminate the need to store old and new values of i and p (unless, of course, the compiler looks ahead and determines how (or if) c, i, and p (and *p) are being used in subsequent code). In more complex expressions (involving multiple variables and interacting operations) the analysis needed can be significant.
It then comes down to implementation choices by developers/designers of the compiler. Practically, compiler vendors compete pretty heavily on compilation time (getting compilation times as small as possible) and, in doing so, may choose not to do all possible code transformations that remove unneeded uses of temporaries (or machine registers).
You compiled with optimization disabled! For gcc and LLVM, that means each C statement is compiled independently, so you can modify variables in memory with a debugger, and even jump to a different source line. To support this, the compiler can't optimize between C statements at all, and in fact spills / reloads everything between statements.
So the major flaw in your analysis is that you're looking at an asm implementation of that statement where the inputs and outputs are memory, not registers. This is totally unrealistic: compilers keep most "hot" values in registers inside inner loops, and don't need separate copies of a value just because it's assigned to multiple C variables.
Compilers generally (and LLVM in particular, I think) transform the input program into an SSA (Static Single Assignment) internal representation. This is how they track data flow, not according to C variables. (This is why I said "hot values", not "hot variables". A loop induction variable might be totally optimized away into a pointer-increment / compare against end_pointer in a loop over arr[i++]).
c = ++i; produces one value with 2 references to it (one for c, one for i). The result can stay in a single register. If it doesn't optimize into part of some other operation, the asm implementation could be as simple as inc %ecx, with the compiler just using ecx/rcx everywhere that c or i is read before the next modification of either. If the next modification of c can't be done non-destructively (e.g. with a copy-and-modify like lea (,%rcx,4), %edx or shrx %eax, %ecx, %edx), then a mov instruction to copy the register will be emitted.
d = b++; produces one new value, and makes d a reference to the old value of b. It's syntactic sugar for d=b; b+=1;, and compiles into SSA the same as that would. x86 has a copy-and-add instruction, called lea. The compiler doesn't care which register holds which value (except in loops, especially without unrolling, when the end of the loop has to have values in the right registers to jump to the beginning of the loop). But other than that, the compiler can do lea 1(%rbx), %edx to leave %ebx unmodified and make EDX hold the incremented value.
An additional minor flaw in your test is that with optimization disabled, the compiler is trying to compile quickly, not well, so it doesn't look for all possible peephole optimizations even within the statement that it does allow itself to optimize.
If the value of c or d is never read, then it's the same as if you had never done the assignment in the first place. (In un-optimized code, every value is implicitly read by the memory barrier between statements.)
What determines that c = a++; requires two registers instead of just one (ecx for example)?
The surrounding code, as always. +1 can be optimized into other operations, e.g. done with an LEA as part of a shift and/or add. Or built in to an addressing mode.
Or before/after negation, use the 2's complement identity that -x == ~x+1, and use NOT instead of NEG. (Although often you're adding the negated value to something, so it turns into a SUB instead of NEG + ADD, so there isn't a stand-alone NEG you can turn into a NOT.)
++ prefix or postfix is too simple to look at on its own; you always have to consider where the input comes from (does the incremented value have to end up back in memory right away or eventually?) and how the incremented and original values are used.
Basically, un-optimized code is un-interesting. Look at optimized code for short functions. See Matt Godbolt's talk at CppCon2017: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”, and also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler asm output.

GCC floating point error if optimizer enabled [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for a workaround for a GCC optimizer bug. The bug was in v4.5 and is still present in v5.3.0, alas. Here's the problem C code snippet (part of a printf-like func):
d *= factor;
if ((d > 0) && (d < 1)) /* 9.9e-1 instead of 0.99e0 */
{
exp--;
d *= 10;
}
else if (!sci && (d >= 10)) /* 1e0 instead of 10e-1 */
{
exp++;
d /= 10;
}
With -O1 or -O2, this code does not produce correct results. But, if I insert a function call like srand() or similar after "d *= factor" and before the "if", then the code compiles right and produces the expected result.
I've tried inserting other things there in place of the function call, to try to nudge the compiler into a different state w/o the bug, but so far only a function call seems to work.
Which leads to the question: any better suggestions for a workaround?
I haven't been able to produce a small test case for this or I would have reported it as a GCC bug. If I extract the above code segment from the larger function it works fine; it only fails when part of the long and complicated overall function.
Here is another snippet very similar in form to the first one, which also has the same problem. If I insert a function call between the assignment and the if, the problem goes away.
gl = gc->mgr * pat_round(l / gc->mgr);
dl = l - gl;
if (((dl * gc->last_dl) < 0) &&
((gl == gc->last_gl) || ((gl * gc->last_gl) < 0)))
{
...
}
These are doubles being compared in both cases.
Here is the assembler for the hacked version of the first snippet, with a srand(0) call stuck in to make it work (I marked the instructions which change with a '*'):
.L461:
fldl (%esp)
fmull 24(%esp)
fstpl (%esp)
subl $12, %esp
.cfi_def_cfa_offset 4252
pushl $0
.cfi_def_cfa_offset 4256
call srand
addl $16, %esp
.cfi_def_cfa_offset 4240
fldz
fldl (%esp)
fucomi %st(1), %st
fstp %st(1)
jbe .L559
fld1
fucomip %st(1), %st
jbe .L560
decl %ebp
fmuls .LC2
fstpl (%esp)
jmp .L457
and here is the same thing with the srand(0) removed-- this is the non-working version:
.L461:
fldl (%esp)
fmull 24(%esp)
fstl (%esp)
fldz
fxch %st(1)
fucomi %st(1), %st
fstp %st(1)
jbe .L559
fld1
fucomip %st(1), %st
jbe .L560
decl %ebp
fmuls .LC2
fstpl (%esp)
jmp .L457
A few details would help here, such as a brief description of the unexpected behaviour. Lacking that, we are forced to fall back on telepathy and crystal balls, which are notoriously unreliable.
According to my ouija board, your problem is that at the termination of that snippet, you expect d to be in the half closed range [1, 10). But it turns out that d is actually 10. When you interpose a function call, however, d mysteriously changes to the correct value.
This can happen, and it is not an optimiser bug. Although double has a fixed precision, and is nominally used throughout the computation, the compiler is permitted to perform intermediate computations in a higher precision, which can result in variables appearing to be of a more precise type at various points in their lifetime.
Now, let's telepathically ascertain that the product d * factor is just slightly less than 1, if computed precisely. It is so close to 1, in fact, that if it were rounded to 53 bits of precision, it would round to 1.0. But it happens to have 64 bits of precision at that point, so it's a tiny bit less. Now we multiply by 10, (because it was less than 1, as per the test) and round the result to 53 bits because we no longer keep the value in a register. The rounded value will turn out to be 10, contravening the expectation (except the expectation of the spirit world, who knew it all along.)
Interposing a function call will force the compiler to save the value in the floating point register, so it will be corrected to 53 bits before the comparison with 1, and thus will compare equal, not less.
Of course, all of the above is just a flight of fantasy since it has no basis whatsoever in reported evidence. If it turns out to have any resemblance with reality, that will just be one of those inexplicable coincidences.
In that hypothetical case, forcing the compiler to use SSE for floating point arithmetic would avoid the excess precision computations, since SSE registers are only 64 bits. Alternatively, you can tell GCC to work harder to avoid 80-bit intermediates. See the -mfpmath=sse option, and also -ffloat-store and -fexcess-precision=standard (SSE is generally rhe best.)

Which is the most efficient way in C to check that at least one of two integers is zero?

I have code that is doing a lot of these comparison operations. I was wondering which is the most efficient one to use. Is it likely that the compiler will correct it if I intentionally choose the "wrong" one?
int a, b;
// Assign a value to a and b.
// Now check whether either is zero.
// The worst?
if (a * b == 0) // ...
// The best?
if (a & b == 0) // ...
// The most obvious?
if (a == 0 || b == 0) // ...
Other ideas?
In general, if there's a fast way of doing a simple thing, you can assume the compiler will do it that fast way. And remember that the compiler is outputting machine language, not C -- the fastest method probably can't be correctly represented as a set of C constructs.
Also, the third method there is the only one that always works. The first one fails if a and b are 1<<16, and the second you already know doesn't work.
It's possible to see which variant generates fewer assembly instructions, but it's a separate matter to see which one actually executes in less time.
To help you analyze the first matter, learn to use your C compiler's command-line flags to capture its intermediate output. GCC is a common choice for a C compiler. Let's look at its unoptimized assembly code for two different programs.
#include <stdio.h>
void report_either_zero()
{
int a = 1;
int b = 0;
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Save that text to a file such as zero-test.c, and run the following command:
gcc -S zero-test.c
GCC will emit a file called zero-test.s, which is the assembly code it would normally submit to the assembler as it generates object code.
Let's look at the relevant fragment of the assembly code. I'm using gcc version 4.2.1 on Mac OS X generating x86 64-bit instructions.
_report_either_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $32, %rsp
Ltmp2:
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $1, -20(%rbp) // a = 1
movl $0, -24(%rbp) // b = 0
movl -24(%rbp), %eax // Get ready to compare a.
cmpl $0, %eax // Does zero equal a?
je LBB1_2 // If so, go to label LBB1_2.
movl -24(%rbp), %eax // Otherwise, get ready to compare b.
cmpl $0, %eax // Does zero equal b?
jne LBB1_3 // If not, go to label LBB1_3.
LBB1_2:
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_3:
addq $32, %rsp
popq %rbp
ret
Leh_func_end1:
You can see where the we load the integer values 1 and 0 into registers, then prepare to compare the first to zero, and then again with the second if the first is nonzero.
Now let's try a different approach with the comparison, to see how the assembly code changes. Note that this is not the same predicate; this one checks whether both numbers are zero.
#include <stdio.h>
void report_both_zero()
{
int a = 1;
int b = 0;
if (!(a | b))
{
puts("Both of them are zero.");
}
}
The assembly code is a little different:
_report_both_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
movl $1, -4(%rbp) // a = 1
movl $0, -8(%rbp) // b = 0
movl -4(%rbp), %eax // Get ready to operate on a.
movl -8(%rbp), %ecx // Get ready to operate on b too.
orl %ecx, %eax // Combine a and b via bitwise OR.
cmpl $0, %eax // Does zero equal the result?
jne LBB1_2 // If not, go to label LBB1_2.
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_2:
addq $16, %rsp
popq %rbp
ret
Leh_func_end1:
If the first number is zero, the first variant does less work—in terms of the number of assembly instructions involved—by avoiding a second register move. If the first number is not zero, the second variant does less work by avoiding a second comparison to zero.
The question now is whether "move, move, bitwise or, compare" runs faster that "move, compare, move, compare." The answer could come down to things like whether the processor learns to predict how often the first integer is zero, and whether it is or not consistently.
If you ask the compiler to optimize this code, the example is too simple; the compiler decides at compile time that no comparison is necessary, and just condenses that code to an unconditional request to write the string. It's interesting to change the code to operate on parameters rather than constants, and see how the optimizer handles the situation differently.
Variant one:
#include <stdio.h>
void report_either_zero(int a, int b)
{
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Variant two (again, a different predicate):
#include <stdio.h>
void report_both_zero(int a, int b)
{
if (!(a | b))
{
puts("Both of them are zero.");
}
}
Generate the optimized assembly code with this command:
gcc -O -S zero-test.c
Let us know what you find.
This will not likely have much (if any, given modern compiler optimizers) effect on the overall performance of your app. If you really must know, you should write some code to test the performance of each for your compiler. However, as a best guess, I'd say...
if ( !( a && b ) )
This will short-circuit if the first happens to be 0.
If you want to find whether or not one of two integers is zero using one comparison instruction ...
if ((a << b) == a)
If a is zero, then no amount of shifting it to the left will change its value.
If b is zero, then there is no shifting performed.
It is possible (I am too lazy to check) that there is some undefined behaviour should b be negative or really large.
However, due to the non-intuitiveness, it would be strongly recommended to implement this as a macro (with an appropriate comment).
Hope this helps.
The most efficient is certainly the most obvious, if by efficiency, you are measuring the programmer's time.
If by measuring efficiency using the processor's time, profiling your candidate solution is the best way to answer - for the target machine you profiled.
But this exercise demonstrated a pitfall of programmer optimization. The 3 candidates are not functionally equivalent for all int.
If you was a functional equivalent alternative...
I think the last candidate and a 4th one deserve comparison.
if ((a == 0) || (b == 0))
if ((a == 0) | (b == 0))
Due to the variation of compilers, optimization and CPU branch prediction, one should profile, rather than pontificate, to determine relative performance. OTOH, a good optimizing compiler may give you the same code for both.
I recommend the code that is easiest to maintain.
There's no "most efficient way to do it in C", if by "efficiency" you mean the efficiency of the compiled code.
Firstly, even if we assume that the compiler translates C language operator into their "obvious" machine counterparts (i.e. C multiplication into machine multiplication etc) the efficiency of each method will differ from one hardware platform to the other. Even if we restrict our consideration to a very specific sequence of instructions on a very specific hardware platform, it still can exhibit different performance in different surrounding contexts, depending, for example, on how well the whole thing agrees with the branch prediction heuristic in the given CPU.
Secondly, modern C compilers rarely translate C operators into their "obvious" machine counterparts. Often the instructions used in machine code will have very little in common with the C code. It is possible that many "completely different" methods of performing the check at C level will actually be translated into the same sequence of machine instructions by a smart compiler. At the same time the same C code might get translated into different sequences machine instructions when the surrounding contexts are different.
In other words, there's no meaningful answer to your question, unless you really really really localize it to a specific hardware platform, specific compiler version and specific set of compilation settings. And that will make it too localized to be useful.
That usually means that in general case the best way to do it is to write the most readable code. Just do
if (a == 0 || b == 0)
The readability of the code will not only help the human reader to understand it, but will also increase the probability of the compiler properly interpreting your intent and generating the most optimal code.
But if you really have to squeeze the last CPU cycle out of your performance-critical code, you have to try different versions and compare their relative efficiency manually.

Trouble understanding gcc's assembly output

While writing some C code, I decided to compile it to assembly and read it--I just sort of, do this from time to time--sort of an exercise to keep me thinking about what the machine is doing every time I write a statement in C.
Anyways, I wrote these two lines in C
asm(";move old_string[i] to new_string[x]");
new_string[x] = old_string[i];
asm(";shift old_string[i+1] into new_string[x]");
new_string[x] |= old_string[i + 1] << 8;
(old_string is an array of char, and new_string is an array of unsigned short, so given two chars, 42 and 43, this will put 4342 into new_string[x])
Which produced the following output:
#move old_string[i] to new_string[x]
movl -20(%ebp), %esi #put address of first char of old_string in esi
movsbw (%edi,%esi),%dx #put first char into dx
movw %dx, (%ecx,%ebx,2) #put first char into new_string
#shift old_string[i+1] into new_string[x]
movsbl 1(%esi,%edi),%eax #put old_string[i+1] into eax
sall $8, %eax #shift it left by 8 bits
orl %edx, %eax #or edx into it
movw %ax, (%ecx,%ebx,2) #?
(I'm commenting it myself, so I can follow what's going on).
I compiled it with -O3, so I could also sort of see how the compiler optimizes certain constructs. Anyways, I'm sure this is probably simple, but here's what I don't get:
the first section copies a char out of old_string[i], and then movw's it (from dx) to (%ecx,%ebx). Then the next section, copies old_string[i+1], shifts it, ors it, and then puts it into the same place from ax. It puts two 16 bit values into the same place? Wouldn't this not work?
Also, it shifts old_string[i+1] to the high-order dword of eax, then ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't ax just contain what was already in new_string[x]? so it saves the same thing to the same place in memory twice?
Is there something I'm missing? Also, I'm fairly certain that the rest of the compiled program isn't relevant to this snippet... I've read around before and after, to find where each array and different variables are stored, and what the registers' values would be upon reaching that code--I think that this is the only piece of the assembly that matters for these lines of C.
--
oh, turns out GNU assembly comments are started with a #.
Okay, so it was pretty simple after all.
I figured it out with a pen and paper, writing down each step, what it did to each register, and then wrote down the contents of each register given an initial starting value...
What got me was that it was using 32 bit and 16 bit registers for 16 and 8 bit data types...
This is what I thought was happening:
first value put into memory as, say, 0001 (I was thinking 01).
second value (02) loaded into 32 bit register (so it was like, 00000002, I was thinking, 0002)
second value shifted left 8 bits (00000200, I was thinking, 0200)
first value (0000001, I thought 0001) xor'd into second value (00000201, I thought 0201)
16 bit register put into memory (0201, I was thinking, just 01 again).
I didn't get why it wrote it to memory twice though, or why it was using 32 bit registers (well, actually, my guess is that a 32 bit processor is way faster at working with 32 bit values than it is with 8 and 16 bit values, but that's a totally uneducated guess), so I tried rewriting it:
movl -20(%ebp), %esi #gets pointer to old_string
movsbw (%edi,%esi),%dx #old_string[i] -> dx (0001)
movsbw 1(%edi,%esi),%ax #old_string[i + 1] -> ax (0002)
salw $8, %ax #shift ax left (0200)
orw %dx, %ax #or dx into ax (0201)
movw %ax,(%ecx,%ebx,2) #doesn't write to memory until end
This worked exactly the same.
I don't know if this is an optimization or not (aside from taking one memory write out, which obviously is), but if it is, I know it's not really worth it and didn't gain me anything. In any case, I get what this code is doing now, thanks for the help all.
I'm not sure what's not to understand, unless I'm missing something.
The first 3 instructions load a byte from old_string into dx and stores that to your new_string.
The next 3 instructions utilize what's already in dx and combines old_string[i+1] with it, and stores it as a 16-bit value (ax) to new_string.
Also, it shifts old_string[i+1] to the high-order dword of eax, then
ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't
ax just contain what was already in new_string[x]? so it saves the same
thing to the same place in memory twice?
Now you see why optimizers are a Good Thing. That kind of redundant code shows up pretty often in unoptimized, generated code, because the generated code comes more or less from templates that don't "know" what happened before or after.

Resources