GCC floating point error if optimizer enabled [closed]

GCC floating point error if optimizer enabled [closed] - c

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for a workaround for a GCC optimizer bug. The bug was in v4.5 and is still present in v5.3.0, alas. Here's the problem C code snippet (part of a printf-like func):
d *= factor;
if ((d > 0) && (d < 1)) /* 9.9e-1 instead of 0.99e0 */
{
exp--;
d *= 10;
}
else if (!sci && (d >= 10)) /* 1e0 instead of 10e-1 */
{
exp++;
d /= 10;
}
With -O1 or -O2, this code does not produce correct results. But, if I insert a function call like srand() or similar after "d *= factor" and before the "if", then the code compiles right and produces the expected result.
I've tried inserting other things there in place of the function call, to try to nudge the compiler into a different state w/o the bug, but so far only a function call seems to work.
Which leads to the question: any better suggestions for a workaround?
I haven't been able to produce a small test case for this or I would have reported it as a GCC bug. If I extract the above code segment from the larger function it works fine; it only fails when part of the long and complicated overall function.
Here is another snippet very similar in form to the first one, which also has the same problem. If I insert a function call between the assignment and the if, the problem goes away.
gl = gc->mgr * pat_round(l / gc->mgr);
dl = l - gl;
if (((dl * gc->last_dl) < 0) &&
((gl == gc->last_gl) || ((gl * gc->last_gl) < 0)))
{
...
}
These are doubles being compared in both cases.
Here is the assembler for the hacked version of the first snippet, with a srand(0) call stuck in to make it work (I marked the instructions which change with a '*'):
.L461:
fldl (%esp)
fmull 24(%esp)
fstpl (%esp)
subl $12, %esp
.cfi_def_cfa_offset 4252
pushl $0
.cfi_def_cfa_offset 4256
call srand
addl $16, %esp
.cfi_def_cfa_offset 4240
fldz
fldl (%esp)
fucomi %st(1), %st
fstp %st(1)
jbe .L559
fld1
fucomip %st(1), %st
jbe .L560
decl %ebp
fmuls .LC2
fstpl (%esp)
jmp .L457
and here is the same thing with the srand(0) removed-- this is the non-working version:
.L461:
fldl (%esp)
fmull 24(%esp)
fstl (%esp)
fldz
fxch %st(1)
fucomi %st(1), %st
fstp %st(1)
jbe .L559
fld1
fucomip %st(1), %st
jbe .L560
decl %ebp
fmuls .LC2
fstpl (%esp)
jmp .L457

A few details would help here, such as a brief description of the unexpected behaviour. Lacking that, we are forced to fall back on telepathy and crystal balls, which are notoriously unreliable.
According to my ouija board, your problem is that at the termination of that snippet, you expect d to be in the half closed range [1, 10). But it turns out that d is actually 10. When you interpose a function call, however, d mysteriously changes to the correct value.
This can happen, and it is not an optimiser bug. Although double has a fixed precision, and is nominally used throughout the computation, the compiler is permitted to perform intermediate computations in a higher precision, which can result in variables appearing to be of a more precise type at various points in their lifetime.
Now, let's telepathically ascertain that the product d * factor is just slightly less than 1, if computed precisely. It is so close to 1, in fact, that if it were rounded to 53 bits of precision, it would round to 1.0. But it happens to have 64 bits of precision at that point, so it's a tiny bit less. Now we multiply by 10, (because it was less than 1, as per the test) and round the result to 53 bits because we no longer keep the value in a register. The rounded value will turn out to be 10, contravening the expectation (except the expectation of the spirit world, who knew it all along.)
Interposing a function call will force the compiler to save the value in the floating point register, so it will be corrected to 53 bits before the comparison with 1, and thus will compare equal, not less.
Of course, all of the above is just a flight of fantasy since it has no basis whatsoever in reported evidence. If it turns out to have any resemblance with reality, that will just be one of those inexplicable coincidences.
In that hypothetical case, forcing the compiler to use SSE for floating point arithmetic would avoid the excess precision computations, since SSE registers are only 64 bits. Alternatively, you can tell GCC to work harder to avoid 80-bit intermediates. See the -mfpmath=sse option, and also -ffloat-store and -fexcess-precision=standard (SSE is generally rhe best.)

Related

How do I translate an optimized x86-64 asm loop back to a C for loop?

I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.

Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}

I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

How is it possible that BITWISE AND operation to take more CPU clocks than ARITHMETIC ADDITION operation in a C program?

I wanted to test if bitwise operations really are faster to execute than arithmetic operation. I thought they were.
I wrote a small C program to test this hypothesis and to my surprise the addition takes less on average than bitwise AND operation. This is surprising to me and I cannot understand why this is happening.
From what I know for addition the carry from the less significant bits should be carried to the next bits because the result depends on the carry too. It does not make sense to me that a logic operator is slower than addition.
My cod is below:
#include<stdio.h>
#include<time.h>
int main()
{
int x=10;
int y=25;
int z=x+y;
printf("Sum of x+y = %i", z);
time_t start = clock();
for(int i=0;i<100000;i++)z=x+y;
time_t stop = clock();
printf("\n\nArithmetic instructions take: %d",stop-start);
start = clock();
for(int i=0;i<100000;i++)z=x&y;
stop = clock();
printf("\n\nLogic instructions take: %d",stop-start);
}
Some of the results:
Arithmetic instructions take: 327
Logic instructions take: 360
Arithmetic instructions take: 271
Logic instructions take: 271
Arithmetic instructions take: 287
Logic instructions take: 294
Arithmetic instructions take: 279
Logic instructions take: 266
Arithmetic instructions take: 265
Logic instructions take: 296
These results are taken from consecutive runs of the program.
As you can see on average the logic operator takes longer on average than the arithmetic operator.

ok, let's take this "measuring" and blow it up, 100k is a bit few
#include<stdio.h>
#include<time.h>
#define limit 10000000000
int main()
{
int x=10, y=25, z;
time_t start = clock();
for(long long i=0;i<limit;i++)z=x+y;
time_t stop = clock();
printf("Arithmetic instructions take: %ld\n",stop-start);
start = clock();
for(long long i=0;i<limit;i++)z=x&y;
stop = clock();
printf("Logic instructions take: %ld\n",stop-start);
}
this will run a bit longer. First let's try with no optimization:
thomas#TS-VB:~/src$ g++ -o trash trash.c
thomas#TS-VB:~/src$ ./trash
Arithmetic instructions take: 21910636
Logic instructions take: 21890332
you see, both loops take about the same time.
compiling with -S reveals why (only relevant part of .s file shown here) :
// this is the assembly for the first loop
.L3:
movl 32(%esp), %eax
movl 28(%esp), %edx
addl %edx, %eax // <<-- ADD
movl %eax, 40(%esp)
addl $1, 48(%esp)
adcl $0, 52(%esp)
.L2:
cmpl $2, 52(%esp)
jl .L3
cmpl $2, 52(%esp)
jg .L9
cmpl $1410065407, 48(%esp)
jbe .L3
// this is the one for the second
.L9:
movl 32(%esp), %eax
movl 28(%esp), %edx
andl %edx, %eax // <<--- AND
movl %eax, 40(%esp)
addl $1, 56(%esp)
adcl $0, 60(%esp)
.L5:
cmpl $2, 60(%esp)
jl .L6
cmpl $2, 60(%esp)
jg .L10
cmpl $1410065407, 56(%esp)
jbe .L6
.L10:
looking into the CPUs instruction set tells us, both ADD and AND will take the same amount of cycles --> the 2 loops will run the same amount of time
Now with optimization:
thomas#TS-VB:~/src$ g++ -O3 -o trash trash.c
thomas#TS-VB:~/src$ ./trash
Arithmetic instructions take: 112
Logic instructions take: 74
The loop has been optimized away. The value calculated is never needed, so the compiler decided to not run it at all
conclusion: If you shoot 3 times into the forest, and hit 2 boars and 1 rabbit, it doesn't mean there a twice as much boars than rabbits inside

Let's start by looking at your code. The loops don't actually do anything. Any reasonable compiler will see that you don't use the variable z after the first call to printf, so it is perfectly safe to throw away. Of course, the compiler doesn't have to do it, but any reasonable compiler with any reasonable levels of optimisations enabled will do so.
Let's look at what a compiler did to your code (standard clang with -O2 optimisation level):
leaq L_.str(%rip), %rdi
movl $35, %esi
xorl %eax, %eax
callq _printf
This is the first printf ("Sum of ..."), notice how the generated code didn't actually add anything, the compiler knows the values of x and y and just calculated its sum and calls printf with 35.
callq _clock
movq %rax, %rbx
callq _clock
Call clock, save its result in a temporary register, call clock again,
movq %rax, %rcx
subq %rbx, %rcx
leaq L_.str.1(%rip), %rdi
xorl %eax, %eax
movq %rcx, %rsi
Subtract start from end, set up arguments for printf,
callq _printf
Call printf.
The second loop is removed in an equivalent way. There are no loops because the compiler does the reasonable thing - it notices that z is not used after you modify it in the loop, so the compiler throws away any stores to it. And then since nothing is stored into it, it can also throw away x+y. And now, since the body of the loop is not doing anything, the loop can be thrown away. So your code essentially becomes:
printf("\n\nArithmetic instructions take: %d", clock() - clock());
Now, why is this relevant. It is relevant to understand some important concepts. The compiler doesn't slavishly translate one statement at a time into code. The compiler reads all (or as much as possible) of your code, tries to figure out what you actually mean and then generates code that behaves "as if" it executed all those statements. The language and compiler only care about preserving what we could call observable side effects. If computing a value is not observable, it doesn't need to be computed. The time to execute some code is not a side effect we care about, so the compiler doesn't care about preserving it, after all we want our code to be as fast as possible, so we'd like the time to execute something to be not observable at all.
The second part why this is relevant. It is pretty useless to measure how long something takes if you compiled it without optimisations. This is the loop in your code compiled without optimisations:
LBB0_1:
cmpl $100000, -28(%rbp)
jge LBB0_4
movl -8(%rbp), %eax
addl -12(%rbp), %eax
movl %eax, -16(%rbp)
movl -28(%rbp), %eax
addl $1, %eax
movl %eax, -28(%rbp)
jmp LBB0_1
LBB0_4:
You thought you were measuring the addl instruction here. But the whole loop contains so much more. In fact, most of the time in the loop is spent maintaining the loop, not actually executing your instruction. Most of the time is spent reading and writing values on the stack and calculating the loop variable. Any timing you happen to measure will be totally dominated by the infrastructure of the loop rather than operation you want to measure.
You loop very few times. I'm pretty certain that most of the time you actually measure is spent in clock() and not the code you're actually trying to measure. clock needs to do quite some work, reading time is often quite expensive.
Then we get to the question of the actual instructions you care about. They take the exact same amount of time. Here's the canonical source of everything related to instruction timing on x86.
But. It's very hard and almost pretty useless to reason about individual instructions. Pretty much every CPU in the past few decades is superscalar. This means that it will execute many instructions at the same time. What matters for how much time things take is more dependencies between instructions (can't start executing an instruction before its inputs are ready if those inputs are computed by previous instructions) rather than the actual instruction. While you can do dozens of calculations in registers per nanosecond, it can take hundreds of nanoseconds to fetch data from main memory. So even if we know that an instruction takes one cycle and your CPU does two cycles per nanosecond (it's usually around that), it can mean that the number of instructions we can finish in 100ns can be anywhere between 1 (if you need to wait for main memory) and 12800 (I don't know the real exact numbers, but I remember that Skylake can retire 64 floating point operations per cycle).
This is why microbenchmarks aren't seriously done anymore. If minor variations in how things are done can affect the outcome by a factor of twelve thousand, you can quickly see why measuring individual instructions is useless. Most measurements today are done on large parts of programs or whole programs. I do this a lot at work and I've had several situations where improving an algorithm changed memory access patterns that even though the algorithm could be mathematically proven faster, the behavior of the whole program suffered because of changed memory access patterns or such.
Sorry for such a rambling answer, but I'm trying to get to why even though there is a simple answer to your question: "your measurement method is bad" and also a real answer: "they are the same", there are actually interesting reasons why the question itself is unanswerable.

This is just a few minutes worth of work, would rather also demonstrate bare metal and other such things but not worth the time right now.
Through testing of some functions to see what the calling convention is and also noting that for add it is generating
400600: 8d 04 37 lea (%rdi,%rsi,1),%eax
400603: c3 retq
for and
400610: 89 f8 mov %edi,%eax
400612: 21 f0 and %esi,%eax
400614: c3 retq
three instructions instead of two, five bytes instead of four, these bits if information both do and dont matter depending. But to make it more fair will do the same thing for each operation.
Also want the do it a zillion times loop closely coupled and not compiled as that may end up creating some variation. And lastly the alignment try to make that fair.
.balign 32
nop
.balign 256
.globl and_test
and_test:
mov %edi,%eax
and %esi,%eax
sub $1,%edx
jne and_test
retq
.balign 32
nop
.balign 256
.globl add_test
add_test:
mov %edi,%eax
add %esi,%eax
sub $1,%edx
jne add_test
retq
.balign 256
nop
derived from yours
#include<stdio.h>
#include<time.h>
unsigned int add_test ( unsigned int a, unsigned int b, unsigned int x );
unsigned int and_test ( unsigned int a, unsigned int b, unsigned int x );
int main()
{
int x=10;
int y=25;
time_t start,stop;
for(int j=0;j<10;j++)
{
start = clock();
add_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
for(int j=0;j<10;j++)
{
start = clock();
and_test(10,25,2000000000);
stop = clock();
printf("%u %u\n",j,(int)(stop-start));
}
return(0);
}
first run as expected the first loop took longer as it wasnt in the cache? should not have taken that much longer so that doesnt make sense, perhaps other reasons...
0 605678
1 520204
2 521311
3 520050
4 521455
5 520213
6 520315
7 520197
8 520253
9 519743
0 520475
1 520221
2 520109
3 520319
4 521128
5 520974
6 520584
7 520875
8 519944
9 521062
but we stay fairly consistent. second run, the times stay somewhat consistent.
0 599558
1 515120
2 516035
3 515863
4 515809
5 516069
6 516578
7 516359
8 516170
9 515986
0 516403
1 516666
2 516842
3 516710
4 516932
5 516380
6 517392
7 515999
8 516861
9 517047
note that this is 2 billion loops. four instructions per. my clocks per second is 1000000 at 3.4ghz 0.8772 clocks per loop or 0.2193 clocks per instruction, how is that possible? superscaler processor.
A LOT more work could be done, here, this was just a few minutes worth and hopefully it is just enough to demonstrate (as others have already as well) that you cant really see the difference with a test like this.
I could do a demo with something more linear like an arm and something we could read the clock/timer register as part of the code under test, as the calling of the clock code are all part of the code under test and can vary here. Hopefully that is not necessary, the results are much more consistent though using sram, controlling all the instructions under test, etc. and with that you can see alignment differences you can see the cost of the cache read on the first loop but not the remaining, etc...(a few clocks total although 10ms as we see here, hmm, might be on par for an x86 system dont know benchmarking x86 is a near complete waste of time, no fun in it and the results dont translate to other x86 computers that well)
As pointed out in your other question that was closed as a duplicate, and I hate using links here, should learn how to cut and paste pictures (TODO).
https://en.wikipedia.org/wiki/AND_gate
https://en.wikipedia.org/wiki/Adder_(electronics)
Assuming feeding the math/logic operation for add and and is the same and we are only trying to measure the difference between them you are correct the AND is faster not getting into further detail you can see an and only has one stage/gate. Where a full adder takes three levels, back of the envelope math, three times as much time to settle the signals once the inputs change than the AND....BUT....Although there are some exceptions, chips are not designed to take advantage of this (well multiply and divide vs add/and/xor/etc, yes they are or are more likely to be). One would design these simple operations to be one clock cycle, on a clock the inputs to the combinational logic (the actual AND or ADD) are latched, on the next clock cycle the result is latched from the other end and begins its journey to the register file or out the core to memory, etc...At some point in the design you do synthesis into the gates available for the foundry/process you are using then do a timing analysis/closure on that and look for long poles in the tent. Extremely unlikely (impossible) that the add is a long pole, both add and and are very very short poles, but you determine at that point what your max clock rate is, if you wanted a 4ghz procerssor but the result is 2.7 well you need to take those long poles and turn them into two or more clock operations. the time it takes to do an add vs and which should vary add should be longer, is so fast and in the noise that it is all within a clock cycle so even if you did a functional simulation of the logic design you would not see the difference you need to implement an and and a full adder in say pspice using transistors and other components, then feed step changes into the inputs and watch how long it takes to settle, that or build them from discrete components from radio shack and try, although the results might be too fast for your scope, so use pspice or other.
think of writing equations to solve something you can write maybe a long equation or you can break that into multiple smaller ones with intermediate variables
this
a = b+c+d+e+f+g;
vs
x=b+c;
y=d+e;
z=f+g;
a=x+y;
a=a+z;
one clock vs 5 clocks, but each of the 5 clocks can be faster if this was the longest pole in the tent. all the other logic is that much faster too feeding this. (actually x,y,z can be one clock, then either a=x+y+z in the next or make it two more)
multiply and divide are different simply because the logic explodes exponentially, there is no magic to multiply or divide they have to work the same way we do things on pencil and paper. you can take shortcuts with binary if you think about it. as you can only multiply by 0 or 1 before shifing and adding to the accumulator. the logic equations for one clock still explode exponentially and then you can do parallel things. it burns a ton of chip realestate, so you may choose to make multiply as well as divide take more than one clock and hide those in the pipeline. or you can choose to burn a significant amount of chip realestate...Look at the documentation for some of the arm cores you can at compile time (when you compile/synthesize the core) choose a one clock or multi-clock multiply to balance chip size vs performance. x86 we dont buy the IP and make the chips ourselves so it is up to intel how they do it, and very likely microcoded so by just microcoding you can tweak how things happen or do it in an alu type operation.
So you might detect multiply or divide performance vs add/and with a test like this but either they did it in one clock and you cant ever see it, or they may have buried it in two or more steps in the pipe so that it averages out right and to see it you would need access to a chip sim. using timers and running something a billion times is fun, but to actually see instruction performance you need a chip sim and need to tweak the code to keep the code not under test from affecting the results.

Assembly language - How it works

I am really new at learning assembly language and just started digging in to it so I was wondering if maybe some of you guys could help me figure one problem out. I have a homework assignment which tells me to compare assembly language instructions to c code and tell me which c code is equivalent to the assembly instructions. So here is the assembly instructions:
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
jge .L3 // Jump to .L3 if b is greater than a - else continue the instructions
movl %edx,%eax // If the term is not met here it will return b
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
I am trying to comment it out based on my understanding of this - but I might be totally wrong about this. The options for C functions which one of are suppose to be equivalent to are:
int fun1(int a, int b)
{
unsigned ua = (unsigned) a;
if (ua < b)
return b;
else
return ua;
}
int fun2(int a, int b)
{
if (b < a)
return b;
else
return a;
}
int fun3(int a, int b)
{
if (a < b)
return a;
else
return b;
}
I think the correct answer is fun3 .. but I'm not quite sure.

First off, welcome to StackOverflow. Great place, really it is.
Now for starters, let me help you; a lot; a whole lot.
You have good comments that help both you and me and everyone else tremendously, but they are so ugly that reading them is painful.
Here's how to fix that: white space, lots of it, blank lines, and grouping the instructions into small groups that are related to each other.
More to the point, after a conditional jump, insert one blank line, after an absolute jump, insert two blank lines. (Old tricks, work great for readability)
Secondly, line up the comments so that they are neatly arranged. It looks a thousand times better.
Here's your stuff, with 90 seconds of text arranging by me. Believe me, the professionals will respect you a thousand times better with this kind of source code...
pushl %ebp // What i think is happening here is that we are creating more space for the function.
movl %esp,%ebp // Here i think we are moving the stack pointer to the old base pointer.
movl 8(%ebp),%edx // Here we are taking parameter int a and storing it in %edx
movl 12(%ebp),%eax // Here we are taking parameter int b and storing it in %eax
cmpl %eax,%edx // Here i think we are comparing int a and b ( b > a ) ?
// No, Think like this: "What is the value of edx with respect to the value of eax ?"
jge .L3 // edx is greater, so return the value in eax as it is
movl %edx,%eax // If the term is not met here it will return b
// (pssst, I think you're wrong; think it through again)
.L3:
movl %ebp,%esp // Starting to finish the function
popl %ebp // Putting the base pointer in the right place
ret // return
Now, back to your problem at hand. What he's getting at is the "sense" of the compare instruction and the related JGE instruction.
Here's the confuse-o-matic stuff you need to comprehend to survive these sorts of "academic experiences"
This biz, the cmpl %eax,%edx instruction, is one of the forms of the "compare" instructions
Try to form an idea something like this when you see that syntax, "...What is the value of the destination operand with respect to the source operand ?..."
Caveat: I am absolutely no good with the AT&T syntax, so anybody is welcome to correct me on this.
Anyway, in this specific case, you can phrase the idea in your mind like this...
"...I see cmpl %eax,%edx so I think: With respect to eax, the value in edx is..."
You then complete that sentence in your mind with the "sense" of the next instruction which is a conditional jump.
The paradigmatic process in the human brain works out to form a sentence like this...
"...With respect to eax, the value in edx is greater or equal, so I jump..."
So, if you are correct about the locations of a and b, then you can do the paradigmatic brain scrambler and get something like this...
"...With respect to the value in b, that value in a is greater or equal, so I will jump..."
To get a grasp of this, take note that JGE is the "opposite sense" if you will, of JL (i.e., "Jump if less than")
Okay, now it so happens that return in C is related to the ret instruction in assembly language, but it isn't the same thing.
When C programmers say "...That function returns an int..." what they mean is...
The assembly language subroutine will place a value in Eax
The subroutine will then fix the stack and put it back in neat order
The subroutine will then execute its Ret instruction
One more item of obfuscation is thrown in your face now.
These following conditional jumps are applicable to Signed arithmetic comparison operations...
JG
JGE
JNG
JL
JLE
JNL
There it is ! The trap waiting to screw you up in all this !
Do you want to do signed or unsigned compares ???
By the way, I've never seen anybody do anything like that first function where an unsigned number is compared with a signed number. Is that even legal ?
So anyway, we put all these facts together, and we get: This assembly language routine returns the value in a if it is less than the value in b otherwise it returns the value in b.
These values are evaluated as signed integers.
(I think I got that right; somebody check my logic. I really don't like that assembler's syntax at all.)
So anyway, I am reasonably certain that you don't want to ask people on the internet to provide you with the specific answer to your specific homework question, so I'll leave it up to you to figure it out from this explanation.
Hopefully, I have explained enough of the logic and the "sense" of comparisons and the signed and unsigned biz so that you can get your brain around this.
Oh, and disclaimer again, I always use the Intel syntax (e.g., Masm, Tasm, Nasm, whatever) so if I got something backwards here, feel free to correct it for me.

Which is the most efficient way in C to check that at least one of two integers is zero?

I have code that is doing a lot of these comparison operations. I was wondering which is the most efficient one to use. Is it likely that the compiler will correct it if I intentionally choose the "wrong" one?
int a, b;
// Assign a value to a and b.
// Now check whether either is zero.
// The worst?
if (a * b == 0) // ...
// The best?
if (a & b == 0) // ...
// The most obvious?
if (a == 0 || b == 0) // ...
Other ideas?

In general, if there's a fast way of doing a simple thing, you can assume the compiler will do it that fast way. And remember that the compiler is outputting machine language, not C -- the fastest method probably can't be correctly represented as a set of C constructs.
Also, the third method there is the only one that always works. The first one fails if a and b are 1<<16, and the second you already know doesn't work.

It's possible to see which variant generates fewer assembly instructions, but it's a separate matter to see which one actually executes in less time.
To help you analyze the first matter, learn to use your C compiler's command-line flags to capture its intermediate output. GCC is a common choice for a C compiler. Let's look at its unoptimized assembly code for two different programs.
#include <stdio.h>
void report_either_zero()
{
int a = 1;
int b = 0;
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Save that text to a file such as zero-test.c, and run the following command:
gcc -S zero-test.c
GCC will emit a file called zero-test.s, which is the assembly code it would normally submit to the assembler as it generates object code.
Let's look at the relevant fragment of the assembly code. I'm using gcc version 4.2.1 on Mac OS X generating x86 64-bit instructions.
_report_either_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $32, %rsp
Ltmp2:
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $1, -20(%rbp) // a = 1
movl $0, -24(%rbp) // b = 0
movl -24(%rbp), %eax // Get ready to compare a.
cmpl $0, %eax // Does zero equal a?
je LBB1_2 // If so, go to label LBB1_2.
movl -24(%rbp), %eax // Otherwise, get ready to compare b.
cmpl $0, %eax // Does zero equal b?
jne LBB1_3 // If not, go to label LBB1_3.
LBB1_2:
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_3:
addq $32, %rsp
popq %rbp
ret
Leh_func_end1:
You can see where the we load the integer values 1 and 0 into registers, then prepare to compare the first to zero, and then again with the second if the first is nonzero.
Now let's try a different approach with the comparison, to see how the assembly code changes. Note that this is not the same predicate; this one checks whether both numbers are zero.
#include <stdio.h>
void report_both_zero()
{
int a = 1;
int b = 0;
if (!(a | b))
{
puts("Both of them are zero.");
}
}
The assembly code is a little different:
_report_both_zero:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $16, %rsp
Ltmp2:
movl $1, -4(%rbp) // a = 1
movl $0, -8(%rbp) // b = 0
movl -4(%rbp), %eax // Get ready to operate on a.
movl -8(%rbp), %ecx // Get ready to operate on b too.
orl %ecx, %eax // Combine a and b via bitwise OR.
cmpl $0, %eax // Does zero equal the result?
jne LBB1_2 // If not, go to label LBB1_2.
leaq L_.str(%rip), %rax
movq %rax, %rdi
callq _puts // Otherwise, write the string to standard output.
LBB1_2:
addq $16, %rsp
popq %rbp
ret
Leh_func_end1:
If the first number is zero, the first variant does less work—in terms of the number of assembly instructions involved—by avoiding a second register move. If the first number is not zero, the second variant does less work by avoiding a second comparison to zero.
The question now is whether "move, move, bitwise or, compare" runs faster that "move, compare, move, compare." The answer could come down to things like whether the processor learns to predict how often the first integer is zero, and whether it is or not consistently.
If you ask the compiler to optimize this code, the example is too simple; the compiler decides at compile time that no comparison is necessary, and just condenses that code to an unconditional request to write the string. It's interesting to change the code to operate on parameters rather than constants, and see how the optimizer handles the situation differently.
Variant one:
#include <stdio.h>
void report_either_zero(int a, int b)
{
if (a == 0 || b == 0)
{
puts("One of them is zero.");
}
}
Variant two (again, a different predicate):
#include <stdio.h>
void report_both_zero(int a, int b)
{
if (!(a | b))
{
puts("Both of them are zero.");
}
}
Generate the optimized assembly code with this command:
gcc -O -S zero-test.c
Let us know what you find.

This will not likely have much (if any, given modern compiler optimizers) effect on the overall performance of your app. If you really must know, you should write some code to test the performance of each for your compiler. However, as a best guess, I'd say...
if ( !( a && b ) )
This will short-circuit if the first happens to be 0.

If you want to find whether or not one of two integers is zero using one comparison instruction ...
if ((a << b) == a)
If a is zero, then no amount of shifting it to the left will change its value.
If b is zero, then there is no shifting performed.
It is possible (I am too lazy to check) that there is some undefined behaviour should b be negative or really large.
However, due to the non-intuitiveness, it would be strongly recommended to implement this as a macro (with an appropriate comment).
Hope this helps.

The most efficient is certainly the most obvious, if by efficiency, you are measuring the programmer's time.
If by measuring efficiency using the processor's time, profiling your candidate solution is the best way to answer - for the target machine you profiled.
But this exercise demonstrated a pitfall of programmer optimization. The 3 candidates are not functionally equivalent for all int.
If you was a functional equivalent alternative...
I think the last candidate and a 4th one deserve comparison.
if ((a == 0) || (b == 0))
if ((a == 0) | (b == 0))
Due to the variation of compilers, optimization and CPU branch prediction, one should profile, rather than pontificate, to determine relative performance. OTOH, a good optimizing compiler may give you the same code for both.
I recommend the code that is easiest to maintain.

There's no "most efficient way to do it in C", if by "efficiency" you mean the efficiency of the compiled code.
Firstly, even if we assume that the compiler translates C language operator into their "obvious" machine counterparts (i.e. C multiplication into machine multiplication etc) the efficiency of each method will differ from one hardware platform to the other. Even if we restrict our consideration to a very specific sequence of instructions on a very specific hardware platform, it still can exhibit different performance in different surrounding contexts, depending, for example, on how well the whole thing agrees with the branch prediction heuristic in the given CPU.
Secondly, modern C compilers rarely translate C operators into their "obvious" machine counterparts. Often the instructions used in machine code will have very little in common with the C code. It is possible that many "completely different" methods of performing the check at C level will actually be translated into the same sequence of machine instructions by a smart compiler. At the same time the same C code might get translated into different sequences machine instructions when the surrounding contexts are different.
In other words, there's no meaningful answer to your question, unless you really really really localize it to a specific hardware platform, specific compiler version and specific set of compilation settings. And that will make it too localized to be useful.
That usually means that in general case the best way to do it is to write the most readable code. Just do
if (a == 0 || b == 0)
The readability of the code will not only help the human reader to understand it, but will also increase the probability of the compiler properly interpreting your intent and generating the most optimal code.
But if you really have to squeeze the last CPU cycle out of your performance-critical code, you have to try different versions and compare their relative efficiency manually.

Which operator is faster (> or >=), (< or <=)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is < cheaper (faster) than <=, and similarly, is > cheaper (faster) than >=?
Disclaimer: I know I could measure but that will be on my machine only and I am not sure if the answer could be "implementation specific" or something like that.

TL;DR
There appears to be little-to-no difference between the four operators, as they all perform in about the same time for me (may be different on different systems!). So, when in doubt, just use the operator that makes the most sense for the situation (especially when messing with C++).
So, without further ado, here is the long explanation:
Assuming integer comparison:
As far as assembly generated, the results are platform dependent. On my computer (Apple LLVM Compiler 4.0, x86_64), the results (generated assembly is as follows):
a < b (uses 'setl'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setl %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a <= b (uses 'setle'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a > b (uses 'setg'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setg %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
a >= b (uses 'setge'):
movl $10, -8(%rbp)
movl $15, -12(%rbp)
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %cl
andb $1, %cl
movzbl %cl, %eax
popq %rbp
ret
Which isn't really telling me much. So, we skip to a benchmark:
And ladies & gentlemen, the results are in, I created the following test program (I am aware that 'clock' isn't the best way to calculate results like this, but it'll have to do for now).
#include <time.h>
#include <stdio.h>
#define ITERS 100000000
int v = 0;
void testL()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i < v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testLE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++)
{
v = i <= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testG()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i > v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
void testGE()
{
clock_t start = clock();
v = 0;
for (int i = 0; i < ITERS; i++) {
v = i >= v;
}
printf("%s: %lu\n", __FUNCTION__, clock() - start);
}
int main()
{
testL();
testLE();
testG();
testGE();
}
Which, on my machine (compiled with -O0), gives me this (5 separate runs):
testL: 337848
testLE: 338237
testG: 337888
testGE: 337787
testL: 337768
testLE: 338110
testG: 337406
testGE: 337926
testL: 338958
testLE: 338948
testG: 337705
testGE: 337829
testL: 339805
testLE: 339634
testG: 337413
testGE: 337900
testL: 340490
testLE: 339030
testG: 337298
testGE: 337593
I would argue that the differences between these operators are minor at best, and don't hold much weight in a modern computing world.

it varies, first start at examining different instruction sets and how how the compilers use those instruction sets. Take the openrisc 32 for example, which is clearly mips inspired but does conditionals differently. For the or32 there are compare and set flag instructions, compare these two registers if less than or equal unsigned then set the flag, compare these two registers if equal set the flag. Then there are two conditional branch instructions branch on flag set and branch on flag clear. The compiler has to follow one of these paths, but less, than, less than or equal, greater than, etc are all going to use the same number of instructions, same execution time for a conditional branch and same execution time for not doing the conditional branch.
Now it is definitely going to be true for most architectures that performing the branch takes longer than not performing the branch because of having to flush and re-fill the pipe. Some do branch prediction, etc to help with that problem.
Now some architectures the size of the instruction may vary, compare gpr0 and gpr1 vs compare gpr0 and the immediate number 1234, may require a larger instruction, you will see this a lot with x86 for example. so although both cases may be a branch if less than how you encode the less depending on what registers happen to hold what values can make a performance difference (sure x86 does a lot of pipelining, lots of caching, etc to make up for these issues). Another similar example is mips and or32, where r0 is always a zero, it is not really a general purpose register, if you write to it it doesnt change, it is hardwired to a zero, so a compare if equal to 0 MIGHT cost you more than a compare if equal to some other number if an extra instruction or two is required to fill a gpr with that immediate so that the compare can happen, worst case is having to evict a register to the stack or memory, to free up the register to put the immediate in there so that the compare can happen.
Some architectures have conditional execution like arm, for the full arm (not thumb) instructions you can on a per instruction basis execute, so if you had code
if(i==7) j=5; else j=9;
the pseudo code for arm would be
cmp i,#7
moveq j,#5
movne j,#7
there is no actual branch, so no pipeline issues you flywheel right on through, very fast.
One architecture to another if that is an interesting comparison some as mentioned, mips, or32, you have to specifically perform some sort of instruction for the comparision, others like x86, msp430 and the vast majority each alu operation changes the flags, arm and the like change flags if you tell it to change flags otherwise dont as shown above. so a
while(--len)
{
//do something
}
loop the subtract of 1 also sets the flags, if the stuff in the loop was simple enough you could make the whole thing conditional, so you save on separate compare and branch instructions and you save in the pipeline penalty. Mips solves this a little by compare and branch are one instruction, and they execute one instruction after the branch to save a little in the pipe.
The general answer is that you will not see a difference, the number of instructions, execuition time, etc are the same for the various conditionals. special cases like small immediates vs big immediates, etc may have an effect for corner cases, or the compiler may simply choose to do it all differently depending on what comparison you do. If you try to re-write your algorithm to have it give the same answer but use a less than instead of a greater than and equal, you could be changing the code enough to get a different instruction stream. Likewise if you perform too simple of a performance test, the compiler can/will optimize out the comparison complete and just generate the results, which could vary depending on your test code causing different execution. The key to all of this is disassemble the things you want to compare and see how the instructions differ. That will tell you if you should expect to see any execution differences.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight