I know this title is duplicated with Is this tail-recursion?, but this is different and I could not get a clue from the previous question. so I have to ask again.
The code:
int sum_n(int n, int *sum)
{
return n && (*sum += n) && sum_n(n - 1, sum);
}
It is said(foreign language source) that, tail recursion have two features:
The function is self-called
Only constant stack space is needed
So are these two features the only key factors for judging tail recursion? And whether the logical operator && in return statement affects tail recursion or not?
Most of all, is the above code tail recursion?
As written, it's a bit iffy. Reason being, technically, the function should regain control in order to && the result to know what to return. (This is easily optimized away, though, and most compilers will probably do so.)
In order to make sure it's tail-recursive, simply avoid doing anything with the result other than returning it.
int sum_n(int n, int *sum)
{
if (!(n && (*sum += n))) return 0;
return sum_n(n - 1, sum);
}
This very function, I think, is not a tail recursion.
First, I agree that, this following form is tail recursion:
int sum_n(int n, int *sum)
{
int tmp = n && (*sum += n);
if (!tmp)
return 0;
else
return sum_n(n - 1, sum);
}
However the above function is not literally equivalent to your original one. The code you provide shall be equivalent to this:
int sum_n(int n, int *sum)
{
int tmp = n && (*sum += n);
if (!tmp)
return 0;
else
return sum_n(n - 1, sum) ? 1 : 0;
}
The difference, is the return value of sum_n(n - 1, sum) could not be used directly as the return value of your function: it shall be casted from int to _Bool.
Not talking about the academic definition of tail recursion, however, clang 3.3 does compile sum() as a loop, not recursion.
.file "tail.c"
.text
.globl sum
.align 16, 0x90
.type sum,#function
sum: # #sum
.cfi_startproc
# BB#0:
testl %edi, %edi
je .LBB0_6
# BB#1: # %.lr.ph
movl (%rsi), %eax
.align 16, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
addl %edi, %eax
je .LBB0_4
# BB#2: # %tailrecurse
# in Loop: Header=BB0_3 Depth=1
decl %edi
jne .LBB0_3
# BB#5: # %tailrecurse._crit_edge
movl %eax, (%rsi)
.LBB0_6:
xorl %eax, %eax
ret
.LBB0_4: # %split
movl $0, (%rsi)
xorl %eax, %eax
ret
.Ltmp0:
.size sum, .Ltmp0-sum
.cfi_endproc
.section ".note.GNU-stack","",#progbits
compiled with command:
$ clang -c tail.c -S -O2
clang version:
$ clang -v
clang version 3.3 (tags/RELEASE_33/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Related
EDIT 1: Added another example (showing that GCC is, in principle, be capable to do what I want to achieve) and some more discussion at the end of this question.
EDIT 2: Found the malloc function attribute, which should do what. Please take a look at the very end of the question.
This is a question about how to tell the compiler that stores to a memory area are not visible outside of a region (and thus could be optimized away). To illustrate what I mean, let's take a look at the following code
int f (int a)
{
int v[2];
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
gcc -O2 generates the following assembly code (x86-64 gcc, trunk, on https://godbolt.org):
f:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
As one can see, the loads and stores into the array v are gone after optimization.
Now consider the following code:
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The difference is that v is not (stack-) allocated in the function, but provided as an argument. The result of gcc -O2 in this case is:
g:
leal -1(%rdi), %edx
movl $0, 4(%rsi)
xorl %eax, %eax
movl %edx, (%rsi)
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
movl %eax, 4(%rsi)
movl $-1, (%rsi)
ret
.L4:
ret
Clearly, the code has to store the final values of v[0] and v[1] in memory as they may be observable.
Now, what I am looking for is a way to tell the compiler that the memory pointed to by v in the second example isn't accessible any more after the function g has returned so that the compiler could optimize away the memory accesses.
To have an even simpler example:
void h (int *v)
{
v[0] = 0;
}
If the memory pointed to by v isn't accessible after h returns, it should be possible to simplify the function to a single ret.
I tried to achieve what I want by playing with the strict aliasing rules but haven't succeeded.
ADDED IN EDIT 1:
GCC seems to have the necessary code built-in as the following example shows:
include <stdlib.h>
int h (int a)
{
int *v = malloc (2 * sizeof (int));
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated code contains no loads and stores:
h:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
In other words, GCC knows that changing the memory area pointed to by v is not observable through any side-effect of malloc. For purposes like this one, GCC has __builtin_malloc.
So I can also ask: How can user code (say a user version of malloc) make use of this functionality?
ADDED IN EDIT 2:
GCC has the following function attribute:
malloc
This tells the compiler that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
Using this attribute can improve optimization. Compiler predicts that a function with the attribute returns non-null in most cases. Functions like malloc and calloc have this property because they return a pointer to uninitialized or zeroed-out storage. However, functions like realloc do not have this property, as they can return a pointer to storage containing pointers.
It seems to do what I want as the following example shows:
__attribute__ (( malloc )) int *m (int *h);
int i (int a, int *h)
{
int *v = m (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated assembler code has no loads and stores:
i:
pushq %rbx
movl %edi, %ebx
movq %rsi, %rdi
call m
testl %ebx, %ebx
jle .L4
leal -1(%rbx), %edx
xorl %eax, %eax
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
popq %rbx
ret
.L4:
xorl %eax, %eax
popq %rbx
ret
However, as soon as the compiler sees a definition of m, it may forget about the attribute. For example, this is the case when the following definition is given:
__attribute__ (( malloc )) int *m (int *h)
{
return h;
}
In that case, the function is inlined and the compiler forgets about the attribute, yielding the same code as the function g.
P.S.: Initially, I thought that the restrict keyword may help, but it doesn't seem so.
EDIT: Discussion about the noinline attribute added at the end.
Using the following function definition, one can achieve the goal of my question:
__attribute__ (( malloc, noinline )) static void *get_restricted_ptr (void *p)
{
return p;
}
This function get_restricted_ptr simply returns its pointer argument but informs the compiler that the returned pointer P cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
The use of this function is demonstrated here:
int i (int a, int *h)
{
int *v = get_restricted_ptr (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return;
}
The generated code does not contain loads and stores:
i:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L6
.L5:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L5
ret
.L6:
ret
ADDED IN EDIT: If the noinline attribute is left out, GCC ignores the malloc attribute. Apparently, in this case, the function gets inlined first so that there is no function call any more for which GCC would check the malloc attribute. (One can discuss whether this behaviour should be considered a bug in GCC.) With the noinline attribute, the function doesn't get inlined. Then, due to the malloc attribute, GCC understands that the call to that function is unnecessary and removes it completely.
Unfortunately, this means that the (trivial) function won't be inlined when its call is not eliminated due to the malloc attribute.
Both functions have side effects and memory reads & stores cannot be optimized out
void h (int *v)
{
v[0] = 0;
}
and
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The side effects have to be observable outside the function scope. Inline functions may have another behavior as the side effect might have to be observable outside the enclosing code.
inline int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
void h(void)
{
int x[2],y ;
g(y,x);
}
this code will be optimized to just a simple return
You can promise the compiler that nothing will happen to allow easier optimizations by using keyword restrict. But of course your code must keep this promise.
For C, the only restriction is that the compiler has to ensure that the code behaves the same. If the compiler can prove that the code behaves the same then it can and will remove the stores.
For example, I put this into https://godbolt.org/ :
void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
And told it to use GCC 8.2 and "-O3", and got this output:
h(int*):
mov DWORD PTR [rdi], 0
ret
foo():
ret
Note that there are two different versions of the function h() in the output. The first version exists in case other code (in other object files) want to use the function (and may be discarded by the linker). The second version of h() was inlined directly into foo() and then optimised down to absolutely nothing.
If you change the code to this:
static void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
Then it tells the compiler that the version of h() that only existed for linking with other object files isn't needed, so the compiler only generates the second version of h() and the output becomes this:
foo():
ret
Of course all optimizers in all compiler's aren't perfect - for more complex code (and for different compilers including different versions of GCC) results might be different (the compiler may fail to do this optimization). This is purely a limitation of the compiler's optimizer and not a limitation of C itself.
For cases where the compiler's optimiser isn't good enough, there are 4 possible solutions:
get a better compiler
improve the compiler's optimiser (e.g. send an email with to the compiler's developers that includes a minimal example and cross your fingers)
modify the code to make it easier for the compiler's optimiser (e.g. copy the input array into a local array, like "void h(int *v) { int temp[2]; temp[0] = v[0]; temp[1] = v[1]; ... ).
shrug and say "Oh, that's a pity" and do nothing
My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?
I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.
Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.
Let assume that I have a Boolean function that receives a random number as its argument and then return True if the random number is 200, 201 or 202 and return False for other values.
The question is which one of the following functions are more efficient?
f1:
bool f1(int number) {
if (number >= 200 && number <= 202)
return true;
return false;
}
f2:
bool f2(int number) {
if (number == 200 || number == 201 || number == 202)
return true;
return false;
}
The question is which one of the following functions are more efficient?
The C11 standard (read n1570) does not care about (or speak of) efficiency.
An optimizing compiler could generate the same code for both functions (and several of them do).
My GCC 7.1 compiler on Linux/x86-64 generate the same code with gcc -fverbose-asm -O2 -S:
.text
.p2align 4,,15
.globl f1
.type f1, #function
f1:
.LFB0:
.cfi_startproc
# abraham.c:3: if (number >= 200 && number <= 202)
subl $200, %edi #, tmp92
cmpl $2, %edi #, tmp92
setbe %al #, tmp93
# abraham.c:6: }
ret
.cfi_endproc
.LFE0:
.size f1, .-f1
.p2align 4,,15
.globl f2
.type f2, #function
f2:
.LFB3:
.cfi_startproc
subl $200, %edi #, tmp92
cmpl $2, %edi #, tmp92
setbe %al #, tmp93
ret
.cfi_endproc
.LFE3:
.size f2, .-f2
.ident "GCC: (Debian 7.1.0-2) 7.1.0"
.section .note.GNU-stack,"",#progbits
BTW clang-4.0 -fverbose-asm -S -O2 generates also the same code for both functions, but a different one than gcc:
.type f1,#function
f1: # #f1
.cfi_startproc
# BB#0:
addl $-200, %edi
cmpl $3, %edi
setb %al
retq
.Lfunc_end0:
.size f1, .Lfunc_end0-f1
.cfi_endproc
And if performance matters that much to you, I recommend defining both functions as static inline in some common included header.
If you really care about performance, benchmark (after asking the compiler to optimize, e.g. with gcc -Wall -O2 with GCC). But read more about premature optimization, notably the fallacy of premature optimization. Notice that asking about performance without enabling optimization is contradictory.
Most of the time, you should choose whatever is more readable.
I benchmarked it (something you should learn how to do):
#!/bin/sh -e
cat > bench.c <<EOF
#include <stdio.h>
#include <stdlib.h>
_Bool f1(int number) { return (number >= 200 && number <= 202); }
_Bool f2(int number) { return (number == 200 || number == 201 || number == 202); }
int main(int c, char **v)
{
int it = c>1 ? atoi(v[1]) : 10000000000;
int cnt=0;
for(int j=0; j<10;j++)
for(int i=0;i<it;i++){
#ifdef F2
cnt+=f2(i);
#else
cnt+=f1(i);
#endif
}
printf("%d\n", cnt);
}
EOF
gcc -O3 bench.c
./a.out 1
time ./a.out
gcc -DF2 -O3 bench.c
./a.out 1
time ./a.out
Couldn't measure a statistically significant difference.
Then I checked the generated assembly and gcc generates the same output for both cases, starting at -O1 (clang isn't so smart):
f1:
subl $200, %edi
cmpl $2, %edi
setbe %al
ret
f2:
subl $200, %edi
cmpl $2, %edi
setbe %al
ret
(looks like a pretty neat optimization trick)
So the answer is the usual: do the more readable thing and leave optimization to the optimizer until you've measured and found it's not doing its job as well as it could.
Just see the assembly codes and we do some statistics just from if:
bool f1(int number) {
011D9160 push ebp
011D9161 mov ebp,esp
011D9163 sub esp,0C0h
011D9169 push ebx
011D916A push esi
011D916B push edi
011D916C lea edi,[ebp-0C0h]
011D9172 mov ecx,30h
011D9177 mov eax,0CCCCCCCCh
011D917C rep stos dword ptr es:[edi]
if (number >= 200 && number <= 202)
011D917E cmp dword ptr [number],0C8h
011D9185 jl f1+34h (011D9194h)
011D9187 cmp dword ptr [number],0CAh
011D918E jg f1+34h (011D9194h)
return true;
011D9190 mov al,1
011D9192 jmp f1+36h (011D9196h)
return false;
011D9194 xor al,al }
bool f2(int number) {
011D91B0 push ebp
011D91B1 mov ebp,esp
011D91B3 sub esp,0C0h
011D91B9 push ebx
011D91BA push esi
011D91BB push edi
011D91BC lea edi,[ebp-0C0h]
011D91C2 mov ecx,30h
011D91C7 mov eax,0CCCCCCCCh
011D91CC rep stos dword ptr es:[edi]
if (number == 200 || number == 201 || number == 202)
011D91CE cmp dword ptr [number],0C8h
011D91D5 je f2+39h (011D91E9h)
011D91D7 cmp dword ptr [number],0C9h
011D91DE je f2+39h (011D91E9h)
011D91E0 cmp dword ptr [number],0CAh
011D91E7 jne f2+3Dh (011D91EDh)
return true;
011D91E9 mov al,1
011D91EB jmp f2+3Fh (011D91EFh)
return false;
011D91ED xor al,al
}
For function f1:
When number equals 200,function f1 calls cmp,jl,cmp,jg,mov,jmp 6
logical operation instructions.
When number equals 201,function f1 calls cmp,jl,cmp,jg,mov,jmp 6 logical operation instructions.
When number equals 202,function f1 calls cmp,jl,cmp,jg,mov,jmp 6 logical operation instructions.
When number is less than 200,function f1 calls cmp,jl,xor 3 logical operation instructions.
When number is bigger than 202,function f1 calls cmp,jl,cmp,jg,xor 5 logical operation instructions.
For function f2:
When number equals 200,function f2 calls cmp,je,mov,jmp 4
logical operation instructions.
When number equals 201,function f2 calls cmp,je,cmp,je,mov,jmp 6 logical operation instructions.
When number equals 202,function f2 calls cmp,je,cmp,je,cmp,jne,mov,jmp 8 logical operation instructions.
For other numbers,function f2 calls cmp,je,cmp,je,cmp,jne,xor 7 logical operation instructions.
Seems that when return true,the performance is averagely the same since 6+6+6=4+6+8.
But when return false,f1 is better than f2.
Two more options for you to measure
#include <stdbool.h>
bool f3(int n) {
if (n < 200) return 0;
if (n > 202) return 0;
return 1;
}
or
#include <stdbool.h>
bool f4(int n) {
return (n >= 200) * (n <= 202);
}
I am writing a function which checks if two integers are same .I wrote it in two different manners.I want to know if there is any performance difference
Technique 1
int checkEqual(int a ,int b)
{
if (a == b)
return 1; //it means they were equal
else
return 0;
}
Technique 2
int checkEqual(int a ,int b)
{
if (!(a - b))
return 1; //it means they are equal
else
return 0;
}
In short, there is no difference of performance.
I compiled each techniques using gcc-4.8.2 with -O2 -S option (-S generates assembly codes)
Technique 1
checkEqual1:
.LFB24:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
Technique 2
checkEqual2:
.LFB25:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
These are exactly the same assembly code.
So these two codes will provide the same performance.
Appendix
bool checkEquals3(int a, int b) { return a == b; }
provides
checkEqual3:
.LFB26:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
exactly the same assembly code too!
It doesn't make any sense whatsoever to discuss manual code optimization without a specific system in mind.
That being said, you should always leave optimizations like these to the compiler and focus on writing as readable code as possible.
Your code can be made more readable by using only one return statement. Also, indent your code.
int checkEqual (int a, int b)
{
return a == b;
}
Here's my demo program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int cmp(const void *d1, const void *d2)
{
int a, b;
a = *(int const *) d1;
b = *(int const *) d2;
if (a > b)
return 1;
else if (a == b)
return 0;
return -1;
}
int main()
{
int seed = time(NULL);
srandom(seed);
int i, n, max = 32768, a[max];
for (n=0; n < max; n++) {
int r = random() % 256;
a[n] = r;
}
qsort(a, max, sizeof(int), cmp);
clock_t beg = clock();
long long int sum = 0;
for (i=0; i < 20000; i++)
{
for (n=0; n < max; n++) {
if (a[n] >= 128)
sum += a[n];
}
}
clock_t end = clock();
double sec = (end - beg) / CLOCKS_PER_SEC;
printf("sec: %f\n", sec);
printf("sum: %lld\n", sum);
return 0;
}
unsorted
sec: 5.000000
sum: 63043880000
sorted
sec: 1.000000
sum: 62925420000
Here's an assembly diff of two versions of the program, one with qsort and one without:
--- unsorted.s
+++ sorted.s
## -58,7 +58,7 ##
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
- leal 4(%esp), %eax
+ leal 16(%esp), %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
## -83,6 +83,13 ##
movl -16(%ebp), %eax
cmpl -24(%ebp), %eax
jl .L7
+ movl -24(%ebp), %eax
+ movl $cmp, 12(%esp)
+ movl $4, 8(%esp)
+ movl %eax, 4(%esp)
+ movl -32(%ebp), %eax
+ movl %eax, (%esp)
+ call qsort
movl $0, -48(%ebp)
movl $0, -44(%ebp)
movl $0, -12(%ebp)
As far as I understand the assembly output, the sorted version just has more code due to passing values to qsort, but I don't see any branching optimization/prediction/whatever thing. Maybe I'm looking in the wrong direction?
Branch prediction is not something you will see at the assembly code level; it is done by the CPU itself.
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be zero. Since you are limited to integral expressions for exp, you
should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
Otherwise the branch prediction is determined by the processor...
Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction
unit (BPU) for prediction. This unit predicts the target address not
only based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types:
• Conditional branches.
• Direct calls and jumps.
• Indirect calls and jumps.
• Returns.
The microarchitecture tries to overcome this problem by feeding the most probable branch into the pipeline and execut[ing] it speculatively.
...Using various methods of branch prediction.