gcc branch prediction

gcc branch prediction - c

Here's my demo program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int cmp(const void *d1, const void *d2)
{
int a, b;
a = *(int const *) d1;
b = *(int const *) d2;
if (a > b)
return 1;
else if (a == b)
return 0;
return -1;
}
int main()
{
int seed = time(NULL);
srandom(seed);
int i, n, max = 32768, a[max];
for (n=0; n < max; n++) {
int r = random() % 256;
a[n] = r;
}
qsort(a, max, sizeof(int), cmp);
clock_t beg = clock();
long long int sum = 0;
for (i=0; i < 20000; i++)
{
for (n=0; n < max; n++) {
if (a[n] >= 128)
sum += a[n];
}
}
clock_t end = clock();
double sec = (end - beg) / CLOCKS_PER_SEC;
printf("sec: %f\n", sec);
printf("sum: %lld\n", sum);
return 0;
}
unsorted
sec: 5.000000
sum: 63043880000
sorted
sec: 1.000000
sum: 62925420000
Here's an assembly diff of two versions of the program, one with qsort and one without:
--- unsorted.s
+++ sorted.s
## -58,7 +58,7 ##
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
- leal 4(%esp), %eax
+ leal 16(%esp), %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
## -83,6 +83,13 ##
movl -16(%ebp), %eax
cmpl -24(%ebp), %eax
jl .L7
+ movl -24(%ebp), %eax
+ movl $cmp, 12(%esp)
+ movl $4, 8(%esp)
+ movl %eax, 4(%esp)
+ movl -32(%ebp), %eax
+ movl %eax, (%esp)
+ call qsort
movl $0, -48(%ebp)
movl $0, -44(%ebp)
movl $0, -12(%ebp)
As far as I understand the assembly output, the sorted version just has more code due to passing values to qsort, but I don't see any branching optimization/prediction/whatever thing. Maybe I'm looking in the wrong direction?

Branch prediction is not something you will see at the assembly code level; it is done by the CPU itself.

Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be zero. Since you are limited to integral expressions for exp, you
should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
Otherwise the branch prediction is determined by the processor...
Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction
unit (BPU) for prediction. This unit predicts the target address not
only based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types:
• Conditional branches.
• Direct calls and jumps.
• Indirect calls and jumps.
• Returns.
The microarchitecture tries to overcome this problem by feeding the most probable branch into the pipeline and execut[ing] it speculatively.
...Using various methods of branch prediction.

Related

GCC: Optimizing away memory loads and stores

EDIT 1: Added another example (showing that GCC is, in principle, be capable to do what I want to achieve) and some more discussion at the end of this question.
EDIT 2: Found the malloc function attribute, which should do what. Please take a look at the very end of the question.
This is a question about how to tell the compiler that stores to a memory area are not visible outside of a region (and thus could be optimized away). To illustrate what I mean, let's take a look at the following code
int f (int a)
{
int v[2];
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
gcc -O2 generates the following assembly code (x86-64 gcc, trunk, on https://godbolt.org):
f:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
As one can see, the loads and stores into the array v are gone after optimization.
Now consider the following code:
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The difference is that v is not (stack-) allocated in the function, but provided as an argument. The result of gcc -O2 in this case is:
g:
leal -1(%rdi), %edx
movl $0, 4(%rsi)
xorl %eax, %eax
movl %edx, (%rsi)
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
movl %eax, 4(%rsi)
movl $-1, (%rsi)
ret
.L4:
ret
Clearly, the code has to store the final values of v[0] and v[1] in memory as they may be observable.
Now, what I am looking for is a way to tell the compiler that the memory pointed to by v in the second example isn't accessible any more after the function g has returned so that the compiler could optimize away the memory accesses.
To have an even simpler example:
void h (int *v)
{
v[0] = 0;
}
If the memory pointed to by v isn't accessible after h returns, it should be possible to simplify the function to a single ret.
I tried to achieve what I want by playing with the strict aliasing rules but haven't succeeded.
ADDED IN EDIT 1:
GCC seems to have the necessary code built-in as the following example shows:
include <stdlib.h>
int h (int a)
{
int *v = malloc (2 * sizeof (int));
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated code contains no loads and stores:
h:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
In other words, GCC knows that changing the memory area pointed to by v is not observable through any side-effect of malloc. For purposes like this one, GCC has __builtin_malloc.
So I can also ask: How can user code (say a user version of malloc) make use of this functionality?
ADDED IN EDIT 2:
GCC has the following function attribute:
malloc
This tells the compiler that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
Using this attribute can improve optimization. Compiler predicts that a function with the attribute returns non-null in most cases. Functions like malloc and calloc have this property because they return a pointer to uninitialized or zeroed-out storage. However, functions like realloc do not have this property, as they can return a pointer to storage containing pointers.
It seems to do what I want as the following example shows:
__attribute__ (( malloc )) int *m (int *h);
int i (int a, int *h)
{
int *v = m (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated assembler code has no loads and stores:
i:
pushq %rbx
movl %edi, %ebx
movq %rsi, %rdi
call m
testl %ebx, %ebx
jle .L4
leal -1(%rbx), %edx
xorl %eax, %eax
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
popq %rbx
ret
.L4:
xorl %eax, %eax
popq %rbx
ret
However, as soon as the compiler sees a definition of m, it may forget about the attribute. For example, this is the case when the following definition is given:
__attribute__ (( malloc )) int *m (int *h)
{
return h;
}
In that case, the function is inlined and the compiler forgets about the attribute, yielding the same code as the function g.
P.S.: Initially, I thought that the restrict keyword may help, but it doesn't seem so.

EDIT: Discussion about the noinline attribute added at the end.
Using the following function definition, one can achieve the goal of my question:
__attribute__ (( malloc, noinline )) static void *get_restricted_ptr (void *p)
{
return p;
}
This function get_restricted_ptr simply returns its pointer argument but informs the compiler that the returned pointer P cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
The use of this function is demonstrated here:
int i (int a, int *h)
{
int *v = get_restricted_ptr (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return;
}
The generated code does not contain loads and stores:
i:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L6
.L5:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L5
ret
.L6:
ret
ADDED IN EDIT: If the noinline attribute is left out, GCC ignores the malloc attribute. Apparently, in this case, the function gets inlined first so that there is no function call any more for which GCC would check the malloc attribute. (One can discuss whether this behaviour should be considered a bug in GCC.) With the noinline attribute, the function doesn't get inlined. Then, due to the malloc attribute, GCC understands that the call to that function is unnecessary and removes it completely.
Unfortunately, this means that the (trivial) function won't be inlined when its call is not eliminated due to the malloc attribute.

Both functions have side effects and memory reads & stores cannot be optimized out
void h (int *v)
{
v[0] = 0;
}
and
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The side effects have to be observable outside the function scope. Inline functions may have another behavior as the side effect might have to be observable outside the enclosing code.
inline int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
void h(void)
{
int x[2],y ;
g(y,x);
}
this code will be optimized to just a simple return
You can promise the compiler that nothing will happen to allow easier optimizations by using keyword restrict. But of course your code must keep this promise.

For C, the only restriction is that the compiler has to ensure that the code behaves the same. If the compiler can prove that the code behaves the same then it can and will remove the stores.
For example, I put this into https://godbolt.org/ :
void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
And told it to use GCC 8.2 and "-O3", and got this output:
h(int*):
mov DWORD PTR [rdi], 0
ret
foo():
ret
Note that there are two different versions of the function h() in the output. The first version exists in case other code (in other object files) want to use the function (and may be discarded by the linker). The second version of h() was inlined directly into foo() and then optimised down to absolutely nothing.
If you change the code to this:
static void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
Then it tells the compiler that the version of h() that only existed for linking with other object files isn't needed, so the compiler only generates the second version of h() and the output becomes this:
foo():
ret
Of course all optimizers in all compiler's aren't perfect - for more complex code (and for different compilers including different versions of GCC) results might be different (the compiler may fail to do this optimization). This is purely a limitation of the compiler's optimizer and not a limitation of C itself.
For cases where the compiler's optimiser isn't good enough, there are 4 possible solutions:
get a better compiler
improve the compiler's optimiser (e.g. send an email with to the compiler's developers that includes a minimal example and cross your fingers)
modify the code to make it easier for the compiler's optimiser (e.g. copy the input array into a local array, like "void h(int *v) { int temp[2]; temp[0] = v[0]; temp[1] = v[1]; ... ).
shrug and say "Oh, that's a pity" and do nothing

Counting '1' in number in C

My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?

I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.

Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.

Implementing a timing, cache attack resistant sort in C

Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.
As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:
The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array
Do any such implementations exist? If not, are there any good resources on this type of programming?
To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.
double arr[24]; // some input
double min = DBL_MAX;
int i;
for (i = 0; i < 24; ++i) {
if (arr[i] < min) {
min = arr[i];
}
}
Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?

Use a sorting network, a series of comparisons and swaps.
The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.
Like this:
void swap( int* a , int* b , bool c )
{
const int min = c ? b : a;
const int max = c ? a : b;
*a = min;
*b = max;
}
swap( &array[0] , &array[1] , array[0] > array[1] );
Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html
Example for 4 elements, the numbers are array indices, generated by the above link:
SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);

This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.
#include <stdint.h>
static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
uint32_t a = *ap;
uint32_t b = *bp;
int64_t c = (int64_t)a - (int64_t)b;
uint32_t sign = ((uint64_t)c >> 63);
uint32_t min = a * sign + b * (sign ^ 1);
uint32_t max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
void
timing_sort(uint32_t *arr, int n)
{
int i, j;
for (i = n - 1; i >= 0; i--) {
for (j = 0; j < i; j++) {
cmp_and_swap(&arr[j], &arr[j + 1]);
}
}
}
The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):
_cmp_and_swap:
00000001000009e0 pushq %rbp
00000001000009e1 movq %rsp, %rbp
00000001000009e4 movl (%rdi), %r8d
00000001000009e7 movl (%rsi), %r9d
00000001000009ea movq %r8, %rdx
00000001000009ed subq %r9, %rdx
00000001000009f0 shrq $0x3f, %rdx
00000001000009f4 movl %edx, %r10d
00000001000009f7 negl %r10d
00000001000009fa orl $-0x2, %edx
00000001000009fd incl %edx
00000001000009ff movl %r9d, %ecx
0000000100000a02 andl %edx, %ecx
0000000100000a04 andl %r8d, %edx
0000000100000a07 movl %r8d, %eax
0000000100000a0a andl %r10d, %eax
0000000100000a0d addl %eax, %ecx
0000000100000a0f andl %r9d, %r10d
0000000100000a12 addl %r10d, %edx
0000000100000a15 movl %ecx, (%rdi)
0000000100000a17 movl %edx, (%rsi)
0000000100000a19 popq %rbp
0000000100000a1a retq
0000000100000a1b nopl (%rax,%rax)
Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.
The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.
As requested, here's a compare function for doubles:
void
cmp_and_swap(double *ap, double *bp)
{
double a = *ap;
double b = *bp;
int sign = signbit(a - b);
double min = a * sign + b * (sign ^ 1);
double max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
Compiled code is branchless and doesn't change memory access pattern depending on input data.

A very trivial, time-constant (but also highly in-efficient) sort is to
have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.
No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.

Is this tail recursion even though a logical-and is present?

I know this title is duplicated with Is this tail-recursion?, but this is different and I could not get a clue from the previous question. so I have to ask again.
The code:
int sum_n(int n, int *sum)
{
return n && (*sum += n) && sum_n(n - 1, sum);
}
It is said(foreign language source) that, tail recursion have two features:
The function is self-called
Only constant stack space is needed
So are these two features the only key factors for judging tail recursion? And whether the logical operator && in return statement affects tail recursion or not?
Most of all, is the above code tail recursion?

As written, it's a bit iffy. Reason being, technically, the function should regain control in order to && the result to know what to return. (This is easily optimized away, though, and most compilers will probably do so.)
In order to make sure it's tail-recursive, simply avoid doing anything with the result other than returning it.
int sum_n(int n, int *sum)
{
if (!(n && (*sum += n))) return 0;
return sum_n(n - 1, sum);
}

This very function, I think, is not a tail recursion.
First, I agree that, this following form is tail recursion:
int sum_n(int n, int *sum)
{
int tmp = n && (*sum += n);
if (!tmp)
return 0;
else
return sum_n(n - 1, sum);
}
However the above function is not literally equivalent to your original one. The code you provide shall be equivalent to this:
int sum_n(int n, int *sum)
{
int tmp = n && (*sum += n);
if (!tmp)
return 0;
else
return sum_n(n - 1, sum) ? 1 : 0;
}
The difference, is the return value of sum_n(n - 1, sum) could not be used directly as the return value of your function: it shall be casted from int to _Bool.

Not talking about the academic definition of tail recursion, however, clang 3.3 does compile sum() as a loop, not recursion.
.file "tail.c"
.text
.globl sum
.align 16, 0x90
.type sum,#function
sum: # #sum
.cfi_startproc
# BB#0:
testl %edi, %edi
je .LBB0_6
# BB#1: # %.lr.ph
movl (%rsi), %eax
.align 16, 0x90
.LBB0_3: # =>This Inner Loop Header: Depth=1
addl %edi, %eax
je .LBB0_4
# BB#2: # %tailrecurse
# in Loop: Header=BB0_3 Depth=1
decl %edi
jne .LBB0_3
# BB#5: # %tailrecurse._crit_edge
movl %eax, (%rsi)
.LBB0_6:
xorl %eax, %eax
ret
.LBB0_4: # %split
movl $0, (%rsi)
xorl %eax, %eax
ret
.Ltmp0:
.size sum, .Ltmp0-sum
.cfi_endproc
.section ".note.GNU-stack","",#progbits
compiled with command:
$ clang -c tail.c -S -O2
clang version:
$ clang -v
clang version 3.3 (tags/RELEASE_33/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Two very similar functions involving sin() exhibit vastly different performance -- why?

Consider the following two programs that perform the same computations in two different ways:
// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x;
for (j = 0; j < nbr_values; j++) {
x = 1;
for (i = 0; i < n_iter; i++)
x = sin(x);
}
printf("%f\n", x);
return 0;
}
and
// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x[nbr_values];
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i = 0; i < n_iter; i++) {
for (j = 0; j < nbr_values; ++j) {
x[j] = sin(x[j]);
}
}
printf("%f\n", x[0]);
return 0;
}
When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.
Why is that?
One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.
(Question inspired by Why is my python/numpy example faster than pure C implementation?)
EDIT:
Here is the generated assembly for v1:
movl $8192, %ebp
pushq %rbx
LCFI1:
subq $8, %rsp
LCFI2:
.align 4
L2:
movl $100000, %ebx
movss LC0(%rip), %xmm0
jmp L5
.align 4
L3:
call _sinf
L5:
subl $1, %ebx
jne L3
subl $1, %ebp
.p2align 4,,2
jne L2
and for v2:
movl $100000, %r14d
.align 4
L8:
xorl %ebx, %ebx
.align 4
L9:
movss (%r12,%rbx), %xmm0
call _sinf
movss %xmm0, (%r12,%rbx)
addq $4, %rbx
cmpq $32768, %rbx
jne L9
subl $1, %r14d
jne L8

Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:
x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...
that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.
In v2, by contrast, you do the following:
x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...
notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.
I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.

In the first example, it runs 100000 loops of sin, 8192 times.
In the second example, it runs 8192 loops of sin, 100000 times.
Other than that and storing the result differently, I don't see any difference.
However, what does make a difference is that the input is being changed for each loop in the second case. So I suspect what happens is that the sin value, at certain times in the loop, gets much easier to calculate. And that can make a big difference. Calculating sin is not entirely trivial, and it's a series calculation that loops until the exit condition is hit.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

gcc branch prediction - c

Branch prediction is not something you will see at the assembly code level; it is done by the CPU itself.

Related

GCC: Optimizing away memory loads and stores

Counting '1' in number in C

Implementing a timing, cache attack resistant sort in C

Is this tail recursion even though a logical-and is present?

Two very similar functions involving sin() exhibit vastly different performance -- why?

Categories

Resources