Consider the following two programs that perform the same computations in two different ways:
// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x;
for (j = 0; j < nbr_values; j++) {
x = 1;
for (i = 0; i < n_iter; i++)
x = sin(x);
}
printf("%f\n", x);
return 0;
}
and
// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x[nbr_values];
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i = 0; i < n_iter; i++) {
for (j = 0; j < nbr_values; ++j) {
x[j] = sin(x[j]);
}
}
printf("%f\n", x[0]);
return 0;
}
When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.
Why is that?
One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.
(Question inspired by Why is my python/numpy example faster than pure C implementation?)
EDIT:
Here is the generated assembly for v1:
movl $8192, %ebp
pushq %rbx
LCFI1:
subq $8, %rsp
LCFI2:
.align 4
L2:
movl $100000, %ebx
movss LC0(%rip), %xmm0
jmp L5
.align 4
L3:
call _sinf
L5:
subl $1, %ebx
jne L3
subl $1, %ebp
.p2align 4,,2
jne L2
and for v2:
movl $100000, %r14d
.align 4
L8:
xorl %ebx, %ebx
.align 4
L9:
movss (%r12,%rbx), %xmm0
call _sinf
movss %xmm0, (%r12,%rbx)
addq $4, %rbx
cmpq $32768, %rbx
jne L9
subl $1, %r14d
jne L8
Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:
x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...
that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.
In v2, by contrast, you do the following:
x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...
notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.
I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.
In the first example, it runs 100000 loops of sin, 8192 times.
In the second example, it runs 8192 loops of sin, 100000 times.
Other than that and storing the result differently, I don't see any difference.
However, what does make a difference is that the input is being changed for each loop in the second case. So I suspect what happens is that the sin value, at certain times in the loop, gets much easier to calculate. And that can make a big difference. Calculating sin is not entirely trivial, and it's a series calculation that loops until the exit condition is hit.
Related
Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.
You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.
So I have a question regarding performance on two different code techniques. Can you help me understanding which one is faster/better and why?
Here is the first technique:
int x, y, i;
for(i=0; i<10; i++)
{
//do stuff with x and y
}
//reset x and y to zero
x=0;
y=0;
And here is the second one:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
So which coding technique is better?
Also if you know a better one and/or any site/article etc. where I can read about this and more performance related stuff I would love to have that also!
It does not matter at all, because compilers don't automatically translate variable declaration to memory or register allocation. The difference between the two samples is that in the first case the variables are visible outside of the loop body, and in the second case they are not. However this difference is at the C level only, and if you don't use the variables outside the loop it will result in the same compiled code.
The compiler has two options for where to store a local variable : it's either on the stack or in a register. For each variable you use in your program, the compiler has to choose where it is going to live. If on the stack, then it needs to decrement the stack pointer to make room for the variable. But this decrementation will not happen at the place of variable declaration, typically it will be done at the beginning of the function : the stack pointer will be decremented only once by an amount sufficient to hold all of the stack-allocated variables. If it's only going to be in a register, no initialization needs to be done and the register will be used as destination when you first do an assignment. The important thing is that it can and will re-use memory locations and registers that were previously used for variables which are now out of scope.
For illustration, I made two test programs. I used 10000 iterations instead of 10 because otherwise the compiler would unroll the loop at high optimization levels. The programs use rand to make for a quick and portable demo, but it should not be used in production code.
declare_once.c :
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(void) {
srand(time(NULL));
int x, y, i;
for (i = 0; i < 10000; i++) {
x = rand();
y = rand();
printf("Got %d and %d !\n", x, y);
}
return 0;
}
redeclare.c is the same except for the loop which is :
for (i = 0; i < 10000; i++) {
int x, y;
x = rand();
y = rand();
printf("Got %d and %d !\n", x, y);
}
I compiled the programs using Apple's LLVM version 7.3.0 on x86_64 Mac. I asked it for assembly output which I reproduced below, leaving out the parts unrelated to the question.
clang -O0 -S declare_once.c -o declare_once.S :
_main:
## Function prologue
pushq %rbp
movq %rsp, %rbp ## Move the old value of the stack
## pointer (%rsp) to the base pointer
## (%rbp), which will be used to
## address stack variables
subq $32, %rsp ## Decrement the stack pointer by 32
## to make room for up to 32 bytes
## worth of stack variables including
## x and y
## Removed code that calls srand
movl $0, -16(%rbp) ## i = 0. i has been assigned to the 4
## bytes starting at address -16(%rbp),
## which means 16 less than the base
## pointer (so here, 16 more than the
## stack pointer).
LBB0_1:
cmpl $10, -16(%rbp)
jge LBB0_4
callq _rand ## Call rand. The return value will be in %eax
movl %eax, -8(%rbp) ## Assign the return value of rand to x.
## x has been assigned to the 4 bytes
## starting at -8(%rbp)
callq _rand
leaq L_.str(%rip), %rdi
movl %eax, -12(%rbp) ## Assign the return value of rand to y.
## y has been assigned to the 4 bytes
## starting at -12(%rbp)
movl -8(%rbp), %esi
movl -12(%rbp), %edx
movb $0, %al
callq _printf
movl %eax, -20(%rbp)
movl -16(%rbp), %eax
addl $1, %eax
movl %eax, -16(%rbp)
jmp LBB0_1
LBB0_4:
xorl %eax, %eax
addq $32, %rsp ## Add 32 to the stack pointer :
## deallocate all stack variables
## including x and y
popq %rbp
retq
The assembly output for redeclare.c is almost exactly the same, except that for some reason x and y get assigned to -16(%rbp) and -12(%rbp) respectively, and i gets assigned to -8(%rbp). I copy-pasted only the loop :
movl $0, -16(%rbp)
LBB0_1:
cmpl $10, -16(%rbp)
jge LBB0_4
callq _rand
movl %eax, -8(%rbp) ## x = rand();
callq _rand
leaq L_.str(%rip), %rdi
movl %eax, -12(%rbp) ## y = rand();
movl -8(%rbp), %esi
movl -12(%rbp), %edx
movb $0, %al
callq _printf
movl %eax, -20(%rbp)
movl -16(%rbp), %eax
addl $1, %eax
movl %eax, -16(%rbp)
jmp LBB0_1
So we see that even at -O0 the generated code is the same. The important thing to note is that the same memory locations are reused for x and y in each loop iteration, even though they are separate variables at each iteration from the C language point of view.
At -O3 the variables are kept in registers, and both programs output the exact same assembly.
clang -O3 -S declare_once.c -o declare_once.S :
movl $10000, %ebx ## i will be in %ebx. The compiler decided
## to count down from 10000 because
## comparisons to 0 are less expensive,
## so it actually does i = 10000.
leaq L_.str(%rip), %r14
.align 4, 0x90
LBB0_1:
callq _rand
movl %eax, %r15d ## x = rand(). x has been assigned to
## register %r15d (32 less significant
## bits of r15)
callq _rand
movl %eax, %ecx ## y = rand(). y has been assigned to
## register %ecx
xorl %eax, %eax
movq %r14, %rdi
movl %r15d, %esi
movl %ecx, %edx
callq _printf
decl %ebx
jne LBB0_1
So again, no differences between the two versions, and even though in redeclare.c we have different variables at each iteration, the same registers are re-used so that there is no allocation overhead.
Keep in mind that everything I said applies to variables that are assigned in each loop iteration, which seems to be what you were thinking. If on the other hand you want to use the same values for all iterations, of course the assignment should be done before the loop.
Declaring the variables in the inner-most scope where you'll use them:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
is always going to be preferred. The biggest improvement is that you've limited the scope of the x and y variables. This prevents you from accidentally using them where you didn't intend to.
Even if you use "the same" variables again:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
there will be no performance impact whatsoever. The statement int x, y has practically no effect at runtime.
Most modern compilers will calculate the total size of all local variables, and emit code to reserve the space on the stack (e.g. sub esp, 90h) once in the function prologue. The space for these variables will almost certainly be re-used from one "version" of x to the next. It's purely a lexical construct that the compiler uses to keep you from using that "space" on the stack where you didn't intend to.
It should not matter because you need to initialize the variables in either case. Additionally, the first case sets x and y after they are no longer being used. As a result, the reset is not needed.
Here is the first technique:
int x=0, y=0, i;
for(i=0; i<10; i++)
{
//do stuff with x and y
// x and y stay at the value they get set to during the pass
}
// x and y need to be reset if you want to use them again.
// or would retain whatever they became during the last pass.
If you had wanted x and y to be reset to 0 inside the loop, then you would need to say
Here is the first technique:
int x, y, i;
for(i=0; i<10; i++)
{
//reset x and y to zero
x=0;
y=0;
//do stuff with x and y
// Now x and y get reset before the next pass
}
The second procedure makes x and y local in scope so they are dropped at the end of the last pass. The values retain whatever they were set for during each pass for the next pass. The compiler will actually set up the variables and initialize them them at compile time not at run time. Thus you will not be defining (and initializing) the variable for each pass through the loop.
And here is the second one:
int i;
for(i=0; i<10; i++)
{
int x=0, y=0;
//do the same stuff with x and y as above
// Usually x and y only saet to 0 at start of first pass.
}
Best Practices
So which coding technique is better?
As others have pointed out, given a sufficiently mature/modern compiler the performance aspect will likely be null due to optimization. Instead, the preferred code is determined by virtue of sets of ideas known as best practices.
Limiting Scope
"Scope" describes the range of access in your code. Assuming the intended scope is to be limited to within the loop itself, x and y should be declared inside the loop as the compiler will prevent you from using them later on in your function. However, in your OP you show them being reset, which implies they will be used again later for other purposes. In this case, you must declare them towards the top (e.g. outside the loop) so you can use them later.
Here's some code you can use to demonstrate the limiting of the scope:
#include <stdio.h>
#define IS_SCOPE_LIMITED
int main ( void )
{
int i;
#ifndef IS_SCOPE_LIMITED
int x, y; // compiler will not complain, scope is generous
#endif
for(i=0; i<10; i++)
{
#ifdef IS_SCOPE_LIMITED
int x, y; // compiler will complain about use outside of loop
#endif
x = i;
y = x+1;
y++;
}
printf("X is %d and Y is %d\n", x, y);
}
To test the scope, comment out the #define towards the top. Compile with gcc -Wall loopVars.c -o loopVars and run with ./loopVars.
Benchmarking and Profiling
If you're still concerned about performance, possibly because you have some obscure operations involving these variables, then test, test, and test again! (try benchmarking or profiling your code). Again, with optimizations you probably won't find significant (if any) differences because the compiler will have done all this (allocation of variable space) prior to runtime.
UPDATE
To demonstrate this another way, you could remove the #ifdef and the #ifndef from the code (also removing each #endif), and add a line immediately preceding the printf such as x=2; y=3;. What you will find is the code will compile and run but the output will be "X is 2 and Y is 3". This is legal because the two scopes prevent the identically-named variables from competing with each other. Of course, this is a bad idea because you now have multiple variables within the same piece of code with identical names and with more complex code this will not be as easy to read and maintain.
In the specific case of int variables, it makes little (or no) difference.
For variables of more complex types, especially something with a constructor that (for example) allocates some memory dynamically, re-creating the variable every iteration of a loop may be substantially slower than re-initializing it instead. For example:
#include <vector>
#include <chrono>
#include <numeric>
#include <iostream>
unsigned long long versionA() {
std::vector<int> x;
unsigned long long total = 0;
for (int j = 0; j < 1000; j++) {
x.clear();
for (int i = 0; i < 1000; i++)
x.push_back(i);
total += std::accumulate(x.begin(), x.end(), 0ULL);
}
return total;
}
unsigned long long versionB() {
unsigned long long total = 0;
for (int j = 0; j < 1000; j++) {
std::vector<int> x;
for (int i = 0; i < 1000; i++)
x.push_back(i);
total += std::accumulate(x.begin(), x.end(), 0ULL);
}
return total;
}
template <class F>
void timer(F f) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
auto result = f();
auto stop = high_resolution_clock::now();
std::cout << "Result: " << result << "\n";
std::cout << "Time: " << duration_cast<microseconds>(stop - start).count() << "\n";
}
int main() {
timer(versionA);
timer(versionB);
}
At least when I run it, there's a fairly substantial difference between the two methods:
Result: 499500000
Time: 5114
Result: 499500000
Time: 13196
In this case, creating a new vector every iteration takes more than twice as long as clearing an existing vector every iteration instead.
For what it's worth, there are probably two separate factors contributing to the speed difference:
initial creation of the vector.
Re-allocating memory as elements are added to the vector.
When we clear() a vector, that removes the existing elements, but retains the memory that's currently allocated, so in a case like this were we use the same size every iteration of the outer loop, the version that just resets the vector doesn't need to allocate any memory on subsequent iterations. If we add x.reserve(1000); immediately after defining the vector in vesionA, the difference shrinks substantially (at least in my testing not quite tied in speed, but pretty close).
Disclaimer: I am well aware implementing your own crypto is a very bad idea. This is part of a master thesis, the code will not be used in practice.
As part of a larger cryptographic algorithm, I need to sort an array of constant length (small, 24 to be precise), without leaking any information on the contents of this array. As far as I know (please correct me if these are not sufficient to prevent timing and cache attacks), this means:
The sort should run in the same amount of cycles in terms of the length of the array, regardless of the particular values of the array
The sort should not branch or access memory depending on the particular values of the array
Do any such implementations exist? If not, are there any good resources on this type of programming?
To be honest, I'm even struggling with the easier subproblem, namely finding the smallest value of an array.
double arr[24]; // some input
double min = DBL_MAX;
int i;
for (i = 0; i < 24; ++i) {
if (arr[i] < min) {
min = arr[i];
}
}
Would adding an else with a dummy assignment be sufficient to make it timing-safe? If so, how do I ensure the compiler (GCC in my case) doesn't undo my hard work? Would this be susceptible to cache attacks?
Use a sorting network, a series of comparisons and swaps.
The swap call must not be dependent on the comparison. It must be implemented in a way to execute the same amount of instructions, regardless of the comparison result.
Like this:
void swap( int* a , int* b , bool c )
{
const int min = c ? b : a;
const int max = c ? a : b;
*a = min;
*b = max;
}
swap( &array[0] , &array[1] , array[0] > array[1] );
Then find the sorting network and use the swaps. Here is a generator that does that for you: http://pages.ripco.net/~jgamble/nw.html
Example for 4 elements, the numbers are array indices, generated by the above link:
SWAP(0, 1);
SWAP(2, 3);
SWAP(0, 2);
SWAP(1, 3);
SWAP(1, 2);
This is a very dumb bubble sort that actually works and doesn't branch or change memory access behavior depending on input data. Not sure if this can be plugged into another sorting algorithm, they need their compares separate from the swaps, but maybe it's possible, working on that now.
#include <stdint.h>
static void
cmp_and_swap(uint32_t *ap, uint32_t *bp)
{
uint32_t a = *ap;
uint32_t b = *bp;
int64_t c = (int64_t)a - (int64_t)b;
uint32_t sign = ((uint64_t)c >> 63);
uint32_t min = a * sign + b * (sign ^ 1);
uint32_t max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
void
timing_sort(uint32_t *arr, int n)
{
int i, j;
for (i = n - 1; i >= 0; i--) {
for (j = 0; j < i; j++) {
cmp_and_swap(&arr[j], &arr[j + 1]);
}
}
}
The cmp_and_swap function compiles to (Apple LLVM version 7.3.0 (clang-703.0.29), compiled with -O3):
_cmp_and_swap:
00000001000009e0 pushq %rbp
00000001000009e1 movq %rsp, %rbp
00000001000009e4 movl (%rdi), %r8d
00000001000009e7 movl (%rsi), %r9d
00000001000009ea movq %r8, %rdx
00000001000009ed subq %r9, %rdx
00000001000009f0 shrq $0x3f, %rdx
00000001000009f4 movl %edx, %r10d
00000001000009f7 negl %r10d
00000001000009fa orl $-0x2, %edx
00000001000009fd incl %edx
00000001000009ff movl %r9d, %ecx
0000000100000a02 andl %edx, %ecx
0000000100000a04 andl %r8d, %edx
0000000100000a07 movl %r8d, %eax
0000000100000a0a andl %r10d, %eax
0000000100000a0d addl %eax, %ecx
0000000100000a0f andl %r9d, %r10d
0000000100000a12 addl %r10d, %edx
0000000100000a15 movl %ecx, (%rdi)
0000000100000a17 movl %edx, (%rsi)
0000000100000a19 popq %rbp
0000000100000a1a retq
0000000100000a1b nopl (%rax,%rax)
Only memory accesses are reading and writing of the array, no branches. The compiler did figure out what the multiplication actually does, quite clever actually, but it didn't use branches for that.
The casts to int64_t are necessary to avoid overflows. I'm pretty sure it can be written cleaner.
As requested, here's a compare function for doubles:
void
cmp_and_swap(double *ap, double *bp)
{
double a = *ap;
double b = *bp;
int sign = signbit(a - b);
double min = a * sign + b * (sign ^ 1);
double max = b * sign + a * (sign ^ 1);
*ap = min;
*bp = max;
}
Compiled code is branchless and doesn't change memory access pattern depending on input data.
A very trivial, time-constant (but also highly in-efficient) sort is to
have a src and destination array
for each element in the (sorted) destination array, iterate through the complete source array to find the element that belongs exactly into this position.
No early breaks, (nearly) constant timing, not depending on even partial sortedness of the source.
Here's my demo program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int cmp(const void *d1, const void *d2)
{
int a, b;
a = *(int const *) d1;
b = *(int const *) d2;
if (a > b)
return 1;
else if (a == b)
return 0;
return -1;
}
int main()
{
int seed = time(NULL);
srandom(seed);
int i, n, max = 32768, a[max];
for (n=0; n < max; n++) {
int r = random() % 256;
a[n] = r;
}
qsort(a, max, sizeof(int), cmp);
clock_t beg = clock();
long long int sum = 0;
for (i=0; i < 20000; i++)
{
for (n=0; n < max; n++) {
if (a[n] >= 128)
sum += a[n];
}
}
clock_t end = clock();
double sec = (end - beg) / CLOCKS_PER_SEC;
printf("sec: %f\n", sec);
printf("sum: %lld\n", sum);
return 0;
}
unsorted
sec: 5.000000
sum: 63043880000
sorted
sec: 1.000000
sum: 62925420000
Here's an assembly diff of two versions of the program, one with qsort and one without:
--- unsorted.s
+++ sorted.s
## -58,7 +58,7 ##
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
- leal 4(%esp), %eax
+ leal 16(%esp), %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
## -83,6 +83,13 ##
movl -16(%ebp), %eax
cmpl -24(%ebp), %eax
jl .L7
+ movl -24(%ebp), %eax
+ movl $cmp, 12(%esp)
+ movl $4, 8(%esp)
+ movl %eax, 4(%esp)
+ movl -32(%ebp), %eax
+ movl %eax, (%esp)
+ call qsort
movl $0, -48(%ebp)
movl $0, -44(%ebp)
movl $0, -12(%ebp)
As far as I understand the assembly output, the sorted version just has more code due to passing values to qsort, but I don't see any branching optimization/prediction/whatever thing. Maybe I'm looking in the wrong direction?
Branch prediction is not something you will see at the assembly code level; it is done by the CPU itself.
Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be zero. Since you are limited to integral expressions for exp, you
should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
Otherwise the branch prediction is determined by the processor...
Branch prediction predicts the branch target and enables the
processor to begin executing instructions long before the branch true
execution path is known. All branches utilize the branch prediction
unit (BPU) for prediction. This unit predicts the target address not
only based on the EIP of the branch but also based on the execution
path through which execution reached this EIP. The BPU can
efficiently predict the following branch types:
• Conditional branches.
• Direct calls and jumps.
• Indirect calls and jumps.
• Returns.
The microarchitecture tries to overcome this problem by feeding the most probable branch into the pipeline and execut[ing] it speculatively.
...Using various methods of branch prediction.
This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.
I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:
// SSE VERSION
#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i <N; i+= 2){
__m128d mm_a = _mm_load_pd( &a[i] );
_mm_prefetch( &a[i+4], _MM_HINT_T0 );
__m128d mm_b = _mm_load_pd( &b[i] );
_mm_prefetch( &b[i+4] , _MM_HINT_T0 );
__m128d mm_c = _mm_load_pd( &c[i] );
_mm_prefetch( &c[i+4] , _MM_HINT_T0 );
__m128d mm_r;
mm_r = _mm_add_pd( mm_a, mm_b );
mm_a = _mm_mul_pd( mm_r , mm_c );
_mm_store_pd( &r[i], mm_a );
}
}
}
//NO SSE VERSION
//same definitions as before
int main(void){
int i, times;
for( times = 0; times < NTIMES; times++ ){
for( i = 0; i < N; i++ ){
r[i] = (a[i]+b[i])*c[i];
}
}
}
When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.
Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?
Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:
#define N 10000
#define NTIMES 100000
double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));
int
main (void)
{
int i, times;
for (times = 0; times < NTIMES; times++)
{
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
}
return 0;
}
and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:
.L3:
movapd a(%eax), %xmm0
addpd b(%eax), %xmm0
mulpd c(%eax), %xmm0
movapd %xmm0, r(%eax)
addl $16, %eax
cmpl $80000, %eax
jne .L3
As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:
.L3:
vmovapd a(%eax), %ymm0
vaddpd b(%eax), %ymm0, %ymm0
vmulpd c(%eax), %ymm0, %ymm0
vmovapd %ymm0, r(%eax)
addl $32, %eax
cmpl $80000, %eax
jne .L3
Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.
I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.
Just replace the inner loop in chill's sample code with:
for (i = N-1; i >= 0; --i)
r[i] = (a[i] + b[i]) * c[i];
GCC (4.8.4) with options -S -O3 -mavx produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5