Assembly code's length can indicate execution speed? - c

I'm learning C, consider the following code snippet:
#include <stdio.h>
int main(void) {
int fahr;
float calc;
for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
calc = (5.0 / 9.0) * (fahr - 32);
printf("%3d %6.1f\n", fahr, calc);
}
return 0;
}
Which is printing Celsius to Fahrenheit conversion table from 300 to 0. I compile this with:
$ clang -std=c11 -Wall -g -O3 -march=native main.c -o main
I also generate assembly code with this command:
$ clang -std=c11 -Wall -S -masm=intel -O3 -march=native main.c -o main
Which is generating 1.26kb file and 71 lines.
I slightly edited the code and moved the logic into another function which is initalized at main():
#include <stdio.h>
void foo(void) {
int fahr;
float calc;
for (fahr = 300; fahr >= 0; fahr = fahr - 20) {
calc = (5.0 / 9.0) * (fahr - 32);
printf("%3d %6.1f\n", fahr, calc);
}
}
int main(void) {
foo();
return 0;
}
This will generate 2.33kb assembly code with 128 lines.
Running both programs with time ./main I see no difference in execution speed.
My question is, does it matter anything trying to optimize your C programs by assembly code's length?

It seems that you are comparing the sizes of the .S files generated by GCC, since that obviously make no sense, I'm just pretending you were confronting the binary size of two, GCC generated, code snippets.
While, having all other conditions the same, a shorter code size may gives an increase in speed (due to an higher code density), in general x86 CPUs are complex enough to require a decoupling between optimizations for code size and optimizations for code speed.
Specifically if you aim at code speed you should optimize for... code speed. Sometime this require choosing the shortest snippet, sometime it doesn't.
Consider the classic example of compiler optimization, multiplication by powers of two:
int i = 4;
i = i * 8;
This may be badly translated as:
;NO optimizations at all
mov eax, 4 ;i = 4 B804000000 0-1 clocks
imul eax, 8 ;i = i * 8 6BC009 3 clocks
;eax = i 8 bytes total 3-4 clocks total
;Slightly optimized
;4*8 gives no sign issue, we can use shl
mov eax, 4 ;i = 4 B804000000 0-1 clocks
shl eax, 3 ;i = i * 8 C1E003 1 clock
;eax = i 8 bytes total 1-2 clocks total
Both snippets have the same code length but the second performs nearly as twice as faster.
This is a very basic example1, where there is not even much need to take the micro-architecture into account.
Another more subtle example is the following, taken from Agner Fog discussion of Partial register stalls2:
;Version A Version B
mov al, byte ptr [mem8] movzx ebx, byte ptr [mem8]
mov ebx, eax and eax, 0ffffff00h
or ebx, eax
;7 bytes 14 bytes
Both versions give the same result but Version B is 5-6 clocks faster than Version A despite the former being twice the size of the latter.
The answer is then no, code size is not enough; it may be a tie-breaker though.
If you really are interested into optimizing assembly you will enjoy these two readings:
Agner Fog's classics.
Intel optimization manual
The first link also have a manual to optimize C and C++ code.
If you write in C remember that the most impacting optimizations are 1) How data is represented/stored, i.e. Data structures 2) How data is processed, i.e. Algorithms.
There are the macro optimizations.
Taking into account the generated assembly is shifting into micro optimization and there the most useful tools are 1) A smart compiler 2) A good set of intrinsics3.
1 So simple to be optimized out in practice.
2 Maybe a little obsolete now but it serves the purpose.
3 Built-in, non standard, functions that translate into specific assembly instructions.

As always, the answer is "it depends". Sometimes making code longer makes it more efficient: for example, the CPU doesn't have to waste extra instructions jumping after every loop. A classic example (literally 'classic': 1983!) is "Duff's Device". The following code
register short *to, *from;
register count;
{
do { /* count > 0 assumed */
*to = *from++;
} while(--count > 0);
}
was made much faster by using this much larger, and more complicated, code:
register short *to, *from;
register count;
{
register n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}
}
But that can be taken to extremes: making code too large increases cache misses and all sorts of other issues. In short: "premature optimisation is evil" - you need to test your before-and-after, and often on multiple platforms, before deciding that it is a good idea.
And I'll ask you: is the second version of the above code "better" than the first version? It's less readable, less maintainable, and far more complex than the code it replaces.

The code that actually runs is the same in both cases, after inlining. The 2nd way is bigger because it also has to emit a stand-alone definition of the function, not inlined into main.
You would have avoided this if you'd used static on the function, so the compiler would know that nothing could call it from outside the compilation unit, and thus a stand-alone definition wasn't needed if it was inlined into its only caller.
Also, most of the .s lines in compiler output are comments or assembler directives, not instructions. So you're not even counting instructions.
The Godbolt compiler explorer is a good way to look at compiler asm output, with just the instructions and actually-used labels. Have a look at your code there.
Counting the total number of instructions in the executable is totally bogus if there are loops or branches. Or especially function calls inside loops, like in this case. Dynamic instruction count (how many instructions actually ran, i.e. counting each time through loops and so on) is very roughly correlated with performance, but some code runs at 4 instructions per cycle, while some runs at well below 1 (e.g. lots of div or sqrt, cache misses, and/or branch mispredicts).
To learn more about what makes code run slow or fast, see the x86 tag wiki, especially Agner Fog's stuff.
I also recently wrote an answer to Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs. Thinking of diabolical ways to make a program run slower is a fun exercise.

Related

Working inline assembly in C for bit parity?

I'm trying to compute the bit parity of a large number of uint64's. By bit parity I mean a function that accepts a uint64 and outputs 0 if the number of set bits is even, and 1 otherwise.
Currently I'm using the following function (by #Troyseph, found here):
uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111) * 0x1111111111111111;
return (n >> 60) & 1;
}
The same SO page has the following assembly routine (by #papadp):
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
which takes advantage of the machine's parity flag. But I cannot get it to work with my C program (I know next to no assembly).
Question. How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
(I'm using GCC with 64-bit Ubuntu 14 on an Intel Xeon Haswell)
In case it's of any help, the parity64() function is called inside the following routine:
uint bindot(uint64* a, uint64* b, uint64 entries){
uint parity = 0;
for(uint i=0; i<entries; ++i)
parity ^= parity64(a[i] & b[i]); // Running sum!
return parity;
}
(This is supposed to be the "dot product" of two vectors over the field Z/2Z, aka. GF(2).)
This may sound a bit harsh, but I believe it needs to be said. Please don't take it personally; I don't mean it as an insult, especially since you already admitted that you "know next to no assembly." But if you think code like this:
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
will beat what a C compiler generates, then you really have no business using inline assembly. In just those 5 lines of code, I see 2 instructions that are glaringly sub-optimal. It could be optimized by just rewriting it slightly:
xor eax, eax
test ecx, ecx ; logically, should use RCX, but see below for behavior of PF
jnp jmp_over
mov eax, 1 ; or possibly even "inc eax"; would need to verify
jmp_over:
ret
Or, if you have random input values that are likely to foil the branch predictor (i.e., there is no predictable pattern to the parity of the input values), then it would be faster yet to remove the branch, writing it as:
xor eax, eax
test ecx, ecx
setp al
ret
Or perhaps the equivalent (which will be faster on certain processors, but not necessarily all):
xor eax, eax
test ecx, ecx
mov ecx, 1
cmovp eax, ecx
ret
And these are just the improvements I could see off the top of my head, given my existing knowledge of the x86 ISA and previous benchmarks that I have conducted. But lest anyone be fooled, this is undoubtedly not the fastest code, because (borrowing from Michael Abrash), "there ain't no such thing as the fastest code"—someone can virtually always make it faster yet.
There are enough problems with using inline assembly when you're an expert assembly-language programmer and a wizard when it comes to the intricacies of the x86 ISA. Optimizers are pretty darn good nowadays, which means it's hard enough for a true guru to produce better code (though certainly not impossible). It also takes trustworthy benchmarks that will verify your assumptions and confirm that your optimized inline assembly is actually faster. Never commit yourself to using inline assembly to outsmart the compiler's optimizer without running a good benchmark. I see no evidence in your question that you've done anything like this. I'm speculating here, but it looks like you saw that the code was written in assembly and assumed that meant it would be faster. That is rarely the case. C compilers ultimately emit assembly language code, too, and it is often more optimal than what us humans are capable of producing, given a finite amount of time and resources, much less limited expertise.
In this particular case, there is a notion that inline assembly will be faster than the C compiler's output, since the C compiler won't be able to intelligently use the x86 architecture's built-in parity flag (PF) to its benefit. And you might be right, but it's a pretty shaky assumption, far from universalizable. As I've said, optimizing compilers are pretty smart nowadays, and they do optimize to a particular architecture (assuming you specify the right options), so it would not at all surprise me that an optimizer would emit code that used PF. You'd have to look at the disassembly to see for sure.
As an example of what I mean, consider the highly specialized BSWAP instruction that x86 provides. You might naïvely think that inline assembly would be required to take advantage of it, but it isn't. The following C code compiles to a BSWAP instruction on almost all major compilers:
uint32 SwapBytes(uint32 x)
{
return ((x << 24) & 0xff000000 ) |
((x << 8) & 0x00ff0000 ) |
((x >> 8) & 0x0000ff00 ) |
((x >> 24) & 0x000000ff );
}
The performance will be equivalent, if not better, because the optimizer has more knowledge about what the code does. In fact, a major benefit this form has over inline assembly is that the compiler can perform constant folding with this code (i.e., when called with a compile-time constant). Plus, the code is more readable (at least, to a C programmer), much less error-prone, and considerably easier to maintain than if you'd used inline assembly. Oh, and did I mention it's reasonably portable if you ever wanted to target an architecture other than x86?
I know I'm making a big deal of this, and I want you to understand that I say this as someone who enjoys the challenge of writing highly-tuned assembly code that beats the compiler's optimizer in performance. But every time I do it, it's just that: a challenge, which comes with sacrifices. It isn't a panacea, and you need to remember to check your assumptions, including:
Is this code actually a bottleneck in my application, such that optimizing it would even make any perceptible difference?
Is the optimizer actually emitting sub-optimal machine language instructions for the code that I have written?
Am I wrong in what I naïvely think is sub-optimal? Maybe the optimizer knows more than I do about the target architecture, and what looks like slow or sub-optimal code is actually faster. (Remember that less code is not necessarily faster.)
Have I tested it in a meaningful, real-world benchmark, and proven that the compiler-generated code is slow and that my inline assembly is actually faster?
Is there absolutely no way that I can tweak the C code to persuade the optimizer to emit better machine code that is close, equal to, or even superior to the performance of my inline assembly?
In an attempt to answer some of these questions, I set up a little benchmark. (Using MSVC, because that's what I have handy; if you're targeting GCC, it's best to use that compiler, but we can still get a general idea. I use and recommend Google's benchmarking library.) And I immediately ran into problems. See, I first run my benchmarks in "debugging" mode, with assertions compiled in that verify that my "tweaked"/"optimized" code is actually producing the same results for all test cases as the original code (that is presumably known to be working/correct). In this case, an assertion immediately fired. It turns out that the CheckParity routine written in assembly language does not return identical results to the parity64 routine written in C! Uh-oh. Well, that's another bullet we need to add to the above list:
Have I ensured that my "optimized" code is returning the correct results?
This one is especially critical, because it's easy to make something faster if you also make it wrong. :-) I jest, but not entirely, because I've done this many times in the pursuit of faster code.
I believe Michael Petch has already pointed out the reason for the discrepancy: in the x86 implementation, the parity flag (PF) only concerns itself with the bits in the low byte, not the entire value. If that's all you need, then great. But even then, we can go back to the C code and further optimize it to do less work, which will make it faster—perhaps faster than the assembly code, eliminating the one advantage that inline assembly ever had.
For now, let's assume that you need the parity of the full value, since that's the original implementation you had that was working, and you're just trying to make it faster without changing its behavior. Thus, we need to fix the assembly code's logic before we can even proceed with meaningfully benchmarking it. Fortunately, since I am writing this answer late, Ajay Brahmakshatriya (with collaboration from others) has already done that work, saving me the extra effort.
…except, not quite. When I first drafted this answer, my benchmark revealed that draft 9 of his "tweaked" code still did not produce the same result as the original C function, so it's unsuitable according to our test cases. You say in a comment that his code "works" for you, which means either (A) the original C code was doing extra work, making it needlessly slow, meaning that you can probably tweak it to beat the inline assembly at its own game, or worse, (B) you have insufficient test cases and the new "optimized" code is actually a bug lying in wait. Since that time, Ped7g suggested a couple of fixes, which both fixed the bug causing the incorrect result to be returned, and further improved the code. The amount of input required here, and the number of drafts that he has gone through, should serve as testament to the difficulty of writing correct inline assembly to beat the compiler. But we're not even done yet! His inline assembly remains incorrectly written. SETcc instructions require an 8-bit register as their operand, but his code doesn't use a register specifier to request that, meaning that the code either won't compile (because Clang is smart enough to detect this error) or will compile on GCC but won't execute properly because that instruction has an invalid operand.
Have I convinced you about the importance of testing yet? I'll take it on faith, and move on to the benchmarking part. The benchmark results use the final draft of Ajay's code, with Ped7g's improvements, and my additional tweaks. I also compare some of the other solutions from that question you linked, modified for 64-bit integers, plus a couple of my own invention. Here are my benchmark results (mobile Haswell i7-4850HQ):
Benchmark Time CPU Iterations
-------------------------------------------------------------------
Naive 36 ns 36 ns 19478261
OriginalCCode 4 ns 4 ns 194782609
Ajay_Brahmakshatriya_Tweaked 4 ns 4 ns 194782609
Shreyas_Shivalkar 37 ns 37 ns 17920000
TypeIA 5 ns 5 ns 154482759
TypeIA_Tweaked 4 ns 4 ns 160000000
has_even_parity 227 ns 229 ns 3200000
has_even_parity_Tweaked 36 ns 36 ns 19478261
GCC_builtin_parityll 4 ns 4 ns 186666667
PopCount 3 ns 3 ns 248888889
PopCount_Downlevel 5 ns 5 ns 100000000
Now, keep in mind that these are for randomly-generated 64-bit input values, which disrupts branch prediction. If your input values are biased in a predictable way, either towards parity or non-parity, then the branch predictor will work for you, rather than against you, and certain approaches may be faster. This underscores the importance of benchmarking against data that simulates real-world use cases. (That said, when I write general library functions, I tend to optimize for random inputs, balancing size and speed.)
Notice how the original C function compares to the others. I'm going to make the claim that optimizing it any further is probably a big fat waste of time. So hopefully you learned something more general from this answer, rather than just scrolled down to copy-paste the code snippets. :-)
The Naive function is a completely unoptimized sanity check to determine the parity, taken from here. I used it to validate even your original C code, and also to provide a baseline for the benchmarks. Since it loops through each bit, one-by-one, it is relatively slow, as expected:
unsigned int Naive(uint64 n)
{
bool parity = false;
while (n)
{
parity = !parity;
n &= (n - 1);
}
return parity;
}
OriginalCCode is exactly what it sounds like—it's the original C code that you had, as shown in the question. Notice how it posts up at exactly the same time as the tweaked/corrected version of Ajay Brahmakshatriya's inline assembly code! Now, since I ran this benchmark in MSVC, which doesn't support inline assembly for 64-bit builds, I had to use an external assembly module containing the function, and call it from there, which introduced some additional overhead. With GCC's inline assembly, the compiler probably would have been able to inline the code, thus eliding a function call. So on GCC, you might see the inline-assembly version be up to a nanosecond faster (or maybe not). Is that worth it? You be the judge. For reference, this is the code I tested for Ajay_Brahmakshatriya_Tweaked:
Ajay_Brahmakshatriya_Tweaked PROC
mov rax, rcx ; Windows 64-bit calling convention passes parameter in ECX (System V uses EDI)
shr rax, 32
xor rcx, rax
mov rax, rcx
shr rax, 16
xor rcx, rax
mov rax, rcx
shr rax, 8
xor eax, ecx ; Ped7g's TEST is redundant; XOR already sets PF
setnp al
movzx eax, al
ret
Ajay_Brahmakshatriya_Tweaked ENDP
The function named Shreyas_Shivalkar is from his answer here, which is just a variation on the loop-through-each-bit theme, and is, in keeping with expectations, slow:
Shreyas_Shivalkar PROC
; unsigned int parity = 0;
; while (x != 0)
; {
; parity ^= x;
; x >>= 1;
; }
; return (parity & 0x1);
xor eax, eax
test rcx, rcx
je SHORT Finished
Process:
xor eax, ecx
shr rcx, 1
jne SHORT Process
Finished:
and eax, 1
ret
Shreyas_Shivalkar ENDP
TypeIA and TypeIA_Tweaked are the code from this answer, modified to support 64-bit values, and my tweaked version. They parallelize the operation, resulting in a significant speed improvement over the loop-through-each-bit strategy. The "tweaked" version is based on an optimization originally suggested by Mathew Hendry to Sean Eron Anderson's Bit Twiddling Hacks, and does net us a tiny speed-up over the original.
unsigned int TypeIA(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n ^= n >> 2;
n ^= n >> 1;
return !((~n) & 1);
}
unsigned int TypeIA_Tweaked(uint64 n)
{
n ^= n >> 32;
n ^= n >> 16;
n ^= n >> 8;
n ^= n >> 4;
n &= 0xf;
return ((0x6996 >> n) & 1);
}
has_even_parity is based on the accepted answer to that question, modified to support 64-bit values. I knew this would be slow, since it's yet another loop-through-each-bit strategy, but obviously someone thought it was a good approach. It's interesting to see just how slow it actually is, even compared to what I termed the "naïve" approach, which does essentially the same thing, but faster, with less-complicated code.
unsigned int has_even_parity(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
if (n & (b << i)) { ++count; }
}
return (count % 2);
}
has_even_parity_Tweaked is an alternate version of the above that saves a branch by taking advantage of the fact that Boolean values are implicitly convertible into 0 and 1. It is substantially faster than the original, clocking in at a time comparable to the "naïve" approach:
unsigned int has_even_parity_Tweaked(uint64 n)
{
uint64 count = 0;
uint64 b = 1;
for (uint64 i = 0; i < 64; ++i)
{
count += static_cast<int>(static_cast<bool>(n & (b << i)));
}
return (count % 2);
}
Now we get into the good stuff. The function GCC_builtin_parityll consists of the assembly code that GCC would emit if you used its __builtin_parityll intrinsic. Several others have suggested that you use this intrinsic, and I must echo their endorsement. Its performance is on par with the best we've seen so far, and it has a couple of additional advantages: (1) it keeps the code simple and readable (simpler than the C version); (2) it is portable to different architectures, and can be expected to remain fast there, too; (3) as GCC improves its implementation, your code may get faster with a simple recompile. You get all the benefits of inline assembly, without any of the drawbacks.
GCC_builtin_parityll PROC ; GCC's __builtin_parityll
mov edx, ecx
shr rcx, 32
xor edx, ecx
mov eax, edx
shr edx, 16
xor eax, edx
xor al, ah
setnp al
movzx eax, al
ret
GCC_builtin_parityll ENDP
PopCount is an optimized implementation of my own invention. To come up with this, I went back and considered what we were actually trying to do. The definition of "parity" is an even number of set bits. Therefore, it can be calculated simply by counting the number of set bits and testing to see if that count is even or odd. That's two logical operations. As luck would have it, on recent generations of x86 processors (Intel Nehalem or AMD Barcelona, and newer), there is an instruction that counts the number of set bits—POPCNT (population count, or Hamming weight)—which allows us to write assembly code that does this in two operations.
(Okay, actually three instructions, because there is a bug in the implementation of POPCNT on certain microarchitectures that creates a false dependency on its destination register, and to ensure we get maximum throughput from the code, we need to break this dependency by pre-clearing the destination register. Fortunately, this a very cheap operation, one that can generally be handled for "free" by register renaming.)
PopCount PROC
xor eax, eax ; break false dependency
popcnt rax, rcx
and eax, 1
ret
PopCount ENDP
In fact, as it turns out, GCC knows to emit exactly this code for the __builtin_parityll intrinsic when you target a microarchitecture that supports POPCNT (otherwise, it uses the fallback implementation shown below). As you can see from the benchmarks, this is the fastest code yet. It isn't a major difference, so it's unlikely to matter unless you're doing this repeatedly within a tight loop, but it is a measurable difference and presumably you wouldn't be optimizing this so heavily unless your profiler indicated that this was a hot-spot.
But the POPCNT instruction does have the drawback of not being available on older processors, so I also measured a "fallback" version of the code that does a population count with a sequence of universally-supported instructions. That is the PopCount_Downlevel function, taken from my private library, originally adapted from this answer and other sources.
PopCount_Downlevel PROC
mov rax, rcx
shr rax, 1
mov rdx, 5555555555555555h
and rax, rdx
sub rcx, rax
mov rax, 3333333333333333h
mov rdx, rcx
and rcx, rax
shr rdx, 2
and rdx, rax
add rdx, rcx
mov rcx, 0FF0F0F0F0F0F0F0Fh
mov rax, rdx
shr rax, 4
add rax, rdx
mov rdx, 0FF01010101010101h
and rax, rcx
imul rax, rdx
shr rax, 56
and eax, 1
ret
PopCount_Downlevel ENDP
As you can see from the benchmarks, all of the bit-twiddling instructions that are required here exact a cost in performance. It is slower than POPCNT, but supported on all systems and still reasonably quick. If you needed a bit count anyway, this would be the best solution, especially since it can be written in pure C without the need to resort to inline assembly, potentially yielding even more speed:
unsigned int PopCount_Downlevel(uint64 n)
{
uint64 temp = n - ((n >> 1) & 0x5555555555555555ULL);
temp = (temp & 0x3333333333333333ULL) + ((temp >> 2) & 0x3333333333333333ULL);
temp = (temp + (temp >> 4)) & 0x0F0F0F0F0F0F0F0FULL;
temp = (temp * 0x0101010101010101ULL) >> 56;
return (temp & 1);
}
But run your own benchmarks to see if you wouldn't be better off with one of the other implementations, like OriginalCCode, which simplifies the operation and thus requires fewer total instructions. Fun fact: Intel's compiler (ICC) always uses a population count-based algorithm to implement __builtin_parityll; it emits a POPCNT instruction if the target architecture supports it, or otherwise, it simulates it using essentially the same code as I've shown here.
Or, better yet, just forget the whole complicated mess and let your compiler deal with it. That's what built-ins are for, and there's one for precisely this purpose.
Because C sucks when handling bit operations, I suggest using gcc built in functions, in this case __builtin_parityl(). See:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
You will have to use extended inline assembly (which is a gcc extension) to get the similar effect.
Your parity64 function can be changed as follows -
uint parity64_unsafe_and_broken(uint64 n){
uint result = 0;
__asm__("addq $0, %0" : : "r"(n) :);
// editor's note: compiler-generated instructions here can destroy EFLAGS
// Don't depending on FLAGS / regs surviving between asm statements
// also, jumping out of an asm statement safely requires asm goto
__asm__("jnp 1f");
__asm__("movl $1, %0" : "=r"(result) : : );
__asm__("1:");
return result;
}
But as commented by #MichaelPetch the parity flag is computed only on the lower 8 bits. So this will work for your if your n is less than 255. For bigger numbers you will have to use the code you mentioned in your question.
To get it working for 64 bits you can collapse the parity of the 32 bit integer into single byte by doing
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
This code will have to be just at the start of the function before the assembly.
You will have to check how it affects the performance.
The most optimized I could get it is
uint parity64(uint64 n){
unsigned char result = 0;
n = (n >> 32) ^ n;
n = (n >> 16) ^ n;
n = (n >> 8) ^ n;
__asm__("test %1, %1 \n\t"
"setp %0"
: "+r"(result)
: "r"(n)
:
);
return result;
}
How can I include the above (or similar) code as inline assembly in my C source file, so that the parity64() function runs that instead?
This is an XY problem... You think you need to inline that assembly to gain from its benefits, so you asked about how to inline it... but you don't need to inline it.
You shouldn't include assembly into your C source code, because in this case you don't need to, and the better alternative (in terms of portability and maintainability) is to keep the two pieces of source code separate, compile them separately and use the linker to link them.
In parity64.c you should have your portable version (with a wrapper named bool CheckParity(size_t result)), which you can default to in non-x86/64 situations.
You can compile this to an object file like so: gcc -c parity64.c -o parity64.o
... and then link the object code generated from assembly, with the C code: gcc bindot.c parity64.o -o bindot
In parity64_x86.s you might have the following assembly code from your question:
.code
; bool CheckParity(size_t Result)
CheckParity PROC
mov rax, 0
add rcx, 0
jnp jmp_over
mov rax, 1
jmp_over:
ret
CheckParity ENDP
END
You can compile this to an alternative parity64.o object file object code using gcc with this command: gcc -c parity64_x86.s -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
Similarly, if you wanted to use __builtin_parityl instead (as suggested by hdantes answer, you could (and should) once again keep that code separate (in the same place you keep other gcc/x86 optimisations) from your portable code. In parity64_x86.c you might have:
bool CheckParity(size_t result) {
return __builtin_parityl(result);
}
To compile this, your command would be: gcc -c parity64_x86.c -o parity64.o
... and then link the object code generated like so: gcc bindot.c parity64.o -o bindot
On a side-note, if you'd like to inspect the assembly gcc would produce from this: gcc -S parity64_x86.c
Comments in your assembly indicate that the equivalent function prototype in C would be bool CheckParity(size_t Result), so with that in mind, here's what bindot.c might look like:
extern bool CheckParity(size_t Result);
uint64_t bindot(uint64_t *a, uint64_t *b, size_t entries){
uint64_t parity = 0;
for(size_t i = 0; i < entries; ++i)
parity ^= a[i] & b[i]; // Running sum!
return CheckParity(parity);
}
You can build this and link it to any of the above parity64.o versions like so: gcc bindot.c parity64.o -o bindot...
I highly recommend reading the manual for your compiler, when you have the time...

SSE intrinsics without compiler optimization

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.
I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...
This code is for an assignment where we're not allowed to enable compiler optimization options.
No SSE version:
int get_freq(const float* matrix, float value) {
int freq = 0;
for (ssize_t i = start; i < end; i++) {
if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
SSE version:
#include <immintrin.h>
#include <math.h>
#include <float.h>
#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)
int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {
int freq = 0;
int i;
__m128 value = _mm_set1_ps(givenValue);
__m128 count = _mm_setzero_ps();
__m128 and_value = _mm_set1_ps(0x00000001);
for (i = 0; i + 15 < g_elements; i += 16) {
GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
GETEQU(0); GETEQU(1); GETEQU(2); GETEQU(3);
GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
}
__m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
count = _mm_add_ps(count, shuffle_a);
__m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
count = _mm_add_ps(count, shuffle_b);
freq = _mm_cvtss_si32(count);
for (; i < g_elements; i++) {
if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.
-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.
I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.
Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.
In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)
Re: your update:
Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.
The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)
BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).
You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.
Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

How to optimize these loops (with compiler optimization disabled)?

I need to optimize some for-loops for speed (for a school assignment) without using compiler optimization flags.
Given a specific Linux server (owned by the school), a satisfactory improvement is to make it run under 7 seconds, and a great improvement is to make it run under 5 seconds. This code that I have right here gets about 5.6 seconds. I am thinking I may need to use pointers with this in some way to get it to go faster, but I'm not really sure. What options do I have?
The file must remain 50 lines or less (not counting comments).
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
register double sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0, sum8 = 0, sum9 = 0;
register int j;
// ... and this one.
printf("CS201 - Asgmt 4 - \n");
for (i = 0; i < N_TIMES; i++)
{
// You can change anything between this comment ...
for (j = 0; j < ARRAY_SIZE; j += 10)
{
sum += array[j];
sum1 += array[j + 1];
sum2 += array[j + 2];
sum3 += array[j + 3];
sum4 += array[j + 4];
sum5 += array[j + 5];
sum6 += array[j + 6];
sum7 += array[j + 7];
sum8 += array[j + 8];
sum9 += array[j + 9];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum += sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8 + sum9;
// ... and this one.
return 0;
}
Re-posting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as "what else is possible", so I took him at his word and info-dumped about vectorizing and tuning for current CPU hardware. :)
The OP of that question eventually said he wasn't allowed to use compiler options higher than -O0, which I guess is the case here, too.
Summary:
Why using -O0 distorts things (unfairly penalizes things that are fine in normal code for a normal compiler). Using -O0 (the gcc/clang default) so your loops don't optimize away is not a valid excuse or a useful way to find out what will be faster with normal optimization enabled. (See also Idiomatic way of performance evaluation? for more about benchmark methods and pitfalls, like ways to enable optimization but still stop the compiler from optimizing away the work you want to measure.)
Stuff that's wrong with the assignment.
Types of optimizations. FP latency vs. throughput, and dependency chains. Link to Agner Fog's site. (Essential reading for optimization).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -ffast-math, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).
I think the point of the assignment is to sort of teach assembly-language performance optimizations using C with no compiler optimizations. This is silly. It's mixing up things the compiler will do for you in real life with things that do require source-level changes.
See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
-O0 doesn't just "not optimize", it makes the compiler store variables to memory after every statement instead of keeping them in registers. It does this so you get the "expected" results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you jump to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of -O0 requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it's horrible for tiny loops if the loop counter is kept in memory. (e.g. 1 cycle per iteration for inc reg vs. 6c for inc [mem], creating a bottleneck on loop counter updates in tight loops).
With gcc -O0, the register keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that's only with -O0. In real code, register is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there's a proposal to remove it from the language along with other obsolete stuff like trigraphs.
With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.
Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like array[i*width + j*width*height], so it's a good idea to change the source to do the strength-reduction optimization of turning the multiplies into += adds.
At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later.) It's the compiler's job to optimize your code by using pointer incrementing even when the source uses array indexing, when that's faster.
For good performance, you have to be aware of what compilers can and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing an optimization that was essential for some code to run fast. (e.g. pulling a constant computation out of a loop, or proving something about how different branch conditions are related to each other, and simplifying.)
Besides all that, it's a crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
(You can fix this by printing sum at the end. gcc and clang don't seem to realize that calloc returns zeroed memory, and optimize it away to 0.0. See my code below.)
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also, the other version of this question had an uninitialized variable kicking around. It looks like long int help was introduced by the OP of that question, not the prof. So I will have to downgrade my "utter nonsense" to merely "silly", because the code doesn't even print the result at the end. That's the most common way of getting the compiler not to optimize everything away in a microbenchmark like this.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it.
Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you're doing with the sum0..sum9 unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.
If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
Compiler Options
Lets start by seeing what the compiler can do for us.
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling / inlining decisions.
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang 3.5 is too old to support -march=sandybridge. You should prefer to use a compiler version that's new enough to know about the target architecture you're tuning for, esp. if using -march to make code that doesn't need to run on older architectures.)
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. It knows that calloc only returns 16-byte aligned memory (on x86-64 System V), and when tuning for Sandybridge (before Haswell) it knows that 32-byte loads have a big penalty when misaligned. And that splitting them isn't too expensive since a 32-byte load takes 2 cycles in a load port anyway.
vmovupd -0x60(%rbx,%rcx,8),%xmm4
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
This is worse on later CPUs, especially when the data does happen to be aligned at run-time; see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? about GCC versions where -mavx256-split-unaligned-load was on by default with -mtune=generic.
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) In that case it uses a tight loop of 4x vaddpd mem, %ymm, %ymm. It only runs about 0.65 insns per cycle (and 0.93 uops / cycle), according to perf, so the bottleneck isn't front-end.
I checked with a debugger, and calloc is indeed returning a pointer that's an odd multiple of 16. (glibc for large allocations tends to allocate new pages, and put bookkeeping info in the initial bytes, always misaligning to any boundary wider than 16.) So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables -mavx256-split-unaligned-load and ...-store for -march=sandybridge, and also for the default tune=generic with -mavx, which is not so good especially for Haswell or with memory that's usually aligned by the compiler doesn't know about it.)
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your (from the other question) source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
GCC, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at the godbolt compiler explorer. The -xc compiler option compiles as C, not C++. The inner loop is from .L3 to jne .L3. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog's guides don't cover).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 bandwidth was the problem after all. There aren't enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle. L2 sustained bandwidth is less than peak on Intel SnB / Haswell / Skylake CPUs.
See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how latency * max_concurrency is one possible bottleneck. See also the "Latency Bound Platforms" part of the answer to Enhanced REP MOVSB for memcpy limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.
Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) Loop tiling is a much better solution, see below.
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the 64kiB L1D on an AMD K10 (Istanbul) CPU, but not Bulldozer-family (16kiB L1D) or Ryzen (32kiB L1D).
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.
You may be on the right track, though you'll need to measure it to be certain (my normal advice to measure, not guess seems a little superfluous here since the whole point of the assignment is to measure).
Optimising compilers will probably not see much of a difference since they're pretty clever about that sort of stuff but, since we don't know what optimisation level it will be compiling at, you may get a substantial improvement.
To use pointers in the inner loop is a simple matter of first adding a pointer variable:
register double *pj;
then changing the loop to:
for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
sum += *j++;
sum1 += *j++;
sum2 += *j++;
sum3 += *j++;
sum4 += *j++;
sum5 += *j++;
sum6 += *j++;
sum7 += *j++;
sum8 += *j++;
sum9 += *j;
}
This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.
With no optimisation1 on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.
1 With optimisation level -O3, both are reported as taking 0.001 seconds so, as mentioned, the optimisers are pretty clever. However, given you're seeing 5+ seconds, I'd suggest it wasn't been compiled with optimisation on.
As an aside, this is a good reason why it's usually advisable to write your code in a readable manner and let the compiler take care of getting it running faster. While my meager attempts at optimisation roughly doubled the speed, using -O3 made it run some ten thousand times faster :-)
Before anything else, try to change compiler settings to produce faster code. There is general optimisation, and the compiler might do auto vectorisation.
What you would always do is try several approaches and check what is fastest. As a target, try to get to one cycle per addition or better.
Number of iterations per loop: You add up 10 sums simultaneously. It might be that your processor doesn't have enough registers for that, or it has more. I'd measure the time for 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... sums per loop.
Number of sums: Having more than one sum means that latency doesn't bite you, just throughput. But more than four or six might not be helpful. Try four sums, with 4, 8, 12, 16 iterations per loop. Or six sums, with 6, 12, 18 iterations.
Caching: You are running through an array of 80,000 bytes. Probably more than L1 cache. Split the array into 2 or 4 parts. Do an outer loop iterating over the two or four subarrays, the next loop from 0 to N_TIMES - 1, and the inner loop adding up values.
And then you can try using vector operations, or multi-threading your code, or using the GPU to do the work.
And if you are forced to use no optimisation, then the "register" keyword might actually work.

optimized sum of an array of doubles in C [duplicate]

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I've got an assignment where I must take a program and make it more efficient in terms of time.
the original code is:
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
help++;
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// ... and this one.
return 0;
}
I almost exclusively modified the second for loop by changing it to
double *j=array;
double *p=array+ARRAY_SIZE;
for(; j<p;j+=10){
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{
this on its own was able to reduce the time down to the criteria...
it already seems to work but are there any mistakes i'm not seeing?
I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.
See also other optimization guides in the x86 tag wiki.
First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also:
gcc -Wall -O3 -march=native fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
long int help;
I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.
Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.
Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
compiler options
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.
vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.
Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.
I would try this for the inner loop:
double* tmp = array;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += *tmp; // Use a pointer
tmp++; // because it is faster to increment the pointer
// than it is to recalculate array+j every time
help++;
}
or better
double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating
// the end criteria and
while (tmp != end) { // just compare if the end is reached
sum += *tmp;
tmp++;
help++;
}
I think You should read about openmp library if You could use multithreaded. But this is so simple example that I think could not be optimized.
Certain thing is that You don't need to declare i and j before for loop. That would do:
for (int i = 0; i < N_TIMES; i++)

GCC auto-vectorization has no effect on runtime, even when supposedly "profitable"

I've spent the past few days reading about autovectorization with gcc 4.7. I followed some examples I saw online, and the setup seems to be correct. But when I actually run with the code and compare between vectorization on or off, there isn't a noticeable difference in runtime.
Here's the code I've been working with:
#include <string.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <stdio.h>
#include <math.h>
int main(int argc, char** argv) {
long b = strtol(argv[2], NULL, 0);
unsigned long long int i;
unsigned long long int n = (int)pow(2,29);
float total = 0;
float *__restrict__ x1;
float *__restrict__ y1;
posix_memalign((void *)&x1, 16, sizeof(float)*n);
posix_memalign((void *)&y1, 16, sizeof(float)*n);
float *__restrict__ x = __builtin_assume_aligned(x1,16);
float *__restrict__ y = __builtin_assume_aligned(y1,16);
for (i=0;i<n;i++) {
x[i] = i;
y[i] = i;
}
for (i=0; i<n; i++) {
y[i] += x[i];
}
printf("y[%li]: \t\t\t\t%f\n", b,y[b]);
printf("correct answer: \t\t\t%f\n", (b)*2);
return 0;
}
Some of this stuff seems redundant to me, but was necessary to get the compiler to understand what was going on (especially the fact that the data were aligned). The "b" variable that's read from command line is just there because I was paranoid about the compiler optimizing away the loop entirely.
Here is the compiler command when vectorizeration is enabled:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -ftree-vectorize -o v
Basically, this is equivalent to just using -O3. I put the flags in myself so that all I needed to do was remove "ftree-vectorize" and be able to test the result sans vectorization.
Here is the output of the ftree-vectorize-verbose flag, to show that the code is in fact being vectorized:
Analyzing loop at test.c:29
29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_store_cost: aligned.
29: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
29: cost model: Adding cost of checks for loop versioning aliasing.
29: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 4
Scalar iteration cost: 4
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 2
29: Profitability threshold = 3
Vectorizing loop at test.c:29
29: Profitability threshold is 3 loop iterations.
29: created 1 versioning for alias checks.
29: LOOP VECTORIZED.
Analyzing loop at test.c:24
24: vect_model_induction_cost: inside_cost = 2, outside_cost = 2 .
24: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
24: not vectorized: relevant stmt not supported: D.5806_18 = (float) D.5823_58;
test.c:7: note: vectorized 1 loops in function.
Note that the vectorization is profitable after 3 iterations, and I'm running with 2^29~=500,000,000 iterations. So I should expect a vastly different runtime with vectorization turned off, right?
Well, here are the runtimes of the code (I ran it 20 times in a row):
59.082s
79.385s
57.557s
57.264s
53.588s
54.300s
53.645s
69.044s
57.238s
59.366s
56.314s
55.224s
57.308s
57.682s
56.083s
369.590s
59.963s
55.683s
54.979s
62.309s
Throwing away that weird ~370s outlier, that gives a mean runtime of 58.7s, with a standard deviation of 6.0s.
Next, I'll compile with the same command as before, but with no -ftree-vectorize flag:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -o nov
Again running the program 20 times in a row yields the following times:
69.471s
57.134s
56.240s
57.040s
55.787s
56.530s
60.010s
60.187s
324.227s
56.377s
55.337s
54.110s
56.164s
59.919s
493.468s
63.876s
57.389s
55.553s
54.908s
56.828s
Again throwing away outliers, this gives a mean runtimee of 57.9s with a standard deviation of 3.6s.
So these two versions have statistically indistinguishable runtimes.
Can anyone point me to what I'm doing wrong? Does the "profitability threshold" spit out by the compiler not mean what I think it means? I really appreciate any help people can give me, I've been trying to figure this out for the past week.
EDIT:
I implemented the change that #nilspipenbrinck suggested, and it seems to have worked. I stuck the vectorized loop in a function, and called that function a boatload of times. The relative run-times are now 24.0s (sigma of <0.1s) for no vectorization vs 20.8s (sigma of <0.2s) for vectorization, or a 13% speed improvement. Not as much as I was hoping for, but at least now I know its working! Thanks for taking the time to look at my question and write an answer, I really appreciate it.
You don't do much arithmetic. Therefore the runtime of your test code is memory bound. E.g. you spend most of the time by moving the data between the CPU and memory.
Furthermore your n is very large with 2^29 elements. Therefore you don't benefit from the first and second level cache in any way.
If you want to see improvements with SSE, use a smaller n such that you only touch 8 or 16 kilobyte of data. Also make sure that the data is 'hot' e.g. it has recently been accessed by the CPU. That way the data does not have to be moved from main memory but it gets moved from the caches which is several magnitudes faster.
As an alternative you could also do a lot more arithmetic. This would give the memory prefetch system a chance to fetch the data from main memory in the background while you utilize the CPU doing math.
Summarized: If the arithmetic is faster than your system can move the memory around you will not see any benefits. Memory access times will be the bottleneck and the few cycles you save using the SSE instruction set will get lost in the noise of memory access timings.
There are several factors that determine how profitable will be to vectorize code. In this case (basing this on the output you provided) the compiler is only vectorizing one loop, I would think that's the second one because the first one would be typically ignored since there is not enough computation being done for it to be profitable to vectorize.
The running times you are posting are for the whole code, not for the loop alone, so there is only so much vectorizing will do for the overall running time. If you really want to see how much improvement is there from vectorization I would suggest running a profiler such as AMD Code XL, Intel Vtune, OProfile and such, it will tell you specifically for that loop how much improvement in terms of time and performance you are making.
Right now I'm working in evaluation of vectorizing compilers, and I would code running upt to 60 times faster with vectorization, other times the speedup is not as impressive, and it all depends on the loop, the compiler and the architecture you are using.

Resources