cilk plus array notation not vectorized with gcc 4.9.0 - c

I'm trying to figure out why gcc 4.9.0 won't vectorize a simple array addition when using gcc 4.9.0, using -O -ftree-vectorize:
int a[256], b[256], c[256];
foo () {
int i;
a[:] = b[:] + c[:];
}
From looking at the assembler produced this loop hasn't been vectorized and with the -fopt-info-vec-all flag I get a lot of output telling me why vectorization failed, beginning with:
>testvec.c:5: note: ===== analyze_loop_nest =====
>testvec.c:5: note: === vect_analyze_loop_form ===
>testvec.c:5: note: not vectorized: control flow in loop.
>testvec.c:5: note: bad loop form.
which is puzzling, since there's not control flow in the loop. Vectorization of the for loop using standard array notation for the same operation works fine.

It looks like only the latest version of GCC (6.1) can vectorize your example:
http://melpon.org/wandbox/permlink/LOIweYNRRLXeJsZf

Related

Why doesn't the loop vectorize after certain number of elements?

I have made a matrix-vector multiplication function which is nicely auto-vectorized, when array is under 16x16 size, compiling with GCC 11.2 results in vectorized code:
#define no_el 16
void mv_mul(float arr[no_el][no_el],float a1[no_el],float a2[no_el])
{
for(int i=0;i<no_el;i++)
{
a2[i]=0;
for(int j=0;j<no_el;j++)
{
a2[i]+=arr[i][j]*a1[j];
}
}
}
However, when I increase number of elements to 24, 32, etc. the compiler gives these warnings:
Output of x86-64 gcc 11.2 (Compiler #1)
<source>:7:18: missed: couldn't vectorize loop
<source>:12:11: missed: not vectorized: complicated access pattern.
<source>:10:18: missed: couldn't vectorize loop
<source>:12:11: missed: not vectorized: complicated access pattern.
<source>:5:7: note: vectorized 0 loops in function.
<source>:15:1: note: ***** Analysis failed with vector mode V8SF
<source>:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V8SF
And the code is not vectorized.
Is there anything that can be done ?
As tstanisl commented, adding restrict keyword solved the issue.

Force the compiler duplicate c inline function code which been called from inside a loop [duplicate]

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

Initializing array in C - execution time

int a[5] = {0};
VS
typedef struct
{
int a[5];
} ArrStruct;
ArrStruct arrStruct;
sizeA = sizeof(arrStruct.a)/sizeof(int);
for (it = 0 ; it < sizeA ; ++it)
arrStruct.a[it] = 0;
Does initializing by for loop takes more execution time? if so, why?
It depends upon the compiler and the optimization flags.
On recent GCC (e.g. 4.8 or 4.9) with gcc -O3 (or probably even -O1 or -O2) it should not matter, since the same code would be emitted (GCC has even an optimization which would transform your loop into a builtin_memset which would be further optimized).
On some compilers, it could happen that the int a[5] = {0}; might be faster, because the compiler might emit e.g. vector instruction (or on x86 a rep stosw) to clear an array.
The best thing is to examine the generated (gimple representation and) assembler code (e.g. with gcc -fdump-tree-gimple -O3 -fverbose-asm -mtune=native -S) and to benchmark. Most of the cases it does not matter. Be sure to enable optimizations when compiling.
Generally, don't care about such micro-optimization; a good optimizing compiler is better than you have time to code.
It depends on the scope of the variables. For a static or global variable, the first initialization
int a[5]={0};
may be done at compile time, while the loop is run at, well, run time. Thus there is no "execution" associated with the former.
You may find the discussion of this question (and in particular this answer ) interesting.

Why is clang optimizing out my array even when using -O0 flag?

I am trying to debug the following C Program using GDB:
// Program to generate a user specified number of
// fibonacci numbers using variable length arrays
// Chapter 7 Program 8 2013-07-14
#include <stdio.h>
int main(void)
{
int i, numFibs;
printf("How many fibonacci numbers do you want (between 1 and 75)?\n");
scanf("%i", &numFibs);
if (numFibs < 1 || numFibs > 75)
{
printf("Between 1 and 75 remember?\n");
return 1;
}
unsigned long long int fibonacci[numFibs];
fibonacci[0] = 0; // by definition
fibonacci[1] = 1; // by definition
for(i = 2; i < numFibs; i++)
fibonacci[i] = fibonacci[i-2] + fibonacci[i-1];
for(i = 0; i < numFibs; i++)
printf("%llu ", fibonacci[i]);
printf("\n");
return 0;
}
The issue I am having is when trying to compile the code using:
clang -ggdb3 -O0 -Wall -Werror 7_8_FibonacciVarLengthArrays.c
When I try to run gdb on the a.out file created and I am stepping through the program execution. Anytime after the fibonacci[] array is decalared and I type:
info locals
the result says fibonacci <value optimized out> (until after the first iteration of my for loop) which then results in fibonacci holding the address 0xbffff128 for the rest of the program (but dereferencing that address does not appear to contain any meaningful data).
I am just confused why clang appears to be optimizing out this array when the -O0 flag is used?
I can use gcc to compile this code and the value displays as expected when using GDB....
Any thoughts?
Thank you.
You don't mention which version of clang you are using. I tried it with both 3.2 and a recent SVN install (3.4).
The code generated by the two versions looks pretty similar to me, but the debugging information is different. The clang 3.2 (which comes from a default ubuntu 13.04 install) produces an error when I try to examine fibonacci in gdb:
fibonacci = <error reading variable fibonacci (DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece.)>
In the code compiled with clang 3.4, it all works fine. In neither case is the array "optimized out"; it's clearly allocated on the stack.
So I suspect the oddity that you're seeing has more to do with the emission of debugging information than with the actual code.
gdb does not yet support debugging stack allocated variable-length arrays. See https://sourceware.org/gdb/wiki/VariableLengthArray
Use a compile time constant or malloc to allocate fibonacci so that it will be visible to gdb.
See also GDB reports "no symbol in current context" upon array initialization
clang is not "optimizing out" the array at all! The array is declared as a variable-length array on the stack, so it has to be explicitly allocated (using techniques similar to those used by alloca()) when its declaration is reached. The starting address of the array is unknown until that process is complete.

Tell gcc to specifically unroll a loop

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

Resources