Here is excerpt from my C code. I hope I copied all relevant parts.
#define SIN_LEN 22050
#define CALC_N 4100
#define CHUNK_LEN 22050
float __attribute__((aligned(16))) sin_array[SIN_LEN];
float __attribute__((aligned(16))) cos_array[SIN_LEN];
short __attribute__((aligned(16))) data[CHUNK_LEN];
size_t __attribute__((aligned(16))) a1[CHUNK_LEN];
float sinus_sum __attribute__((aligned(16)));
float cosinus_sum __attribute__((aligned(16)));
float __attribute__((aligned(16))) temp1[CHUNK_LEN];
float __attribute__((aligned(16))) temp2[CHUNK_LEN];
void sin_summ(long n, float* restrict sin_s, float* restrict cos_s ) {
sinus_sum = 0;
cosinus_sum = 0;
for (int i = 0; i < CHUNK_LEN; i++ )
a1[i] = (n*i) % SIN_LEN;
for (int i = 0; i < CHUNK_LEN; i++) {
temp1[i] = sin_array[a1[i]];
temp2[i] = cos_array[a1[i]];
}
for (int i = 0; i < CHUNK_LEN; i++) {
sinus_sum += data[i] * temp1[i];
cosinus_sum += data[i] * temp2[i];
}
*sin_s = sinus_sum;
*cos_s = cosinus_sum;
return;
}
Function sin_summ is called in loop too. The GCC parameters are -O3 -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -ffast-math -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -funsafe-math-optimizations -fstrict-aliasing -Wstrict-aliasing The architecture is armv7-neon (BeagleBone Black).
The problem is loops #1 and #3 are vectorized, but #2 is not. I guess it is because of [a[i]] part. Somebody please suggest how to make it vectorizable. Right now single non-vectorized loop works 40% faster than what is written here.
Related
I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.
I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.
For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.
Environment: compiling with gcc on WSL2
gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 4000 // Matrix size will be N x N
#define T 1
//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt
void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for simd private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
// double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
results[i] += A[i * matrixSize + j] * V[j];
// also tried tmp += A[i * matrixSize + j] * V[j];
}
// results[i] = tmp;
}
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
for (i = 0; i < size; i++)
{
double n = rand() % 5;
S[i] = n;
}
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 5;
A[i*size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
double *V = (double *)malloc(N * sizeof(double)); // v in our A*v = parV computation
double *parV = (double *)malloc(N * sizeof(double)); // Parallel computed vector
double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
genRandVector(V, N);
doParallelComputation(A, V, parV, N, T);
free(parV);
free(A);
free(V);
return 0;
}
Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v
You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)
With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.
With restrict and -fopenmp (without fast-math), it just does scalar FMA.
With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3
With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?
Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.
However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)
(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)
Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.
Auto-vec auto-par version:
I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.
void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
double tmp = 0;
// TODO: unroll outer loop and cache-block it.
#pragma omp simd reduction(+:tmp)
for (j = 0; j < matrixSize; j++)
{
//results[i] += A[i * matrixSize + j] * V[j];
tmp += A[i * matrixSize + j] * V[j]; //
}
results[i] = tmp; // write-only to results, not adding to old value.
}
}
Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:
# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
add rdx, 1 # loop counter; newer GCC just compares the end-pointer
vmovupd ymm2, YMMWORD PTR [rcx+rax] # 32-byte load
vfmadd231pd ymm0, ymm2, YMMWORD PTR [rsi+rax] # 32-byte memory-source FMA
add rax, 32 # pointer increment
cmp rdi, rdx
ja .L6
Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.
I test the following simple function
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
with very large arrays so that it is memory bandwidth bound. The test code I use is below. When I compile with -O2 it takes 1.7 seconds. When I compile with -O2 -mavx it takes only 1.0 seconds. The non vex-encoded scalar operations are 70% slower! Why is this?
Here is the the assembly for -O2 and -O2 -mavx.
https://godbolt.org/g/w4p60f
System: i7-6700HQ#2.60GHz (Skylake) 32 GB mem, Ubuntu 16.10, GCC 6.3
Test code
//gcc -O2 -fopenmp test.c
//or
//gcc -O2 -mavx -fopenmp test.c
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>
#define N 1000000
#define R 1000
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *b * N);
double dtime;
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
dtime = -omp_get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += omp_get_wtime();
printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
_mm_free(a), _mm_free(b);
}
The problem is related to a dirty upper half of an AVX register after calling omp_get_wtime(). This is a problem particularly for Skylake processors.
The first time I read about this problem was here. Since then other people have observed this problem: here and here.
Using gdb I found that omp_get_wtime() calls clock_gettime. I rewrote my code to use clock_gettime() and I see the same problem.
void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();
double get_wtime() {
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
#ifndef __AVX__
fix();
#endif
return time.tv_sec + 1E-9*time.tv_nsec;
}
void dispatch() {
fix = fix_sse;
#if defined(__INTEL_COMPILER)
if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
#else
#if defined(__GNUC__) && !defined(__clang__)
__builtin_cpu_init();
#endif
if(__builtin_cpu_supports("avx")) fix = fix_avx;
#endif
}
Stepping through code with gdb I see that the first time clock_gettime is called it calls _dl_runtime_resolve_avx(). I believe the problem is in this function based on this comment. This function appears to only be called the first time clock_gettime is called.
With GCC the problem goes away using //__asm__ __volatile__ ( "vzeroupper" : : : ); after the first call with clock_gettime however with Clang (using clang -O2 -fno-vectorize since Clang vectorizes even at -O2) it only goes away using it after every call to clock_gettime.
Here is the code I used to test this (with GCC 6.3 and Clang 3.8)
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();
double get_wtime() {
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
#ifndef __AVX__
fix();
#endif
return time.tv_sec + 1E-9*time.tv_nsec;
}
void dispatch() {
fix = fix_sse;
#if defined(__INTEL_COMPILER)
if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
#else
#if defined(__GNUC__) && !defined(__clang__)
__builtin_cpu_init();
#endif
if(__builtin_cpu_supports("avx")) fix = fix_avx;
#endif
}
#define N 1000000
#define R 1000
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
dispatch();
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *b * N);
double dtime;
//dtime = get_wtime(); // call once to fix GCC
//printf("%f\n", dtime);
//fix = fix_sse;
dtime = -get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += get_wtime();
printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
_mm_free(a), _mm_free(b);
}
If I disable lazy function call resolution with -z now (e.g. clang -O2 -fno-vectorize -z now foo.c) then Clang only needs __asm__ __volatile__ ( "vzeroupper" : : : ); after the first call to clock_gettime just like GCC.
I expected that with -z now I would only need __asm__ __volatile__ ( "vzeroupper" : : : ); right after main() but I still need it after the first call to clock_gettime.
i have been trying for hours and it drives me crazy. The last error I get is :
demo_cblas.c:(.text+0x83): undefined reference to `clapack_sgetrf'
demo_cblas.c:(.text+0xa3): undefined reference to `clapack_sgetri'
I am compiling the code using
/usr/bin/gcc -o demo_cblas demo_cblas.c -L /usr/lib64 -l :libgfortran.so.3 -L /usr/lib64 \
-llapack -L /usr/lib64 -lblas
I try with and without libgfortran, with different compilers gcc-33, gcc-47, gcc-48. The test code is not from me but comes from this forum ...
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "clapack.h"
#include "cblas.h"
void invertMatrix(float *a, unsigned int height){
int info, ipiv[height];
info = clapack_sgetrf(CblasColMajor, height, height, a, height, ipiv);
info = clapack_sgetri(CblasColMajor, height, a, height, ipiv);
}
void displayMatrix(float *a, unsigned int height, unsigned int width)
{
int i, j;
for(i = 0; i < height; i++){
for(j = 0; j < width; j++)
{
printf("%1.3f ", a[height*j + i]);
}
printf("\n");
}
printf("\n");
}
int main(int argc, char *argv[])
{
int i;
float a[9], b[9], c[9];
srand(time(NULL));
for(i = 0; i < 9; i++)
{
a[i] = 1.0f*rand()/RAND_MAX;
b[i] = a[i];
}
displayMatrix(a, 3, 3);
return 0;
}
I am on Suse 12.3 64bits. In /usr/lib64 I have liblapack.a liblapack.so, ... and libblas.a libblas.so, ... and libgfortran.so.3
The same code without the function "invertMatrix" (the one using the library) compiles fine.
Any idea or suggestion ?
Thank you all for your help.
Vava
I'm quite positive that you also need to link to libcblas, which is the c wrapper library for libblas. Note that libblas is a FORTRAN library which therefore does not contain the function clapack_* you're calling.
I've just got this working on FreeBSD with:
gcc -o test test.c \
-llapack -lblas -lalapack -lcblas
I'd installed math/atlas (from ports) and the lapack and blas packages.
See my question here
I have the following c function declaration:
float Sum2d( const unsigned int nRows, const unsigned int mCols, float arr[nRows][mCols] )
{
float sumAll = 0;
// I would like to make this change illegal!
arr[0][0] = 15;
for (int i = 0; i < nRows; i++)
for (int j = 0; j < mCols; j++)
sumAll += arr[i][j];
return sumAll;
}
Using the code:
int main()
{
// define a 2d float array
float myArr2d[3][2] = {{1,2}, {3,4}, {5,6}};
// calculate the sum
float sum = Sum2d(3, 2, myArr2d);
// print the sum
printf("%f\n", myOpResult);
// return 1
return 1;
}
This function works well, yet there's one problem: the elements of arr can be altered in the Sum2d() function.
How can I change Sum2d()'s prototype to prevent any changes to arr's elements?
Multidimensional arrays with const qualification are difficult to handle. Basically you have the choice to cast non-const arrays at every call side, to avoid such const arrays as arguments completely, or to deviate by using some sophisticated macros. This is a longer story, you may read it up here.
I don't know what compiler you're using, but that doesn't compile for me as C or C++.
But regardless, just making arr const should suffice.
Change the prototype of the function to use const with float
Also you have specified nRows / nCols in array argument, which is not allowed in C. If you don't know the bounds of array, use double pointer.
This approach doesn't prevents typecasting in the function.
#include <stdio.h>
float Sum2d( const unsigned int nRows, const unsigned int mCols, const float arr[][2] )
{
float sumAll = 0;
// I would like to make this change illegal!
//arr[0][0] = 15;
for (int i = 0; i < nRows; i++)
for (int j = 0; j < mCols; j++)
sumAll += arr[i][j];
return sumAll;
}
int main()
{
// define a 2d float array
float myArr2d[3][2] = {{1,2}, {3,4}, {5,6}};
// calculate the sum
float sum = Sum2d(3, 2, (const float (*)[2])myArr2d);
// print the sum
printf("%f\n", sum);
// return 1
return 1;
}
Since you are using following command line i suppose:
gcc <file.c> -o out -std=c99
Running on Debian Squeeze
$ gcc array.c -o array -std=c99
$ gcc --version
gcc (Debian 4.4.5-8) 4.4.5
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have the following C program (a simplification of my actual use case which exhibits the same behavior)
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ const input = malloc(20000*sizeof(float));
float * __restrict__ const output = malloc(20000*sizeof(float));
unsigned int pos=0;
while(1) {
unsigned int rest=100;
for(unsigned int i=pos;i<pos+rest; i++) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
When I compile with
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math
I get the output
main.c:10: note: not vectorized: unhandled data-ref
where 10 is the line of the inner for loop. When I looked up why it might say this, it seemed to be saying that the pointers could be aliased, but they can't be in my code, as I have the __restrict keyword. They also suggested including the -msse flags, but they don't seem to do anything either. Any help?
It certainly seems like a bug. In the following, equivalent functions, foo() is vectorised but bar() is not, when compiling for an x86-64 target:
void foo(const float * restrict input, float * restrict output)
{
unsigned int pos;
for (pos = 0; pos < 10100; pos++)
output[pos] = input[pos] * 0.1;
}
void bar(const float * restrict input, float * restrict output)
{
unsigned int pos;
unsigned int i;
for (pos = 0; pos <= 10000; pos += 100)
for (i = 0; i < 100; i++)
output[pos + i] = input[pos + i] * 0.1;
}
Adding the -m32 flag, to compile for an x86 target instead, causes both functions to be vectorised.
It doesn't like the outer loop format which is preventing it from understanding the inner loop. I can get it to vectorize if I just fold it into a single loop:
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ input = malloc(20000*sizeof(float));
float * __restrict__ output = malloc(20000*sizeof(float));
for(unsigned int i=0; i<=10100; i++) {
output[i] = input[i] * 0.1f;
}
}
(note that I didn't think too hard about how to properly translate the pos+rest limit into a single for loop condition, it may be wrong)
You might be able to take advantage of this by putting a simplified inner loop into a function which you call with pointers and a count. Even when it is inlined again it may work fine. This is assuming you deleted parts of your while() loop that I have just simplified away but you need to retain.
try:
const float * __restrict__ input = ...;
float * __restrict__ output = ...;
experiment a bit by changing things around:
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ input = new float[20000];
float * __restrict__ output = new float[20000];
unsigned int pos=0;
while(1) {
unsigned int rest=100;
output += pos;
input += pos;
for(unsigned int i=0;i<rest; ++i) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp
test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21
test.cpp:14: note: Alignment of access forced using versioning.
test.cpp:14: note: Vectorizing an unaligned access.
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware.
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment.
test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing.
test.cpp:14: note: Cost model analysis:
Vector inside of loop cost: 8
Vector outside of loop cost: 6
Scalar iteration cost: 5
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 2
test.cpp:14: note: Profitability threshold = 3
test.cpp:14: note: Vectorization may not be profitable.
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21
test.cpp:14: note: created 1 versioning for alias checks.
test.cpp:14: note: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.
Compilation finished at Wed Feb 16 19:17:59