Is using pragma omp simd like this correct? - c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define pow(x) ((x) * (x))
#define NUM_THREADS 8
#define wmax 1000
#define Nv 2
#define N 5
int b=0;
float Points[N][Nv]={ {0,1}, {3,4}, {1,2}, {5,1} ,{8,9}};
float length[wmax+1]={0};
float EuclDist(float* Ne, float* Pe) {
int i;
float s = 0;
for (i = 0; i < Nv; i++) {
s += pow(Ne[i] - Pe[i]);
}
return s;
}
void DistanceFinder(float* a[]){
int i;
#pragma omp simd
for (i=1;i<N+1;i++){
length[b] += EuclDist(&a[i],&a[i-1]);
}
//printf(" %f\n", length[b]);
}
void NewRoute(){
//some irrelevant things
DistanceFinder(Points);
}
int main(){
omp_set_num_threads(NUM_THREADS);
do{
b+=1;
NewRoute();
} while (b<wmax);
}
Trying to parallelize this loop and trying different things, tried this one.
Seems to be the fastest, however is it correct to use SIMD like that? Because I'm using a previous iteration (i and i - 1). The results I see though are correct weirdly or not.

Seems to be the fastest, however is it correct to use SIMD like that?
First, there is a race condition that needs to be fixed, namely during the updates of the array length[b]. Moreover, you are accessing memory outside the array a; (iterating from 1 to N + 1), and you are passing &a[i]. You can fix the race condition by using OpenMP reduction clause:
void DistanceFinder(float* a[]){
int i;
float sum = 0;
float tmp;
#pragma omp simd private(tmp) reduction(+:sum)
for (i=1;i<N;i++){
tmp = EuclDist(a[i], a[i-1]);
sum += tmp;
}
length[b] += sum;
}
Furthermore, you need to provide a version of EuclDist as follows:
#pragma omp declare simd uniform(Ne, Pe)
float EuclDist(float* Ne, float* Pe) {
int i;
float s = 0;
for (i = 0; i < Nv; i++)
s += pow(Ne[i] - Pe[i]);
return s;
}
Because I'm using a previous iteration (i and i - 1).
In your case, it is okay, since the array a is just being read.
The results I see though are correct weirdly or not.
Very-likely there was no vectorization taking place. Regardless, it would still be undefined behavior due to the aforementioned race condition.
You can simplify your code so that it increases the likelihood of the vectorization actually happening, for instance:
void DistanceFinder(float* a[]){
int i;
float sum = 0;
float tmp;
#pragma omp simd private(tmp) reduction(+:sum)
for (i=1;i<N;i++){
tmp = pow(a[i][0] - a[i-1][0]) + pow(a[i][1] - a[i-1][1])
sum += tmp;
}
length[b] += sum;
}
A further change that you can do to improve the performance of your code is to allocate the matrix (that is passed as a parameter of the function DistanceFinder) in a manner that when you iterate over its rows (i.e., a[i]) you would be iterating over continuous memory address.
For instance, you could pass two arrays a1 and a2 to represent the first and second columns of the matrix a:
void DistanceFinder(float a1[], float a2[]){
int i;
float sum = 0;
float tmp;
#pragma omp simd private(tmp) reduction(+:sum)
for (i=1;i<N;i++){
tmp = pow(a1[i] - a1[i-1]) + pow(a2[i][1] - a2[i-1][1])
sum += tmp;
}
length[b] += sum;
}

Related

gcc not autovectorising matrix-vector multiplication

I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.
I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.
For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.
Environment: compiling with gcc on WSL2
gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 4000 // Matrix size will be N x N
#define T 1
//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt
void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for simd private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
// double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
results[i] += A[i * matrixSize + j] * V[j];
// also tried tmp += A[i * matrixSize + j] * V[j];
}
// results[i] = tmp;
}
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
for (i = 0; i < size; i++)
{
double n = rand() % 5;
S[i] = n;
}
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 5;
A[i*size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
double *V = (double *)malloc(N * sizeof(double)); // v in our A*v = parV computation
double *parV = (double *)malloc(N * sizeof(double)); // Parallel computed vector
double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
genRandVector(V, N);
doParallelComputation(A, V, parV, N, T);
free(parV);
free(A);
free(V);
return 0;
}
Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v
You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)
With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.
With restrict and -fopenmp (without fast-math), it just does scalar FMA.
With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3
With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?
Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.
However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)
(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)
Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.
Auto-vec auto-par version:
I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.
void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
double tmp = 0;
// TODO: unroll outer loop and cache-block it.
#pragma omp simd reduction(+:tmp)
for (j = 0; j < matrixSize; j++)
{
//results[i] += A[i * matrixSize + j] * V[j];
tmp += A[i * matrixSize + j] * V[j]; //
}
results[i] = tmp; // write-only to results, not adding to old value.
}
}
Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:
# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
add rdx, 1 # loop counter; newer GCC just compares the end-pointer
vmovupd ymm2, YMMWORD PTR [rcx+rax] # 32-byte load
vfmadd231pd ymm0, ymm2, YMMWORD PTR [rsi+rax] # 32-byte memory-source FMA
add rax, 32 # pointer increment
cmp rdi, rdx
ja .L6
Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.

With openmp in C, how can I parallelize a for loop that contains a nested comparison function for qsort?

I want to parallelize a for loop which contains a nested comparison function for qsort:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(){
int i;
#pragma omp parallel for
for(i = 0; i < 100; i++){
int *index= (int *) malloc(sizeof(int)*10);
double *tmp_array = (double*) malloc(sizeof(double)*10);
int j;
for(j=0; j<10; j++){
tmp_array[j] = rand();
index[j] = j;
}
// QuickSort the index array based on tmp_array:
int simcmp(const void *a, const void *b){
int ia = *(int *)a;
int ib = *(int *)b;
if ((tmp_array[ia] - tmp_array[ib]) > 1e-12){
return -1;
}else{
return 1;
}
}
qsort(index, 10, sizeof(*index), simcmp);
free(index);
free(tmp_array);
}
return 0;
}
When I try to compile this, I get the error:
internal compiler error: in get_expr_operands, at tree-ssa-operands.c:881
}
As far as I can tell, this error is due to the nested comparison function. Is there a way to make openmp work with this nested comparison function? If not, is there a good way to achieve a similar result without a nested comparison function?
Edit:
I'm using GNU C compiler where nested functions are permitted. The code compiles and runs fine without the pragma statement. I can't define simcmp outside of the for loop because tmp_array would then have to be a global variable, which would mess up the multi-threading. However, if somebody has a suggestion to achieve the same result without a nested function, that would be most welcome.
I realize this has been self answered, but here are some standard C and OpenMP options. The qsort_r function is a good classic choice, but it's worth noting that qsort_s is part of the c11 standard, and thus is portable wherever c11 is offered (which does not include Windows, they don't quite offer c99 yet).
As to doing it in OpenMP without the nested comparison function, still using original qsort, there are two ways that come to mind. First is to use the classic global variable in combination with OpenMP threadprivate:
static int *index = NULL;
static double *tmp_array = NULL;
#pragma omp threadprivate(index, tmp_array)
int simcmp(const void *a, const void *b){
int ia = *(int *)a;
int ib = *(int *)b;
double aa = ((double *)tmp_array)[ia];
double bb = ((double *)tmp_array)[ib];
if ((aa - bb) > 1e-12){
return -1;
}else{
return 1;
}
}
int main(){
int i;
#pragma omp parallel for
for(i = 0; i < 100; i++){
index= (int *) malloc(sizeof(int)*10);
tmp_array = (double*) malloc(sizeof(double)*10);
int j;
for(j=0; j<10; j++){
tmp_array[j] = rand();
index[j] = j;
}
// QuickSort the index array based on tmp_array:
qsort_r(index, 10, sizeof(*index), simcmp, tmp_array);
free(index);
free(tmp_array);
}
return 0;
}
The version above causes every thread in the parallel region to use a private copy of the global variables index and tmp_array, which takes care of the issue. This is probably the most portable version you can write in standard C and OpenMP, with the only likely incompatible platforms being those that do not implement thread local memory (some microcontrollers, etc.).
If you want to avoid the global variable and still have portability and use OpenMP, then I would recommend using C++11 and the std::sort algorithm with a lambda:
std::sort(index, index+10, [=](const int& a, const int& b){
if ((tmp_array[a] - tmp_array[b]) > 1e-12){
return -1;
}else{
return 1;
}
});
I solved my problem with qsort_r, which allows you to pass an additional pointer to the comparison function.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int simcmp(const void *a, const void *b, void *tmp_array){
int ia = *(int *)a;
int ib = *(int *)b;
double aa = ((double *)tmp_array)[ia];
double bb = ((double *)tmp_array)[ib];
if ((aa - bb) > 1e-12){
return -1;
}else{
return 1;
}
}
int main(){
int i;
#pragma omp parallel for
for(i = 0; i < 100; i++){
int *index= (int *) malloc(sizeof(int)*10);
double *tmp_array = (double*) malloc(sizeof(double)*10);
int j;
for(j=0; j<10; j++){
tmp_array[j] = rand();
index[j] = j;
}
// QuickSort the index array based on tmp_array:
qsort_r(index, 10, sizeof(*index), simcmp, tmp_array);
free(index);
free(tmp_array);
}
return 0;
}
This compiles and runs with no issue. However, it is not completely ideal as qsort_r is platform and compiler dependent. There is a portable version of qsort_r here where the author summarizes my problem nicely:
If you want to qsort() an array with a comparison operator that takes
parameters you need to use global variables to pass those parameters
(not possible when writing multithreaded code), or use qsort_r/qsort_s
which are not portable (there are separate GNU/BSD/Windows versions
and they all take different arguments).

No result using openmp

I'm tryng to add all the members of an array using openmp this way
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
int v[] ={1,2,3,4,5,6,7,8,9};
int sum = 0;
#pragma omp parallel private(v, sum)
{
#pragma reduction(+: sum)
{
for (int i = 0; i < sizeof(v)/sizeof(int); i++){
sum += v[i];
}
}
}
printf("%d\n",sum);
}
But when I print sum the result is 0
You are very confused about data-sharing attributes and work-sharing for OpenMP. This answer does not attempt to properly teach them to you, but only give you a concise specific example.
Your code does not make any sense and does not compile.
You do not need multiple regions or such, and there are only two variables. v - which is defined outside, is read by all and must be shared - which it implicitly is because it is defined outside. Then there is sum, which is a reduction variable.
Further, you need to apply worksharing (for) to the loop. So in the end it looks like this:
int v[] ={1,2,3,4,5,6,7,8,9};
int sum = 0;
#pragma omp parallel for reduction(+: sum)
for (int i = 0; i < sizeof(v)/sizeof(int); i++){
sum += v[i];
}
printf("%d\n",sum);
Note there are private variables in this example. Private variables are very dangerous because they are uninitialized inside the parallel region, simply don't use them explicitly. If you need something local, declare it inside the parallel region.

OpenMP parallelization (Block Matrix Mult)

I'm attempting to implement block matrix multiplication and making it more parallelized.
This is my code :
int i,j,jj,k,kk;
float sum;
int en = 4 * (2048/4);
#pragma omp parallel for collapse(2)
for(i=0;i<2048;i++) {
for(j=0;j<2048;j++) {
C[i][j]=0;
}
}
for (kk=0;kk<en;kk+=4) {
for(jj=0;jj<en;jj+=4) {
for(i=0;i<2048;i++) {
for(j=jj;j<jj+4;j++) {
sum = C[i][j];
for(k=kk;k<kk+4;k++) {
sum+=A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
}
I've been playing around with OpenMP but still have had no luck in figuring what the best way to have this done in the least amount of time.
Getting good performance from matrix multiplication is a big job. Since "The best code is the code I don't have to write", a much better use of your time would be to understand how to use a BLAS library.
If you are using X86 processors, the Intel Math Kernel Library (MKL) is available free, and includes optimized, parallelized, matrix multiplication operations.
https://software.intel.com/en-us/articles/free-mkl
(FWIW, I work for Intel, but not on MKL :-))
I recently started looking into dense matrix multiplication (GEMM)again. It turns out the Clang compiler is really good at optimization GEMM without needing any intrinsics (GCC still needs intrinsics). The following code gets 60% of the peak FLOPS of my four core/eight hardware thread Skylake system. It uses block matrix multiplication.
Hyper-threading gives worse performance so you make sure you only use threads equal to the number of cores and bind threads to prevent thread migration.
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=4
Then compile like this
clang -Ofast -march=native -fopenmp -Wall gemm_so.c
The code
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>
#define SM 80
typedef __attribute((aligned(64))) float * restrict fast_float;
static void reorder2(fast_float a, fast_float b, int n) {
for(int i=0; i<SM; i++) memcpy(&b[i*SM], &a[i*n], sizeof(float)*SM);
}
static void kernel(fast_float a, fast_float b, fast_float c, int n) {
for(int i=0; i<SM; i++) {
for(int k=0; k<SM; k++) {
for(int j=0; j<SM; j++) {
c[i*n + j] += a[i*n + k]*b[k*SM + j];
}
}
}
}
void gemm(fast_float a, fast_float b, fast_float c, int n) {
int bk = n/SM;
#pragma omp parallel
{
float *b2 = _mm_malloc(sizeof(float)*SM*SM, 64);
#pragma omp for collapse(3)
for(int i=0; i<bk; i++) {
for(int j=0; j<bk; j++) {
for(int k=0; k<bk; k++) {
reorder2(&b[SM*(k*n + j)], b2, n);
kernel(&a[SM*(i*n+k)], b2, &c[SM*(i*n+j)], n);
}
}
}
_mm_free(b2);
}
}
static int doublecmp(const void *x, const void *y) { return *(double*)x < *(double*)y ? -1 : *(double*)x > *(double*)y; }
double median(double *x, int n) {
qsort(x, n, sizeof(double), doublecmp);
return 0.5f*(x[n/2] + x[(n-1)/2]);
}
int main(void) {
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
int n = SM*10*2;
int mem = sizeof(float) * n * n;
float *a = _mm_malloc(mem, 64);
float *b = _mm_malloc(mem, 64);
float *c = _mm_malloc(mem, 64);
memset(a, 1, mem), memset(b, 1, mem);
printf("%dx%d matrix\n", n, n);
printf("memory of matrices: %.2f MB\n", 3.0*mem*1E-6);
printf("peak SP GFLOPS %.2f\n", peak);
puts("");
while(1) {
int r = 10;
double times[r];
for(int j=0; j<r; j++) {
times[j] = -omp_get_wtime();
gemm(a, b, c, n);
times[j] += omp_get_wtime();
}
double flop = 2.0*1E-9*n*n*n; //GFLOP
double time_mid = median(times, r);
double flops_low = flop/times[r-1], flops_mid = flop/time_mid, flops_high = flop/times[0];
printf("%.2f %.2f %.2f %.2f\n", 100*flops_low/peak, 100*flops_mid/peak, 100*flops_high/peak, flops_high);
}
}
This does GEMM 10 times per iteration of an infinite loop and prints the low, median, and high ratio of FLOPS to peak_FLOPS and finally the median FLOPS.
You will need to adjust the following lines
int cores = 4;
double frequency = 3.1; // i7-6700HQ turbo 4 cores
double peak = 32*cores*frequency;
to the number of physical cores, frequency for all cores (with turbo if enabled), and the number of floating pointer operations per core which is 16 for Core2-Ivy Bridge, 32 for Haswell-Kaby Lake, and 64 for the Xeon Phi Knights Landing.
This code may be less efficient with NUMA systems. It does not do nearly as well with Knight Landing (I just started looking into this).

OpenMP gathering data (join data?) after parallel for

What I am looking for is what is the best way to gather all the data from the parallel for loops into one variable. OpenMP seems to have a different routine then I am used to seeing as I started learning OpenMPI first which has scatter and gather routines.
Calculating PI (embarrassingly parallel routine)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
#define CHUNKSIZE 20
int main(int argc, char *argv[])
{
double step, x, pi, sum=0.0;
int i, chunk;
chunk = CHUNKSIZE;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel shared(chunk) private(i,x,sum,step)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < NUM_STEPS; i++)
{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
printf("Thread %d: i = %i sum = %f \n",tid,i,sum);
}
pi = step * sum;
}
EDIT: It seems that I could use an array sum[*NUM_STEPS / CHUNKSIZE*] and sum the array into one value, or would it be better to use some sort of blocking routine to sum the product of each iteration
Add this clause to your #pragma omp parallel ... statement:
reduction(+ : pi)
Then just do pi += step * sum; at the end of the parallel region. (Notice the plus!) OpenMP will then automagically sum up the partial sums for you.
Lets see, I am not quite sure what happens, because I havn't got deterministic behaviour on the finished application, but I have something looks like it resembles π. I removed the #pragma omp parallel shared(chunk) and changed the #pragma omp for schedule(dynamic,chunk) to #pragma omp parallel for schedule(dynamic) reduction(+:sum).
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
This requires some explanation, I removed the schedules chunk just to make it all simpler (for me). The part that you are interested in is the reduction(+:sum) which is a normal reduce opeartion with the operator + and using the variable sum.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
int main(int argc, char *argv[])
{
double step, x, pi, sum=0.0;
int i;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
for(i = 0; i < NUM_STEPS; i++)
{
x = (i+0.5)*step;
sum +=4.0/(1.0+x*x);
printf("Thread %%d: i = %i sum = %f \n",i,sum);
}
pi = step * sum;
printf("pi=%lf\n", pi);
}

Resources