Parallelise 2 for loops with OpenMP - c

So I have this function that I have to parallelize with OpenMP static scheduling for n threads
void computeAccelerations(){
int i,j;
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
}
I tried to do something like:
void computeAccelerations_static(int num_of_threads){
int i,j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
It comes naturally to just add the #pragma omp parallel for num_threads(num_of_threads) schedule(static) but it isn't correct.
I think there is some kind of false sharing with the accelerations[i] but I don't know how to approach it. I appreciate any kind of help. Thank you.

In your loop nest, only the iterations of the outer loop are parallelized. Because i is the loop-control variable, each thread gets its own, private copy, but as a matter of style, it would be better to declare i in the loop control block.
j is another matter. It is declared outside the parallel region and it is not the control variable of a parallelized loop. As a result, it is shared among the threads. Because each of the threads executing i-loop iterations manipulates shared variable j, you have a huge problem with data races. This would be resolved (among other alternatives) by moving the declaration of j into the parallel region, preferrably into the control block of its associated loop.
Overall, then:
// int i, j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
vector sji = { positions[j].x - positions[i].x,
positions[j].y - positions[i].y,
positions[j].z - positions[i].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
vector S = { s * sji.x, s * sji.y, s * sji.z };
accelerations[i].x += S.x;
accelerations[i].y += S.y;
accelerations[i].z += S.z;
}
}
}
Note also that computing sji appears to be wasteful, as in mathematical terms it is just -sij, and neither sji nor sij is modified. I would probably reduce the above to something more like this:
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
accelerations[i].x -= s * sij.x;
accelerations[i].y -= s * sij.y;
accelerations[i].z -= s * sij.z;
}
}
}

Related

Vectorising a nested loop with AVX2

I am trying to vectorise the inner loop the following nested loop. Firstly, is this good practice, or should one avoid attempting to vectorise nested loops?
The following works, it has already some basic loop unrolling.
int sparsemv(struct mesh *A, const double * const x, double * const y) {
const int nrow = (const int) A->local_nrow;
int j = 0;
double sum = 0.0;
#pragma omp parallel for private(j, sum)
for (int i=0; i< nrow; i++) {
sum = 0.0;
const double * const cur_vals = (const double * const) A->ptr_to_vals_in_row[i];
const int * const cur_inds = (const int * const) A->ptr_to_inds_in_row[i];
const int cur_nnz = (const int) A->nnz_in_row[i];
int unroll = (cur_nnz/4)*4;
for (j=0; j< unroll; j+=4) {
sum += cur_vals[j] * x[cur_inds[j]];
sum += cur_vals[j+1] * x[cur_inds[j+1]];
sum += cur_vals[j+2] * x[cur_inds[j+2]];
sum += cur_vals[j+3] * x[cur_inds[j+3]];
}
for (; j < cur_nnz; j++) {
sum += cur_vals[j] * x[cur_inds[j]];
}
y[i] = sum;
}
return 0;
}
However, when I try to vectorise using 256-bit Vector registers in AVX2, I get either the incorrect answers or seg faults. x and y are aligned but A is not, but for the moment, all loading and storing is done using unaligned operations since that is the only time I don't get seg faults:
int sparsemv(struct mesh *A, const double * const x, double * const y) {
const int nrow = (const int) A->local_nrow;
int j = 0;
double sum = 0.0;
#pragma omp parallel for private(j, sum)
for (int i=0; i< nrow; i++) {
sum = 0.0;
const double * const cur_vals = (const double * const) A->ptr_to_vals_in_row[i];
const int * const cur_inds = (const int * const) A->ptr_to_inds_in_row[i];
const int cur_nnz = (const int) A->nnz_in_row[i];
int unroll = (cur_nnz/4)*4;
__m256d sumVec = _mm256_set1_pd(sum);
for (j=0; j< unroll; j+=4) {
__m256d cur_valsVec = _mm256_loadu_pd(cur_vals + j);
__m256d xVec = _mm256_loadu_pd(x + cur_inds[j]);
sumVec = _mm256_add_pd(sumVec, _mm256_mul_pd(cur_valsVec, xVec));
}
_mm256_storeu_pd(y + i, sumVec); // Is this storing in y + i + 1, 2 and 3 aswell?
for (; j < cur_nnz; j++) {
sum += cur_vals[j] * x[cur_inds[j]];
}
y[i] += sum;
}
return 0;
}

C - parallelize recurrence omp

I have a problem: I have to parallelize this piece of code with OMP.
There is a problem of data dependencies and I don't know how to solve it.
Any suggestions?
for (n = 2; n < N+1; n++) {
dz = *(dynamic_d + n-1)*z;
*(dynamic_A + n) = *(dynamic_A + n-1) + dz * (*(dynamic_A + n-2));
*(dynamic_B + n) = *(dynamic_B + n-1) + dz * (*(dynamic_B + n-2));
}
You cannot parallelize the loop iterations due to the depdency, but you can split the computation of dynamic_A vs dynamic_B using sections:
#pragma omp parallel sections
{
#pragma omp section
{
// NOTE: Declare n and dz locally so that it is private!
for (int n = 2; n < N+1; n++) {
my_type dz = dynamic_d[n-1] * z;
dynamic_A[n] = dynamic_A[n-1] + dz * dynamic_A[n-2];
}
}
#pragma omp section
{
for (int n = 2; n < N+1; n++) {
my_type dz = dynamic_d[n-1] * z;
dynamic_B[n] = dynamic_B[n-1] + dz * dynamic_B[n-2];
}
}
}
Please use array indexing instead of the unholy pointer arithmetic referencing abnormity.

Cost function not decreasing in gradient descent implementation

I am trying implemented batch gradient descent in C language. The problem is, my cost function increases dramatically in every turn and I am not able to understand what is wrong. I checked my code several times and it seems to me that I coded exactly the formulas. Do you have any suggestions or ideas what might be the wrong in the implementation?
My data set is here: https://archive.ics.uci.edu/ml/datasets/Housing
And I reference these slides for the algorithm (I googled this): http://asv.informatik.uni-leipzig.de/uploads/document/file_link/527/TMI04.2_linear_regression.pdf
I read the data set correctly into the main memory. Below part shows how I store the data set information in main memory. It is straight-forward.
//Definitions
#define NUM_OF_ATTRIBUTES 13
#define NUM_OF_SETS 506
#define LEARNING_RATE 0.07
//Data holder
struct data_set_s
{
double x_val[NUM_OF_SETS][NUM_OF_ATTRIBUTES + 1];
double y_val[NUM_OF_SETS];
double teta_val[NUM_OF_ATTRIBUTES + 1];
};
//RAM
struct data_set_s data_set;
Teta values are initialized to 0 and x0 values are initialized to 1.
Below section is the hypothesis function, which is the standart polynomial function.
double perform_hypothesis_a(unsigned short set_index)
{
double result;
int i;
result = 0;
for(i = 0; i < NUM_OF_ATTRIBUTES + 1; i++)
result += data_set.teta_val[i] * data_set.x_val[set_index][i];
return result;
}
Below section is the cost function.
double perform_simplified_cost_func(double (*hypothesis_func)(unsigned short))
{
double result, val;
int i;
result = 0;
for(i = 0; i < NUM_OF_SETS; i++)
{
val = hypothesis_func(i) - data_set.y_val[i];
result += pow(val, 2);
}
result = result / (double)(2 * NUM_OF_SETS);
return result;
}
Below section is the gradient descent function.
double perform_simplified_gradient_descent(double (*hypothesis_func)(unsigned short))
{
double temp_teta_val[NUM_OF_ATTRIBUTES + 1], summation, val;
int i, j, k;
for(i = 0; i < NUM_OF_ATTRIBUTES + 1; i++)
temp_teta_val[i] = 0;
for(i = 0; i < 10; i++) //assume this is "while not converged"
{
for(j = 0; j < NUM_OF_ATTRIBUTES + 1; j++)
{
summation = 0;
for(k = 0; k < NUM_OF_SETS; k++)
{
summation += (hypothesis_func(k) - data_set.y_val[k]) * data_set.x_val[k][j];
}
val = ((double)LEARNING_RATE * summation) / NUM_OF_SETS);
temp_teta_val[j] = data_set.teta_val[j] - val;
}
for(j = 0; j < NUM_OF_ATTRIBUTES + 1; j++)
{
data_set.teta_val[j] = temp_teta_val[j];
}
printf("%lg\n ", perform_simplified_cost_func(hypothesis_func));
}
return 1;
}
While it seems correct to me, when I print the cost function at the end of the every gradient descent, it goes like: 1.09104e+011, 5.234e+019, 2.51262e+028, 1.20621e+037...

OpenMP with C program

I am having a hard time using OpenMP with C to parallelize this method. I was wondering if anyone could help and possibly tell me what is wrong with my parallelization of this method.
void blur(float **out, float **in) {
// assumes "padding" to avoid messy border cases
int i, j, r, c;
float tmp, term;
term = 1.0 / 157.0;
#pragma omp parallel num_threads(8)
#pragma omp for private(r,c)
for (i = 0; i < N-4; i++) {
for (j = 0; j < N-4; j++) {
tmp = 0.0;
for (r = 0; r < 5; r++) {
for (c = 0; c < 5; c++) {
tmp += in[i+r][j+c] * mask[r][c];
}
}
out[i+2][j+2] = term * tmp;
}
}
}
You shall either declare tmp inside the loop:
// at line 11:
float tmp = 0.0;
or specify tmp as a private variable:
// at line 7:
#pragma omp for private(r,c,tmp)
Or it would be treated like a shared variable among threads.

OpenMP C Normal Random Generator

I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);
In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!

Resources