I am having a hard time using OpenMP with C to parallelize this method. I was wondering if anyone could help and possibly tell me what is wrong with my parallelization of this method.
void blur(float **out, float **in) {
// assumes "padding" to avoid messy border cases
int i, j, r, c;
float tmp, term;
term = 1.0 / 157.0;
#pragma omp parallel num_threads(8)
#pragma omp for private(r,c)
for (i = 0; i < N-4; i++) {
for (j = 0; j < N-4; j++) {
tmp = 0.0;
for (r = 0; r < 5; r++) {
for (c = 0; c < 5; c++) {
tmp += in[i+r][j+c] * mask[r][c];
}
}
out[i+2][j+2] = term * tmp;
}
}
}
You shall either declare tmp inside the loop:
// at line 11:
float tmp = 0.0;
or specify tmp as a private variable:
// at line 7:
#pragma omp for private(r,c,tmp)
Or it would be treated like a shared variable among threads.
Related
So I have this function that I have to parallelize with OpenMP static scheduling for n threads
void computeAccelerations(){
int i,j;
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
}
I tried to do something like:
void computeAccelerations_static(int num_of_threads){
int i,j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for(i=0;i<bodies;i++){
accelerations[i].x = 0; accelerations[i].y = 0; accelerations[i].z = 0;
for(j=0;j<bodies;j++){
if(i!=j){
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = {positions[i].x-positions[j].x,positions[i].y-positions[j].y,positions[i].z-positions[j].z};
vector sji = {positions[j].x-positions[i].x,positions[j].y-positions[i].y,positions[j].z-positions[i].z};
double mod = sqrt(sij.x*sij.x + sij.y*sij.y + sij.z*sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant*masses[j]/mod3;
vector S = {s*sji.x,s*sji.y,s*sji.z};
accelerations[i].x+=S.x;accelerations[i].y+=S.y;accelerations[i].z+=S.z;
}
}
}
It comes naturally to just add the #pragma omp parallel for num_threads(num_of_threads) schedule(static) but it isn't correct.
I think there is some kind of false sharing with the accelerations[i] but I don't know how to approach it. I appreciate any kind of help. Thank you.
In your loop nest, only the iterations of the outer loop are parallelized. Because i is the loop-control variable, each thread gets its own, private copy, but as a matter of style, it would be better to declare i in the loop control block.
j is another matter. It is declared outside the parallel region and it is not the control variable of a parallelized loop. As a result, it is shared among the threads. Because each of the threads executing i-loop iterations manipulates shared variable j, you have a huge problem with data races. This would be resolved (among other alternatives) by moving the declaration of j into the parallel region, preferrably into the control block of its associated loop.
Overall, then:
// int i, j;
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
//accelerations[i] = addVectors(accelerations[i],scaleVector(GravConstant*masses[j]/pow(mod(subtractVectors(positions[i],positions[j])),3),subtractVectors(positions[j],positions[i])));
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
vector sji = { positions[j].x - positions[i].x,
positions[j].y - positions[i].y,
positions[j].z - positions[i].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
vector S = { s * sji.x, s * sji.y, s * sji.z };
accelerations[i].x += S.x;
accelerations[i].y += S.y;
accelerations[i].z += S.z;
}
}
}
Note also that computing sji appears to be wasteful, as in mathematical terms it is just -sij, and neither sji nor sij is modified. I would probably reduce the above to something more like this:
#pragma omp parallel for num_threads(num_of_threads) schedule(static)
for (int i = 0; i < bodies; i++) {
accelerations[i].x = 0;
accelerations[i].y = 0;
accelerations[i].z = 0;
for (int j = 0; j < bodies; j++) {
if (i != j) {
vector sij = { positions[i].x - positions[j].x,
positions[i].y - positions[j].y,
positions[i].z - positions[j].z };
double mod = sqrt(sij.x * sij.x + sij.y * sij.y + sij.z * sij.z);
double mod3 = mod * mod * mod;
double s = GravConstant * masses[j] / mod3;
accelerations[i].x -= s * sij.x;
accelerations[i].y -= s * sij.y;
accelerations[i].z -= s * sij.z;
}
}
}
I have a program that works with arrays and outputs a single number. To parallelize the program, I use OpenMP, but the problem is that after writing the directives, I started getting answers that are not similar to the answers of the program without parallelization. Can anyone tell me where I made a mistake?
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <math.h>
#include <float.h>
#define LOOP_COUNT 100
#define MIN_1_ARRAY_VALUE 1
#define MAX_1_ARRAY_VALUE 10
#define MIN_2_ARRAY_VALUE 10
#define MAX_2_ARRAY_VALUE 100
#define PI 3.1415926535897932384626433
#define thread_count 4
double generate_random(unsigned int* seed, int min, int max) {
return (double)((rand_r(seed) % 10) + (1.0 / (rand_r(seed) % (max - min + 1))));
}
double map_1(double value) {
return pow(value / PI, 3);
}
double map_2(double value) {
return fabs(tan(value));
}
void dwarf_sort(int n, double mass[]) {
int i = 1;
int j = 2;
while (i < n) {
if (mass[i-1]<mass[i]) {
i = j;
j = j + 1;
}
else
{
double tmp = mass[i];
mass[i] = mass[i - 1];
mass[i - 1] = tmp;
--i;
if (i==0)
{
i = j;
j = j + 1;
}
}
}
}
int main(int argc, char* argv[]) {
printf("lab work 3 in processing...!\n");
int trial_counter;
int array_size = atoi(argv[1]);
struct timeval before, after;
long time_diff;
gettimeofday(&before, NULL);
for (trial_counter = 0; trial_counter < LOOP_COUNT; trial_counter++) {
double arr1[array_size];
double arr2[array_size / 2];
double arr2_copy[array_size / 2];
double arr2_min = DBL_MAX;
unsigned int tempValue = trial_counter;
unsigned int *currentSeed = &tempValue;
//stage 1 - init
#pragma omp parallel num_threads(thread_count)
{
#pragma omp parallel for default(none) shared(arr1, currentSeed, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = generate_random(currentSeed, MIN_1_ARRAY_VALUE, MAX_1_ARRAY_VALUE);
// printf("arr[%d] = %f\n", i, arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size, currentSeed, arr2_min) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
double value = generate_random(currentSeed, MIN_2_ARRAY_VALUE, MAX_2_ARRAY_VALUE);
arr2[i] = value;
arr2_copy[i] = value;
if (value < arr2_min) {
arr2_min = value;
}
}
#pragma omp parallel for default(none) shared(arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = map_1(arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size) schedule(guided, thread_count)
for (int i = 1; i < array_size / 2; i++) {
#pragma omp critical
arr2[i] = map_2(arr2_copy[i] + arr2_copy[i - 1]);
}
#pragma omp parallel for default(none) shared(arr2, arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
arr2[i] = pow(arr1[i], arr2[i]);
}
#pragma omp parallel sections
{
#pragma omp section
{
dwarf_sort((int) array_size / 2, arr2);
}
}
double final_sum = 0;
for (int i = 0; i < array_size / 2; i++) {
if (((int) arr2[i]) / 2 == 0) {
final_sum += sin(arr2[i]);
}
}
// printf("Iteration %d, value: %f\n", trial_counter, final_sum);
}
}
gettimeofday(&after, NULL);
time_diff = 1000 * (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000;
printf("\nN=%d. Milliseconds passed: %ld\n", array_size, time_diff);
return 0;
}
rand_r is thread-safe only if each thread have its own seed or if threads are guaranteed to operate on the seed in an exclusive way (eg. using an expensive critical section). This is not the case in your code. Indeed, currentSeed is shared between thread. Thus, it causes a race condition since multiple threads can mutate it simultaneously. You need to use a thread-private seed (with a different value so for results not to be deterministic between threads). The seed of each thread can be initialized from a shared array filled from the main thread (eg. naively [0, 1, 2, 3, etc.]).
The thing is you will still get different results suing a different number of threads with this approach. One solution is to split your data set is independent chunks with an associated seed and then possibly compute the chunk in parallel.
Note that using a #pragma omp parallel for in a #pragma omp parallel section causes many threads to be created (ie. over-subscription). This is generally very inefficient. You should use #pragma omp for instead.
I'm trying to make this code to run in parallel. It's a chunk of code from a big project. I thought I started parallelizing slowly to see if there is a problem step by step (I don't know if that's a good tactic so please let me know).
double best_nearby(double delta[MAXVARS], double point[MAXVARS], double prevbest, int nvars)
{
double z[MAXVARS];
double minf, ftmp;
int i;
minf = prevbest;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for shared(nvars,point,z) private(i)
for (i = 0; i < nvars; i++)
z[i] = point[i];
for (i = 0; i < nvars; i++) {
z[i] = point[i] + delta[i];
ftmp = f(z, nvars);
if (ftmp < minf)
minf = ftmp;
else {
delta[i] = 0.0 - delta[i];
z[i] = point[i] + delta[i];
ftmp = f(z, nvars);
if (ftmp < minf)
minf = ftmp;
else
z[i] = point[i];
}
}
for (i = 0; i < nvars; i++)
point[i] = z[i];
return (minf);
}
NUM_THREADS is #defined
The function has some more lines but they are the same among the parallel and the serial.
It looks like the serial code takes on average 130s thus the parallel takes something like 400s. It baffles me that such a small change can lead up to so much increase in exe time. Any ideas on why this happens? Thank you in advance!
double f(double *x, int n){
double fv;
int i;
funevals++;
fv = 0.0;
for (i=0; i<n-1; i++) /* rosenbrock */
fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);
return fv;
}
Currently, you are not parallelizing much. You can start by parallelizing the f function since it looks computational demanding:
double f(double *x, int n){
..
double fv = 0.0;
#pragma omp parallel for reduction(+:fv)
for (int i=0; i<n-1; i++)
fv = fv + 100.0*pow((x[i+1]-x[i]*x[i]),2) + pow((x[i]-1.0),2);
return fv;
}
Test and check the results. After that you can try to expand the scope of the parallelization to include also the outermost loop.
Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms
I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);
In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!