OpenMP - how to efficiently synchronize field update

OpenMP - how to efficiently synchronize field update - c

I have the following code:
for (int i = 0; i < veryLargeArraySize; i++){
int value = A[i];
if (B[value] < MAX_VALUE) {
B[value]++;
}
}
I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)
#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.
Does anyone have a good suggestion on how I might do this?

Here's what I've found out and done.
I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.
However, I needed to do it (for the assignment), so what I did is the following:
Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]
int min_value, max_value;
#pragma omp for reduction(min:min_value), reduction(max:max_value)
for (i = 0; i < veryLargeArraySize; i++){
const unsigned int value = A[i];
if(max_value < value) max_value = value;
if(min_value > value) min_value = value;
}
int size_of_histo = max_value - min_value + 1;`
That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)
Allocate a shared array, such as:
int num_threads = omp_get_num_threads();
int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));
Each thread is assigned a part of the sharedHisto, and can update it without synchronization
int my_id = omp_get_thread_num();
#pragma omp parallel for default(shared) private(i)
for(i = 0; i < veryLargeArraySize; i++){
int value = A[i];
// my_id * size_of_histo positions to the begining of this thread's
// part of sharedHisto .
// i - min_value positions to the actual histo value
sharedHisto[my_id * size_of_histo + i - min_value]++;
}
Now, perform a reduction (as stated here: Reducing on array in OpenMp)
#pragma omp parallel
{
// Every thread is in charge for part of the reduced histogram
// shared_histo with the size: size_of_histo
int my_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int chunk = (size_of_histo + num_threads - 1) / num_threads;
int start = my_id * chunk;
int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;
#pragma omp for default(shared) private(i, j)
for(i = start; i < end; i++){
for(j = 0; j < num_threads; j++){
int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];
if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
else B[i + min_value] = value;
}
}
}

Related

Clang OpenMP. Find max value in matrix N x N

I need to find max value in matrix using OpenMP. It is my first experience with OpenMP, previously I did this task using pthreads.
I wrote this code but it does not work:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void MatrixFIller(int nrows, int* m) {
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
*(m + i * nrows + j) = rand() % 200;
}
}
};
#define dimension 9
#define number_of_threads 4
int main() {
srand(time(NULL));
int matrix[dimension][dimension];
int local_max=-1;
int final_max=-1;
int j = 0;
MatrixFIller(dimension, &matrix[0][0]);
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
printf("%d\t", matrix[i][j]);
}
printf("\n");
}
omp_set_num_threads(number_of_threads);
#pragma omp parallel private(local_max)
{
#pragma omp for
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > local_max) {
local_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
#pragma omp critical
if (local_max > final_max) { final_max = local_max; };
};
printf("Max value of matrix with dimension %d is %d", dimension, final_max);
};
The idea is that in pragma for each thread finds local max and after that it is compared with global max value in pragma critical. Why it does not correct? Thanks!

When entering the parallel region, local_max gets unitialized: the private clause creates variables that are local to each thread and that's it, they are not initialized to any value. If you want them to be initialized with the content of local_max had before entering the parallel region, you have to use the firstprivate clause instead.
However, it would actually be better to declare (and initialize) local_max inside the parallel region.
Also, you may have a look at the reduction clause (with the max option), which will make the code even simpler:
#pragma omp parallel for reduction(max:final_max)
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > final_max) {
final_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
EDIT
Following Laci's comment about the incorrectness of the arithmetic: all of your indeces calculations look correct but are not easy to read. Since you have from the begining a 2D array it is simpler to set two loops. And possibly tell OpenMP to parallelize them both using the collapse clause (and by the way, and as far as possible, declare the loop indeces within the for(): this avoids always wondering which ones should be declared as private or not):
#pragma omp parallel for reduction(max:final_max) collapse(2)
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
if (matrix[i][j] > final_max) {
final_max = matrix[i][j];
}
}
}

The problem with parallelizing the program using OpenMP

I have a program that works with arrays and outputs a single number. To parallelize the program, I use OpenMP, but the problem is that after writing the directives, I started getting answers that are not similar to the answers of the program without parallelization. Can anyone tell me where I made a mistake?
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <math.h>
#include <float.h>
#define LOOP_COUNT 100
#define MIN_1_ARRAY_VALUE 1
#define MAX_1_ARRAY_VALUE 10
#define MIN_2_ARRAY_VALUE 10
#define MAX_2_ARRAY_VALUE 100
#define PI 3.1415926535897932384626433
#define thread_count 4
double generate_random(unsigned int* seed, int min, int max) {
return (double)((rand_r(seed) % 10) + (1.0 / (rand_r(seed) % (max - min + 1))));
}
double map_1(double value) {
return pow(value / PI, 3);
}
double map_2(double value) {
return fabs(tan(value));
}
void dwarf_sort(int n, double mass[]) {
int i = 1;
int j = 2;
while (i < n) {
if (mass[i-1]<mass[i]) {
i = j;
j = j + 1;
}
else
{
double tmp = mass[i];
mass[i] = mass[i - 1];
mass[i - 1] = tmp;
--i;
if (i==0)
{
i = j;
j = j + 1;
}
}
}
}
int main(int argc, char* argv[]) {
printf("lab work 3 in processing...!\n");
int trial_counter;
int array_size = atoi(argv[1]);
struct timeval before, after;
long time_diff;
gettimeofday(&before, NULL);
for (trial_counter = 0; trial_counter < LOOP_COUNT; trial_counter++) {
double arr1[array_size];
double arr2[array_size / 2];
double arr2_copy[array_size / 2];
double arr2_min = DBL_MAX;
unsigned int tempValue = trial_counter;
unsigned int *currentSeed = &tempValue;
//stage 1 - init
#pragma omp parallel num_threads(thread_count)
{
#pragma omp parallel for default(none) shared(arr1, currentSeed, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = generate_random(currentSeed, MIN_1_ARRAY_VALUE, MAX_1_ARRAY_VALUE);
// printf("arr[%d] = %f\n", i, arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size, currentSeed, arr2_min) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
double value = generate_random(currentSeed, MIN_2_ARRAY_VALUE, MAX_2_ARRAY_VALUE);
arr2[i] = value;
arr2_copy[i] = value;
if (value < arr2_min) {
arr2_min = value;
}
}
#pragma omp parallel for default(none) shared(arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = map_1(arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size) schedule(guided, thread_count)
for (int i = 1; i < array_size / 2; i++) {
#pragma omp critical
arr2[i] = map_2(arr2_copy[i] + arr2_copy[i - 1]);
}
#pragma omp parallel for default(none) shared(arr2, arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
arr2[i] = pow(arr1[i], arr2[i]);
}
#pragma omp parallel sections
{
#pragma omp section
{
dwarf_sort((int) array_size / 2, arr2);
}
}
double final_sum = 0;
for (int i = 0; i < array_size / 2; i++) {
if (((int) arr2[i]) / 2 == 0) {
final_sum += sin(arr2[i]);
}
}
// printf("Iteration %d, value: %f\n", trial_counter, final_sum);
}
}
gettimeofday(&after, NULL);
time_diff = 1000 * (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000;
printf("\nN=%d. Milliseconds passed: %ld\n", array_size, time_diff);
return 0;
}

rand_r is thread-safe only if each thread have its own seed or if threads are guaranteed to operate on the seed in an exclusive way (eg. using an expensive critical section). This is not the case in your code. Indeed, currentSeed is shared between thread. Thus, it causes a race condition since multiple threads can mutate it simultaneously. You need to use a thread-private seed (with a different value so for results not to be deterministic between threads). The seed of each thread can be initialized from a shared array filled from the main thread (eg. naively [0, 1, 2, 3, etc.]).
The thing is you will still get different results suing a different number of threads with this approach. One solution is to split your data set is independent chunks with an associated seed and then possibly compute the chunk in parallel.
Note that using a #pragma omp parallel for in a #pragma omp parallel section causes many threads to be created (ie. over-subscription). This is generally very inefficient. You should use #pragma omp for instead.

Radix-sort parallel algorithm C program openmp

I'm trying to parallelize the following Radix Sort algorithm C code using OpenMP but I have some doubts about using the OpenMP clauses. In particular, there are some loops where I doubt that they can be parallelized at all.
Here is the code I'm working on:
unsigned getMax(size_t n, unsigned arr[n]) {
unsigned mx = arr[0];
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++)
if (arr[i] > mx)
mx = arr[i];
return mx;
}
void countSort(size_t n, unsigned arr[n], unsigned exp) {
unsigned output[n]; // output array
int i, count[10] = { 0 };
// Store count of occurrences in count[]
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
// Build the output array
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
#pragma omp parallel for private(i)
for (i = 0; i < n; i++)
arr[i] = output[i];
}
// The main function to that sorts arr[] of size n using Radix Sort
void radixsort(size_t n, unsigned arr[n], int threads) {
omp_set_num_threads(threads);
unsigned m = getMax(n, arr);
unsigned exp;
for (exp = 1; m / exp > 0; exp *= 10)
countSort(n, arr, exp);
}
In particular, I'm not sure if for loops like the following can be parallelized or not:
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
I'm asking for help on the specific OMP clauses I should use; other comments on the code shown are also welcome.

First of all to parallelize a code a reasonable amount of work is needed, otherwise the parallel overheads are bigger than the gain by parallelization. This is definitely the case in your example, since you create the output array on stack (so it cannot be big enough). Comments on your code:
Both loops you mention in your question depend on the order of execution, so they cannot be parallelized easily/efficiently. Note also that there is a race condition when count array is accessed.
If you select a base which is a power of 2 (2^k), you can get rid off expensive integer division and you can use fast bitwise/shift operators instead.
Always define your variables in their minimal required scope. So instead of
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++) ....
the following code is preferred:
#pragma omp parallel for reduction(max:mx)
for (unsigned i = 1; i < n; i++) ....
To copy your array, memcpy can be used: memcpy(arr,output,n*sizeof(output[0]))
In this loop
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
you can use reduction instead of atomic operation:
#pragma omp parallel for private(i) reduction(+:count[10])
for (i = 0; i < n; i++) {
count[(arr[i] / exp) % 10]++; }

Radix sort can be parallelized if you split up the data. One way to do this is to use a most significant digit radix sort for the first pass, to create multiple logical bins. For example, if using base 256 (2^8), you end up with 256 bins, which radix sort can then sort in parallel, based on the number of logical cores on your system. With 4 cores, you can sort 4 bins at a time. This relies on having somewhat uniform distribution of the most significant digit, so that the bins are somewhat equal in size.
Trying to optimize the first pass may not help much, since you'll need atomic read|write for the to update a bin index, and the random access writes to anywhere in the destination array will create cache conflicts.

Timing of portion of serial code increases when increasing the number of threads used to parallelize other portion of code

I am trying to compare the timings of two algorithms ; one is run in serial and the other is run in parallel and I am using omp_get_wtime() to get the execution time of functions.
When the number of threads is increased to 2 then the execution time of serial part is getting doubled , even though the timing of parallel part is getting correctly reduced.
As the number of threads in parallel portion is increased the timing of serial portion is getting increased proportionally.
Sample code of portion that is being parallelized:
void constraints(int pop[], double gain[], double gammma[], double snr, int numberOfLinks, int chromosomes, int links,int (*cv)[])
{
int i, j, k;
double interference[chromosomes*links];
#pragma omp parallel for private(i,j,k) num_threads(4)
for(i = 0; i < chromosomes; i++)
{
(*cv)[i] = 0;
for(j = 0; j < links; j++)
{
interference[i * links + j] = 0;
for(k = 0; k < links; k++)
{
if(k != j)
{
interference[i * links + j] = interference[i * links + j] + pop[i * links + k] * gain[k * numberOfLinks + j];
}
}
if (((gain[j * numberOfLinks + j]/(interference[i * links + j] + 1/snr)) < gammma[j]) && (pop[i * links + j] == 1))
{
(*cv)[i] = (*cv)[i] + 1;
}
}
}
}
Could any one please let me know why this is happening.

OpenMP C Normal Random Generator

I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);

In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

OpenMP - how to efficiently synchronize field update - c

Related

Clang OpenMP. Find max value in matrix N x N

The problem with parallelizing the program using OpenMP

Radix-sort parallel algorithm C program openmp

Timing of portion of serial code increases when increasing the number of threads used to parallelize other portion of code

OpenMP C Normal Random Generator

Categories

Resources