I'm trying to parallelize the following Radix Sort algorithm C code using OpenMP but I have some doubts about using the OpenMP clauses. In particular, there are some loops where I doubt that they can be parallelized at all.
Here is the code I'm working on:
unsigned getMax(size_t n, unsigned arr[n]) {
unsigned mx = arr[0];
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++)
if (arr[i] > mx)
mx = arr[i];
return mx;
}
void countSort(size_t n, unsigned arr[n], unsigned exp) {
unsigned output[n]; // output array
int i, count[10] = { 0 };
// Store count of occurrences in count[]
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
// Build the output array
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
#pragma omp parallel for private(i)
for (i = 0; i < n; i++)
arr[i] = output[i];
}
// The main function to that sorts arr[] of size n using Radix Sort
void radixsort(size_t n, unsigned arr[n], int threads) {
omp_set_num_threads(threads);
unsigned m = getMax(n, arr);
unsigned exp;
for (exp = 1; m / exp > 0; exp *= 10)
countSort(n, arr, exp);
}
In particular, I'm not sure if for loops like the following can be parallelized or not:
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
#pragma omp parallel for private(i)
for (i = (int) n - 1; i >= 0; i--) {
#pragma omp atomic write
output[count[(arr[i] / exp) % 10] - 1] = arr[i];
count[(arr[i] / exp) % 10]--;
}
I'm asking for help on the specific OMP clauses I should use; other comments on the code shown are also welcome.
First of all to parallelize a code a reasonable amount of work is needed, otherwise the parallel overheads are bigger than the gain by parallelization. This is definitely the case in your example, since you create the output array on stack (so it cannot be big enough). Comments on your code:
Both loops you mention in your question depend on the order of execution, so they cannot be parallelized easily/efficiently. Note also that there is a race condition when count array is accessed.
If you select a base which is a power of 2 (2^k), you can get rid off expensive integer division and you can use fast bitwise/shift operators instead.
Always define your variables in their minimal required scope. So instead of
unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++) ....
the following code is preferred:
#pragma omp parallel for reduction(max:mx)
for (unsigned i = 1; i < n; i++) ....
To copy your array, memcpy can be used: memcpy(arr,output,n*sizeof(output[0]))
In this loop
#pragma omp parallel for private(i)
for (i = 0; i < n; i++) {
#pragma omp atomic
count[(arr[i] / exp) % 10]++; }
you can use reduction instead of atomic operation:
#pragma omp parallel for private(i) reduction(+:count[10])
for (i = 0; i < n; i++) {
count[(arr[i] / exp) % 10]++; }
Radix sort can be parallelized if you split up the data. One way to do this is to use a most significant digit radix sort for the first pass, to create multiple logical bins. For example, if using base 256 (2^8), you end up with 256 bins, which radix sort can then sort in parallel, based on the number of logical cores on your system. With 4 cores, you can sort 4 bins at a time. This relies on having somewhat uniform distribution of the most significant digit, so that the bins are somewhat equal in size.
Trying to optimize the first pass may not help much, since you'll need atomic read|write for the to update a bin index, and the random access writes to anywhere in the destination array will create cache conflicts.
Related
I made this parallel code to share the iterations like first and last, fisrst+1 and last-1,... But I don't know how to improve the code in every one of the two parallel sections because I have an inner loop in the sections and I can't think of any way to simplify it, thanks.
This isn't about which values are stored in x or y, I use this sections design because the requisite is execute the iterations from 0 to N like: 0 N, 1 N-1, 2 N-2 but I would like to know if I can optimize the inner loops maintaining this model
int x = 0, y = 0,k,i,j,h;
#pragma omp parallel private(i, h) reduction(+:x, y)
{
#pragma omp sections
{
#pragma omp section
{
for (i=0; i<N/2; i++)
{
C[i] = 0;
for (j=0; j<N; j++)
{
C[i] += MAT[i][j] * B[j];
}
x += C[i];
}
}
#pragma omp section
{
for (h=N-1; h>=N/2; h--)
{
C[h] = 0;
for (k=0; k<N; k++)
{
C[h] += MAT[h][k] * B[k];
}
y += C[h];
}
}
}
}
x = x + y;
Using sections seems like the wrong approach. A pragma omp for seems more appropriate. Also note that you forgot to declare j private.
int x = 0, y = 0,k,i,j;
#pragma omp parallel private(i,j) reduction(+:x, y)
{
# pragma omp for nowait
for(i=0; i<N/2; i++) {
// local variable to make the life easier on the compiler
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
x += ci;
C[i] = ci;
}
# pragma omp for nowait
for(i=N/2; i < N; i++) {
int ci = 0;
for(j=0; j<N; j++)
ci += MAT[i][j] * B[j];
y += ci;
C[i] = ci;
}
}
x = x + y;
Also, I'm not sure but if you just want x as your final output, you can simplify the code even further:
int x=0, i, j;
#pragma omp parallel for reduction(+:x) private(i,j)
for(i=0; i < N; ++i)
for(j=0; j < N; ++j)
x += MAT[i][j] * B[j];
The section construct is to distribute different tasks to different threads and each section block marks a different task so you will not be able to do that iterations in the order you want I answered you here:
Distribution of loop iterations between threads with a specific order
But I want to clarify that the requirement to use sections is that each block must be independent of the other blocks.
A section gets only one thread, so you can't make the loops parallel. How about
Make a parallel loop to N at the top level,
then inside each iteration use a conditional to decide whether to accumulate into x,y?
Although #Homer512 's solution looks correct to me too.
Unable to reduce the execution time of multiple FFTs using OpenMP.
Tried parallelizing the outermost loop, but thsi degraded the performance
typedef struct{float r; float i;}cmplx_f32_t;
double src[2*128];
double dst[2*128];
double w[128];
cmplx_f32_t data[128][4][256];
cffti(128, w);
for (k = 0; k < 128; k++)
{
for (j = 0; j < 4; j++)
{
for (i = 0; i < 2*32; i++)
{
src[i] = data[i/2][j][k].r;
src[i+1] = data[i/2][j][k].i;
}
cfft2(128, src, dst, w, 1);
}
}
cffti and cfft2 and as given in the example at https://people.sc.fsu.edu/~jburkardt/c_src/fft_openmp/fft_openmp.html
If I disable the #pragma omp directives from the fft_openmp.c files, the run time is about 11ms. If we use #pragma omp, the total execution time is about 220 ms
I'm trying to optimize a program as an experiment.
When I parallelized the first two outer loops(with "it" and "i") I saw significant difference on execution time. But when I tried to parallelize the inner most loop the program became much slower than sequential one. I also tried using reduction but the result was the same.
Is this something that I should expect or I made a mistake on the parallelization?
When I use the "nowait" clause it runs faster than the other two previous parallelizations.
#pragma omp parallel private(it,i,j) firstprivate(u,sigma,dt,mu)
{
for (it = 0; it < itime; it++) {
for (i = 0; i < n; i++) {
sum = 0.0;
#pragma omp for schedule(static)
for (j = 0; j < n; j+=1) {
sum += sigma[i * n + j] * (u[j] - u[i]);
}
#pragma omp atomic write
uplus[i]= (u[i] + dt * (mu - u[i])) + dt * sum / divide;
if (u[i] > uth) {
#pragma omp atomic write
uplus[i] = 0.0;
if (it >= ttransient) {
#pragma omp atomic
omega1[i] += 1.0;
}
}
}
}//omp end
The following function reads data from a file in loops and processes each loaded chunk at a time. To speed up this process, I thought to use openmp in the for loop so that this job is divided between the threads as the following:
void read_process(FILE *fp_read, double *centroids, int total) {
int i, j, c, dim = 16, chunk_size = 10000, num_itr;
double *buffer = calloc(total * dim, sizeof(double));
num_itr = total / chunk_size;
for (c = 0; c < total; ++c) {
fread(buffer, sizeof(double), chunk_size * dim, fp_read);
#pragma omp parallel private(i, j)
{
#pragma omp for
for (i = 0; i < chunk_size; i++) {
for (j = 0; j < dim; j++) {
#pragma omp atomic update
centroids[j] += buffer[i * dim + j];
}
}
}
}
free(buffer);
fclose(fp_read);
}
Without using openmp, my code works fine. However, adding #pragma section causes the code to stop and show the word Hangup in the terminal without further explanation of what was it hanged for. Some folks in StackOverflow answered other issues related to this error message that it is probably because of race condition but I think it won't be the case here because I am using atomic which serializes the access of the buffer. Am I right? Do you guys see an issue with my code? How can I enhance this code?
Thank you very much.
What you want to do is an array reduction. If you have a compiler that supports OpenMP 4.5 then you don't need to change your serial code. You can do
#pragma omp parallel for private (j) reduction(+:centroids[:dim])
for(i=0; i <chunck_size; i++) {
for(j=0; j < dim; j++) {
centroids[j] += buffer[i*dim +j];
}
}
Otherwise you can do the array reduction by hand. Here is one solution
#pragma omp parallel private(j)
{
double tmp[dim] = {0};
#pragma omp for
for(i=0; i < chunck_size; i++) {
for(j=0; j < dim; j++) {
tmp[j] += buffer[i*dim +j];
}
}
#pragma omp critical
for(int i=0; i < dim; i++) centroids[i] += tmp[i];
}
Your current solution is causing massive false sharing as each thread is writing to the same cache line. Both of the solutions above fix this problem by making private versions of centroid for each thread.
As long as dim << chunck_size then these are good solutions.
I have the following code:
for (int i = 0; i < veryLargeArraySize; i++){
int value = A[i];
if (B[value] < MAX_VALUE) {
B[value]++;
}
}
I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)
#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.
Does anyone have a good suggestion on how I might do this?
Here's what I've found out and done.
I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.
However, I needed to do it (for the assignment), so what I did is the following:
Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]
int min_value, max_value;
#pragma omp for reduction(min:min_value), reduction(max:max_value)
for (i = 0; i < veryLargeArraySize; i++){
const unsigned int value = A[i];
if(max_value < value) max_value = value;
if(min_value > value) min_value = value;
}
int size_of_histo = max_value - min_value + 1;`
That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)
Allocate a shared array, such as:
int num_threads = omp_get_num_threads();
int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));
Each thread is assigned a part of the sharedHisto, and can update it without synchronization
int my_id = omp_get_thread_num();
#pragma omp parallel for default(shared) private(i)
for(i = 0; i < veryLargeArraySize; i++){
int value = A[i];
// my_id * size_of_histo positions to the begining of this thread's
// part of sharedHisto .
// i - min_value positions to the actual histo value
sharedHisto[my_id * size_of_histo + i - min_value]++;
}
Now, perform a reduction (as stated here: Reducing on array in OpenMp)
#pragma omp parallel
{
// Every thread is in charge for part of the reduced histogram
// shared_histo with the size: size_of_histo
int my_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int chunk = (size_of_histo + num_threads - 1) / num_threads;
int start = my_id * chunk;
int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;
#pragma omp for default(shared) private(i, j)
for(i = start; i < end; i++){
for(j = 0; j < num_threads; j++){
int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];
if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
else B[i + min_value] = value;
}
}
}