Clang OpenMP. Find max value in matrix N x N - c

I need to find max value in matrix using OpenMP. It is my first experience with OpenMP, previously I did this task using pthreads.
I wrote this code but it does not work:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void MatrixFIller(int nrows, int* m) {
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
*(m + i * nrows + j) = rand() % 200;
}
}
};
#define dimension 9
#define number_of_threads 4
int main() {
srand(time(NULL));
int matrix[dimension][dimension];
int local_max=-1;
int final_max=-1;
int j = 0;
MatrixFIller(dimension, &matrix[0][0]);
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
printf("%d\t", matrix[i][j]);
}
printf("\n");
}
omp_set_num_threads(number_of_threads);
#pragma omp parallel private(local_max)
{
#pragma omp for
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > local_max) {
local_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
#pragma omp critical
if (local_max > final_max) { final_max = local_max; };
};
printf("Max value of matrix with dimension %d is %d", dimension, final_max);
};
The idea is that in pragma for each thread finds local max and after that it is compared with global max value in pragma critical. Why it does not correct? Thanks!

When entering the parallel region, local_max gets unitialized: the private clause creates variables that are local to each thread and that's it, they are not initialized to any value. If you want them to be initialized with the content of local_max had before entering the parallel region, you have to use the firstprivate clause instead.
However, it would actually be better to declare (and initialize) local_max inside the parallel region.
Also, you may have a look at the reduction clause (with the max option), which will make the code even simpler:
#pragma omp parallel for reduction(max:final_max)
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > final_max) {
final_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
EDIT
Following Laci's comment about the incorrectness of the arithmetic: all of your indeces calculations look correct but are not easy to read. Since you have from the begining a 2D array it is simpler to set two loops. And possibly tell OpenMP to parallelize them both using the collapse clause (and by the way, and as far as possible, declare the loop indeces within the for(): this avoids always wondering which ones should be declared as private or not):
#pragma omp parallel for reduction(max:final_max) collapse(2)
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
if (matrix[i][j] > final_max) {
final_max = matrix[i][j];
}
}
}

Related

Numbers not randomized after runs

I'm trying to create an openMP program that randomizes double arrays and run the values through the formula: y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
If I run the program multiple times I've realised that the Y[] values are the same even though they are supposed to be randomized when the arrays are initialized in the first #pragma omp for . Any Ideas as to why this might be happening?
#include<stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include<omp.h>
#define ARRAY_SIZE 10
double randfrom(double min, double max);
double randfrom(double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand() / div);
}
int main() {
int i;
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
double min, max;
int imin, imax;
/*A[10] consists of random number in between 1 and 100
B[10] consists of random number in between 10 and 50
C[10] consists of random number in between 1 and 10
D[10] consists of random number in between 1 and 50
E[10] consists of random number in between 1 and 5
F[10] consists of random number in between 10 and 80*/
srand(time(NULL));
#pragma omp parallel
{
#pragma omp parallel for
for (i = 0; i < ARRAY_SIZE; i++) {
a[i] = randfrom(1, 100);
b[i] = randfrom(10, 50);
c[i] = randfrom(1, 50);
d[i] = randfrom(1, 50);
e[i] = randfrom(1, 5);
f[i] = randfrom(10, 80);
}
}
printf("This is the parallel Print\n\n\n");
#pragma omp parallel shared(a,b,c,d,e,f,y) private(i)
{
//Y=(A*B)+C+(D*E)+(F/2)
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
/*printf("A[%d]%.2f",i, a[i]);
printf("\n\n");
printf("B[%d]%.2f", i, b[i]);
printf("\n\n");
printf("C[%d]%.2f", i, c[i]);
printf("\n\n");
printf("D[%d]%.2f", i, d[i]);
printf("\n\n");
printf("E[%d]%.2f", i, e[i]);
printf("\n\n");
printf("F[%d]%.2f", i, f[i]);
printf("\n\n");*/
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
printf("Y[%d]=%.2f\n", i, y[i]);
}
}
#pragma omp parallel shared(y, min,imin,max,imax) private(i)
{
//min
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
min = y[i];
imin = i;
}
else {
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
//max
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
max = y[i];
imax = i;
}
else {
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
return 0;
}
First of all, I would like to emphasize that OpenMP has significant overheads, so you need a reasonable amount of work in your code, otherwise the overhead is bigger than the gain by parallelization. In your code this is the case, so the fastest solution is to use serial code. However, you mentioned that your goal is to learn OpenMP, so I will show you how to do it.
In your previous post's comments #paleonix linked a post ( How to generate random numbers in parallel? ) which answers your question about random numbers. One of the solutions is to use rand_r.
Your code has a data race when searching for minimum and maximum values of array Y. If you need to find the minimum/maximum value only it is very easy, because you can use reduction like this:
double max=y[0];
#pragma omp parallel for default(none) shared(y) reduction(max:max)
for (int i = 1; i < ARRAY_SIZE; i++) {
if (y[i] > max) {
max = y[i];
}
}
But in your case you also need the indices of minimum and maximum value, so it is a bit more complicated. You have to use a critical section to be sure that other threads can not change the max, min, imax and imin values while you updating their values. So, it can be done the following way (e.g. for finding minimum value):
#pragma omp parallel for
for (int i = 0; i < ARRAY_SIZE; i++) {
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
Note that the if (y[i] < min) appears twice, because after the first comparison other threads may change the value of min, so inside the critical region before updating min and imin values you have to check it again. You can do it exactly the same way in the case of finding the maximum value.
Always use your variables at their minimum required scope.
It is also recommend to use default(none) clause in your OpenMP parallel region so, you have to explicitly define the sharing attributes all of your variables.
You can fill the array and find its minimum/maximum values in a single loop and print their values in a different serial loop.
If you set min and max before the loop, you can get rid of the extra comparison if (i == 0) used inside the loop.
Putting it together:
double threadsafe_rand(unsigned int* seed, double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand_r(seed) / div);
}
In main:
double min=DBL_MAX;
double max=-DBL_MAX;
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y,imin,imax,min,max)
{
unsigned int seed=omp_get_thread_num();
#pragma omp for
for (int i = 0; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
if (y[i] > max) {
#pragma omp critical
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
Update:
I have updated the code according to #Qubit's and #JérômeRichard's suggestions:
I used the 'Really minimal PCG32 code' / (c) 2014 M.E. O'Neill / from https://www.pcg-random.org/download.html. Note that I do not intend to properly handle the seeding of this simple random number generator. If you would like to do so, please use a complete random number generator library.
I have changed the code to use user defined reductions. Indeed, it makes the code much more efficient, but not really beginner friendly. It would require a very long post to explain it, so if you are interested in the details, please read a book about OpenMP.
I have reduced the number of divisions in threadsafe_rand
The updated code:
#include<stdio.h>
#include<stdint.h>
#include<time.h>
#include<float.h>
#include<limits.h>
#include<omp.h>
#define ARRAY_SIZE 10
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
inline uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
inline double threadsafe_rand(pcg32_random_t* seed, double min, double max)
{
const double tmp=1.0/UINT32_MAX;
return min + tmp*(max - min)*pcg32_random_r(seed);
}
struct v{
double value;
int i;
};
#pragma omp declare reduction(custom_min: struct v: \
omp_out = omp_in.value < omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={DBL_MAX,0} )
#pragma omp declare reduction(custom_max: struct v: \
omp_out = omp_in.value > omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={-DBL_MAX,0} )
int main() {
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
struct v max={-DBL_MAX,0};
struct v min={DBL_MAX,0};
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y) reduction(custom_min:min) reduction(custom_max:max)
{
pcg32_random_t seed={omp_get_thread_num()*7842 + time(NULL)%2299, 1234+omp_get_thread_num()};
#pragma omp for
for (int i=0 ; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min.value) {
min.value = y[i];
min.i = i;
}
if (y[i] > max.value) {
max.value = y[i];
max.i = i;
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", min.i, min.value, max.i, max.value);
return 0;
}

Loop transformation for data dependence and parallelization

I have a nested for loop for iterating over a three dimensional space (one for each dimension). The nested loop forms a part of a stencil based matrix solver which has a operation with data dependence. I have gone through lot of links/online material going into the details of loop transformations and it seems like loop skewing can help me. Though it is pretty straight forward for a 2d grid (consisting of two loop nests) i find it bit difficult to extend to 3d. The loop looks like this.
# pragma omp parallel num_threads(NTt) default(none) private(i,j,k, mythread, dummy) shared(STA,res_sparse_s,COEFF,p_sparse_s, ap_sparse_s,h_sparse_s,RLL, pipi_sparse, normres_sparse, riri_sparse,riri_sparse2,noemer_sparse, nx, ny, nz, nv, PeriodicBoundaryX, PeriodicBoundaryY, PeriodicBoundaryZ)
{
mythread = omp_get_thread_num();//0
// loop 1
#pragma omp for reduction(+:pipi_sparse)
for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
{
dummy = COEFF[i][j][k][6] * p_sparse_s[i][j][k];
if (PeriodicBoundaryX && i == 1) dummy += COEFF[i][j][k][0] * p_sparse_s[nx ][j][k];
else dummy += COEFF[i][j][k][0] * p_sparse_s[i-1][j][k];
if (PeriodicBoundaryX && i == nx) dummy += COEFF[i][j][k][1] * p_sparse_s[1 ][j][k];
else dummy += COEFF[i][j][k][1] * p_sparse_s[i+1][j][k];
if (PeriodicBoundaryY && j == 1) dummy += COEFF[i][j][k][2] * p_sparse_s[i][ny ][k];
else dummy += COEFF[i][j][k][2] * p_sparse_s[i][j-1][k];
if (PeriodicBoundaryY && j == ny) dummy += COEFF[i][j][k][3] * p_sparse_s[i][ 1][k];
else dummy += COEFF[i][j][k][3] * p_sparse_s[i][j+1][k];
if (PeriodicBoundaryZ && k == 1) dummy += COEFF[i][j][k][4] * p_sparse_s[i][j][nz ];
else dummy += COEFF[i][j][k][4] * p_sparse_s[i][j][k-1];
if (PeriodicBoundaryZ && k == nz) dummy += COEFF[i][j][k][5] * p_sparse_s[i][j][ 1];
else dummy += COEFF[i][j][k][5] * p_sparse_s[i][j][k+1];
ap_sparse_s[i][j][k] = dummy;
pipi_sparse += p_sparse_s[i][j][k] * ap_sparse_s[i][j][k];
}
// loop 2
// FORWARD
#pragma omp for schedule(static, nx/NTt)
for (i=1; i<=nx; i++) for (j=1; j<=ny; j++) for (k=1; k<=nz; k++)
{
dummy = res_sparse_s[i][j][k];
dummy -= COEFF[i][j][k][7] * RLL[i-1][j][k];
if (PeriodicBoundaryX && i==nx)dummy -= COEFF[i][j][k][8] * RLL[1 ][j][k];
dummy -= COEFF[i][j][k][2] * RLL[i][j-1][k];
if (PeriodicBoundaryY && j==ny) dummy -= COEFF[i][j][k][3] * RLL[i][1 ][k];
dummy -= COEFF[i][j][k][4] * RLL[i][j][k-1];
if (PeriodicBoundaryZ && k==nz) dummy -= COEFF[i][j][k][5] * RLL[i][j][1 ];
RLL[i][j][k] = dummy / h_sparse_s[i][j][k];
}
// loop 3
// BACKWARD
#pragma omp for schedule(static, nx/NTt)
for (i=nx; i>=1;i--) for (j=ny; j>=1;j--) for (k=nz; k>=1;k--)
{
dummy = RLL[i][j][k]*h_sparse_s[i][j][k];
if (PeriodicBoundaryX && i==1) dummy -= COEFF[i][j][k][7] * RLL[nx ][j][k];
dummy -= COEFF[i][j][k][8] * RLL[i+1][j][k];
if (PeriodicBoundaryY && j==1) dummy -= COEFF[i][j][k][2] * RLL[i][ny ][k];
dummy -= COEFF[i][j][k][3] * RLL[i][j+1][k];
if (PeriodicBoundaryZ && k==1) dummy -= COEFF[i][j][k][4] * RLL[i][j][nz ];
dummy -= COEFF[i][j][k][5] * RLL[i][j][k+1];
RLL[i][j][k] = dummy / h_sparse_s[i][j][k];
}
}
Loop 1 -> Data dependence of [i][j][k] on [i+1][i-1][j-1][j+1][k-1][k+1] although the values of p_sparse_s are read-only
Loop 2 -> Data dependence of [i][j][k] on [i-1][j-1][k-1]
Loop 3 -> Data dependence of [i][j][k] on [i+1][j+1][k+1]
EDIT
COEFF[i][j][k][NUM] are just generic coefficients (some constant numbers) defined for each point in the 3d space. Since there are 9 such coefficients corresponding to neighboring points hence COEFF[][][][0], COEFF[][][][1] .... COEFF[][][][8]and like that.
EDIT
find a small code below that has a data dependence. I have tried to skew the inner k loop with respect to the i and j loop so that the k loop can be vectorised. Problem is the code gives absolutely correct answers when running in serial and gives some weird answers if I enforce parallelism or if I enforce vectorisation of the inner loop.
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<omp.h>
typedef double lr;
#define nx 4
#define ny 4
#define nz 4
void
print3dmatrix(double a[nx+2][ny+2][nz+2])
{
for(int i=1; i<= nx; i++) {
for(int j=1; j<= ny; j++) {
for(int k=1; k<= nz; k++) {
printf("%f ", a[i][j][k]);
}
printf("\n");
}
printf("\n");
}
}
int
main()
{
double a[nx+2][ny+2][nz+2];
double b[nx+2][ny+2][nz+2];
srand(3461833726);
// matrix filling
// b is just a copy of a
for(int i=0; i< nx+2; i++) for(int j=0; j< ny+2; j++) for(int k=0; k< nz+2; k++)
{
a[i][j][k] = rand() % 5;
b[i][j][k] = a[i][j][k];
}
// loop 1
//#pragma omp parallel for num_threads(1)
for(int i=1; i<= nx; i++) for(int j=1; j<= ny; j++) for(int k=1; k<= nz; k++)
{
a[i][j][k] = -1*a[i-1][j][k] - 1*a[i][j-1][k] -1 * a[i][j][k-1] + 4 * a[i][j][k];
}
print3dmatrix(a);
printf("******************************\n");
// loop 2
//#pragma omp parallel for num_threads(1)
for(int i=1; i<= nx; i++)
for(int j=1; j<= ny; j++)
// #pragma omp simd
for(int m=j+1; m<= j+nz; m++)
{
b[i][j][m-j] = -1*b[i-1][j][m-j] - 1*b[i][j-1][m-j] -1 * b[i][j][m-j-1] + 4 * b[i][j][m-j];
}
print3dmatrix(b);
printf("=========================\n");
return 0;
}
see - loop skewing for vectorisation

OPENMP - Parallelize Schwarz algorithm with preconditions

I need to parallelize the Schwarz algorithm right bellow but I do not know how to deal with the precondition and the fact there are nested loops.
I have to use OpenMP or MPI.
void ssor_forward_sweep(int n, int i1, int i2, int j1, int j2, int k1, int k2, double* restrict Ax, double w)
{
#define AX(i,j,k) (Ax[((k)*n+(j))*n+(i)])
int i, j, k;
double xx, xn, xe, xu;
for (k = k1; k < k2; ++k) {
for (j = j1; j < j2; ++j) {
for (i = i1; i < i2; ++i) {
xx = AX(i,j,k);
xn = (i > 0) ? AX(i-1,j,k) : 0;
xe = (j > 0) ? AX(i,j-1,k) : 0;
xu = (k > 0) ? AX(i,j,k-1) : 0;
AX(i,j,k) = (xx+xn+xe+xu)/6*w;
}
}
}
#undef AX
}
Taking account that each loop use values from the loop before, how to parallelize this function to get the best time.
I already tried to parallelize loops two by two or by splitting in blocks (like Stencil Jacobi 3D) but without success...
Thank you very much !
Unfortunately, the inter-loop data dependency limits the amount of parallelism you can obtain in your nested loops.
You can use tasks with dependences, which will be the easiest approach. OpenMP runtime library will take care of the scheduling and you focus only on your algorithm. Another good side is that there is no synchronization at the end of any loop, but only between dependent parts of the code.
#pragma omp parallel
#pragma omp single
for (int k = 0; k < k2; k += BLOCK_SIZE) {
for (int j = 0; j < j2; j += BLOCK_SIZE) {
for (int i = 0; i < i2; i += BLOCK_SIZE) {
#pragma omp task depend (in: AX(i-1,j,k),AX(i,j-1,k),AX(i,j,k-1)) \
depend (out: AX(i,j,k))
// your code here
}
}
}
Tasks are sometimes a bit more expensive than parallel loops (depending on workload and synchronization granularities), so another alternative is the wavefront parallelization pattern, which basically transforms the iteration space so that the elements in the inner loop are independent between each other (so you can use parallel for there).
If you want to either approach, I strongly suggest you to turn your algorithm into a blocking one: unroll your 3-nested loops to do the computation in two stages:
Iterate among fixed sized sized blocks/cubes (let's call your new induction variables ii, jj and kk).
For each block, call the original serial version of your loop.
The goal of blocking is to increase the granularity of the parallel part, so that the parallelization overhead is not as noticeable.
Here is some pseudocode for the blocking part:
#define min(a,b) ((a)<(b)?(a):(b))
// Inter block iterations
for (int kk = 0; kk < k2; kk += BLOCK_SIZE) {
for (int jj = 0; jj < j2; jj += BLOCK_SIZE) {
for (int ii = 0; ii < i2; ii += BLOCK_SIZE) {
// Intra block iterations
for (int k = kk; k < min(k2,k+BLOCK_SIZE); k++) {
for (int j = jj; j < min(j2,j+BLOCK_SIZE); j++) {
for (int i = ii; i < min(i2,i+BLOCK_SIZE); i++) {
// Your code goes here
}
}
}
}
}
}
In the case of the wavefront parallelization, the last step is turning the outer loops (inter block iterations) into a wavefront, so that you iterate over the elements that are not dependent between each other. In 3D iteration spaces, it is basically a diagonal plane that advances from (0,0,0) to (i2,j2,k2). Something like the one highlighted in red, in the image below.
I'm going to put an example of the 2D wavefront, because it is easier to understand.
#define min(a,b) ((a)<(b)?(a):(b))
#pragma omp parallel
for (int d = 1; d < i2+j2; d++ ) {
int i = min(d,i2) - 1;
int j = 0;
// Iterations in the inner loop are independent
// Implicit thread barrier (synchronization) at the end of the loop
#pragma omp for
for ( ; i >= 0 && j < min(d,j2); i--, j++) {
// your code here
}
}

OpenMP - how to efficiently synchronize field update

I have the following code:
for (int i = 0; i < veryLargeArraySize; i++){
int value = A[i];
if (B[value] < MAX_VALUE) {
B[value]++;
}
}
I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)
#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.
Does anyone have a good suggestion on how I might do this?
Here's what I've found out and done.
I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.
However, I needed to do it (for the assignment), so what I did is the following:
Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]
int min_value, max_value;
#pragma omp for reduction(min:min_value), reduction(max:max_value)
for (i = 0; i < veryLargeArraySize; i++){
const unsigned int value = A[i];
if(max_value < value) max_value = value;
if(min_value > value) min_value = value;
}
int size_of_histo = max_value - min_value + 1;`
That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)
Allocate a shared array, such as:
int num_threads = omp_get_num_threads();
int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));
Each thread is assigned a part of the sharedHisto, and can update it without synchronization
int my_id = omp_get_thread_num();
#pragma omp parallel for default(shared) private(i)
for(i = 0; i < veryLargeArraySize; i++){
int value = A[i];
// my_id * size_of_histo positions to the begining of this thread's
// part of sharedHisto .
// i - min_value positions to the actual histo value
sharedHisto[my_id * size_of_histo + i - min_value]++;
}
Now, perform a reduction (as stated here: Reducing on array in OpenMp)
#pragma omp parallel
{
// Every thread is in charge for part of the reduced histogram
// shared_histo with the size: size_of_histo
int my_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int chunk = (size_of_histo + num_threads - 1) / num_threads;
int start = my_id * chunk;
int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;
#pragma omp for default(shared) private(i, j)
for(i = start; i < end; i++){
for(j = 0; j < num_threads; j++){
int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];
if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
else B[i + min_value] = value;
}
}
}

How to efficiently store a triangular matrix in memory?

I want to store a lower triangular matrix in memory, without storing all the zeros.
The way I have implemented it is by allocating space for i + 1 elements on the ith row.
However, I am new to dynamic memory allocation in C and something seems to be wrong with my first allocation.
int main ()
{
int i, j;
int **mat1;
int dim;
scanf("%d", &dim);
*mat1 = (int**) calloc(dim, sizeof(int*));
for(i = 0; i < dim; i++)
mat1[i] = (int*) calloc(i + 1, sizeof(int));
for(i = 0; i < dim; i++)
{
for(j = 0; j < i + 1; j++)
{
scanf("%d", &mat1[i][j]);
}
}
/* Print the matrix without the zeros*/
for(i = 0; i < dim; i++)
{
for(j = 0; j < (i + 1); j++)
{
printf("%d%c", mat1[i][j], j != (dim-1) ? ' ' : '\n');
}
}
return 0;
}
If you want to conserve space and the overhead of allocating every row of the matrix, you could implement a triangular matrix by using clever indexing of a single array.
A lower triangular matrix (including diagonals) has the following properties:
Dimension Matrix Elements/row Total elements
1 x . . . 1 1
2 x x . . 2 3
3 x x x . 3 6
4 x x x x 4 10
...
The total number of elements for a given dimension is:
size(d) = 1 + 2 + 3 + ... + d = (d+1)(d/2)
If you lay the rows out consecutively in a single array, you can use the formula above to calculate the offset of a given row and column (both zero-based) inside the matrix:
index(r,c) = size(r-1) + c
The formulas above are for the lower triangular matrix. You can access the upper matrix as if it was a lower matrix by simply reversing the indexes:
index((d-1)-r, (d-1)-c)
If you have concerns about changing the orientation of the array, you can devise a different offset calculation for the upper array, such as:
uindex(r,c) = size(d)-size(d-r) + c-r
Sample code:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define TRM_SIZE(dim) (((dim)*(dim+1))/2)
#define TRM_OFFSET(r,c) (TRM_SIZE((r)-1)+(c))
#define TRM_INDEX(m,r,c) ((r)<(c) ? 0 : (m)[TRM_OFFSET((r),(c))])
#define TRM_UINDEX(m,r,c,d) ((r)>(c)?0:(m)[TRM_SIZE(d)-TRM_SIZE((d)-(r))+(c)-(r)])
#define UMACRO 0
int main (void)
{
int i, j, k, dimension;
int *ml, *mu, *mr;
printf ("Enter dimension: ");
if (!scanf ("%2d", &dimension)) {
return 1;
}
ml = calloc (TRM_SIZE(dimension), sizeof *ml);
mu = calloc (TRM_SIZE(dimension), sizeof *mu);
mr = calloc (dimension*dimension, sizeof *mr);
if (!ml || !mu || !mr) {
free (ml);
free (mu);
free (mr);
return 2;
}
/* Initialization */
srand (time (0));
for (i = 0; i < TRM_SIZE(dimension); i++) {
ml[i] = 100.0*rand() / RAND_MAX;
mu[i] = 100.0*rand() / RAND_MAX;
}
/* Multiplication */
for (i = 0; i < dimension; i++) {
for (j = 0; j < dimension; j++) {
for (k = 0; k < dimension; k++) {
mr[i*dimension + j] +=
#if UMACRO
TRM_INDEX(ml, i, k) *
TRM_UINDEX(mu, k, j, dimension);
#else
TRM_INDEX(ml, i, k) *
TRM_INDEX(mu, dimension-1-k, dimension-1-j);
#endif
}
}
}
/* Output */
puts ("Lower array");
for (i = 0; i < dimension; i++) {
for (j = 0; j < dimension; j++) {
printf (" %2d", TRM_INDEX(ml, i, j));
}
putchar ('\n');
}
puts ("Upper array");
for (i = 0; i < dimension; i++) {
for (j = 0; j < dimension; j++) {
#if UMACRO
printf (" %2d", TRM_UINDEX(mu, i, j, dimension));
#else
printf (" %2d", TRM_INDEX(mu, dimension-1-i, dimension-1-j));
#endif
}
putchar ('\n');
}
puts ("Result");
for (i = 0; i < dimension; i++) {
for (j = 0; j < dimension; j++) {
printf (" %5d", mr[i*dimension + j]);
}
putchar ('\n');
}
free (mu);
free (ml);
free (mr);
return 0;
}
Note that this is a trivial example. You could extend it to wrap the matrix pointer inside a structure that also stores the type of the matrix (upper or lower triangular, or square) and the dimensions, and write access functions that operate appropriately depending on the type of matrix.
For any non-trivial use of matrices, you should probably use a third-party library that specializes in matrices.
mat1 = calloc(dim,sizeof(int*));
mat1 is a double pointer.You need to allocate memory for your array of pointers and later you need to allocate memory to each of your pointers individually.No need to cast calloc()
You are dereferencing mat1 at line 8 before it has even been set to point anywhere. You are allocating an array of pointers to int, but you are not assigning that to mat1 but to the dereference of mat1, which is uninitialized, we don't know what it points to.
So this line:
// ERROR: You are saying an unknown memory location should have the value of calloc.
*mat1 = (int**)calloc(dim,sizeof(int*));
Should change to:
// OK: Now you are assigning the allocation to the pointer variable.
mat1 = (int**)calloc(dim,sizeof(int*));

Resources