I have read this post on how to fuse a loop. The goal is to fuse my double for loop in order to parallelize it with OpenMP. The reason why I don't use collapse(2) is because the inner loop has dependencies on the outer one. I have also read this relevant post.
My problem though, is that when I fuse my loop I get a Segmentation Fault error and that sounds pretty fuzzy. I am pretty sure I am making the right conversion. Unfortunately there is no way I can provide a reproducible - minimal example as my program has a ton of functions where they call one another. Here is my initial loop though:
for(int i=0; i<size; i++)
{
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
where graph is a pointer to Struct and graph->nodes is an array of pointers to the graph's nodes. Same goes for graph->nodes[i]->neighbours. An array of pointers (pointed to by a pointer pointed to by another pointer - sorry).
As you can see the fact that I am using the counter variablethat restricts me from using #pragma omp parallel for collapse(2). Below you can see my converted loop:
for(int n=0; n<size*size; n++)
{
int i = n / size;
int j = n % size;
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
I have tried debugging with valgrind and what's ultra weird is that the Segmentation Fault does not appear to be on these specific lines, although it happens only when I make the loop conversion.
Mini disclaimer: As you may guess, because of these pointer to pointer to pointer variables I use lots of mallocs.
I don't expect you to get the same error with the code that I have posted that is why my question is more of a general one: How could theoretically a loop fusion cause a segfault error?
I think in your converted loop you got i and j mixed up.
It should be int i = n % size;, not j.
n / size always equals 0.
Related
I have a problem with my code, it should print number of appearances of a certain number.
I want parallelize this code with OpenMP, and I tried to use reduction for arrays but it's obviously didn't working as I wanted.
The error is: "segmentation fault". Should some variables be private? or it's the problem with the way I'm trying to use the reduction?
I think each thread should count some part of array, and then merge it somehow.
#pragma omp parallel for reduction (+: reasult[:i])
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
if ( numbers[j] == i){
result[i]++;
}
}
}
Where N is big number telling how many numbers I have. Numbers is array of all numbers and result array with sum of each number.
First you have a typo on the name
#pragma omp parallel for reduction (+: reasult[:i])
should actually be "result" not "reasult"
Nonetheless, why are you section the array with result[:i]? Based on your code, it seems that you wanted to reduce the entire array, namely:
#pragma omp parallel for reduction (+: result)
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
When one's compiler does not support the OpenMP 4.5 array reduction feature one can alternatively explicitly implement the reduction (check this SO thread to see how).
As pointed out by #Hristo Iliev in the comments:
Provided that M * sizeof(result[0]) / #threads is a multiple of the
cache line size, and even if it isn't when the value of M is large
enough, there is absolutely no need to involve reduction in the
process. Unless the program is running on a NUMA system, that is.
Assuming that the aforementioned conditions are met, and if you analyze carefully the outermost loop iterations (i.e., variable i) are assigned to the threads, and since the variable i is used to access the result array, each thread will be updating a different position of the result array. Therefore, you can simplified your code to:
#pragma omp parallel for
for (i = 0; i < M; i++)
for(j = 0; j < N; j++)
if ( numbers[j] == i)
result[i]++;
My primary aim is to demonstrate how virtualization differs from containerization by benchmarking a matrix multiplication algorithm in C and Java over various environments and draw up a suitable conclusion.
The reason I chose this algorithm was because matrix multiplication is a very frequently used algorithm in various Computer Science fields, since most of them deal with large sizes, I wish to optimize my code to be able to perform the action up to at least 2000x2000 matrix size so that the difference between both of these is apparent.
I use GCC on Linux and the default compiler for C in Code::Blocks on Windows (I do not know which version of GCC is uses).
The problem is that when I run the code on Windows, the compiler accepts sizes only up to 490x490 and dumps the core if I exceed the size. Linux manages to overcome this but cannot go beyond 590x590.
I initially thought that my device memory was the reason and asked a few of my friends with much better machines to run the same code, but the result was still the same.
FYI: I'm running a Pentium N3540 and 4GB of DDR3 RAM. My friends are running i7-8750H with 16GB DDR4 and another one with an i5-9300H with 8GB DDR4.
Here is the code I wrote:
#include <stdio.h>
#include <stdlib.h>
#define MAX 10
int main()
{
long i, j, k, m, n;
printf("Enter the row dimension of the matrix: "); scanf("%ld", &m);
printf("Enter the column dimension of the matrix: "); scanf("%ld", &n);
long mat1[m][n], mat2[m][n], mat3[m][n];
for(i=0; i<m; i++)
for(j=0; j<n; j++)
{
mat1[i][j] = (long)rand()%MAX;
mat2[i][j] = (long)rand()%MAX;
mat3[i][j] = 0;
}
printf("\n\nThe matrix 1 is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat1[i][j]);
}
printf("\n");
}
printf("\n\nThe matrix 2 is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat2[i][j]);
}
printf("\n");
}
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
mat3[i][j] += mat1[i][k] * mat2[k][j];
printf("\n\nThe resultant matrix is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat3[i][j]);
}
printf("\n");
}
return 0;
}
When you do
long mat1[m][n], mat2[m][n], mat3[m][n];
you create an object (aka a variable) with automatic storage duration. That means that the object is automatically created once the function is executed and automatically destroyed when the function exits.
The C standard does not describe how this shall be done. That is left to the system implementing the standard. The most common way is to use what is called a stack. It's a memory area that is pre-allocated for your program. Whenever your program calls functions any variables defined inside the function can be placed on that stack. This allows for very simple and fast allocation of memory for such variables.
However, it has one drawback - the stack has a limited (and rather small) size. So if a function uses huge variables, you may run out of stack memory. Unfortunately, most systems doesn't detect that until it's too late.
The simple rule to avoid this is: Do not define huge variables with automatic storage duration (aka huge function local variables).
So for your specific example you should replace:
long mat1[m][n]
with
long (*mat1)[n] = malloc(m * sizeof *mat1); // This defines mat1 as a pointer
if (mat1 == NULL) // to an array of n longs and
{ // allocate memory of m such arrays.
// Out of memory // In other words:
exit(1); // A m by n matrix of long
}
// From here on you can use mat1 as a 2D matrix
// example:
mat1[4][9] = 42;
...
// Once you are done using mat1, you need to free the memory. Like:
free(mat1);
Apart from #EdHeal 's suggestion of using malloc() , you may also simply use static to make your variable global, forcing it to be placed outside the stack in the heap - Assuming you dont mind an immutable matrice.
Since anyways here you must make arrays of variable length, malloc() is naturally suited to your use-case.
The stack is for small things; the heap for large things - and in your own words, these are matrices of "huge sizes" :)
As a sidenote to those more experienced in C programming - Why don't more C textbooks teach about stack-heap theory and size limitations of non-malloc()ed auto variables ? Why must every newb learn by falling flat on their face and being told of this on S.O ?
I implemented a 2D median filter in C. For an image of size 1440X1440, floating point values. For starting, I tried it with a simple 3X3 kernel size. Here's the code.
#define kernelSize 3
void sort(float *array2sort, int n)
{
float temp;
for(int i=0; i < n-1; i++)
for(int j=0; j < n-1-i; j++)
if(array2sort[j] > array2sort[j+1])
{
temp = array2sort[j];
array2sort[j] = array2sort[j+1];
array2sort[j+1] = temp;
}
}
void medianFilter(float *input, float *output)
{
int halfKernelSize = kernelSize/2;
float neighbourhood[kernelSize*kernelSize];
for(int i=0+halfKernelSize; i<(1440-halfKernelSize); i++)
for(int j=0+halfKernelSize; j<(1440-halfKernelSize); j++)
{
for(int ii=-halfKernelSize; ii<halfKernelSize+1; ii++)
for(int jj=-halfKernelSize; jj<halfKernelSize+1; jj++)
neighbourhood[(ii+halfKernelSize)*kernelSize+(jj+halfKernelSize)] = input[(i+ii)*1440+(j+jj)];
sort(neighbourhood, kernelSize*kernelSize);
output[(i)*1440+(j)] = neighbourhood[(kernelSize*kernelSize)/2+1];
}
}
Now, in order to verify if the code is fine, I took an image, added salt & pepper noise to it using MATLAB. Then tried the above code on it. I can see the noise getting reduced ALMOST completely with a few dots remaining. If I increase the kernel size to 5X5, noise does get filtered completely. But the worrying fact for me is ,the MATLAB ,median filter code is able to remove the noise completely even with a kernel of size 3X3. That leaves me in doubt. Please have a look at the code and let me know if there is some fault in the filter implementation or the MATLAB code is taking some additional steps.
I think the median value calculated from neighbourhood buffer is wrong.
It should have been neighbourhood[(kernelSize*kernelSize)/2].
Can you try with this correction?
I am optimizing an application and I have been stuck in a loop. I just want to vectorize a loop which is a nested loop. What exactly this loop does is just multiplying two arrays' values and add up. Then, find out the minimum value within the results. The original code is like the below.
float min = 0xffffffff;
for(i=0; i<limit_x; i++){
for(j=0; j<limit_y; j++){
for(k=0; k<limit_z; k++){
temp += x[i][k] * y[j][k];
}
if(min > temp){
min = temp;
}
}
}
As a matter of course, the vectorization occurs at inner-most loop, and it is guaranteed that x and y are aligned by the vectorization register width. In this code, I think vectorization is not contiguous because of comparing and setting up the minimum value. So I modified this code like the below.
float min = 0xffffffff;
for(i=0; i<limit_x; i++){
for(j=0; j<limit_y; j++){
for(k=0; k<limit_z; k++){
temps[i] += x[i][k] * y[i][k];
}
}
}
for(i=0; i<limit_x; i++){
if(min > temps[i]){
min = temps[i];
}
}
I expected for this to improve the performance because it will keep doing vectoized multiplications for the whole data without getting disturbed by comparing and getting minima. But the execution timing is different with my expectation. The previous one is slightly better than the new code.
Is there anyone who can explain why?
Background: the overall program is designed to carry out 2D DIC between a refference image and 1800 target images, (for tomographic reconstruction) In my code, there is this for loop block
for (k=0; k<kmax; k++)
{
K=nm12+(k*(h-n+1))/(kmax-1);
printf("\nk=%d\nL= ", K);
for (l=0; l<lmax; l++)
{
///For each subset, calculate and store its mean and standard deviation.
///Also want to know the sum and sum of squares of subset, but in two sections, stored in fm/df[k][l][0 and 1].
L=nm12+(l*(w-n+1))/(lmax-1);
printf("%d ", L);
fm[k][l][0]=0;
df[k][l][0]=0;
fm[k][l][1]=0;
df[k][l][1]=0;
///loops are j then i as it is more efficient (saves m-1 recalculations of b=j+L;
for (j=0; j<m; j++)
{
b=j+L;
for (i=0; i<M; i++)
{
a=i+K;
fm[k][l][0]+=ref[a][b];
df[k][l][0]+=ref[a][b]*ref[a][b];
}
for (i=M; i<m; i++)
{
a=i+K;
fm[k][l][1]+=ref[a][b];
df[k][l][1]+=ref[a][b]*ref[a][b];
}
}
fm[k][l][2]=m2r*(fm[k][l][1]+fm[k][l][0]);
df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]);
a+=1;
}
}
Each time l reaches 10 the line df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]); appears to no longer be executed. By this I mean the debugger shows that the value of df[k][l][2] is not changed from zero to the sum correctly. Also, df[k][l][0 and 1] remain fixed regardless of k and l, just as long as l>=10.
kmax=15, lmax=20, n=121, m=21, M=(3*m)/4=15, nm12=(n-m+1)/2=50.
The arrays fm and df are double arrays, declared double fm[kmax][lmax][3], df[kmax][lmax][3];
Also, the line a+=1; is just there to be used as a breakpoint to check the value of df[k][l][2], and has no affect on the code functionality.
Any help as to why this is happening, how to fix, etc will be muchly appreciated!
EDIT: MORE INFO.
The array ref (containing the reference image pixel values) is a dynamic array, with memory allocated using malloc, in this code block:
double **dark, **flat, **ref, **target, **target2, ***gm, ***dg;
dark=(double**)malloc(h * sizeof(double*));
flat=(double**)malloc(h * sizeof(double*));
ref=(double**)malloc(h * sizeof(double*));
target=(double**)malloc(h * sizeof(double*));
target2=(double**)malloc(h * sizeof(double*));
size_t wd=w*sizeof(double);
for (a=0; a<h; a++)
{
dark[a]=(double*)malloc(wd);
flat[a]=(double*)malloc(wd);
ref[a]=(double*)malloc(wd);
target[a]=(double*)malloc(wd);
target2[a]=(double*)malloc(wd);
}
where h=1040 and w=1388 the dimensions of the image.
You don't mention much about what compiler, IDE or framework that you're using. But a way to isolate the problem is to create a new small (console) project, containing only the snippet you've posted. This way you'll eliminate most kinds of input/thread/stack/memory/compiler etc. issues.
And if it doesn't, it'll be small enough to post the whole sample here on stackoverflow, for us take apart and ponder.
Ergo you should create a self contained unit test for your algorithm.