Background: the overall program is designed to carry out 2D DIC between a refference image and 1800 target images, (for tomographic reconstruction) In my code, there is this for loop block
for (k=0; k<kmax; k++)
{
K=nm12+(k*(h-n+1))/(kmax-1);
printf("\nk=%d\nL= ", K);
for (l=0; l<lmax; l++)
{
///For each subset, calculate and store its mean and standard deviation.
///Also want to know the sum and sum of squares of subset, but in two sections, stored in fm/df[k][l][0 and 1].
L=nm12+(l*(w-n+1))/(lmax-1);
printf("%d ", L);
fm[k][l][0]=0;
df[k][l][0]=0;
fm[k][l][1]=0;
df[k][l][1]=0;
///loops are j then i as it is more efficient (saves m-1 recalculations of b=j+L;
for (j=0; j<m; j++)
{
b=j+L;
for (i=0; i<M; i++)
{
a=i+K;
fm[k][l][0]+=ref[a][b];
df[k][l][0]+=ref[a][b]*ref[a][b];
}
for (i=M; i<m; i++)
{
a=i+K;
fm[k][l][1]+=ref[a][b];
df[k][l][1]+=ref[a][b]*ref[a][b];
}
}
fm[k][l][2]=m2r*(fm[k][l][1]+fm[k][l][0]);
df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]);
a+=1;
}
}
Each time l reaches 10 the line df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]); appears to no longer be executed. By this I mean the debugger shows that the value of df[k][l][2] is not changed from zero to the sum correctly. Also, df[k][l][0 and 1] remain fixed regardless of k and l, just as long as l>=10.
kmax=15, lmax=20, n=121, m=21, M=(3*m)/4=15, nm12=(n-m+1)/2=50.
The arrays fm and df are double arrays, declared double fm[kmax][lmax][3], df[kmax][lmax][3];
Also, the line a+=1; is just there to be used as a breakpoint to check the value of df[k][l][2], and has no affect on the code functionality.
Any help as to why this is happening, how to fix, etc will be muchly appreciated!
EDIT: MORE INFO.
The array ref (containing the reference image pixel values) is a dynamic array, with memory allocated using malloc, in this code block:
double **dark, **flat, **ref, **target, **target2, ***gm, ***dg;
dark=(double**)malloc(h * sizeof(double*));
flat=(double**)malloc(h * sizeof(double*));
ref=(double**)malloc(h * sizeof(double*));
target=(double**)malloc(h * sizeof(double*));
target2=(double**)malloc(h * sizeof(double*));
size_t wd=w*sizeof(double);
for (a=0; a<h; a++)
{
dark[a]=(double*)malloc(wd);
flat[a]=(double*)malloc(wd);
ref[a]=(double*)malloc(wd);
target[a]=(double*)malloc(wd);
target2[a]=(double*)malloc(wd);
}
where h=1040 and w=1388 the dimensions of the image.
You don't mention much about what compiler, IDE or framework that you're using. But a way to isolate the problem is to create a new small (console) project, containing only the snippet you've posted. This way you'll eliminate most kinds of input/thread/stack/memory/compiler etc. issues.
And if it doesn't, it'll be small enough to post the whole sample here on stackoverflow, for us take apart and ponder.
Ergo you should create a self contained unit test for your algorithm.
Related
My primary aim is to demonstrate how virtualization differs from containerization by benchmarking a matrix multiplication algorithm in C and Java over various environments and draw up a suitable conclusion.
The reason I chose this algorithm was because matrix multiplication is a very frequently used algorithm in various Computer Science fields, since most of them deal with large sizes, I wish to optimize my code to be able to perform the action up to at least 2000x2000 matrix size so that the difference between both of these is apparent.
I use GCC on Linux and the default compiler for C in Code::Blocks on Windows (I do not know which version of GCC is uses).
The problem is that when I run the code on Windows, the compiler accepts sizes only up to 490x490 and dumps the core if I exceed the size. Linux manages to overcome this but cannot go beyond 590x590.
I initially thought that my device memory was the reason and asked a few of my friends with much better machines to run the same code, but the result was still the same.
FYI: I'm running a Pentium N3540 and 4GB of DDR3 RAM. My friends are running i7-8750H with 16GB DDR4 and another one with an i5-9300H with 8GB DDR4.
Here is the code I wrote:
#include <stdio.h>
#include <stdlib.h>
#define MAX 10
int main()
{
long i, j, k, m, n;
printf("Enter the row dimension of the matrix: "); scanf("%ld", &m);
printf("Enter the column dimension of the matrix: "); scanf("%ld", &n);
long mat1[m][n], mat2[m][n], mat3[m][n];
for(i=0; i<m; i++)
for(j=0; j<n; j++)
{
mat1[i][j] = (long)rand()%MAX;
mat2[i][j] = (long)rand()%MAX;
mat3[i][j] = 0;
}
printf("\n\nThe matrix 1 is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat1[i][j]);
}
printf("\n");
}
printf("\n\nThe matrix 2 is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat2[i][j]);
}
printf("\n");
}
for (i = 0; i < m; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
mat3[i][j] += mat1[i][k] * mat2[k][j];
printf("\n\nThe resultant matrix is: \n");
for(i=0; i<m; i++)
{
for(j=0; j<n; j++)
{
printf("%d\t", (int)mat3[i][j]);
}
printf("\n");
}
return 0;
}
When you do
long mat1[m][n], mat2[m][n], mat3[m][n];
you create an object (aka a variable) with automatic storage duration. That means that the object is automatically created once the function is executed and automatically destroyed when the function exits.
The C standard does not describe how this shall be done. That is left to the system implementing the standard. The most common way is to use what is called a stack. It's a memory area that is pre-allocated for your program. Whenever your program calls functions any variables defined inside the function can be placed on that stack. This allows for very simple and fast allocation of memory for such variables.
However, it has one drawback - the stack has a limited (and rather small) size. So if a function uses huge variables, you may run out of stack memory. Unfortunately, most systems doesn't detect that until it's too late.
The simple rule to avoid this is: Do not define huge variables with automatic storage duration (aka huge function local variables).
So for your specific example you should replace:
long mat1[m][n]
with
long (*mat1)[n] = malloc(m * sizeof *mat1); // This defines mat1 as a pointer
if (mat1 == NULL) // to an array of n longs and
{ // allocate memory of m such arrays.
// Out of memory // In other words:
exit(1); // A m by n matrix of long
}
// From here on you can use mat1 as a 2D matrix
// example:
mat1[4][9] = 42;
...
// Once you are done using mat1, you need to free the memory. Like:
free(mat1);
Apart from #EdHeal 's suggestion of using malloc() , you may also simply use static to make your variable global, forcing it to be placed outside the stack in the heap - Assuming you dont mind an immutable matrice.
Since anyways here you must make arrays of variable length, malloc() is naturally suited to your use-case.
The stack is for small things; the heap for large things - and in your own words, these are matrices of "huge sizes" :)
As a sidenote to those more experienced in C programming - Why don't more C textbooks teach about stack-heap theory and size limitations of non-malloc()ed auto variables ? Why must every newb learn by falling flat on their face and being told of this on S.O ?
I have read this post on how to fuse a loop. The goal is to fuse my double for loop in order to parallelize it with OpenMP. The reason why I don't use collapse(2) is because the inner loop has dependencies on the outer one. I have also read this relevant post.
My problem though, is that when I fuse my loop I get a Segmentation Fault error and that sounds pretty fuzzy. I am pretty sure I am making the right conversion. Unfortunately there is no way I can provide a reproducible - minimal example as my program has a ton of functions where they call one another. Here is my initial loop though:
for(int i=0; i<size; i++)
{
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
where graph is a pointer to Struct and graph->nodes is an array of pointers to the graph's nodes. Same goes for graph->nodes[i]->neighbours. An array of pointers (pointed to by a pointer pointed to by another pointer - sorry).
As you can see the fact that I am using the counter variablethat restricts me from using #pragma omp parallel for collapse(2). Below you can see my converted loop:
for(int n=0; n<size*size; n++)
{
int i = n / size;
int j = n % size;
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
I have tried debugging with valgrind and what's ultra weird is that the Segmentation Fault does not appear to be on these specific lines, although it happens only when I make the loop conversion.
Mini disclaimer: As you may guess, because of these pointer to pointer to pointer variables I use lots of mallocs.
I don't expect you to get the same error with the code that I have posted that is why my question is more of a general one: How could theoretically a loop fusion cause a segfault error?
I think in your converted loop you got i and j mixed up.
It should be int i = n % size;, not j.
n / size always equals 0.
I implemented a 2D median filter in C. For an image of size 1440X1440, floating point values. For starting, I tried it with a simple 3X3 kernel size. Here's the code.
#define kernelSize 3
void sort(float *array2sort, int n)
{
float temp;
for(int i=0; i < n-1; i++)
for(int j=0; j < n-1-i; j++)
if(array2sort[j] > array2sort[j+1])
{
temp = array2sort[j];
array2sort[j] = array2sort[j+1];
array2sort[j+1] = temp;
}
}
void medianFilter(float *input, float *output)
{
int halfKernelSize = kernelSize/2;
float neighbourhood[kernelSize*kernelSize];
for(int i=0+halfKernelSize; i<(1440-halfKernelSize); i++)
for(int j=0+halfKernelSize; j<(1440-halfKernelSize); j++)
{
for(int ii=-halfKernelSize; ii<halfKernelSize+1; ii++)
for(int jj=-halfKernelSize; jj<halfKernelSize+1; jj++)
neighbourhood[(ii+halfKernelSize)*kernelSize+(jj+halfKernelSize)] = input[(i+ii)*1440+(j+jj)];
sort(neighbourhood, kernelSize*kernelSize);
output[(i)*1440+(j)] = neighbourhood[(kernelSize*kernelSize)/2+1];
}
}
Now, in order to verify if the code is fine, I took an image, added salt & pepper noise to it using MATLAB. Then tried the above code on it. I can see the noise getting reduced ALMOST completely with a few dots remaining. If I increase the kernel size to 5X5, noise does get filtered completely. But the worrying fact for me is ,the MATLAB ,median filter code is able to remove the noise completely even with a kernel of size 3X3. That leaves me in doubt. Please have a look at the code and let me know if there is some fault in the filter implementation or the MATLAB code is taking some additional steps.
I think the median value calculated from neighbourhood buffer is wrong.
It should have been neighbourhood[(kernelSize*kernelSize)/2].
Can you try with this correction?
I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks
Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.
Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).
Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.
Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.
you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}