I implemented a 2D median filter in C. For an image of size 1440X1440, floating point values. For starting, I tried it with a simple 3X3 kernel size. Here's the code.
#define kernelSize 3
void sort(float *array2sort, int n)
{
float temp;
for(int i=0; i < n-1; i++)
for(int j=0; j < n-1-i; j++)
if(array2sort[j] > array2sort[j+1])
{
temp = array2sort[j];
array2sort[j] = array2sort[j+1];
array2sort[j+1] = temp;
}
}
void medianFilter(float *input, float *output)
{
int halfKernelSize = kernelSize/2;
float neighbourhood[kernelSize*kernelSize];
for(int i=0+halfKernelSize; i<(1440-halfKernelSize); i++)
for(int j=0+halfKernelSize; j<(1440-halfKernelSize); j++)
{
for(int ii=-halfKernelSize; ii<halfKernelSize+1; ii++)
for(int jj=-halfKernelSize; jj<halfKernelSize+1; jj++)
neighbourhood[(ii+halfKernelSize)*kernelSize+(jj+halfKernelSize)] = input[(i+ii)*1440+(j+jj)];
sort(neighbourhood, kernelSize*kernelSize);
output[(i)*1440+(j)] = neighbourhood[(kernelSize*kernelSize)/2+1];
}
}
Now, in order to verify if the code is fine, I took an image, added salt & pepper noise to it using MATLAB. Then tried the above code on it. I can see the noise getting reduced ALMOST completely with a few dots remaining. If I increase the kernel size to 5X5, noise does get filtered completely. But the worrying fact for me is ,the MATLAB ,median filter code is able to remove the noise completely even with a kernel of size 3X3. That leaves me in doubt. Please have a look at the code and let me know if there is some fault in the filter implementation or the MATLAB code is taking some additional steps.
I think the median value calculated from neighbourhood buffer is wrong.
It should have been neighbourhood[(kernelSize*kernelSize)/2].
Can you try with this correction?
Related
I have read this post on how to fuse a loop. The goal is to fuse my double for loop in order to parallelize it with OpenMP. The reason why I don't use collapse(2) is because the inner loop has dependencies on the outer one. I have also read this relevant post.
My problem though, is that when I fuse my loop I get a Segmentation Fault error and that sounds pretty fuzzy. I am pretty sure I am making the right conversion. Unfortunately there is no way I can provide a reproducible - minimal example as my program has a ton of functions where they call one another. Here is my initial loop though:
for(int i=0; i<size; i++)
{
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
where graph is a pointer to Struct and graph->nodes is an array of pointers to the graph's nodes. Same goes for graph->nodes[i]->neighbours. An array of pointers (pointed to by a pointer pointed to by another pointer - sorry).
As you can see the fact that I am using the counter variablethat restricts me from using #pragma omp parallel for collapse(2). Below you can see my converted loop:
for(int n=0; n<size*size; n++)
{
int i = n / size;
int j = n % size;
int counter = 0;
for(int j=0; j<size; j++)
{
if (i==j)
continue;
if(arr[size * i + j])
{
graph->nodes[i]->degree++;
graph->nodes[i]->neighbours[counter] = (Node*)malloc(sizeof(Node));
graph->nodes[i]->neighbours[counter] = graph->nodes[j];
counter++;
}
}
}
I have tried debugging with valgrind and what's ultra weird is that the Segmentation Fault does not appear to be on these specific lines, although it happens only when I make the loop conversion.
Mini disclaimer: As you may guess, because of these pointer to pointer to pointer variables I use lots of mallocs.
I don't expect you to get the same error with the code that I have posted that is why my question is more of a general one: How could theoretically a loop fusion cause a segfault error?
I think in your converted loop you got i and j mixed up.
It should be int i = n % size;, not j.
n / size always equals 0.
I am optimizing an application and I have been stuck in a loop. I just want to vectorize a loop which is a nested loop. What exactly this loop does is just multiplying two arrays' values and add up. Then, find out the minimum value within the results. The original code is like the below.
float min = 0xffffffff;
for(i=0; i<limit_x; i++){
for(j=0; j<limit_y; j++){
for(k=0; k<limit_z; k++){
temp += x[i][k] * y[j][k];
}
if(min > temp){
min = temp;
}
}
}
As a matter of course, the vectorization occurs at inner-most loop, and it is guaranteed that x and y are aligned by the vectorization register width. In this code, I think vectorization is not contiguous because of comparing and setting up the minimum value. So I modified this code like the below.
float min = 0xffffffff;
for(i=0; i<limit_x; i++){
for(j=0; j<limit_y; j++){
for(k=0; k<limit_z; k++){
temps[i] += x[i][k] * y[i][k];
}
}
}
for(i=0; i<limit_x; i++){
if(min > temps[i]){
min = temps[i];
}
}
I expected for this to improve the performance because it will keep doing vectoized multiplications for the whole data without getting disturbed by comparing and getting minima. But the execution timing is different with my expectation. The previous one is slightly better than the new code.
Is there anyone who can explain why?
Background: the overall program is designed to carry out 2D DIC between a refference image and 1800 target images, (for tomographic reconstruction) In my code, there is this for loop block
for (k=0; k<kmax; k++)
{
K=nm12+(k*(h-n+1))/(kmax-1);
printf("\nk=%d\nL= ", K);
for (l=0; l<lmax; l++)
{
///For each subset, calculate and store its mean and standard deviation.
///Also want to know the sum and sum of squares of subset, but in two sections, stored in fm/df[k][l][0 and 1].
L=nm12+(l*(w-n+1))/(lmax-1);
printf("%d ", L);
fm[k][l][0]=0;
df[k][l][0]=0;
fm[k][l][1]=0;
df[k][l][1]=0;
///loops are j then i as it is more efficient (saves m-1 recalculations of b=j+L;
for (j=0; j<m; j++)
{
b=j+L;
for (i=0; i<M; i++)
{
a=i+K;
fm[k][l][0]+=ref[a][b];
df[k][l][0]+=ref[a][b]*ref[a][b];
}
for (i=M; i<m; i++)
{
a=i+K;
fm[k][l][1]+=ref[a][b];
df[k][l][1]+=ref[a][b]*ref[a][b];
}
}
fm[k][l][2]=m2r*(fm[k][l][1]+fm[k][l][0]);
df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]);
a+=1;
}
}
Each time l reaches 10 the line df[k][l][2]=sqrt(df[k][l][1]+df[k][l][0]-m2*fm[k][l][2]*fm[k][l][2]); appears to no longer be executed. By this I mean the debugger shows that the value of df[k][l][2] is not changed from zero to the sum correctly. Also, df[k][l][0 and 1] remain fixed regardless of k and l, just as long as l>=10.
kmax=15, lmax=20, n=121, m=21, M=(3*m)/4=15, nm12=(n-m+1)/2=50.
The arrays fm and df are double arrays, declared double fm[kmax][lmax][3], df[kmax][lmax][3];
Also, the line a+=1; is just there to be used as a breakpoint to check the value of df[k][l][2], and has no affect on the code functionality.
Any help as to why this is happening, how to fix, etc will be muchly appreciated!
EDIT: MORE INFO.
The array ref (containing the reference image pixel values) is a dynamic array, with memory allocated using malloc, in this code block:
double **dark, **flat, **ref, **target, **target2, ***gm, ***dg;
dark=(double**)malloc(h * sizeof(double*));
flat=(double**)malloc(h * sizeof(double*));
ref=(double**)malloc(h * sizeof(double*));
target=(double**)malloc(h * sizeof(double*));
target2=(double**)malloc(h * sizeof(double*));
size_t wd=w*sizeof(double);
for (a=0; a<h; a++)
{
dark[a]=(double*)malloc(wd);
flat[a]=(double*)malloc(wd);
ref[a]=(double*)malloc(wd);
target[a]=(double*)malloc(wd);
target2[a]=(double*)malloc(wd);
}
where h=1040 and w=1388 the dimensions of the image.
You don't mention much about what compiler, IDE or framework that you're using. But a way to isolate the problem is to create a new small (console) project, containing only the snippet you've posted. This way you'll eliminate most kinds of input/thread/stack/memory/compiler etc. issues.
And if it doesn't, it'll be small enough to post the whole sample here on stackoverflow, for us take apart and ponder.
Ergo you should create a self contained unit test for your algorithm.
I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks
Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.
Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).
Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.
Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.
you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.
#include<stdio.h>
#include<time.h>
int main()
{
clock_t start;
double d;
long int n,i,j;
scanf("%ld",&n);
n=100000;
j=2;
start=clock();
printf("\n%ld",j);
for(j=3;j<=n;j+=2)
{
for(i=3;i*i<=j;i+=2)
if(j%i==0)
break;
if(i*i>j)
printf("\n%ld",j);
}
d=(clock()-start)/(double)CLOCKS_PER_SEC;
printf("\n%f",d);
}
I got the running time of 0.015 sec when n=100000 for the above program.
I also implemented the Sieve of Eratosthenes algorithm in C and got the running time of 0.046 for n=100000.
How is my above algorithm faster than Sieve's algorithm that I have implemented.
What is the time complexity of my above program??
My sieve's implementation
#define LISTSIZE 100000 //Number of integers to sieve<br>
#include <stdio.h>
#include <math.h>
#include <time.h>
int main()
{
clock_t start;
double d;
long int list[LISTSIZE],i,j;
int listMax = (int)sqrt(LISTSIZE), primeEstimate = (int)(LISTSIZE/log(LISTSIZE));
for(int i=0; i < LISTSIZE; i++)
list[i] = i+2;
start=clock();
for(i=0; i < listMax; i++)
{
//If the entry has been set to 0 ('removed'), skip it
if(list[i] > 0)
{
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = i+1; j < LISTSIZE; j++)
{
if((list[j] % list[i]) == 0)
list[j] = 0;
}
}
}
d=(clock()-start)/(double)CLOCKS_PER_SEC;
//Output the primes
int primesFound = 0;
for(int i=0; i < LISTSIZE; i++)
{
if(list[i] > 0)
{
primesFound++;
printf("%ld\n", list[i]);
}
}
printf("\n%f",d);
return 0;
}
There are a number of things that might influence your result. To be sure, we would need to see the code for your sieve implementation. Also, what is the resolution of the clock function on your computer? If the implementation does not allow for a high degree of accuracy at the millisecond level, then your results could be within the margin of error for your measurement.
I suspect the problem lies here:
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = i+1; j < LISTSIZE; j++)
{
if((list[j] % list[i]) == 0)
list[j] = 0;
}
This is a poor way to remove all of the multiples of the prime number. Why not use the built in multiplication operator to remove the multiples? This version should be much faster:
//Remove all multiples of this prime
//Starting from the next entry in the list
//And going up in steps of size i
for(j = list[i]; j < LISTSIZE; j+=list[i])
{
list[j] = 0;
}
What is the time complexity of my above program??
To empirically measure the time complexity of your program, you need more than one data point. Run your program for multiple values of N, then make a graph of N vs. time. You can do this using a spreadsheet, GNUplot, or graph paper and pencil. You can also use software and/or plain old mathematics to find a polynomial curve that fits your data.
Non-empirically: much has been written (and lectured in computer science classes) about analyzing computational complexity. The Wikipedia article on computational complexity theory might provide some starting points for further reading.
Your sieve implementation is incorrect; that's the reason why it is so slow:
you shouldn't make it an array of numbers, but an array of flags (you may still use int as the data type, but char would do as well)
you shouldn't be using index shifts for the array, but list[i] should determine whether i is a prime or not (and not whether i+2 is a prime)
you should start the elimination with i=2
with these modifications, you should follow 1800 INFORMATION's advice, and cancel all multiples of i with a loop that goes in steps of i, not steps of 1
Just for your time complexity:
You have an outer loop of ~LISTMAX iterations and an inner loop of max. LISTSIZE iterations. This means your complexity is
O(sqrt(n)*n)
where n = listsize. It is actually a bit lower since the inner loop reduces it's count eacht time and is only run for each unknown number. But that's difficult to calculate. Since the O-Notation offers an upper bound, O(sqrt(n)*n) should be ok.
The behaviour is difficult to predict, but you should take into account that accessing the memory is not cheap... it's probably faster to just calculate it again for small primes.
Those run times are too small to be meaningful. The system clock resolution is not accurate to that kind of level.
What you should do to get accurate timing information is run your algorithm in a loop. Repeat it a few thousand times to get the run time up to at least a second, then you can divide the time by the number of loops.