I have been using Java and quite new to C. I tried to create a function that generates a random pixel array with malloc. I free the memory in another function after using that random array. I think my concept is just fine, but I wonder if I write the codes properly and it really frees the heap memory. It'd be great if you can look at the codes and see if it works.
pixel* randomPalette(int colors){
int i, x;
pixel * randomArr = (pixel *)malloc(sizeof(pixel)* colors);
srand(time(NULL)); //generate a random seed
for (i = 0; i < colors; i++){
x = rand() % 256;
randomArr[i].r = x; randomArr[i].g = x; randomArr[i].b = x;
}
return randomArr;
}
void QuantizeA(pixel* Im, int width, int height){
//create a random palette of 8 RGB colors;
const int num = 8;
pixel* Arr = randomPalette(num);
//find the min distance between pixel and palette color
int x, y, z;
int min = 195075; // max distance is 255^2 + 255^2 + 255^2
int pos = 0;
for (x = 0; x < height; x++){
for (y = 0; y < width; y++){
//compare distance of the pixel to each palette color
for (z = 0; z < num; z++) {
if (distance(Im[pos], Arr[z]) < min){
Im[pos].r = Arr[pos].r;
Im[pos].g = Arr[pos].g;
Im[pos].b = Arr[pos].b;
}
}
pos++; //go to next piexl
}
}
glutPostRedisplay();
free(Arr);
}
From the memory allocation-deallocation part, your code is fine (I did not check your logic). However, two things to notice here,
Check for the success of malloc() before using the returned value.
You need to call srand(time(NULL)); only once at the start of your program, possibly in main().
As a suggestion, you can always make use of the memcheck tool from valgrind to check for the memory leakage related issues, if you suspect any.
Also, please see this discussion on why not to cast the return value of malloc() and family in C..
When you assign the new minimum distance:
Im[pos].r = Arr[pos].r;
you use the wrong index. It should be:
Im[pos].r = Arr[z].r;
As a side note: You cannot compare two structs with the comparison operators, not even for equality, but C allows you to assign one struct to another, which effectively copies the contents. So you don't need to copy all components, you can just say:
Im[pos] = Arr[z];
Related
I want to make a C function for FIR filter, It has a two input arrays and one output array.
both input arrays are constant numbers, I want to use them for computation of output of filter,and after computation delete them and just store the output array of function this is my code but it does not work
#include <stdlib.h>
float * filter(float *PATIENTSIGNAL,float *FILTERCOEF, int lengthofpatient , int lengthoffilter ){
static float FIROUT[8000];
int i,j;
float temp=0;
float* SIGNAL;
float* COEF;
SIGNAL = malloc(lengthofpatient *sizeof(float));
COEF = malloc(lengthoffilter*sizeof(float));
}
for (j = 0; j <= lengthofpatient; j++){
temp = SIGNAL[j] * COEF[0];
for (i = 1; i <= lengthoffilter; i++){
if ((j - i) >= 0){
temp += SIGNAL[j - i] * COEF[i];
}
FIROUT[j] = temp;
}
}
free(SIGNAL);
free(COEF);
free(PATIENTSIGNAL);
return FIROUT;
}
There are several problems in your code,
Unnecessary } after line COEF = malloc(lengthoffilter*sizeof(float));.
for (j = 0; j <= lengthofpatient; j++). This will loop once more than required. The same for the i loop. pmg mentioned it in the comment.
temp += SIGNAL[j - i] * COEF[i]; will not give you the desired outcome, as you do not initialized both SIGNAL or COEF.
Whats the purpose of float *PATIENTSIGNAL,float *FILTERCOEF in the function parameter?
From a wild guess, I think you need this two line to initialize SIGNAL and/or COEF.
memccpy(SIGNAL, PATIENTSIGNAL, lengthofpatient);
memccpy(COEF, FILTERCOEF, lengthoffilter);
Don't free PATIENTSIGNAL in your local function. Let this be done by the function caller.
I want to do some calculation with some matrices whose size is 2048*2048, for example.But the simulator stop working and it does not simulate the code. I understood that the problem is about the size and type of variable. For example, I run a simple code, which is written below, to check whether I am right or not. I should print 1 after declaring variable A. But it does not work.
Please note that I use Codeblocks. WFM is a function to write a float matrix in a text file and it works properly because I check that before with other matrices.
int main()
{
float A[2048][2048];
printf("1");
float *AP = &(A[0][0]);
const char *File_Name = "example.txt";
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
A[i][j] = counter;
++counter;
}
WFM(AP, 2048, 2048, File_Name , ' ');
return 0;
}
Any help and suggestion to deal with this problem and larger matrices is appreciate it.
Thanks
float A[2048][2048];
which requires approx. 2K * 2K * 8 = 32M of stack memory. But typically the stack size of the process if far less than that. Please allocate it dynamically using alloc family.
float A[2048][2048];
This may be too large for a local array, you should allocate memory dynamically by function such as malloc. For example, you could do this:
float *A = malloc(2048*2048*sizeof(float));
if (A == 0)
{
perror("malloc");
exit(1);
}
float *AP = A;
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
*(A + 2048*i + j) = counter;
++counter;
}
And when you does not need A anymore, you can free it by free(A);.
Helpful links about efficiency pitfalls of large arrays with power-of-2 size (offered by #LưuVĩnhPhúc):
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Why is my program slow when looping over exactly 8192 elements?
Matrix multiplication: Small difference in matrix size, large difference in timings
Looking at Mark Harris's reduction example, I am trying to see if I can have threads store intermediate values without reduction operation:
For example CPU code:
for(int i = 0; i < ntr; i++)
{
for(int j = 0; j < pos* posdir; j++)
{
val = x[i] * arr[j];
if(val > 0.0)
{
out[xcount] = val*x[i];
xcount += 1;
}
}
}
Equivalent GPU code:
const int threads = 64;
num_blocks = ntr/threads;
__global__ void test_g(float *in1, float *in2, float *out1, int *ct, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[threads];
__shared__ float t2[threads];
int gcount = 0;
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
__syncthreads();
for(int i = 0; i < 32; i++)
{
t2[i] = t1[i] * in1[tid];
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
}
}
ct[0] = gcount;
}
what I am trying to do here is the following steps:
(1)Store 32 values of in2 in shared memory variable t1,
(2)For each value of i and in1[tid], calculate t2[i],
(3)if t2[i] > 0 for that particular combination of i, write t2[i]*in1[tid] to out1[gcount]
But my output is all wrong. I am not even able to get a count of all the times t2[i] is greater than 0.
Any suggestions on how to save the value of gcount for each i and tid ?? As I debug, I find that for block (0,0,0) and thread(0,0,0) I can sequentially see the values of t2 updated. After the CUDA kernel switches focus to block(0,0,0) and thread(32,0,0), the values of out1[0] are re-written again. How can I get/store the values of out1 for each thread and write it to the output?
I tried two approaches so far: (suggested by #paseolatis on NVIDIA forums)
(1) defined offset=tid*32; and replace out1[gcount] with out1[offset+gcount],
(2) defined
__device__ int totgcount=0; // this line before main()
atomicAdd(&totgcount,1);
out1[totgcount]=t2[i] * in1[tid];
int *h_xc = (int*) malloc(sizeof(int) * 1);
cudaMemcpyFromSymbol(h_xc, totgcount, sizeof(int)*1, cudaMemcpyDeviceToHost);
printf("GPU: xcount = %d\n", h_xc[0]); // Output looks like this: GPU: xcount = 1928669800
Any suggestions? Thanks in advance !
OK let's compare your description of what the code should do with what you have posted (this is sometimes called rubber duck debugging).
Store 32 values of in2 in shared memory variable t1
Your kernel contains this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i%posdir];
}
which is effectively loading the same value from in2 into every value of t1. I suspect you want something more like this:
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
For each value of i and in1[tid], calculate t2[i],
This part is OK, but why is t2 needed in shared memory at all? It is only an intermediate result which can be discarded after the inner iteration is completed. You could easily have something like:
float inval = in1[tid];
.......
for(int i = 0; i < 32; i++)
{
float result = t1[i] * inval;
......
if t2[i] > 0 for that particular combination of i, write
t2[i]*in1[tid] to out1[gcount]
This is where the problems really start. Here you do this:
if(t2[i] > 0){
out1[gcount] = t2[i] * in1[tid];
gcount = gcount + 1;
}
This is a memory race. gcount is a thread local variable, so each thread will, at different times, overwrite any given out1[gcount] with its own value. What you must have, for this code to work correctly as written, is to have gcount as a global memory variable and use atomic memory updates to ensure that each thread uses a unique value of gcount each time it outputs a value. But be warned that atomic memory access is very expensive if it is used often (this is why I asked about how many output points there are per kernel launch in a comment).
The resulting kernel might look something like this:
__device__ int gcount; // must be set to zero before the kernel launch
__global__ void test_g(float *in1, float *in2, float *out1, int posdir, int pos)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
__shared__ float t1[32];
float ival = in1[tid];
for(int i = 0; i < posdir*pos; i += 32) {
if (threadIdx.x < 32) {
t1[threadIdx.x] = in2[i+threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 32; j++)
{
float tval = t1[j] * ival;
if(tval > 0){
int idx = atomicAdd(&gcount, 1);
out1[idx] = tval * ival
}
}
}
}
Disclaimer: written in browser, never been compiled or tested, use at own risk.....
Note that your write to ct was also a memory race, but with gcount now a global value, you can read the value after the kernel without the need for ct.
EDIT: It seems that you are having some problems with zeroing gcount before running the kernel. To do this, you will need to use something like cudaMemcpyToSymbol or perhaps cudaGetSymbolAddress and cudaMemset. It might look something like:
const int zero = 0;
cudaMemcpyToSymbol("gcount", &zero, sizeof(int), 0, cudaMemcpyHostToDevice);
Again, usual disclaimer: written in browser, never been compiled or tested, use at own risk.....
A better way to do what you are doing is to give each thread its own output, and let it increment its own count and enter values - this way, the double-for loop can happen in parallel in any order, which is what the GPU does well. The output is wrong because the threads share the out1 array, so they'll all overwrite on it.
You should also move the code to copy into shared memory into a separate loop, with a __syncthreads() after. With the __syncthreads() out of the loop, you should get better performance - this means that your shared array will have to be the size of in2 - if this is a problem, there's a better way to deal with this at the end of this answer.
You also should move the threadIdx.x < 32 check to the outside. So your code will look something like this:
if (threadIdx.x < 32) {
for(int i = threadIdx.x; i < posdir*pos; i+=32) {
t1[i] = in2[i];
}
}
__syncthreads();
for(int i = threadIdx.x; i < posdir*pos; i += 32) {
for(int j = 0; j < 32; j++)
{
...
}
}
Then put a __syncthreads(), an atomic addition of gcount += count, and a copy from the local output array to a global one - this part is sequential, and will hurt performance. If you can, I would just have a global list of pointers to the arrays for each local one, and put them together on the CPU.
Another change is that you don't need shared memory for t2 - it doesn't help you. And the way you are doing this, it seems like it works only if you are using a single block. To get good performance out of most NVIDIA GPUs, you should partition this into multiple blocks. You can tailor this to your shared memory constraint. Of course, you don't have a __syncthreads() between blocks, so the threads in each block have to go over the whole range for the inner loop, and a partition of the outer loop.
Here is my function that tests two points x and y if they're in the mandelbrot set or not after MAX_ITERATION 255. It should return 0 if not, 1 if it is.
int isMandelbrot (int x, int y) {
int i;
int j;
double Re[255];
double Im[255];
double a;
double b;
double dist;
double finaldist;
int check;
i=0;
Re[0]=0;
Im[0]=0;
j=-1;
a=0;
b=0;
while (i < MAX_ITERATION) {
a = Re[j];
b = Im[j];
Re[i]=((a*a)-(b*b))+x;
Im[i]=(2 * a * b) + y;
i++;
j++;
}
finaldist = sqrt(pow(Re[MAX_ITERATION],2)+pow(Im[MAX_ITERATION],2));
if (dist > 2) { //not in mandelbrot
check = 0;
} else if (dist <= 2) { //in mandelbrot set
check = 1;
}
return check;
}
Given that it's correct (can someone verify... or write a more efficient one?).
Here is my code to print it, however it does not work! (it keeps giving all points are in the set). What have I done wrong here?
int main(void) {
double col;
double row;
int checkSet;
row = -4;
col = -1;
while (row < 1.0 ) {
while (col < 1.0) {
checkSet = isMandelbrot(row, col);
if (checkSet == 1) {
printf("-");
} else if (checkSet == 0) {
printf("*");
}
col=col+0.5;
}
col=-1;
row=row+0.5;
printf("\n");
}
return 0;
}
There are some bugs in your code. For example, you do this:
a = Re[j];
b = Im[j];
But at the first iteration, j = -1, so you're getting the value at index -1 of the arrays. That is not what you wanted to do.
Also, why are Re and Im arrays - do you really need to keep track of all the intermediate results in the calculation?
Wikipedia contains pseudocode for the algorithm, you might want to check your own code against that.
Another bug: your function takes int arguments, so the values of your double inputs will be truncated (i.e. the fractional part will be discarded).
You should probably be checking for escape inside the while loop. That is to say, if ((a*a + b*b) > 4) at any time then that pixel has escaped, end of story. By continuing to iterate those pixels, as well as wasting CPU cycles you the values are growing without bound and seem to be exceeding what can be represented in a double - the result is NaN, so your finaldist computation is producing garbage.
I think you would benefit from more resolution in your main. Your code as you've put it here isn't computing enough pixels to really see much of the set.
say we are going to insert an element into an array on malloc. I know where and how to insert, but I'm having trouble shuffling every succeeding element down by 1. What would be the technical approach for this? Thanks
| x x x x x x x x x x x | original array
| x x x x x 0 x x x x x x | new array
Suppose the "memmove" function is not available to us...
Yes, if you need to do this without memmove, you can do it with a simple loop. Note that you might also need to use realloc first, to expand the size of the allocated array so that it can fit the new element.
The trick to this is having the loop move each element one forward, starting from the last one. A moment's reflection should tell you why this is necessary.
The basic principle is the same whether the array is dynamically allocated, statically allocated or automatically allocated. The main difference is that if there is insufficient room in a dynamically allocated array, you can reallocate it with more space (subject to some system-imposed limits. Assuming there is enough space in the array, you could use memmove() to copy the section of the array after target location up one space, and then set the target location to the inserted value. Or you could write a loop to do the job.
int *dynarr = malloc(24 * sizeof(*dynarr));
int idx = 0;
dynarr[idx++] = 0;
dynarr[idx++] = 23;
dynarr[idx++] = 34;
dynarr[idx++] = 9;
dynarr[idx++] = 15;
Now insert at position n = 2:
memmove(&dynarr[n+1], &dynarr[n], (idx - n) * sizeof(int));
dynarr[n] = 19;
idx++;
That's a bulk move, an assignment, and increment the counter because there's one more element in the array.
Since the question was edited to disallow memmove(), here is a solution with simple array indexing, assuming that the same initialization sequence is used:
int i;
int n = 2;
for (i = idx; i > n; i--)
{
dynarr[i] = dynarr[i-1];
}
dynarr[n] = 19;
idx++;
Complete example code:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
static void print_array(int *a, int n)
{
int i;
for (i = 0; i < n; i++)
{
printf("a[%d] = %d\n", i, a[i]);
}
}
int main()
{
{
int *dynarr = malloc(24 * sizeof(*dynarr));
int idx = 0;
dynarr[idx++] = 0;
dynarr[idx++] = 23;
dynarr[idx++] = 34;
dynarr[idx++] = 9;
dynarr[idx++] = 15;
printf("Before insert\n");
print_array(dynarr, idx);
int n = 2;
memmove(&dynarr[n+1], &dynarr[n], (idx - n) * sizeof(int));
dynarr[n] = 19;
idx++;
printf("After insert\n");
print_array(dynarr, idx);
free(dynarr);
}
{
int *dynarr = malloc(24 * sizeof(*dynarr));
int idx = 0;
dynarr[idx++] = 0;
dynarr[idx++] = 23;
dynarr[idx++] = 34;
dynarr[idx++] = 9;
dynarr[idx++] = 15;
printf("Before insert\n");
print_array(dynarr, idx);
int n = 2;
int i;
for (i = idx; i > n; i--)
{
dynarr[i] = dynarr[i-1];
}
dynarr[n] = 19;
idx++;
printf("After insert\n");
print_array(dynarr, idx);
free(dynarr);
}
return(0);
}
As Don suggested, memmove() will allow moving part of this array, in order to make room for the new element.
Depending on the size of the elements in the array, you may also consider storing only pointers, in the array, allowing easier/faster re-shuffling of the array, at the cost of a extra indirection when accessing individual elements. (and also at the cost of having to manage individual element-sized memory blocks). Deciding on this type of approach depends on the amount of reorganization of the array elements, as well as their size.
Alert: in view of the added "picture" in the question, memmove(), or indeed any operation, may be impossible, if the memory move implies writing past the size of memory originally allocated!
If this is really what is desired, the idea of an array of pointers may be more appropriate as this allows allocating an over-sized memory block initially (for the array proper) and to allocate (or dispose of) individual elements as needed.
Edit: "We're not allowed to use memmove()" indicates some form of homework. (BTW do tag it as such !!!)
To better help you we need to understand the particular premise of the question. Here's what appears to be the situtation:
1) we readily have an array, containing say N elements.
2) the array on the heap, i.e. it was allocated using malloc() (or related
functions)
3) the effective size of the malloc-ated block of memory is bigger than that of
the array.
Is #3 true ?
4) Depending on #3 we need to either allocate a new memory block (a bigger one)
and copy the array. We expect this copy would be done in 3 steps
- copy the elements that precede the new element
- copy the new element
- copy the elements that are after the new element
or... (if we have enough room), we'd require two steps
- "shift" the elements that are supposed to be after the new element
This can be done one element at a time, if we wish to avoid memcopy
- copy the new element.