This is part of my implementation of kmean algorithm. I have two blocks of memory both in equal size such that *cluster_centeris the current center of cluster and *new_centroids represents the new centroid after taking the mean of the cluster's points:
double *cluster_center = malloc((k * dim) * sizeof(double));
double *new_centroids = malloc((k * dim) * sizeof(double));
I have the following loop to copy the results from the new_centroids to the cluster_center with no issues:
for (int i = 0; i < k; ++i) {
memcpy(&cluster_center[i * dim], &new_centroids[i * dim], dim * sizeof(double));
}
In fact, I want to know if C has a built-it function to compare the values of both blocks since I want to terminate my algorithm once the values of *new_centroids and *cluster_center are the same (i.e., didn't change). I really don't know how to do that.
Thank you
The function you're looking for is memcmp (memory compare). Immediately after you execute a statement:
memcpy(destination, source, size);
then
memcmp(destination, source, size);
should return zero.
Related
I am performing Compressed Sparse Raw Matrix Vector multiplications (CSR SPMV): This involves dividing the array A into multiple chunks, then pass this chunk by reference to a function, however only the first part of the array (A[0] first chunk starting the beginning of the array) is modified. However starting from the second loop A[0 + chunkIndex], when the function reads the sub array it jumps and reads a different address beyond the total array address range, although the indices are correct.
For reference:
The SPMV kernel is:
void serial_matvec(size_t TS, double *A, int *JA, int *IA, double *X, double *Y)
{
double sum;
for (int i = 0; i < TS; ++i)
{
sum = 0.0;
for (int j = IA[i]; j < IA[i + 1]; ++j)
{
sum += A[j] * X[JA[j]]; // the error is here , the function reads diffrent
// address of A, and JA, so the access
// will be out-of-bound
}
Y[i] = sum;
}
}
and it is called this way:
int chunkIndex = 0;
for(size_t k = 0; k < rows/TS; ++k)
{
chunkIndex = IA[k * TS];
serial_matvec(TS, &A[chunkIndex], &JA[chunkIndex], &IA[k*TS], &X[0], &Y[k*TS]);
}
assume I process (8x8) Matrix, and I process 2 rows per chunk, so the loop k will be rows/TS = 4 loops, the chunkIndex and array passed to the function will be as following:
chunkIndex: 0 --> loop k = 0, &A[0], &JA[0]
chunkIndex: --> loop k = 1, &A[16], &JA[16] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 2, &A[32], &JA[32] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 3, &A[48], &JA[48] //[ERROR here, function reads different address]
When I run the code, only the first chunk executes correctly, the other 3 chunks memory are corrupted and the array pointers jump into boundary beyond the array size.
I've checked all indices manually, of all the parameter, they are all correct, however when I print the addresses they are not the same. (debugging this for 3 days now)
I used valgrind and it reported:
Invalid read of size 8 and Use of uninitialised value of size 8 at the sum += A[j] * X[JA[j]]; line
I compiled it with -g -fsanitize=address and I got
heap-buffer-overflow
I tried to access these chunks manually outside the function, and they are correct, so what can cause the heap memory to be corrupted like this ?
The code is here, This is the minimum I can do.
The problem was that I was using global indices (indices inside main) when indexing the portion of the array (chunk) passed to the function, hence the out-of-bound problem.
The solution is to start indexing the sub-arrays from 0 at each function call, but I had another problem. At each function call, I process TS rows, each row has different number of non-zeros.
As an example, see the picture, chunk 1, sorry for my bad handwriting, it is easier this way. As you can see we will need 3 indices, one for the TS rows proceeded per chunk i , and the other because each row has different number of non-zeros j, and the third one to index the sub-array passed l, which was the original problem.
and the serial_matvec function will be as following:
void serial_matvec(size_t TS, const double *A, const int *JA, const int *IA,
const double *X, double *Y) {
int l = 0;
for (int i = 0; i < TS; ++i) {
for (int j = 0; j < (IA[i + 1] - IA[i]); ++j) {
Y[i] += A[l] * X[JA[l]];
l++;
}
}
}
The complete code with test is here If anyone has a more elegant solution, you are more than welcome.
I'm trying to allocate memory for a double type array to use it with GNU Scientific Library.
The code in using for this is something like
double *x_i, *y_i, *x_e, *y_e, data[MAX_SIZE][2];
int n_i, n_e, n_data;
...
x_i = (double *)malloc(n_i * sizeof(double));
y_i = (double *)malloc(n_i * sizeof(double));
x_e = (double *)malloc(n_e * sizeof(double));
y_e = (double *)malloc(n_e * sizeof(double));
for (int i = 0; i < n_data; i++){
if (data[i][1] > 0){
x_e[i] = data[i][0];
y_e[i] = data[i][1];
}
else{
x_i[i] = data[i][0];
y_i[i] = data[i][1];
}
}
With n_i + n_e = n_data.
Apparently, sizeof(x/y_e/i) after malloc is 8, but should be arround 50*sizeof(double). The values attributed after the for loop to x_e/i[i] and y_e/i[i] are not consistent, they change as I change the order of attribution, sometimes returning -nan. Values of data[][], n_i, n_e and n_data are consistent with I expect, and if I print x_e/i[i] and y_e/i[i] values inside the for loop they look correct, but outside that loop they change.
Thank you, and sorry if this is a redundant or bad formulated question.
In addition to sizeof(x/y_e/i) actually returning sizeof(double*) (as other users have mentioned), your for-loop is incorrect:
If n_i + n_e == n_data, then you will always hit a case where attempting to access x/y_e/i[i] will take you out of bounds. To avoid this, you could keep track of x/y_e's and x/y_i's indices separately, perhaps like this:
int i_e = 0, i_i = 0;
for (int i = 0; i < n_data; i++){
if (data[i][1] > 0){
x_e[i_e] = data[i][0];
y_e[i_e] = data[i][1];
i_e++;
}
else{
x_i[i_i] = data[i][0];
y_i[i_i] = data[i][1];
i_i++;
}
}
By the end of the loop, you should find that i_e + i_i == n_data and that the values assigned to x_e/i and y_e/i are consistent.
First, sizeof(x_i) with x_i being of type double* is the size of a pointer (probably 8 on your system), and not the size of the memory allocated. There is no way to get the size of the memory block to which a pointer points solely from the pointer.
Second, the "changing values" of the array that are not set in the loop are caused by having not initialized this portion of the array. Actually you yield undefined behaviour, most likely printing "garbage" when accessing these values. To overcome this, use calloc instead of malloc. calloc zero-initializes the memory block allocated, and according to IEEE standards, floating point values with all bits set to 0 represent the floating point value 0.0:
x_i = calloc(n_i, sizeof(double));
I have a problem to understand the memory usage of the following Code:
typedef struct list{
uint64_t*** entrys;
int dimension;
uint64_t len;
} list;
void init_list(list * t, uint64_t dim, uint64_t length, int amount_reg)
{
t->dimension = dim;
t->len=length;
t->entrys = (uint64_t ***) malloc(sizeof(uint64_t**)*length);
uint64_t i;
for(i=0;i<length;i++)
{
t->entrys[i] = (uint64_t **) malloc(sizeof(uint64_t *)*dim);
int j;
for(j=0;j<dim;j++)
{
t->entrys[i][j]=(uint64_t *) malloc(sizeof(uint64_t)*amount_reg);
}
}
}
int main()
{
list * table = (list *) malloc(sizeof(list));
init_list(table,3,2048*2048,2);
_getch();
}
What i want to do is allocating a 3d-Array of uint64_t elements like table[4194304][3][2].
The taskmanager shows a memory usage of 560MB. cO
If i try to calculate the memory usage on my own i can't comprehend that value.
Here is my calculation (for a x64 System):
2^20 * 8 Byte (first dimension pointers)
+ 2^20 * 3 * 8 Byte (second dimension pointers)
+ 2^20 * 3 * 2 * 8 Byte (for the values itsself)
= 2^20 * 8 Byte * 10 = 80MB
Maybe I'm totaly wrong with that calculation or my code generates a huge amount of overhead?!
If so, is there a way, to make this program more memory efficent?
I can't imagine that for something like ~2^23 uint64_t values so much memory is needed (cause 2^23*8Byte are just 64MB)
Your code does 2²² · 4 + 1 = 16777217 calls to malloc(). For each allocated memory region, malloc() does a little bookkeeping. This adds up when you do that many calls to malloc(). You can reduce the overhead by calling malloc() fewer times like this:
void init_list(list * t, int dim, uint64_t length, int amount_reg)
{
uint64_t ***entries = malloc(sizeof *entries * length);
uint64_t **seconds = malloc(sizeof *seconds * length * dim);
uint64_t *thirds = malloc(sizeof *thirds * length * dim * amount_reg);
uint64_t i, j;
t->entrys = entries;
for (i = 0; i < length; i++) {
t->entrys[i] = seconds + dim * i;
for (j = 0; j < dim; j++)
t->entrys[i][j] = thirds + amount_reg * j + amount_reg * dim * i;
}
}
Here we call malloc() only three times, and memory usage goes down from 561272 KiB to 332020 KiB. Why is the memory usage still so high? Because you made a mistake in your computations. The allocations allocate this much memory:
entries: sizeof(uint64_t**) * length = 8 · 2²²
seconds: sizeof(uint64_t*) * length * dim = 8 · 2²² · 3
thirds: sizeof(uint64_t) * length * dim * amount_reg = 8 · 2²² · 3 · 2
All together we have (1 + 3 + 6) · 8 · 2²² = 335544320 bytes (327680 KiB or 320 MiB) of RAM which closely matches the amount of memory observed.
How can you reduce this amount further? Consider transposing your array so the axes are sorted in ascending order of size. This way you waste much less memory in pointers. You could also consider allocating space for the values only and doing index computations manually. This can speed up the code a lot (less memory accesses) and saves memory but is tedious to program.
4194304 is not 2^20, its more like 2^22, so your calculation is off by at least a factor of 4. And you also allocate a set of pointers to point to other data, which takes space. In your code, the first malloc allocates
2048*2048 pointers, not a single pointer to that many items.
You should also use best practice for dynamic allocation:
1) Do not cast the malloc return
2) always use expression = malloc(count * sizeof *expression); This way you can never get the sizes wrong, no matter how many pointer levels you use in the expression. E.g.
t->entrys = malloc(length * sizeof *t->entrys);
t->entrys[i] = malloc(dim * sizeof *t->entrys[i]);
t->entrys[i][j] = malloc(amount_reg * sizeof *t->entrys[i][j]);
I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.
I'm trying to allocate a large space of contiguous memory in C and print this out to the user. My strategy for doing this is to create two pointers (one a pointer to double, one a pointer to pointer to double), malloc one of them to the entire size (m * n) in this case the pointer to pointer to double. Then malloc the second one to the size of m. The last step will be to iterate through the size of m and perform pointer arithmetic that would ensure the addresses of the doubles in the large array will be stored in contiguous memory. Here is my code. But when I print out the address it doesn't seem to be in contiguous (or in any sort of order). How do i print out the memory addresses of the doubles (all of them are of value 0.0) correctly?
/* correct solution, with correct formatting */
/*The total number of bytes allocated was: 4
0x7fd5e1c038c0 - 1
0x7fd5e1c038c8 - 2
0x7fd5e1c038d0 - 3
0x7fd5e1c038d8 - 4*/
double **dmatrix(size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,i;
double ** f;
m = n = 2;
i = 0;
f = dmatrix(sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n);
for (i=0;i<n*m;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double **dmatrix(size_t m, size_t n) {
double ** ptr1 = (double **)malloc(sizeof(double *) * m * n);
double * ptr2 = (double *)malloc(sizeof(double) * m);
int i;
for (i = 0; i < n; i++){
ptr1[i] = ptr2+m*i;
}
return ptr1;
}
Remember that memory is just memory. Sounds trite, but so many people seem to think of memory allocation and memory management in C as being some magic-voodoo. It isn't. At the end of the day you allocate whatever memory you need, and free it when you're done.
So start with the most basic question: If you had a need for 'n' double values, how would you allocate them?
double *d1d = calloc(n, sizeof(double));
// ... use d1d like an array (d1d[0] = 100.00, etc. ...
free(d1d);
Simple enough. Next question, in two parts, where the first part has nothing to do with memory allocation (yet):
How many double values are in a 2D array that is m*n in size?
How can we allocate enough memory to hold them all.
Answers:
There are m*n doubles in a m*n 2D-matrix of doubles
Allocate enough memory to hold (m*n) doubles.
Seems simple enough:
size_t m=10;
size_t n=20;
double *d2d = calloc(m*n, sizeof(double));
But how do we access the actual elements? A little math is in order. Knowing m and n, you can simple do this
size_t i = 3; // value you want in the major index (0..(m-1)).
size_t j = 4; // value you want in the minor index (0..(n-1)).
d2d[i*n+j] = 100.0;
Is there a simpler way to do this? In standard C, yes; in C++ no. Standard C supports a very handy capability that generates the proper code to declare dynamically-sized indexible arrays:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
Can't stress this enough: Standard C supports this, C++ does NOT. If you're using C++ you may want to write an object class to do this all for you anyway, so it won't be mentioned beyond that.
So what does the above actual do ? Well first, it should be obvious we are still allocating the same amount of memory we were allocating before. That is, m*n elements, each sizeof(double) large. But you're probably asking yourself,"What is with that variable declaration?" That needs a little explaining.
There is a clear and present difference between this:
double *ptrs[n]; // declares an array of `n` pointers to doubles.
and this:
double (*ptr)[n]; // declares a pointer to an array of `n` doubles.
The compiler is now aware of how wide each row is (n doubles in each row), so we can now reference elements in the array using two indexes:
size_t m=10;
size_t n=20;
double (*d2d)[n] = calloc(m, sizeof(*d2d));
d2d[2][5] = 100.0; // does the 2*n+5 math for you.
free(d2d);
Can we extend this to 3D? Of course, the math starts looking a little weird, but it is still just offset calculations into a big'ol'block'o'ram. First the "do-your-own-math" way, indexing with [i,j,k]:
size_t l=10;
size_t m=20;
size_t n=30;
double *d3d = calloc(l*m*n, sizeof(double));
size_t i=3;
size_t j=4;
size_t k=5;
d3d[i*m*n + j*m + k] = 100.0;
free(d3d);
You need to stare at the math in that for a minute to really gel on how it computes where the double value in that big block of ram actually is. Using the above dimensions and desired indexes, the "raw" index is:
i*m*n = 3*20*30 = 1800
j*m = 4*20 = 80
k = 5 = 5
======================
i*m*n+j*m+k = 1885
So we're hitting the 1885'th element in that big linear block. Lets do another. what about [0,1,2]?
i*m*n = 0*20*30 = 0
j*m = 1*20 = 20
k = 2 = 2
======================
i*m*n+j*m+k = 22
I.e. the 22nd element in the linear array.
It should be obvious by now that so long as you stay within the self-prescribed bounds of your array, i:[0..(l-1)], j:[0..(m-1)], and k:[0..(n-1)] any valid index trio will locate a unique value in the linear array that no other valid trio will also locate.
Finally, we use the same array pointer declaration like we did before with a 2D array, but extend it to 3D:
size_t l=10;
size_t m=20;
size_t n=30;
double (*d3d)[m][n] = calloc(l, sizeof(*d3d));
d3d[3][4][5] = 100.0;
free(d3d);
Again, all this really does is the same math we were doing before by hand, but letting the compiler do it for us.
I realize is may be a bit much to wrap your head around, but it is important. If it is paramount you have contiguous memory matrices (like feeding a matrix to a graphics rendering library like OpenGL, etc), you can do it relatively painlessly using the above techniques.
Finally, you might wonder why would anyone do the whole pointer arrays to pointer arrays to pointer arrays to values thing in the first place if you can do it like this? A lot of reasons. Suppose you're replacing rows. swapping a pointer is easy; copying an entire row? expensive. Supposed you're replacing an entire table-dimension (m*n) in your 3D array (l*n*m), even more-so, swapping a pointer: easy; copying an entire m*n table? expensive. And the not-so-obvious answer. What if the rows widths need to be independent from row to row (i.e. row0 can be 5 elements, row1 can be 6 elements). A fixed l*m*n allocation simply doesn't work then.
Best of luck.
Never mind, I figured it out.
/* The total number of bytes allocated was: 8
0x7fb35ac038c0 - 1
0x7fb35ac038c8 - 2
0x7fb35ac038d0 - 3
0x7fb35ac038d8 - 4
0x7fb35ac038e0 - 5
0x7fb35ac038e8 - 6
0x7fb35ac038f0 - 7
0x7fb35ac038f8 - 8 */
double ***d3darr(size_t l, size_t m, size_t n);
int main(int argc, char const *argv[])
{
int m,n,l,i;
double *** f;
m = n = l = 10; i = 0;
f = d3darr(sizeof(l), sizeof(m), sizeof(n));
printf("%s %d\n", "The total number of bytes allocated was: ", m * n * l);
for (i=0;i<n*m*l;++i) {
printf("%p - %d\n ", &f[i], i + 1);
}
return 0;
}
double ***d3darr(size_t l, size_t m, size_t n){
double *** ptr1 = (double ***)malloc(sizeof(double **) * m * n * l);
double ** ptr2 = (double **)malloc(sizeof(double *) * m * n);
double * ptr3 = (double *)malloc(sizeof(double) * m);
int i, j;
for (i = 0; i < l; ++i) {
ptr1[i] = ptr2+m*n*i;
for (j = 0; j < l; ++j){
ptr2[i] = ptr3+j*n;
}
}
return ptr1;
}