I referred this, this and this SO posts before asking this question:
While teaching dynamic memory allocation to a friend, I wrote a simple program, whose snippet is below:
matrix = (int**) malloc (sizeof(int*) * m);
for (i = 0; i < m; ++i)
matrix[i] = (int*) malloc (sizeof(int) * n);
for (i = 0; i < m; ++i)
for (j = 0; j < n; ++j)
matrix[i][j] = rand() % 100; /*some random value*/
for (i = 0; i < m; ++i)
{
for (j = 0; j < n; ++j)
{
printf("(%8u)%-5d", &matrix[i][j], matrix[i][j]);
}
/* Print element just after the first row*/
printf("(%8u)%-5d", matrix[i] + n, *(matrix[i] + n));
/* Print second NEXT element just after the first row*/
printf("(%8u)%-5d", matrix[i] + n + 1, *(matrix[i] + n + 1));
}
When I run this program as
unmesh#unmesh-laptop:~/teaching/C/Day3$ ./matrix
Enter number of rows: 3
Enter number of columns: 3
(141189144)1 (141189148)2 (141189152)3 **(141189156)17** (141189160)2
(141189160)2 (141189164)3 (141189168)4 **(141189172)17** (141189176)3
(141189176)3 (141189180)4 (141189184)5 (141189188)135105(141189192)0
I am curious about the value 17. If I am not mistaking, there are three calls to malloc in this invocation, So memory may not be contiguous as can be observed.
But I run the program with m=n=4 or m=n=5, then I can see the value 25 after each row.
For m=n=6 and m=n=7, value 33 is observed.
More interesting is the fact that when n is odd (n=3 / n=5 ..) this value is stored immediately after the row ends.
example row when m=n=3
Values 1 2 3 17
Address 100 104 108 112
Next row starts from 116
When n is even, it is stored after 4 bytes
example row when m=n=2
Values 1 2 0 17
Address 100 104 108 112
Next row starts from 116
Question is where do these values 17/25/33 come from ?
Why are they always the same ? I hope they are not garbage and must have some meaning ..
I fail to deduce. Please help.
You're seeing the internal bookkeeping information malloc stores to keep track of blocks that have been allocated. The precise nature and size of this information varies from system to system, but it is often the case that malloc rounds up the size requested so that, when combined with its bookkeeping info, the resulting block is a multiple of the largest alignment generally required for your machine. In your case, it looks like the alignment is 8 bytes (two ints), leading to the even/odd behavior you see. The 17/25/33 values you see are likely the sizes of the allocated blocks (including the padding and size info) with the lowest bit set to indicate an in-use block.
Related
I am performing Compressed Sparse Raw Matrix Vector multiplications (CSR SPMV): This involves dividing the array A into multiple chunks, then pass this chunk by reference to a function, however only the first part of the array (A[0] first chunk starting the beginning of the array) is modified. However starting from the second loop A[0 + chunkIndex], when the function reads the sub array it jumps and reads a different address beyond the total array address range, although the indices are correct.
For reference:
The SPMV kernel is:
void serial_matvec(size_t TS, double *A, int *JA, int *IA, double *X, double *Y)
{
double sum;
for (int i = 0; i < TS; ++i)
{
sum = 0.0;
for (int j = IA[i]; j < IA[i + 1]; ++j)
{
sum += A[j] * X[JA[j]]; // the error is here , the function reads diffrent
// address of A, and JA, so the access
// will be out-of-bound
}
Y[i] = sum;
}
}
and it is called this way:
int chunkIndex = 0;
for(size_t k = 0; k < rows/TS; ++k)
{
chunkIndex = IA[k * TS];
serial_matvec(TS, &A[chunkIndex], &JA[chunkIndex], &IA[k*TS], &X[0], &Y[k*TS]);
}
assume I process (8x8) Matrix, and I process 2 rows per chunk, so the loop k will be rows/TS = 4 loops, the chunkIndex and array passed to the function will be as following:
chunkIndex: 0 --> loop k = 0, &A[0], &JA[0]
chunkIndex: --> loop k = 1, &A[16], &JA[16] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 2, &A[32], &JA[32] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 3, &A[48], &JA[48] //[ERROR here, function reads different address]
When I run the code, only the first chunk executes correctly, the other 3 chunks memory are corrupted and the array pointers jump into boundary beyond the array size.
I've checked all indices manually, of all the parameter, they are all correct, however when I print the addresses they are not the same. (debugging this for 3 days now)
I used valgrind and it reported:
Invalid read of size 8 and Use of uninitialised value of size 8 at the sum += A[j] * X[JA[j]]; line
I compiled it with -g -fsanitize=address and I got
heap-buffer-overflow
I tried to access these chunks manually outside the function, and they are correct, so what can cause the heap memory to be corrupted like this ?
The code is here, This is the minimum I can do.
The problem was that I was using global indices (indices inside main) when indexing the portion of the array (chunk) passed to the function, hence the out-of-bound problem.
The solution is to start indexing the sub-arrays from 0 at each function call, but I had another problem. At each function call, I process TS rows, each row has different number of non-zeros.
As an example, see the picture, chunk 1, sorry for my bad handwriting, it is easier this way. As you can see we will need 3 indices, one for the TS rows proceeded per chunk i , and the other because each row has different number of non-zeros j, and the third one to index the sub-array passed l, which was the original problem.
and the serial_matvec function will be as following:
void serial_matvec(size_t TS, const double *A, const int *JA, const int *IA,
const double *X, double *Y) {
int l = 0;
for (int i = 0; i < TS; ++i) {
for (int j = 0; j < (IA[i + 1] - IA[i]); ++j) {
Y[i] += A[l] * X[JA[l]];
l++;
}
}
}
The complete code with test is here If anyone has a more elegant solution, you are more than welcome.
I am new to GPU programming (and rather rusty in C) so this might be a rather basic question with an obvious bug in my code. What I am trying to do is take a 2 dimensional array and find the sum of each column for every row. So If I have a 2D array that contains:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18
I want to get an array that contains the following out:
45
45
90
The code I have so far is not returning the correct output and I'm not sure why. I'm guessing it is because I am not handling the indexing in the kernel properly. But it could be that I am not using the memory correctly since I adapted this from an over-simplified 1 dimensional example and the CUDA Programming Guide (section 3.2.2) makes a rather big and not very well described jump for a beginner between 1 and 2 dimensional arrays.
My incorrect attempt:
#include <stdio.h>
#include <stdlib.h>
// start with a small array to test
#define ROW 3
#define COL 10
__global__ void collapse( int *a, int *c){
/*
Sum along the columns for each row of the 2D array.
*/
int total = 0;
// Loop to get total, seems wrong for GPUs but I dont know a better way
for (int i=0; i < COL; i++){
total = total + a[threadIdx.y + i];
}
c[threadIdx.x] = total;
}
int main( void ){
int array[ROW][COL]; // host copies of a, c
int c[ROW];
int *dev_a; // device copies of a, c (just pointers)
int *dev_c;
// get the size of the arrays I will need
int size_2d = ROW * COL * sizeof(int);
int size_c = ROW * sizeof(int);
// Allocate the memory
cudaMalloc( (void**)&dev_a, size_2d);
cudaMalloc( (void**)&dev_c, size_c);
// Populate the 2D array on host with something small and known as a test
for (int i=0; i < ROW; i++){
if (i == ROW - 1){
for (int j=0; j < COL; j++){
array[i][j] = (j*2);
printf("%i ", array[i][j]);
}
} else {
for (int j=0; j < COL; j++){
array[i][j] = j;
printf("%i ", array[i][j]);
}
}
printf("\n");
}
// Copy the memory
cudaMemcpy( dev_a, array, size_2d, cudaMemcpyHostToDevice );
cudaMemcpy( dev_c, c, size_c, cudaMemcpyHostToDevice );
// Run the kernal function
collapse<<< ROW, COL >>>(dev_a, dev_c);
// copy the output back to the host
cudaMemcpy( c, dev_c, size_c, cudaMemcpyDeviceToHost );
// Print the output
printf("\n");
for (int i = 0; i < ROW; i++){
printf("%i\n", c[i]);
}
// Releasae the memory
cudaFree( dev_a );
cudaFree( dev_c );
}
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18
45
45
45
You are correct, it's an indexing issue. Your kernel will generate a correct answer if you replace this:
total = total + a[threadIdx.y + i];
with this:
total = total + a[blockIdx.x*COL + i];
and this:
c[threadIdx.x] = total;
with this:
c[blockIdx.x] = total;
However there's more to say than that.
Any time you're having trouble with a CUDA code, you should use proper cuda error checking. The second issue above was definitely resulting in a memory access error, and you may have gotten a hint of this with error checking. You should also run your codes with cuda-memcheck which will do an extra-tight job of bounds checking, and it would definitely catch the out-of-bounds access your kernel was making.
I think you may be confused with kernel launch syntax: <<<ROW, COL>>> You may be thinking that this maps into 2D thread coordinates (I'm just guessing, since you used threadIdx.y in a kernel where it has no meaning.) However the first parameter is the number of blocks to be launched, and the second is the number of threads per block. If you provide scalar quantities (as you have) for both of these, you will be launching a 1D grid of 1D threadblocks, and your .y variables won't really be meaningful (for indexing). So one takeaway is that threadIdx.y doesn't do anything useful in this setup (it is always zero).
To fix that, we could make the first change listed at the beginning of this answer. Note that when we launch 3 blocks, each block will have a unique blockIdx.x so we can use that for indexing, and we have to multiply that by the "width" of your array to generate proper indexing.
Since the second parameter is the number of threads per block, your indexing into C also didn't make sense. C only has 3 elements (which is sensible) but each block had 10 threads, and in each block the threads were trying into index into the "first 10" locations in C (each thread in a block has a unique value for threadIdx.x) But after the first 3 locations, there is no extra storage in C.
Now possibly the biggest issue. Each thread in a block is doing exactly the same thing in the loop. Your code does not differentiate behavior of threads. You can write code that gives the correct answer this way, but it's not sensible from a performance standpoint.
To fix this last issue, the canonical answer is to use a parallel reduction. That's an involved topic, and there are many questions about it here on the SO tag, so I'll not try to cover it, but point out to you that there is a good tutorial here along with the accompanying CUDA sample code that you can study. If you want to see a parallel reduction on the matrix rows, for example, you could look at this question/answer. It happens to be performing a max-reduction instead of a sum-reduction, but the differences are minor. You can also use an atomic method as suggested in the other answer, but that is generally not considered a "high-performance" approach, because the throughput of atomic operations is more limited than what is achievable with the ordinary CUDA memory bandwidth.
You also seem to be generally confused about the CUDA kernel execution model, so continued reading of the programming guide (that you've already linked) is a good starting point.
I have a problem that I can not figure out (it is probably an easy solution but I can not see it).
The thing is, I have a program that generates all the possible combinations of numbers. The program ask for the size of the set and size of the subsets and generates all the possible combinations accordingly. So far so good ... now ...
I want to write some routines that check for some things in order to eliminate those combinations, one of those routines is the one who checks the array looking to exclude the sequences that exceed a given number of sequenced numbers, for this the program asks for the maximum of numbers in sequences allowed. For example
Size of the set ? : 10 (stores in n)
size of the subset?: 10 (stores in k)
maximum of seq num: 10 (stores in maxp)
the array is called comb[] (integer) it is initialized as
for (i = 0; i < k; i++)
comb[i] = i;
but I have trouble with the routine that exludes certain combinations. The routine is
int todel (int comb[], int k)
{
int i, j, seq;
for (i = 0, seq = 0; (i+maxp) < k; i++)
{
int j = 0;
for (j = i; j < maxp; j++)
{
fprintf(stderr, "checkin comb %d with comb %d\n", j, j+1);
if (comb[j] == (comb[j+1] - 1))
{
seq++;
}
if (seq >= maxp) return 1;
}
}
return 0;
}
if I have a set of 10 a subset of 10 and a max allowed of 10, the program does not need to exlude anything.
But for a set 10 subset 9 and a max allowed of 1 the program should exclude all 10 combinations. But as it is the program is allowing the following combination
0,2,3,4,5,6,7,8,9
and it should exclude it because 0,2 does not match the criteria, 2,3 and all thw following does.
Another thing is that if I set the maximum allowed to 0 it takes all combinations as valid instead of none.
I know the fix should not be very hard and I am missing something really dumb.
I hope some insight from you (probably insults too).
Thank you !
running int a strange scenario where malloc is allocating more memory than I ask for:
void function (int array [], int numberOfElements) {
int *secondArray = malloc(sizeof(int) * numberOfElements/2);
for (int i = 0; i < numberOfElements / 2; i++) {
secondArray[i] = array[i];
}
}
Let's say array is a some 10 numbers. When I print out secondArray after the above code, I get:
so first of all, the array should be 5 elements. But second, why the 0's in the end? I'm mallocing only space for 10/2 = 5 ints.
EDIT:
printing code:
for (int d = 0; d < numberOfElements; d++) {
printf("%i ", secondArray[d]);
}
hmm I might have just answered my own question here, I'm guessing it's the printing beyond secondArray that shows 0, not the array itself.
-
Actually, the problem is that I was also not doing this:
secondArray[numberOfElements] = '\0';
That is why it was printing beyond.
malloc is actually allocating exactly the right amount.
However, you're accessing memory beyond the allocation.
What exists there is completely undefined and could really be anything.
In your case, it was one "junk" number and four zeroes.
You are just lucky. Malloc can and sometimes does ask from more memory off the OS - Taking into account paging. Sometimes it does not even need to ask the OS for memory as it has asked for extra earlier. Therefore the malloc could ask for a page of memory - more that enough to satisfy your request and the extra memory happens to be filled with zeros.
You are in the land of undefined behaviour. So all bets are off.
/** its print 0 0 0 0 because in C no array bound if you define your array
* size is 4 but
* you want to store data more than array size you can store so you print your
* array.
* for(i = 0; i < numberOfElements; i++) its give data and 0 also because you
* store the data
* only 5 position but you print it max size so it give you 0 0 0
*/
int *secondArray = malloc(sizeof(int) * numberOfElements/2); // no matter either use it or
int *secondArray = malloc(sizeof(int));
// ^^^ this will take same memory
Code:
#include<stdio.h>
int main(void)
{
int i, j;
for(j = i+1, i=1; i<=5; j++, i++)
printf("%d %d\n", i, j);
return 0;
}
Output:
1 66
2 67
3 68
4 69
5 70
Can Anyone explain about the nature of output of the code?
i is unitialized when you set j=i+1. So j (initially) could be almost anything.
In your code i, j are not initialized at the time of declaration.
In for loop you assign j = i + 1 So j remains garbage value whereas i assigned 1 ,
in for loop you increment i, j and printf values. i increment from 1 to 5, and j from a initial garbage value (that is 66 in your output) to initial garbage + 5.
Edit On the basis of comments:
If you don't assign an initial value upon declaration the variable will be pointing at an address that may contain previously used data from another application(or any last used).
Before allocating memory in runtime system does not clear the memory before allocating (just to keep system performance high) So,default value of the variable is garbage value.
j is assigned the value of i even before i is assigned = 1. So i here can be any arbitrary value provided to it by the OS. In the above case the value assigned to i by the OS was 66. This arbitrary value could be different on varying systems.