core dump while iterating through arrays - core

I'm having some problems with the following code. newrows is a parameter which is directly given to the function I'm working in. elements is being calculated a bit earlier using another parameter. Somehow, for some combinations of values for newrows and elements I'm getting a core dump while other combinations work fine. Usually, when the core dump occurs there have been 20000 to 25000 iterations. However, when everything works fine there have been up to 40000 iterations.
int32_t newimage[newrows][elements][3];
int32_t pixelcounter[newrows][elements];
//int32_t norm, angle, rohmax;
//double r, alpha, beta, m, mu;
//initialize arrays
for(i=0; i<newrows; i++){
for(j=0; j<elements; j++){
pixelcounter[i][j] = 0;
newimage[i][j][0] = 0;
newimage[i][j][1] = 0;
newimage[i][j][2] = 0;
}
}
combination that works fine: 200 : 188
combination that leads to core dump: 200 : 376
I am using linux btw :-)

This is most likely a stack space issue. Note that newimage and pixelcounter are being allocated in the stack frame of whatever function these are being declared in. You can quickly run out of space trying to allocate large amount of data. Your 3d array newimage grows as
#bytes = newrows * elemets * 3
I cleaned up your program (a good piece of advice is to try and present programs that compile, so can people can help you quicker!):
#include <stdio.h>
#include <stdint.h>
void test(size_t newrows, size_t elements) {
int32_t newimage[newrows][elements][3];
int32_t pixelcounter[newrows][elements];
//initialize arrays
for(size_t i=0; i<newrows; i++) {
for(size_t j=0; j<elements; j++) {
pixelcounter[i][j] = 0;
newimage[i][j][0] = 0;
newimage[i][j][1] = 0;
newimage[i][j][2] = 0;
}
}
}
int main(void) {
printf("Size of integer = %ld\n", sizeof(int));
for (size_t i = 700; ; i += 10) {
printf("Testing (%ld, %ld)\n", i, i);
test(i, i);
}
return 0;
}
And running this, I see:
Size of integer = 4
Testing (700, 700)
Testing (710, 710)
Testing (720, 720)
Testing (730, 730)
[3] 13482 segmentation fault (core dumped) ./a.out
So somewhere between 720^2 * 3 * 4 and 730^2 * 3 * 4 bytes, which is about 6 MiB on my 64-bit Linux computer, it might be different on your computer.
The solution in this case is allocate your arrays on the heap, where you will have a lot more memory to work with. More information about heap-allocating multidimensional arrays can be found in How does C allocate space for a 2D (3D...) array when using malloc?.

Related

why do I have a runtime #2 failure in C when I have enough space and there isn't many data in the array

I'm writing this code in C for some offline games but when I run this code, it says "runtime failure #2" and "stack around the variable has corrupted". I searched the internet and saw some answers but I think there's nothing wrong with this.
#include <stdio.h>
int main(void) {
int a[16];
int player = 32;
for (int i = 0; i < sizeof(a); i++) {
if (player+1 == i) {
a[i] = 254;
}
else {
a[i] = 32;
}
}
printf("%d", a[15]);
return 0;
}
Your loop runs from 0 to sizeof(a), and sizeof(a) is the size in bytes of your array.
Each int is (typically) 4-bytes, and the total size of the array is 64-bytes. So variable i goes from 0 to 63.
But the valid indices of the array are only 0-15, because the array was declared [16].
The standard way to iterate over an array like this is:
#define count_of_array(x) (sizeof(x) / sizeof(*x))
for (int i = 0; i < count_of_array(a); i++) { ... }
The count_of_array macro calculates the number of elements in the array by taking the total size of the array, and dividing by the size of one element.
In your example, it would be (64 / 4) == 16.
sizeof(a) is not the size of a, but rather how many bytes a consumes.
a has 16 ints. The size of int depends on the implementation. A lot of C implementations make int has 4 bytes, but some implementations make int has 2 bytes. So sizeof(a) == 64 or sizeof(a) == 32. Either way, that's not what you want.
You define int a[16];, so the size of a is 16.
So, change your for loop into:
for (int i = 0; i < 16; i++)
You're indexing too far off the size of the array, trying to touch parts of memory that doesn't belong to your program. sizeof(a) returns 64 (depending on C implementation, actually), which is the total amount of bytes your int array is taking up.
There are good reasons for trying not to statically declare the number of iterations in a loop when iterating over an array.
For example, you might realloc memory (if you've declared the array using malloc) in order to grow or shrink the array, thus making it harder to keep track of the size of the array at any given point. Or maybe the size of the array depends on user input. Or something else altogether.
There's no good reason to avoid saying for (int i = 0; i < 16; i++) in this particular case, though. What I would do is declare const int foo = 16; and then use foo instead of any number, both in the array declaration and the for loop, so that if you ever need to change it, you only need to change it in one place. Else, if you really want to use sizeof() (maybe because one of the reasons above) you should divide the return value of sizeof(array) by the return value of sizeof(type of array). For example:
#include <stdio.h>
const int ARRAY_SIZE = 30;
int main(void)
{
int a[ARRAY_SIZE];
for(int i = 0; i < sizeof(a) / sizeof(int); i++)
a[i] = 100;
// I'd use for(int i = 0; i < ARRAY_SIZE; i++) though
}

What causes my array address to be corrupted (change) when passed to function?

I am performing Compressed Sparse Raw Matrix Vector multiplications (CSR SPMV): This involves dividing the array A into multiple chunks, then pass this chunk by reference to a function, however only the first part of the array (A[0] first chunk starting the beginning of the array) is modified. However starting from the second loop A[0 + chunkIndex], when the function reads the sub array it jumps and reads a different address beyond the total array address range, although the indices are correct.
For reference:
The SPMV kernel is:
void serial_matvec(size_t TS, double *A, int *JA, int *IA, double *X, double *Y)
{
double sum;
for (int i = 0; i < TS; ++i)
{
sum = 0.0;
for (int j = IA[i]; j < IA[i + 1]; ++j)
{
sum += A[j] * X[JA[j]]; // the error is here , the function reads diffrent
// address of A, and JA, so the access
// will be out-of-bound
}
Y[i] = sum;
}
}
and it is called this way:
int chunkIndex = 0;
for(size_t k = 0; k < rows/TS; ++k)
{
chunkIndex = IA[k * TS];
serial_matvec(TS, &A[chunkIndex], &JA[chunkIndex], &IA[k*TS], &X[0], &Y[k*TS]);
}
assume I process (8x8) Matrix, and I process 2 rows per chunk, so the loop k will be rows/TS = 4 loops, the chunkIndex and array passed to the function will be as following:
chunkIndex: 0 --> loop k = 0, &A[0], &JA[0]
chunkIndex: --> loop k = 1, &A[16], &JA[16] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 2, &A[32], &JA[32] //[ERROR here, function reads different address]
chunkIndex: --> loop k = 3, &A[48], &JA[48] //[ERROR here, function reads different address]
When I run the code, only the first chunk executes correctly, the other 3 chunks memory are corrupted and the array pointers jump into boundary beyond the array size.
I've checked all indices manually, of all the parameter, they are all correct, however when I print the addresses they are not the same. (debugging this for 3 days now)
I used valgrind and it reported:
Invalid read of size 8 and Use of uninitialised value of size 8 at the sum += A[j] * X[JA[j]]; line
I compiled it with -g -fsanitize=address and I got
heap-buffer-overflow
I tried to access these chunks manually outside the function, and they are correct, so what can cause the heap memory to be corrupted like this ?
The code is here, This is the minimum I can do.
The problem was that I was using global indices (indices inside main) when indexing the portion of the array (chunk) passed to the function, hence the out-of-bound problem.
The solution is to start indexing the sub-arrays from 0 at each function call, but I had another problem. At each function call, I process TS rows, each row has different number of non-zeros.
As an example, see the picture, chunk 1, sorry for my bad handwriting, it is easier this way. As you can see we will need 3 indices, one for the TS rows proceeded per chunk i , and the other because each row has different number of non-zeros j, and the third one to index the sub-array passed l, which was the original problem.
and the serial_matvec function will be as following:
void serial_matvec(size_t TS, const double *A, const int *JA, const int *IA,
const double *X, double *Y) {
int l = 0;
for (int i = 0; i < TS; ++i) {
for (int j = 0; j < (IA[i + 1] - IA[i]); ++j) {
Y[i] += A[l] * X[JA[l]];
l++;
}
}
}
The complete code with test is here If anyone has a more elegant solution, you are more than welcome.

malloc crashes embedded system

I am trying to multiply matrices of arbitrary sizes on a cortex M4-core. I DO need a malloc...
But I don't understand why at the first call it works and at the second call it doesnt work any more. it just jumps to the default interrupt handler FaultISR.
hereby the dissasembly code:
It fails when executing the BL command
function calls:
multiplyMatrices( &transFRotMatrix[0][0],3, 3, &sunMeasurements[0][0], 3, 1, *orbitalSunVector); //Works fine
multiplyMatrices( &firstRotMatrix[0][0],3, 3, &orbitalTMFV[0][0], 3, 1, *inertialTMFV); //doesn t work fine
code:
void multiplyMatrices(float *transposedMatrix, int height1, int width1, float *iSunVector,int height2, int width2, float *orbitalSunVector)
{
int y=0;
int x = 0;
int row=0;
int column =0;
int k=0;
int k2=0;
float result = 0;
float *output2=NULL;
int i=0;
int j=0;
i=0;
k=0;
k2 = 0;
if(width1 != height2)
{
//printf("unmatching matrices, error.\n\n");
return;
}
output2 = malloc(height1 * width2 * sizeof(float)); //<---- jumps o FaultISR
while(k<width1) //aantal rijen 1ste matrix
{
for(j=0;j<height2;j++) //aantal rijen 2de matrix
{
result += (*((transposedMatrix+k*width1)+j)) * (*((iSunVector+j*width2)+k2)); //1ste var:aantal kolommen 2de matrix --2de variabele na de plus = aantal kolommen 2de matrix
//printf("%f * %f\t + ", (*((transposedMatrix+k*width1)+j)), (*((iSunVector+j*width2)+k2)));
}
output2[row* width1 + column] = result;
k2++;
x++;
column++;
if(x==width2) //aantal kolommen 2de Matrix
{
k2=0;
x=0;
column=0;
row++;
y++;
k++;
}
result = 0;
}
//tussenresultaat
for(i=0;i<height1;i++)
{
for(j=0;j<width2;j++)
{
orbitalSunVector[j * height1 + i] = output2[i* width1 + j]; //output2[i][j];
}
}
free(output2);
}
You are overflowing your output2 matrix in both loops due to an incorrect index calculation. You have:
output2[row*width1 + column] = result;
...
orbitalSunVector[j*height1 + i] = output2[i*width1 + j];
but you should be using width2 in both cases since the final matrix is width2 * height1 in size (as it is allocated):
output2[row*width2 + column] = result;
...
orbitalSunVector[j*height1 + i] = output2[i*width2 + j];
I didn't check any of your other indexes but I would test the function with a few known cases to make sure it outputs the correct results. If you had done more debugging and checked the array indexes it should have been easy to spot.
Note that the reason it worked for you the first time but not the second time is due to undefined behaviour (UB). As soon as you write past the end of output2 you invoke UB and anything can happen. For you it happened to show up as a fault on the second call. For me it happened to fault on the first call. If you're really unlucky it may not ever fault and just silently corrupt data.
Do you use printf in other places of your code?
This page recommends starting at 0x400 for heap size, which is 1024 decimal:
It is recommended to start with a reasonable heap size like 0x400 when
there is limited dynamic allocation (like printf() calls in the code),
and increase it as needed depending on the application.
You have 512 today, you could at least try to double that if possible, as per TI's recommendation, and see where this leads you.
This is a related question. If you do not have a tool to watch heap allocation on the fly, try to manually fill the heap at startup (memcpy it with known values, such as ASCII '#==#', 0xDEADBEEF, or whatever recognizable value), then run to just before you usually crash, and watch the content of the heap in the Memory window. My best guess is that you'll find the heap is full.
Please also look if you can see error flag registers while you are in the FaultISR. Often there is something somewhere telling you why you came here.
I am not sure about TI's implementation of malloc, but they may save an error value. I'd not bet on that one, since it would probably return NULL in that case, rather than crash.

CUDA: Is it safe to apply `+=` in parallel to elements of an array located on the device?

I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.

Estimating memory scope of erlang datastructure

Being a former C programmer and current Erlang hacker one question has popped up.
How do I estimate the memory scope of my erlang datastructures?
Lets say I had an array of 1k integers in C, estimating the memory demand of this is easy, just the size of my array, times the size of an integer, 1k 32bit integers would take up 4kb or memory, and some constant amount of pointers and indexes.
In erlang however estimating the memory usage is somewhat more complicated, how much memory does an entry in erlangs array structure take up?, how do I estimate the size of a dynamically sized integer.
I have noticed that scanning over integers in array is fairly slow in erlang, scanning an array of about 1M integers takes almost a second in erlang, whereas a simple piece of c code will do it in arround 2 ms, this most likely is due to the amount of memory taken up by the datastructure.
I'm asking this, not because I'm a speed freak, but because estimating memory has, at least in my experience, been a good way of determining scalability of software.
My test code:
first the C code:
#include <cstdio>
#include <cstdlib>
#include <time.h>
#include <queue>
#include <iostream>
class DynamicArray{
protected:
int* array;
unsigned int size;
unsigned int max_size;
public:
DynamicArray() {
array = new int[1];
size = 0;
max_size = 1;
}
~DynamicArray() {
delete[] array;
}
void insert(int value) {
if (size == max_size) {
int* old_array = array;
array = new int[size * 2];
memcpy ( array, old_array, sizeof(int)*size );
for(int i = 0; i != size; i++)
array[i] = old_array[i];
max_size *= 2;
delete[] old_array;
}
array[size] = value;
size ++;
}
inline int read(unsigned idx) const {
return array[idx];
}
void print_array() {
for(int i = 0; i != size; i++)
printf("%d ", array[i]);
printf("\n ");
}
int size_of() const {
return max_size * sizeof(int);
}
};
void test_array(int test) {
printf(" %d ", test);
clock_t t1,t2;
t1=clock();
DynamicArray arr;
for(int i = 0; i != test; i++) {
//arr.print_array();
arr.insert(i);
}
int val = 0;
for(int i = 0; i != test; i++)
val += arr.read(i);
printf(" size %g MB ", (arr.size_of()/(1024*1024.0)));
t2=clock();
float diff ((float)t2-(float)t1);
std::cout<<diff/1000<< " ms" ;
printf(" %d \n", val == ((1 + test)*test)/2);
}
int main(int argc, char** argv) {
int size = atoi(argv[1]);
printf(" -- STARTING --\n");
test_array(size);
return 0;
}
and the erlang code:
-module(test).
-export([go/1]).
construct_list(Arr, Idx, Idx) ->
Arr;
construct_list(Arr, Idx, Max) ->
construct_list(array:set(Idx, Idx, Arr), Idx + 1, Max).
sum_list(_Arr, Idx, Idx, Sum) ->
Sum;
sum_list(Arr, Idx, Max, Sum) ->
sum_list(Arr, Idx + 1, Max, array:get(Idx, Arr) + Sum ).
go(Size) ->
A0 = array:new(Size),
A1 = construct_list(A0, 0, Size),
sum_list(A1, 0, Size, 0).
Timing the c code:
bash-3.2$ g++ -O3 test.cc -o test
bash-3.2$ ./test 1000000
-- STARTING --
1000000 size 4 MB 5.511 ms 0
and the erlang code:
1> f(Time), {Time, _} =timer:tc(test, go, [1000000]), Time/1000.0.
2189.418
First, an Erlang variable is always just a single word (32 or 64 bits depending on your machine). 2 or more bits of the word are used as a type tag. The remainder can hold an "immediate" value, such as a "fixnum" integer, an atom, an empty list ([]), or a Pid; or it can hold a pointer to data stored on the heap (tuple, list, "bignum" integer, float, etc.). A tuple has a header word specifying its type and length, followed by one word per element. A list cell on the uses only 2 words (its pointer already encodes the type): the head and tail elements.
For example: if A={foo,1,[]}, then A is a word pointing to a word on the heap saying "I'm a 3-tuple" followed by 3 words containing the atom foo, the fixnum 1, and the empty list, respectively. If A=[1,2], then A is a word saying "I'm a list cell pointer" pointing to the head word (containing the fixnum 1) of the first cell; and the following tail word of the cell is yet another list cell pointer, pointing to a head word containing the 2 and followed by a tail word containing the empty list. A float is represented by a header word and 8 bytes of double precision floating-point data. A bignum or a binary is a header word plus as many words as needed to hold the data. And so on. See e.g. http://stenmans.org/happi_blog/?p=176 for some more info.
To estimate size, you need to know how your data is structured in terms of tuples and lists, and you need to know the size of your integers (if too large, they will use a bignum instead of a fixnum; the limit is 28 bits incl. sign on a 32-bit machine, and 60 bits on a 64-bit machine).
Edit: https://github.com/happi/theBeamBook is a newer good resource on the internals of the BEAM Erlang virtual machine.
Is this what you want?
1> erts_debug:size([1,2]).
4
with it you can at least figure out how big a term is. The size returned is in words.
Erlang has integers as "arrays", so you cannot really estimate it in the same way as c, you can only predict how long your integers will be and calculate average amount of bytes needed to store them
check: http://www.erlang.org/doc/efficiency_guide/advanced.html and you can use erlang:memory() function to determine actual amount

Resources