malloc crashes embedded system - c

I am trying to multiply matrices of arbitrary sizes on a cortex M4-core. I DO need a malloc...
But I don't understand why at the first call it works and at the second call it doesnt work any more. it just jumps to the default interrupt handler FaultISR.
hereby the dissasembly code:
It fails when executing the BL command
function calls:
multiplyMatrices( &transFRotMatrix[0][0],3, 3, &sunMeasurements[0][0], 3, 1, *orbitalSunVector); //Works fine
multiplyMatrices( &firstRotMatrix[0][0],3, 3, &orbitalTMFV[0][0], 3, 1, *inertialTMFV); //doesn t work fine
code:
void multiplyMatrices(float *transposedMatrix, int height1, int width1, float *iSunVector,int height2, int width2, float *orbitalSunVector)
{
int y=0;
int x = 0;
int row=0;
int column =0;
int k=0;
int k2=0;
float result = 0;
float *output2=NULL;
int i=0;
int j=0;
i=0;
k=0;
k2 = 0;
if(width1 != height2)
{
//printf("unmatching matrices, error.\n\n");
return;
}
output2 = malloc(height1 * width2 * sizeof(float)); //<---- jumps o FaultISR
while(k<width1) //aantal rijen 1ste matrix
{
for(j=0;j<height2;j++) //aantal rijen 2de matrix
{
result += (*((transposedMatrix+k*width1)+j)) * (*((iSunVector+j*width2)+k2)); //1ste var:aantal kolommen 2de matrix --2de variabele na de plus = aantal kolommen 2de matrix
//printf("%f * %f\t + ", (*((transposedMatrix+k*width1)+j)), (*((iSunVector+j*width2)+k2)));
}
output2[row* width1 + column] = result;
k2++;
x++;
column++;
if(x==width2) //aantal kolommen 2de Matrix
{
k2=0;
x=0;
column=0;
row++;
y++;
k++;
}
result = 0;
}
//tussenresultaat
for(i=0;i<height1;i++)
{
for(j=0;j<width2;j++)
{
orbitalSunVector[j * height1 + i] = output2[i* width1 + j]; //output2[i][j];
}
}
free(output2);
}

You are overflowing your output2 matrix in both loops due to an incorrect index calculation. You have:
output2[row*width1 + column] = result;
...
orbitalSunVector[j*height1 + i] = output2[i*width1 + j];
but you should be using width2 in both cases since the final matrix is width2 * height1 in size (as it is allocated):
output2[row*width2 + column] = result;
...
orbitalSunVector[j*height1 + i] = output2[i*width2 + j];
I didn't check any of your other indexes but I would test the function with a few known cases to make sure it outputs the correct results. If you had done more debugging and checked the array indexes it should have been easy to spot.
Note that the reason it worked for you the first time but not the second time is due to undefined behaviour (UB). As soon as you write past the end of output2 you invoke UB and anything can happen. For you it happened to show up as a fault on the second call. For me it happened to fault on the first call. If you're really unlucky it may not ever fault and just silently corrupt data.

Do you use printf in other places of your code?
This page recommends starting at 0x400 for heap size, which is 1024 decimal:
It is recommended to start with a reasonable heap size like 0x400 when
there is limited dynamic allocation (like printf() calls in the code),
and increase it as needed depending on the application.
You have 512 today, you could at least try to double that if possible, as per TI's recommendation, and see where this leads you.
This is a related question. If you do not have a tool to watch heap allocation on the fly, try to manually fill the heap at startup (memcpy it with known values, such as ASCII '#==#', 0xDEADBEEF, or whatever recognizable value), then run to just before you usually crash, and watch the content of the heap in the Memory window. My best guess is that you'll find the heap is full.
Please also look if you can see error flag registers while you are in the FaultISR. Often there is something somewhere telling you why you came here.
I am not sure about TI's implementation of malloc, but they may save an error value. I'd not bet on that one, since it would probably return NULL in that case, rather than crash.

Related

Why this code works when i is smaller but gives a segmentation fault when i is larger?

#include <stdio.h>
int main(void) {
int *x, *y;
x = malloc(sizeof(int));
for (int i = 0; i < 4; i++)
x[i] = i + 1;
y = x;
for (int i = 0; i < 4; i++)
printf("%d ", y[i]);
}
This works correct and outputs 1 2 3 4.
But when i < 1000000 it gives segmentation fault.
Can someone explain this?
You need to allocate a large enough buffer. You only allocate sizeof(int) which is 4 bytes typically and large enough to hold only one integer. Can't store 1000000 elements in that. It worked for 4 elements out of pure chance, probably because although you were overwriting memory, you didn't clobber anything important. Something like this is what you should use.
#include <stdio.h>
int main(void)
{
int count = 1000000;
int *x, *y;
x = malloc(sizeof(int) * count);
for (int i=0; i < count; i++)
x[i] = i+1;
y = x;
for (int i=0; i < count; i++)
printf("%d ", y[i]);
}
Undefined behaviour is undefined, you cannot justify any outcome whatsoever.
You have memory allocated for one integer, the moment you try to dereference the memory outside that range (i.e., i == 1), you're invoking UB. The only valid access is x[0] and x[0] only.
You only allocated memory for one int:
x = malloc(sizeof(int)); // malloc allocates a memory chunk to only hold one int object.
Indexing x at x[i] = i+1; or y at printf("%d ", y[i]); in the loops with anything other than a value of 0 for i (like x[0] or y[0]) invokes undefined behavior because you would attempt to write to and read from not allocated memory.
"then this means if I don't have any enough buffer, it also will give a segmentation fault for i < 4?"
Exactly. You know that is the bad thing on undefined behavior. It does not need to provide wrong results or errors. So, the i < 4 code is broken, too.
Since you written to "only" 12 bytes after the allocated memory (since sizeof(int) common is 4), it might have worked because there was no other necessary information in memory thereafter, but your code is absolutely broken nonetheless.
you defined less memory than the memory you used causing your program to write after that memory zone and alterate the stack fo the program, this is also the case of the buffer overflow vulnerability in C and C++, increment the buffer size

C - Function call itself causes segfault in recursive calls

Im fighting a segfault error that I can't understand : I have a recursive function that expands on an array representing pixels : starting on an index, it exepands around the index to create groups of pixels by calling the same function left right up and down (aka index -1, index +1...). For debuging purposes, I have a printf call at the very first line of the function, and one just before each of the 4 recursive calls. What I dont get is, I end up with a segfault during the recursion at the recrusive call itself (I get the print that is just before the call, but not the one at function start).
void explore(Pixel * array, int j, Tache * cur_pound, int * it, Pixel previous){
printf("%d\n", j); // I DONT GET THIS PRINT AT LAST RECURSIVE CALL
// out of bounds
if(j > sizeX * sizeY)
return;
// allready explored index
if(array[j].explored == 1){
return;
}
// to big of a color difference between this pixel and the reference one
if(abs((int)array[j].r - previous.r) > SEUIL || abs((int)array[j].g - previous.g) > SEUIL || abs((int)array[j].b - previous.b) > SEUIL){
return;
}
array[j].explored = 1;
cur_pound->limits[* it] = j;
(* it)++;
// recursion
if(j +1 < sizeX * sizeY && array[j+1].explored != 1){
printf("before SF\n); // I GET THIS PRINTF
explore(array, j + 1, cur_pound, it, previous);
}
// 3 other recursive calls removed for simplicity here
}
About my data structures : a Tache * is struct that contains 3 GLubytes and limits, an int * that represents every pixel index that belongs to this group. A Pixel contains 3 GLubytes and a char that represents if this pixel has already been visited by the function. The array given to the function as the first argument is an array of Pixel that represent my image.
it is an int representing the index in the group so that my function knows where on the array it should add a new index.
Limits are initialised at -1 outside this function and are allocated with malloc(size * sizeof(int)) where size is the width of the image multiplied by its height.
This is how the inital call is done :
void taches_de_couleur(Image *i){
int j, k, y, size, it;
GLubyte * im;
Pixel * array;
sizeX = i->sizeX;
sizeY = i->sizeY;
k = 0;
size = sizeX * sizeY;
array = malloc(size * sizeof(Pixel));
im = i->data;
/* build the array from image data */
for(j = 0; j < 3 * size; j+= 3){
array[k].explored = 0;
array[k].r = i->data[j];
array[k].g = i->data[j + 1];
array[k].b = i->data[j + 2];
k++;
}
Tache * new_pound;
new_pound = malloc(sizeof(Tache));
new_pound->limits = malloc(size * sizeof(int));
int x= 0;
while(x < size){
new_pound->limits[x] = -1;
x++;
}
it = 0;
explore(array, 0, new_pound, &it, array[0]);
}
Note that the program does not produce any SF when working with small images (biggest i could do was 512x384px).
This thing has been giving me a headache for a week now, can't figure out what is causing this segfault and thats why im asking you guys if you can see anything obvious here. I can add the second function that calls explore if need be, but this part seems to be good.
EDIT : this is the output gdb gives me when I run it with a image too big :
Thread 1 "palette" received signal SIGSEGV, Segmentation fault.
0x00007ffff7b730be in __GI___libc_write (fd=1, buf=0x555555592770,
nbytes=7)
at ../sysdeps/unix/sysv/linux/write.c:26
26 ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
EDIT : Since im failing to provide enough ressources, see https://github.com/BruhP8/TachesDeCouleur for the full project
Thanks in advance
What I dont get is, I end up with a segfault during the recursion at the recrusive call itself (I get the print that is just before the call, but not the one at function start).
That is an almost sure sign of stack exhaustion.
Run your program under debugger, and examine the instruction which causes segfault. Chances are, it will be one of stack manipulation instructions (CALL, PUSH), or a stack dereference instruction that follows stack decrement. You can also look at the value of $SP register, and compare it to the bounds of stack segment (from /proc/$pid/maps if you are on Linux).
The code you've shown does not appear to allocate any stack, so the problem is likely in the code you omitted.
Note that the program does not produce any SF when working with small images
That is another sign: you are probably allocating a new image on the stack, and the larger the image, the fewer levels of recursion you can achieve.
P.S. On Linux, default stack size is often 8MiB. Try ulimit -s unlimited -- if that allows the program to recur deeper, that would be a sure sign that my guess is correct. But don't use ulimit -s unlimited as a fix (it's not).
Update:
With the full source code, I was able to build the palette program. Each recursive call to explore only takes 48 bytes of stack (which isn't much).
But with default 8MiB stack, that limits the total recursion to (8 << 20) / 48 == 174762 levels deep.
TL;DR: if your recursive procedure requires one level of recursion per pixel, then you would not be able to process large images. You must rewrite the procedure to be iterative instead.
It seems the first boundary check in your code should be:
if( j >= sizeX * sizeY )
and not
if( j > sizeX * sizeY )
(As the last element of your array is array[size - 1] and not array[size])

CUDA: Is it safe to apply `+=` in parallel to elements of an array located on the device?

I noticed strange (incorrect) behavior after compiling and executing a CUDA script, and was able to isolate it to the following minimal example. First I define an export-to-CSV function for integer arrays (just for debugging convenience):
#include <stdio.h>
#include <stdlib.h>
void int1DExportCSV(int *ptr, int n){
FILE *f;
f = fopen("1D IntOutput.CSV", "w");
int i = 0;
for (i = 0; i < n-1; i++){
fprintf(f, "%i,", ptr[i]);
}
fprintf(f, "%i", ptr[n-1]);
}
Then I defined a kernel function which increases a certain element of an input array by one:
__global__ void kernel(int *ptr){
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
ptr[offset] += 1;
}
The main loop allocates a vector of one's called a, allocates an empty array b, and allocates a device copy of a called dev_a:
#define DIM 64
int main(void){
int *a;
a = (int*)malloc(DIM*DIM*sizeof(int));
int i;
for(i = 0; i < DIM*DIM; i++){
a[i] = 0;
}
int *b;
b = (int*)malloc(DIM*DIM*sizeof(int));
int *dev_a;
cudaMalloc( (void**)&dev_a, sizeof(int)*DIM*DIM );
cudaMemcpy( dev_a, a, DIM*DIM*sizeof(int), cudaMemcpyHostToDevice );
Then I feed dev_a into a DIM-by-DIM-by-DIM grid of blocks, each with DIM threads, copy the results back, and export them to CSV:
dim3 blocks(DIM,DIM,DIM);
kernel<<<blocks,DIM>>>(dev_a);
cudaMemcpy( b, dev_a, sizeof(int)*DIM*DIM, cudaMemcpyDeviceToHost );
cudaFree(dev_a);
int1DExportCSV(b, DIM*DIM);
}
The resulting CSV file is DIM*DIM in length, and is filled with DIM's. However, while the length is correct, it should be filled with DIM*DIM's, since I am essentially launching a DIM*DIM*DIM*DIM hypercube of threads, in which the last two dimensions are all devoted to incrementing a unique element of the device array dev_a by one.
My first reaction was to suspect that the ptr[offset] += 1 step might be a culprit, since multiple threads are potentially executing this step at the exact same time, and so each thread might be updating an old copy of ptr while unaware that there are a bunch of other threads doing it at the same time. However, I don't know enough about the "taboo's of CUDA" to tell if this is a reasonable guess or not.
Hardware problems are (to the best of my knowledge) not an issue; I am using a GTX560 Ti, so launching a 3-dimensional grid of blocks is allowed, and my thread count per block is 64, well below the maximum of 1024 imposed by the Fermi architecture.
Am I making a simple mistake? Or is there a subtle error in my example?
Additionally, I noticed that when I increase DIM to 256, the resulting array appears to be filled with random integers between 290 to 430! I am completely baffled by this behavior.
No, it's not safe. The threads in a block are stepping on each other.
Your threads in each threadblock are all updating the same location in memory:
ptr[offset] += 1;
offset is the same for every thread in the block:
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + gridDim.x * y;
That is a no-no. The results are undefined.
Instead use atomics:
atomicAdd(ptr+offset, 1);
or a parallel reduction method of some sort.

core dump while iterating through arrays

I'm having some problems with the following code. newrows is a parameter which is directly given to the function I'm working in. elements is being calculated a bit earlier using another parameter. Somehow, for some combinations of values for newrows and elements I'm getting a core dump while other combinations work fine. Usually, when the core dump occurs there have been 20000 to 25000 iterations. However, when everything works fine there have been up to 40000 iterations.
int32_t newimage[newrows][elements][3];
int32_t pixelcounter[newrows][elements];
//int32_t norm, angle, rohmax;
//double r, alpha, beta, m, mu;
//initialize arrays
for(i=0; i<newrows; i++){
for(j=0; j<elements; j++){
pixelcounter[i][j] = 0;
newimage[i][j][0] = 0;
newimage[i][j][1] = 0;
newimage[i][j][2] = 0;
}
}
combination that works fine: 200 : 188
combination that leads to core dump: 200 : 376
I am using linux btw :-)
This is most likely a stack space issue. Note that newimage and pixelcounter are being allocated in the stack frame of whatever function these are being declared in. You can quickly run out of space trying to allocate large amount of data. Your 3d array newimage grows as
#bytes = newrows * elemets * 3
I cleaned up your program (a good piece of advice is to try and present programs that compile, so can people can help you quicker!):
#include <stdio.h>
#include <stdint.h>
void test(size_t newrows, size_t elements) {
int32_t newimage[newrows][elements][3];
int32_t pixelcounter[newrows][elements];
//initialize arrays
for(size_t i=0; i<newrows; i++) {
for(size_t j=0; j<elements; j++) {
pixelcounter[i][j] = 0;
newimage[i][j][0] = 0;
newimage[i][j][1] = 0;
newimage[i][j][2] = 0;
}
}
}
int main(void) {
printf("Size of integer = %ld\n", sizeof(int));
for (size_t i = 700; ; i += 10) {
printf("Testing (%ld, %ld)\n", i, i);
test(i, i);
}
return 0;
}
And running this, I see:
Size of integer = 4
Testing (700, 700)
Testing (710, 710)
Testing (720, 720)
Testing (730, 730)
[3] 13482 segmentation fault (core dumped) ./a.out
So somewhere between 720^2 * 3 * 4 and 730^2 * 3 * 4 bytes, which is about 6 MiB on my 64-bit Linux computer, it might be different on your computer.
The solution in this case is allocate your arrays on the heap, where you will have a lot more memory to work with. More information about heap-allocating multidimensional arrays can be found in How does C allocate space for a 2D (3D...) array when using malloc?.

Optimizing array transposing function

I'm working on a homework assignment, and I've been stuck for hours on my solution. The problem we've been given is to optimize the following code, so that it runs faster, regardless of how messy it becomes. We're supposed to use stuff like exploiting cache blocks and loop unrolling.
Problem:
//transpose a dim x dim matrix into dist by swapping all i,j with j,i
void transpose(int *dst, int *src, int dim) {
int i, j;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
}
}
What I have so far:
//attempt 1
void transpose(int *dst, int *src, int dim) {
int i, j, id, jd;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
jd = 0;
for(j = 0; j < dim; j++, jd+=dim) {
dst[jd + i] = src[id + j];
}
}
}
//attempt 2
void transpose(int *dst, int *src, int dim) {
int i, j, id;
int *pd, *ps;
id = 0;
for(i = 0; i < dim; i++, id+=dim) {
pd = dst + i;
ps = src + id;
for(j = 0; j < dim; j++) {
*pd = *ps++;
pd += dim;
}
}
}
Some ideas, please correct me if I'm wrong:
I have thought about loop unrolling but I dont think that would help, because we don't know if the NxN matrix has prime dimensions or not. If I checked for that, it would include excess calculations which would just slow down the function.
Cache blocks wouldn't be very useful, because no matter what, we will be accessing one array linearly (1,2,3,4) while the other we will be accessing in jumps of N. While we can get the function to abuse the cache and access the src block faster, it will still take a long time to place those into the dst matrix.
I have also tried using pointers instead of array accessors, but I don't think that actually speeds up the program in any way.
Any help would be greatly appreciated.
Thanks
Cache blocking can be useful. For an example, lets say we have a cache line size of 64 bytes (which is what x86 uses these days). So for a large enough matrix such that it's larger than the cache size, then if we transpose a 16x16 block (since sizeof(int) == 4, thus 16 ints fit in a cache line, assuming the matrix is aligned on a cacheline bounday) we need to load 32 (16 from the source matrix, 16 from the destination matrix before we can dirty them) cache lines from memory and store another 16 lines (even though the stores are not sequential). In contrast, without cache blocking transposing the equivalent 16*16 elements requires us to load 16 cache lines from the source matrix, but 16*16=256 cache lines to be loaded and then stored for the destination matrix.
Unrolling is useful for large matrixes.
You'll need some code to deal with excess elements if the matrix size isn't a multiple of the times you unroll. But this will be outside the most critical loop, so for a large matrix it's worth it.
Regarding the direction of accesses - it may be better to read linearly and write in jumps of N, rather than vice versa. This is because read operations block the CPU, while write operations don't (up to a limit).
Other suggestions:
1. Can you use parallelization? OpenMP can help (though if you're expected to deliver single CPU performance, it's no good).
2. Disassemble the function and read it, focusing on the innermost loop. You may find things you wouldn't notice in C code.
3. Using decreasing counters (stopping at 0) might be slightly more efficient that increasing counters.
4. The compiler must assume that src and dst may alias (point to the same or overlapping memory), which limits its optimization options. If you could somehow tell the compiler that they can't overlap, it may be great help. However, I'm not sure how to do that (maybe use the restrict qualifier).
Messyness is not a problem, so: I would add a transposed flag to each matrix. This flag indicates, whether the stored data array of a matrix is to be interpreted in normal or transposed order.
All matrix operations should receive these new flags in addition to each matrix parameter. Inside each operation implement the code for all possible combinations of flags. Perhaps macros can save redundant writing here.
In this new implementation, the matrix transposition just toggles the flag: The space and time needed for the transpose operation is constant.
Just an idea how to implement unrolling:
void transpose(int *dst, int *src, int dim) {
int i, j;
const int dim1 = (dim / 4) * 4;
for(i = 0; i < dim; i++) {
for(j = 0; j < dim1; j+=4) {
dst[j*dim + i] = src[i*dim + j];
dst[(j+1)*dim + i] = src[i*dim + (j+1)];
dst[(j+2)*dim + i] = src[i*dim + (j+2)];
dst[(j+3)*dim + i] = src[i*dim + (j+3)];
}
for( ; j < dim; j++) {
dst[j*dim + i] = src[i*dim + j];
}
__builtin_prefetch (&src[(i+1)*dim], 0, 1);
}
}
Of cource you should remove counting ( like i*dim) from the inner loop, as you already did in your attempts.
Cache prefetch could be used for source matrix.
you probably know this but register int (you tell the compiler that it would be smart to put this in register). And making the int's unsigned, may make things go little bit faster.

Resources