It was showing errors as Array size too large, Structure size too large, too much global data defined in a file. Please show me how to allocate dynamic memory?
struct
{
doublereal a[25000000];
} _BLNK__;
static doublereal x[22500] /* was [3][7500] */;
static doublereal vn[12], del, eul[22500] /* was [3][1500] */;
Allocate the data on the heap, rather than on the stack. Use pointers and allocate the memory in an initialization routine.
Also, do some calculations to work out if you have enough memory e.g. 25000000 * 16 bytes => 400MB of memory. (no idea how big doublereal is).
try dynamic memory allocation with malloc and pointer like:
typedef struct
{
doublereal a[25000000];
} _BLNK__;
...
{
_BLNK__ *mypointer = malloc(sizeof*mypointer);
mypointer->a[0] = 0;
mypointer->a[1] = 1;
...
free(mypointer);
}
...
The statement
doublereal a[25000000];
allocates memory on the stack. There is a strict limit on the size of the stack, and you can find it on a linux or osx system by running:
$ ulimit -s
8192
which is 8192K = 8M.
You are trying to allocate 25000000 * 8 = 200000000 bytes = 190 M on a 32 bit system, which is much larger than the limit.
You have three choices:
1) reduce the size of the array
2) dynamically allocate memory (doublereal *a = (doublereal *)malloc(sizeof(doublereal) * 25000000))
3) increase stack size (but this requires administrative privileges on all machines that this program will run on)
#include <stdlib.h>
#define LEN_A (25000000)
struct
{
doublereal* a;
}_BLNK__;
#define LEN_X (22500)
#define LEN_VN (12)
#define LEN_EUL (22500)
#define INIT_BLNK(x) x.a=(doublereal*)malloc(LEN_A*sizeof(doublereal))
#define FREE_BLNK(x) if(x.a!=0)free(x.a)
static doublereal *x;
static doublereal *vn,del,*eul;
int main()
{
_BLNK__ Item;
x = (doublereal*)malloc(LEN_X*sizeof(doublereal));
vn = (doublereal*)malloc(LEN_VN*sizeof(doublereal));
eul = (doublereal*)malloc(LEN_EUL*sizeof(doublereal));
INIT_BLNK(Item);
//Do whatever you wish
//Return memory to the OS
free(x);
free(vn);
free(eul);
FREE_BLNK(Item);
return 0;
}
Try using this. I just wrote the code here so if there are any compiler errors try to fix them
Related
I have a rather huge recursive function (also, I write in C), and while I have no doubt that the scenario where stack overflow happens is extremely unlikely, it is still possible. What I wonder is whether you can detect if stack is going to get overflown within a few iterations, so you can do an emergency stop without crashing the program.
In the C programming language itself, that is not possible. In general, you can't know easily that you ran out of stack before running out. I recommend you to instead place a configurable hard limit on the recursion depth in your implementation, so you can simply abort when the depth is exceeded. You could also rewrite your algorithm to use an auxillary data structure instead of using the stack through recursion, this gives you greater flexibility to detect an out-of-memory condition; malloc() tells you when it fails.
However, you can get something similar with a procedure like this on UNIX-like systems:
Use setrlimit to set a soft stack limit lower than the hard stack limit
Establish signal handlers for both SIGSEGV and SIGBUS to get notified of stack overflows. Some operating systems produce SIGSEGV for these, others SIGBUS.
If you get such a signal and determine that it comes from a stack overflow, raise the soft stack limit with setrlimit and set a global variable to identify that this occured. Make the variable volatile so the optimizer doesn't foil your plains.
In your code, at each recursion step, check if this variable is set. If it is, abort.
This may not work everywhere and required platform specific code to find out that the signal came from a stack overflow. Not all systems (notably, early 68000 systems) can continue normal processing after getting a SIGSEGV or SIGBUS.
A similar approach was used by the Bourne shell for memory allocation.
Heres a simple solution that works for win-32. Actually resembles what Wossname already posted but less icky :)
unsigned int get_stack_address( void )
{
unsigned int r = 0;
__asm mov dword ptr [r], esp;
return r;
}
void rec( int x, const unsigned int begin_address )
{
// here just put 100 000 bytes of memory
if ( begin_address - get_stack_address() > 100000 )
{
//std::cout << "Recursion level " << x << " stack too high" << std::endl;
return;
}
rec( x + 1, begin_address );
}
int main( void )
{
int x = 0;
rec(x,get_stack_address());
}
Here's a naive method, but it's a bit icky...
When you enter the function for the first time you could store the address of one of your variables declared in that function. Store that value outside your function (e.g. in a global). In subsequent calls compare the current address of that variable with the cached copy. The deeper you recurse the further apart these two values will be.
This will most likely cause compiler warnings (storing addresses of temporary variables) but it does have the benefit of giving you a fairly accurate way of knowing exactly how much stack you're using.
Can't say I really recommend this but it would work.
#include <stdio.h>
char* start = NULL;
void recurse()
{
char marker = '#';
if(start == NULL)
start = ▮
printf("depth: %d\n", abs(&marker - start));
if(abs(&marker - start) < 1000)
recurse();
else
start = NULL;
}
int main()
{
recurse();
return 0;
}
An alternative method is to learn the stack limit at the start of the program, and each time in your recursive function to check whether this limit has been approached (within some safety margin, say 64 kb). If so, abort; if not, continue.
The stack limit on POSIX systems can be learned by using getrlimit system call.
Example code that is thread-safe: (note: it code assumes that stack grows backwards, as on x86!)
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
void *stack_limit;
#define SAFETY_MARGIN (64 * 1024) // 64 kb
void recurse(int level)
{
void *stack_top = &stack_top;
if (stack_top <= stack_limit) {
printf("stack limit reached at recursion level %d\n", level);
return;
}
recurse(level + 1);
}
int get_max_stack_size(void)
{
struct rlimit rl;
int ret = getrlimit(RLIMIT_STACK, &rl);
if (ret != 0) {
return 1024 * 1024 * 8; // 8 MB is the default on many platforms
}
printf("max stack size: %d\n", (int)rl.rlim_cur);
return rl.rlim_cur;
}
int main (int argc, char *argv[])
{
int x;
stack_limit = (char *)&x - get_max_stack_size() + SAFETY_MARGIN;
recurse(0);
return 0;
}
Output:
max stack size: 8388608
stack limit reached at recursion level 174549
I am working on a connect-four game simulator in C.
https://en.wikipedia.org/wiki/Connect_Four
The first step is to create a board environment for the game. I went ahead and made a data type board_t which is a struct that includes a dynamically sized array that will save moves played in a single dimension array. Board_t also includes height and width information of the board, so things can be retrieved in a correct manner.
I initialize this board in board_create() function, and use this initialized board_t variable in a board_can_play() function to check whether any play is possible in a given play. Here is the code.
#include <stdlib.h>
#include <assert.h>
#define PLAYER_BLUE 2
#define PLAYER_YELLOW 1
#define PLAYER_EMPTY 0
typedef unsigned char player_t;
typedef struct board_t
{
unsigned int width;
unsigned int height;
unsigned int run;
player_t * moves;
} board_t;
bool board_create (board_t ** b, unsigned int height, unsigned int width, unsigned int run, const player_t * i)
{
//Declare a board_t variable temp_b where parameters will be saved.
board_t temp_b;
//Create a pointer and malloc a memory location based on width and height.
temp_b.moves = malloc(sizeof(unsigned char)*(height*width));
//Itereate through the moves and initialize with the given player_t
int j;
for (j = 0; j < width*height; j++)
{
temp_b.moves[j] = PLAYER_EMPTY;
}
//Input all the values to temp_b
temp_b.height = height;
temp_b.width = width;
temp_b.run = run;
//Make a temporary pointer and assign that pointer to *b.
board_t * temp_b_ptr = malloc(sizeof(board_t));
temp_b_ptr = &temp_b;
*b = temp_b_ptr;
return true;
};
/// Return true if the specified player can make a move on the
/// board
bool board_can_play (const board_t * b, player_t p)
{
unsigned int i;
unsigned int height = board_get_height(b);
unsigned int width = board_get_width(b);
for(i = (height-1)*width; i < height*width; i++)
{
if (b->moves[i] == PLAYER_EMPTY)
{
return true;
}
}
return false;
}
However, whenever I call the board_t *b from board_can_play(), the program gives segmentation fault. More specifically,
if (b->moves[i] == PLAYER_EMPTY)
This line is giving me a segmentation fault. Also, functions that worked well in main(), is not working here in board_can_play(). For instance,
unsigned int height = board_get_height(b);
unsigned int width = board_get_width(b);
Are supposed to get 3 and 3, but getting 2 and 419678? I spent about 7 hours now figuring out, but cannot figure out what is going on.
In the if statement that gives you segfault,
if (b->moves[i] == PLAYER_EMPTY)
The problem is not how moves was allocated, but how b itself was allocated. In board_create(), you are returning a temporary object in here:
board_t * temp_b_ptr = malloc(sizeof(board_t));
temp_b_ptr = &temp_b;
*b = temp_b_ptr;
The malloc'ed pointer is lost (you are overwriting it) and simply returning (through *b) a pointer to a local variable.
So the move the allocation to the top and use temp_b_ptr instead of temp_b:
board_t *temp_b_ptr = malloc(sizeof(board_t));
if( !temp_b_ptr ) {
/* error handling */
}
....
....
*b = temp_b_ptr;
I would approach you problem in the following way. Not that I have stubbed-in some error handling, as well as adding a method to destroy the board when done.
The following code compiles without warning in Ubuntu 14.01 LTS, using gcc-4.8.2. I compile the code with the following command line:
gcc -g -std=c99 -pedantic -Wall connect4.c -o connect4
Now, on to the code. You didn't provide a main, so I created a quick stub main:
#include <stdlib.h>
#include <stdbool.h>
#include <stdio.h>
#include <assert.h>
#define PLAYER_BLUE 2
#define PLAYER_YELLOW 1
#define PLAYER_EMPTY 0
typedef unsigned char player_t;
typedef struct board_t
{
unsigned int width;
unsigned int height;
unsigned int run;
player_t * moves;
} board_t;
bool board_create(board_t** b, unsigned int height, unsigned int width);
void board_destroy(board_t** b);
int board_get_height(const board_t* b);
int board_get_width(const board_t* b);
int main(int argc, char** argv)
{
board_t* pBoard = NULL;
if(board_create(&pBoard, 4, 4))
{
printf("board dimensions: %d by %d\n", board_get_height(pBoard), board_get_width(pBoard));
// TODO : put game logic here...
board_destroy(&pBoard);
}
else
{
fprintf(stderr, "failed to initialize the board structure\n");
}
return 0;
}
Not a lot to see in main, much like you would expect. Next is the board_create
function. Note that I deleted the run and the player_t parameters because i didn't see you use them in your code.
bool board_create(board_t** b, unsigned int height, unsigned int width)
{
bool bRet = false;
if(*b != NULL) // we already have a board struct laying about
{
board_destroy(b);
}
if(NULL != (*b = malloc(sizeof(board_t))))
{
(*b)->width = width;
(*b)->height = height;
if(NULL != ((*b)->moves = malloc(sizeof(unsigned char*)*(height * width))))
{
for(int j = 0; j < height * width; j++)
(*b)->moves[j] = PLAYER_EMPTY;
bRet = true;
}
else
{
/* TODO : handle allocation error of moves array */
}
}
else
{
/* TODO : handle allocation error of board struct */
}
return bRet;
}
Couple of comments on this function;
First a bit of defensive programming, I check to see that the board structure has not be previous allocated. If it was I proceed to destroy the previous board prior to creating a new one. This prevent us leaking memory in that is there was a board allocated and then we recalled this function we would over write the pointer to the original board, and this would mean that we would lose our 'handle' to the first board.
Notice that every call to malloc is check to make sure that we actually got the memory that we wanted. I tend to place the check in the same statement as the malloc, but that is personal preference.
I now actually have a significant return value. In you original code, you would just return true regardless if all the allocations succeeded or not. Notice, that I only return true after both allocations are performed, and they succeeded.
Ok, on the the new function I added, board_destroy:
void board_destroy(board_t** b)
{
if(*b != NULL) // no board struct, nothing to do..
{
if((*b)->moves != NULL)
{
free((*b)->moves);
}
free(*b);
*b = NULL;
}
}
Some comments on this function;
a bit more defensive programming, I check to make sure we actually have a board structure to get rid of prior to doing any work.
remember that in your board structure, you have a dynamic array, so you need to free that array first. (free-ing the board structure first would mean that you lost your only reference to the moves array, and you would be leaking memory then).
Prior to free-ing the moves array, I again check to see that it exists.
Once the moves array is destroyed, I proceed to destroy the board structure, and set the pointer back to NULL (in case we want to reuse the board pointer in main).
You didn't provide implementation details of board_get_* functions, but from their usage, I suspect that you have them implemented as:
int board_get_height(const board_t* b)
{
return (b->height);
}
int board_get_width(const board_t* b)
{
return (b->width);
}
I didn't do anything with your board_can_more function due to not being sure how you intend to use it.
A quick run of the above code:
******#ubuntu:~/junk$ ./connect4
board dimensions: 4 by 4
******#ubuntu:~/junk$
My personal opinion is that when doing lots of memory allocations, frees in C or C++ you should run your program under valgrind periodically to make sure you are not leaking memory or have other memory related errors. Below is a sample of running this code under valgrind:
*****#ubuntu:~/junk$ valgrind --tool=memcheck --leak-check=full ./connect4
==4265== Memcheck, a memory error detector
==4265== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==4265== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==4265== Command: ./connect4
==4265==
board dimensions: 4 by 4
==4265==
==4265== HEAP SUMMARY:
==4265== in use at exit: 0 bytes in 0 blocks
==4265== total heap usage: 2 allocs, 2 frees, 152 bytes allocated
==4265==
==4265== All heap blocks were freed -- no leaks are possible
==4265==
==4265== For counts of detected and suppressed errors, rerun with: -v
==4265== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Hope this helps,
T.
The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?
When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.
Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.
There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}
You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx
I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.
Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.
#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080
If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?
Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s
Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
programming language: C
platform: ARM
Compiler: ADS 1.2
I need to keep track of simple melloc/free calls in my project. I just need to get very basic idea of how much heap memory is required when the program has allocated all its resources. Therefore, I have provided a wrapper for the malloc/free calls. In these wrappers I need to increment a current memory count when malloc is called and decrement it when free is called. The malloc case is straight forward as I have the size to allocate from the caller. I am wondering how to deal with the free case as I need to store the pointer/size mapping somewhere. This being C, I do not have a standard map to implement this easily.
I am trying to avoid linking in any libraries so would prefer *.c/h implementation.
So I am wondering if there already is a simple implementation one may lead me to. If not, this is motivation to go ahead and implement one.
EDIT: Purely for debugging and this code is not shipped with the product.
EDIT: Initial implementation based on answer from Makis. I would appreciate feedback on this.
EDIT: Reworked implementation
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <string.h>
#include <limits.h>
static size_t gnCurrentMemory = 0;
static size_t gnPeakMemory = 0;
void *MemAlloc (size_t nSize)
{
void *pMem = malloc(sizeof(size_t) + nSize);
if (pMem)
{
size_t *pSize = (size_t *)pMem;
memcpy(pSize, &nSize, sizeof(nSize));
gnCurrentMemory += nSize;
if (gnCurrentMemory > gnPeakMemory)
{
gnPeakMemory = gnCurrentMemory;
}
printf("PMemAlloc (%#X) - Size (%d), Current (%d), Peak (%d)\n",
pSize + 1, nSize, gnCurrentMemory, gnPeakMemory);
return(pSize + 1);
}
return NULL;
}
void MemFree (void *pMem)
{
if(pMem)
{
size_t *pSize = (size_t *)pMem;
// Get the size
--pSize;
assert(gnCurrentMemory >= *pSize);
printf("PMemFree (%#X) - Size (%d), Current (%d), Peak (%d)\n",
pMem, *pSize, gnCurrentMemory, gnPeakMemory);
gnCurrentMemory -= *pSize;
free(pSize);
}
}
#define BUFFERSIZE (1024*1024)
typedef struct
{
bool flag;
int buffer[BUFFERSIZE];
bool bools[BUFFERSIZE];
} sample_buffer;
typedef struct
{
unsigned int whichbuffer;
char ch;
} buffer_info;
int main(void)
{
unsigned int i;
buffer_info *bufferinfo;
sample_buffer *mybuffer;
char *pCh;
printf("Tesint MemAlloc - MemFree\n");
mybuffer = (sample_buffer *) MemAlloc(sizeof(sample_buffer));
if (mybuffer == NULL)
{
printf("ERROR ALLOCATING mybuffer\n");
return EXIT_FAILURE;
}
bufferinfo = (buffer_info *) MemAlloc(sizeof(buffer_info));
if (bufferinfo == NULL)
{
printf("ERROR ALLOCATING bufferinfo\n");
MemFree(mybuffer);
return EXIT_FAILURE;
}
pCh = (char *)MemAlloc(sizeof(char));
printf("finished malloc\n");
// fill allocated memory with integers and read back some values
for(i = 0; i < BUFFERSIZE; ++i)
{
mybuffer->buffer[i] = i;
mybuffer->bools[i] = true;
bufferinfo->whichbuffer = (unsigned int)(i/100);
}
MemFree(bufferinfo);
MemFree(mybuffer);
if(pCh)
{
MemFree(pCh);
}
return EXIT_SUCCESS;
}
You could allocate a few extra bytes in your wrapper and put either an id (if you want to be able to couple malloc() and free()) or just the size there. Just malloc() that much more memory, store the information at the beginning of your memory block and and move the pointer you return that many bytes forward.
This can, btw, also easily be used for fence pointers/finger-prints and such.
Either you can have access to internal tables used by malloc/free (see this question: Where Do malloc() / free() Store Allocated Sizes and Addresses? for some hints), or you have to manage your own tables in your wrappers.
You could always use valgrind instead of rolling your own implementation. If you don't care about the amount of memory you allocate you could use an even simpler implementation: (I did this really quickly so there could be errors and I realize that it is not the most efficient implementation. The pAllocedStorage should be given an initial size and increase by some factor for a resize etc. but you get the idea.)
EDIT: I missed that this was for ARM, to my knowledge valgrind is not available on ARM so that might not be an option.
static size_t indexAllocedStorage = 0;
static size_t *pAllocedStorage = NULL;
static unsigned int free_calls = 0;
static unsigned long long int total_mem_alloced = 0;
void *
my_malloc(size_t size){
size_t *temp;
void *p = malloc(size);
if(p == NULL){
fprintf(stderr,"my_malloc malloc failed, %s", strerror(errno));
exit(EXIT_FAILURE);
}
total_mem_alloced += size;
temp = (size_t *)realloc(pAllocedStorage, (indexAllocedStorage+1) * sizeof(size_t));
if(temp == NULL){
fprintf(stderr,"my_malloc realloc failed, %s", strerror(errno));
exit(EXIT_FAILURE);
}
pAllocedStorage = temp;
pAllocedStorage[indexAllocedStorage++] = (size_t)p;
return p;
}
void
my_free(void *p){
size_t i;
int found = 0;
for(i = 0; i < indexAllocedStorage; i++){
if(pAllocedStorage[i] == (size_t)p){
pAllocedStorage[i] = (size_t)NULL;
found = 1;
break;
}
}
if(!found){
printf("Free Called on unknown\n");
}
free_calls++;
free(p);
}
void
free_check(void) {
size_t i;
printf("checking freed memeory\n");
for(i = 0; i < indexAllocedStorage; i++){
if(pAllocedStorage[i] != (size_t)NULL){
printf( "Memory leak %X\n", (unsigned int)pAllocedStorage[i]);
free((void *)pAllocedStorage[i]);
}
}
free(pAllocedStorage);
pAllocedStorage = NULL;
}
I would use rmalloc. It is a simple library (actually it is only two files) to debug memory usage, but it also has support for statistics. Since you already wrapper functions it should be very easy to use rmalloc for it. Keep in mind that you also need to replace strdup, etc.
Your program may also need to intercept realloc(), calloc(), getcwd() (as it may allocate memory when buffer is NULL in some implementations) and maybe strdup() or a similar function, if it is supported by your compiler
If you are running on x86 you could just run your binary under valgrind and it would gather all this information for you, using the standard implementation of malloc and free. Simple.
I've been trying out some of the same techniques mentioned on this page and wound up here from a google search. I know this question is old, but wanted to add for the record...
1) Does your operating system not provide any tools to see how much heap memory is in use in a running process? I see you're talking about ARM, so this may well be the case. In most full-featured OSes, this is just a matter of using a cmd-line tool to see the heap size.
2) If available in your libc, sbrk(0) on most platforms will tell you the end address of your data segment. If you have it, all you need to do is store that address at the start of your program (say, startBrk=sbrk(0)), then at any time your allocated size is sbrk(0) - startBrk.
3) If shared objects can be used, you're dynamically linking to your libc, and your OS's runtime loader has something like an LD_PRELOAD environment variable, you might find it more useful to build your own shared object that defines the actual libc functions with the same symbols (malloc(), not MemAlloc()), then have the loader load your lib first and "interpose" the libc functions. You can further obtain the addresses of the actual libc functions with dlsym() and the RTLD_NEXT flag so you can do what you are doing above without having to recompile all your code to use your malloc/free wrappers. It is then just a runtime decision when you start your program (or any program that fits the description in the first sentence) where you set an environment variable like LD_PRELOAD=mymemdebug.so and then run it. (google for shared object interposition.. it's a great technique and one used by many debuggers/profilers)