Why do very large stack allocations fail despite unlimited ulimit? - c

The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?

When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.

Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.

There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}

You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx

Related

Why my program occupies 32KB and not 11200B (Valgrind's massif)

The n-body program does this at the begining:
real4 *pin = (real4*)malloc(n * sizeof(real4));
real4 *pout = (real4*)malloc(n * sizeof(real4));
real3 *v = (real3*)malloc(n * sizeof(real3));
real3 *f = (real3*)malloc(n * sizeof(real3));
the total size of this should be (if n = 100): 100*32 + 100*32 + 100*24 + 100*24 = 11200B but with Valgrind's massif I have this:
I am not fammilliar with massif, but when talking about heap memory, there are two numbers that are interesting, how much memory has the allocator requested from the OS, and how much has the allocator given to your program through malloc(). If your program has requested ~10K of bytes, it is reasonable to think that the allocator may have requested a round number like 32K from the OS. The allocator typically request memory in large blocks from the OS since kernel calls are slow. (and a few other reasons)
So I would guess that the 32K that you are seeing, is what the allocator has aquired from the OS, ready to be given to your program through any additional malloc() that may happen.

Get maximum available heap memory

I'm currently trying to figure out the maximum memory that is able to be allocated through the malloc() command in C.
Until now I´ve tried a simple algorithm that increments a counter that will subsequently be allocated. If the malloc command returns "NULL" I know, that there is not enough memory available.
ULONG ulMaxSize = 0;
for (ULONG ulSize = /*0x40036FF0*/ 0x40A00000; ulSize <= 0xffffffff; ulSize++)
{
void* pBuffer = malloc(ulSize);
if (pBuffer == NULL)
{
ulMaxSize = ulSize - 1;
break;
}
free(pBuffer);
}
void* pMaxBuffer = malloc(ulMaxSize);
However, this algorithm gets executed very long since the malloc() command has turned out to be a time consuming task.
My question is now, if there is a more efficient algorithm to find the maximum memory able to be allocated?
The maximum memory that can be allocated depends mostly on few factors:
Address space limits on the process (max memory, virtual memory and friends).
Virtual space available
Physical space available
Fragmentation, which will limit the size of continuous memory blocks.
... Other limits ...
From your description (extreme slowness) looks like the process start using swap, which is VERY slow vs. real memory.
Consider the following alternative
For address space limit, look at ulimit -a (or use getrlimit to access the same data from C program) - look for 'max memory size', and 'virtual memory'
For swap space, physical memory - top
ulimit -a (filtered)
data seg size (kbytes, -d) unlimited
max memory size (kbytes, -m) 2048
stack size (kbytes, -s) 8192
virtual memory (kbytes, -v) unlimited
From a practical point, given that a program does not have control over system resources, you should be focused on 'max memory size'.
Other than using OS specific API to get such number :
sysinfo on linux or reading it from /proc/meminfo )
GlobalMemoryStatusEx for win32
You can also do a binary search, not recommended, as the state of the system might be in flux and the result could vary over time:
ULONG getMax() {
ULONG min = 0x0;
ULONG max = 0xffffffff;
void* t = malloc(max);
if(t!=NULL) {
free(t);
return max;
}
while(max-min > 1) {
ULONG mid = min + (max - min) / 2;
t = malloc(mid);
if(t == NULL) {
max = mid;
continue;
}
free(t);
min = mid;
}
return min;
}

Detecting stack overflows during runtime beforehand

I have a rather huge recursive function (also, I write in C), and while I have no doubt that the scenario where stack overflow happens is extremely unlikely, it is still possible. What I wonder is whether you can detect if stack is going to get overflown within a few iterations, so you can do an emergency stop without crashing the program.
In the C programming language itself, that is not possible. In general, you can't know easily that you ran out of stack before running out. I recommend you to instead place a configurable hard limit on the recursion depth in your implementation, so you can simply abort when the depth is exceeded. You could also rewrite your algorithm to use an auxillary data structure instead of using the stack through recursion, this gives you greater flexibility to detect an out-of-memory condition; malloc() tells you when it fails.
However, you can get something similar with a procedure like this on UNIX-like systems:
Use setrlimit to set a soft stack limit lower than the hard stack limit
Establish signal handlers for both SIGSEGV and SIGBUS to get notified of stack overflows. Some operating systems produce SIGSEGV for these, others SIGBUS.
If you get such a signal and determine that it comes from a stack overflow, raise the soft stack limit with setrlimit and set a global variable to identify that this occured. Make the variable volatile so the optimizer doesn't foil your plains.
In your code, at each recursion step, check if this variable is set. If it is, abort.
This may not work everywhere and required platform specific code to find out that the signal came from a stack overflow. Not all systems (notably, early 68000 systems) can continue normal processing after getting a SIGSEGV or SIGBUS.
A similar approach was used by the Bourne shell for memory allocation.
Heres a simple solution that works for win-32. Actually resembles what Wossname already posted but less icky :)
unsigned int get_stack_address( void )
{
unsigned int r = 0;
__asm mov dword ptr [r], esp;
return r;
}
void rec( int x, const unsigned int begin_address )
{
// here just put 100 000 bytes of memory
if ( begin_address - get_stack_address() > 100000 )
{
//std::cout << "Recursion level " << x << " stack too high" << std::endl;
return;
}
rec( x + 1, begin_address );
}
int main( void )
{
int x = 0;
rec(x,get_stack_address());
}
Here's a naive method, but it's a bit icky...
When you enter the function for the first time you could store the address of one of your variables declared in that function. Store that value outside your function (e.g. in a global). In subsequent calls compare the current address of that variable with the cached copy. The deeper you recurse the further apart these two values will be.
This will most likely cause compiler warnings (storing addresses of temporary variables) but it does have the benefit of giving you a fairly accurate way of knowing exactly how much stack you're using.
Can't say I really recommend this but it would work.
#include <stdio.h>
char* start = NULL;
void recurse()
{
char marker = '#';
if(start == NULL)
start = &marker;
printf("depth: %d\n", abs(&marker - start));
if(abs(&marker - start) < 1000)
recurse();
else
start = NULL;
}
int main()
{
recurse();
return 0;
}
An alternative method is to learn the stack limit at the start of the program, and each time in your recursive function to check whether this limit has been approached (within some safety margin, say 64 kb). If so, abort; if not, continue.
The stack limit on POSIX systems can be learned by using getrlimit system call.
Example code that is thread-safe: (note: it code assumes that stack grows backwards, as on x86!)
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
void *stack_limit;
#define SAFETY_MARGIN (64 * 1024) // 64 kb
void recurse(int level)
{
void *stack_top = &stack_top;
if (stack_top <= stack_limit) {
printf("stack limit reached at recursion level %d\n", level);
return;
}
recurse(level + 1);
}
int get_max_stack_size(void)
{
struct rlimit rl;
int ret = getrlimit(RLIMIT_STACK, &rl);
if (ret != 0) {
return 1024 * 1024 * 8; // 8 MB is the default on many platforms
}
printf("max stack size: %d\n", (int)rl.rlim_cur);
return rl.rlim_cur;
}
int main (int argc, char *argv[])
{
int x;
stack_limit = (char *)&x - get_max_stack_size() + SAFETY_MARGIN;
recurse(0);
return 0;
}
Output:
max stack size: 8388608
stack limit reached at recursion level 174549

How to dynamically allocate arrays inside a kernel?

I need to dynamically allocate some arrays inside the kernel function. How can a I do that?
My code is something like that:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float x[n],y[nn];
//Do some really cool and heavy computations here that takes hours.
}
But that will not work. If this was inside the host code I could use malloc. cudaMalloc needs a pointer on host, and other on device. Inside the kernel function I don't have the host pointer.
So, what should I do?
If takes too long (some seconds) to allocate all the arrays (I need about 4 of size n and 5 of size nn), this won't be a problem. Since the kernel will probably run for 20 minutes, at least.
Dynamic memory allocation is only supported on compute capability 2.x and newer hardware. You can use either the C++ new keyword or malloc in the kernel, so your example could become:
__global__ func(float *grid_d,int n, int nn){
int i,j;
float *x = new float[n], *y = new float[nn];
}
This allocates memory on a local memory runtime heap which has the lifetime of the context, so make sure you free the memory after the kernel finishes running if your intention is not to use the memory again. You should also note that runtime heap memory cannot be accessed directly from the host APIs, so you cannot pass a pointer allocated inside a kernel as an argument to cudaMemcpy, for example.
#talonmies answered your question on how to dynamically allocate memory within a kernel. This is intended as a supplemental answer, addressing performance of __device__ malloc() and an alternative you might want to consider.
Allocating memory dynamically in the kernel can be tempting because it allows GPU code to look more like CPU code. But it can seriously affect performance. I wrote a self contained test and have included it below. The test launches some 2.6 million threads. Each thread populates 16 integers of global memory with some values derived from the thread index, then sums up the values and returns the sum.
The test implements two approaches. The first approach uses __device__ malloc() and the second approach uses memory that is allocated before the kernel runs.
On my 2.0 device, the kernel runs in 1500ms when using __device__ malloc() and 27ms when using pre-allocated memory. In other words, the test takes 56x longer to run when memory is allocated dynamically within the kernel. The time includes the outer loop cudaMalloc() / cudaFree(), which is not part of the kernel. If the same kernel is launched many times with the same number of threads, as is often the case, the cost of the cudaMalloc() / cudaFree() is amortized over all the kernel launches. That brings the difference even higher, to around 60x.
Speculating, I think that the performance hit is in part caused by implicit serialization. The GPU must probably serialize all simultaneous calls to __device__ malloc() in order to provide separate chunks of memory to each caller.
The version that does not use __device__ malloc() allocates all the GPU memory before running the kernel. A pointer to the memory is passed to the kernel. Each thread calculates an index into the previously allocated memory instead of using a __device__ malloc().
The potential issue with allocating memory up front is that, if only some threads need to allocate memory, and it is not known which threads those are, it will be necessary to allocate memory for all the threads. If there is not enough memory for that, it might be more efficient to reduce the number of threads per kernel call then using __device__ malloc(). Other workarounds would probably end up reimplementing what __device__ malloc() is doing in the background, and would see a similar performance hit.
Test the performance of __device__ malloc():
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
const int N_ITEMS(16);
#define USE_DYNAMIC_MALLOC
__global__ void test_malloc(int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(new int[N_ITEMS]);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
delete[] s;
}
__global__ void test_malloc_2(int* items, int* totals)
{
int tx(blockIdx.x * blockDim.x + threadIdx.x);
int* s(items + tx * N_ITEMS);
for (int i(0); i < N_ITEMS; ++i) {
s[i] = tx * i;
}
int total(0);
for (int i(0); i < N_ITEMS; ++i) {
total += s[i];
}
totals[tx] = total;
}
int main()
{
cudaError_t cuda_status;
cudaSetDevice(0);
int blocks_per_launch(1024 * 10);
int threads_per_block(256);
int threads_per_launch(blocks_per_launch * threads_per_block);
int* totals_d;
cudaMalloc((void**)&totals_d, threads_per_launch * sizeof(int));
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaDeviceSynchronize();
cudaEventRecord(start, 0);
#ifdef USE_DYNAMIC_MALLOC
cudaDeviceSetLimit(cudaLimitMallocHeapSize, threads_per_launch * N_ITEMS * sizeof(int));
test_malloc<<<blocks_per_launch, threads_per_block>>>(totals_d);
#else
int* items_d;
cudaMalloc((void**)&items_d, threads_per_launch * sizeof(int) * N_ITEMS);
test_malloc_2<<<blocks_per_launch, threads_per_block>>>(items_d, totals_d);
cudaFree(items_d);
#endif
cuda_status = cudaDeviceSynchronize();
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
printf("Elapsed: %f\n", elapsedTime);
int* totals_h(new int[threads_per_launch]);
cuda_status = cudaMemcpy(totals_h, totals_d, threads_per_launch * sizeof(int), cudaMemcpyDeviceToHost);
if (cuda_status != cudaSuccess) {
printf("Error: %d\n", cuda_status);
exit(1);
}
for (int i(0); i < 10; ++i) {
printf("%d ", totals_h[i]);
}
printf("\n");
cudaFree(totals_d);
delete[] totals_h;
return cuda_status;
}
Output:
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 27.311169
0 120 240 360 480 600 720 840 960 1080
C:\rd\projects\test_cuda_malloc\Release>test_cuda_malloc.exe
Elapsed: 1516.711914
0 120 240 360 480 600 720 840 960 1080
If the value of n and nn were known before the kernel is called, then why not cudaMalloc the memory on host side and pass in the device memory pointer to the kernel?
Ran an experiment based on the concepts in #rogerdahl's post. Assumptions:
4MB of memory allocated in 64B chunks.
1 GPU block and 32 warp threads in that block
Run on a P100
The malloc+free calls local to the GPU seemed to be much faster than the cudaMalloc + cudaFree calls. The program's output:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 1.169631s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.029794s
I'm leaving out the code for timer.h and timer.cpp, but here's the code for the test itself:
#include "cuda_runtime.h"
#include <stdio.h>
#include <thrust/system/cuda/error.h>
#include "timer.h"
static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)
const int BLOCK_COUNT = 1;
const int THREADS_PER_BLOCK = 32;
const int ITERATIONS = 1 << 12;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 64;
void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err) {
if (err == cudaSuccess)
return;
std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
exit (1);
}
__global__ void mallocai() {
for (int i = 0; i < ITERATIONS_PER_BLOCKTHREAD; ++i) {
int * foo;
foo = (int *) malloc(sizeof(int) * ARRAY_SIZE);
free(foo);
}
}
int main() {
Timer cuda_malloc_timer("cuda malloc timer");
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}
cuda_malloc_timer.stop_and_report();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
Timer device_malloc_timer("device malloc timer");
device_malloc_timer.start();
mallocai<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>();
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
device_malloc_timer.stop_and_report();
}
If you find mistakes, please lmk in the comments, and I'll try to fix them.
And I ran them again with larger everything:
const int BLOCK_COUNT = 56;
const int THREADS_PER_BLOCK = 1024;
const int ITERATIONS = 1 << 18;
const int ITERATIONS_PER_BLOCKTHREAD = ITERATIONS / (BLOCK_COUNT * THREADS_PER_BLOCK);
const int ARRAY_SIZE = 1024;
And cudaMalloc was still slower by a lot:
Starting timer for cuda malloc timer
Stopping timer for cuda malloc timer
timer for cuda malloc timer took 74.878016s
Starting timer for device malloc timer
Stopping timer for device malloc timer
timer for device malloc timer took 0.167331s
Maybe you should test
cudaMalloc(&foo,sizeof(int) * ARRAY_SIZE * ITERATIONS);
cudaFree(foo);
instead
for (int i = 0; i < ITERATIONS; ++ i) {
if (i == 1) cuda_malloc_timer.start(); // let it warm up one cycle
int * foo;
cudaMalloc(&foo, sizeof(int) * ARRAY_SIZE);
cudaFree(foo);
}

How to fix memory bug in my c program?

It was showing errors as Array size too large, Structure size too large, too much global data defined in a file. Please show me how to allocate dynamic memory?
struct
{
doublereal a[25000000];
} _BLNK__;
static doublereal x[22500] /* was [3][7500] */;
static doublereal vn[12], del, eul[22500] /* was [3][1500] */;
Allocate the data on the heap, rather than on the stack. Use pointers and allocate the memory in an initialization routine.
Also, do some calculations to work out if you have enough memory e.g. 25000000 * 16 bytes => 400MB of memory. (no idea how big doublereal is).
try dynamic memory allocation with malloc and pointer like:
typedef struct
{
doublereal a[25000000];
} _BLNK__;
...
{
_BLNK__ *mypointer = malloc(sizeof*mypointer);
mypointer->a[0] = 0;
mypointer->a[1] = 1;
...
free(mypointer);
}
...
The statement
doublereal a[25000000];
allocates memory on the stack. There is a strict limit on the size of the stack, and you can find it on a linux or osx system by running:
$ ulimit -s
8192
which is 8192K = 8M.
You are trying to allocate 25000000 * 8 = 200000000 bytes = 190 M on a 32 bit system, which is much larger than the limit.
You have three choices:
1) reduce the size of the array
2) dynamically allocate memory (doublereal *a = (doublereal *)malloc(sizeof(doublereal) * 25000000))
3) increase stack size (but this requires administrative privileges on all machines that this program will run on)
#include <stdlib.h>
#define LEN_A (25000000)
struct
{
doublereal* a;
}_BLNK__;
#define LEN_X (22500)
#define LEN_VN (12)
#define LEN_EUL (22500)
#define INIT_BLNK(x) x.a=(doublereal*)malloc(LEN_A*sizeof(doublereal))
#define FREE_BLNK(x) if(x.a!=0)free(x.a)
static doublereal *x;
static doublereal *vn,del,*eul;
int main()
{
_BLNK__ Item;
x = (doublereal*)malloc(LEN_X*sizeof(doublereal));
vn = (doublereal*)malloc(LEN_VN*sizeof(doublereal));
eul = (doublereal*)malloc(LEN_EUL*sizeof(doublereal));
INIT_BLNK(Item);
//Do whatever you wish
//Return memory to the OS
free(x);
free(vn);
free(eul);
FREE_BLNK(Item);
return 0;
}
Try using this. I just wrote the code here so if there are any compiler errors try to fix them

Resources