clang eager to inline at cost of stack

clang eager to inline at cost of stack - c

Apple clang version 11.0.0 (clang-1100.0.33.17)
Rewording with example as far too many people got fixated on "recursive" when it was about stack usage, and the one recursive example was the lowest fruit (fix that, and you win more!)
The project has recently moved to mark all functions "static" that are not prototyped in headers, and only used in the specific source file.
llvm appears to be quite eager to inline functions, which is often desirable, especially in userland.
However, in kernel, there is a fixed stack of 16KB. Sometimes, the inlining does the wrong thing.
Example:
#include <stdio.h>
#include <string.h>
#include <stdint.h>
// clang -O2 -g -Wframe-larger-than=1 -o stack stack.c
#define current_stack_pointer ({ \
register unsigned long esp asm("esp"); \
asm("" : "=r"(esp)); \
esp; \
})
__attribute__((noinline))
static void lower(unsigned long top)
{
uint64_t usage[100];
for (int i = 0; i < 100; i++)
usage[i] = usage[i] + 1;
printf("%s; stack now %lu : deepest - we care about this one\n",
__func__, top - current_stack_pointer);
}
// Changing this one to be inlined or not. It is only called once
// on the way down the stack, it's not "part" of the deep stack, and
// it is undesirable that its "cost" is pushed on the deep stack when
// inlined.
//__attribute__((noinline))
__attribute__((always_inline))
static void step_one(unsigned long top)
{
uint64_t usage[200];
for (int i = 0; i < 200; i++)
usage[i] = usage[i] + 1;
printf("%s; stack now %lu\n", __func__, top - current_stack_pointer);
}
__attribute__((noinline))
static void start(unsigned long top)
{
uint64_t usage[100];
for (int i = 0; i < 100; i++)
usage[i] = usage[i] + 1;
printf("%s; stack now %lu\n", __func__, top - current_stack_pointer);
step_one(top);
lower(top);
printf("%s; stack now %lu\n", __func__, top - current_stack_pointer);
}
int main(int argc, char **argv)
{
uint64_t usage[100];
unsigned long top;
top = current_stack_pointer;
printf("%s; stack now %lu\n", __func__, top - current_stack_pointer);
// Make it use stack space, ignore this, just to set some variables
for (int i = 0; i < 100; i++)
usage[i] = usage[i] + 1;
start(top);
}
So here we go, force step_one() to be noinline:
stack.c:55:5: warning: stack frame size of 824 bytes in function 'main'
stack.c:41:13: warning: stack frame size of 856 bytes in function 'start'
stack.c:30:13: warning: stack frame size of 1624 bytes in function 'step_one'
stack.c:13:13: warning: stack frame size of 824 bytes in function 'lower'
# ./stack
main; stack now 0
start; stack now 864
step_one; stack now 2496
lower; stack now 1696 : deepest - we care about this one
That is great, even though we called step_one on the way down, and the stack grew to handle it, it is released and lower is not affected.
Hurrah.
Now, changing step_one to be inlined
stack.c:55:5: warning: stack frame size of 824 bytes in function 'main'
stack.c:41:13: warning: stack frame size of 2440 bytes in function 'start'
stack.c:13:13: warning: stack frame size of 824 bytes in function 'lower'
# ./stack
main; stack now 0
start; stack now 2448
step_one; stack now 2448
lower; stack now 3280 : deepest - we care about this one
Here we are, step_one was inlined, and its cost is now part of start and as we descend into lower that cost is still taking stack space.
Hurroo.
This is unfortunate. For kernel files, inlining functions can make it worse (with stacksize in mind) and it at times it has gone from a frame-size of "88" to "1800". due to inlining.
I suspect it has already been answered, there is no way to tell clang to prefer a lean stack over inlining benefits.

Are there any way to ask clang to limit stack usage as a compile option for some source files?
I don't know of any, and wouldn't expect it to exist (now or in the future).
What you're hoping for only makes sense in the presence of recursion (without recursion, "reduce stack size" means "always inline to reduce stack space wasted by function epilogue/prologue"). Currently (as I understand it); clang doesn't even know if a function is recursive and just has a "max. cutoff depth for caller contexts" approach.
Manually figuring out what to "noline" is time-consuming (for the human)...
It would be annoying and possibly error prone; but shouldn't be time-consuming (given that recursion is rare, even when developers have no reason to care about stack size).
The problem with putting __attribute__(noinline) on the log function is that it will be applied to all callers and prevent the log function from being inlined when it is beneficial. What you'd really want is the reverse - e.g. some kind of (hypothetical) __attribute__(noinline_callees) you can put on the recursive function. That behavior could be achieved by putting the recursive function in its own separate file/compilation unit; but that will probably end up being even more annoying.

So far the best option we have come across is to use -finline-hint-functions for the kernel compile, then it will only inline functions that we ask it to. Changing the default away from try-to-inline-everything, and we have a chance to keep the stack small.

Related

how many recursive function calls causes stack overflow?

I am working on a simulation problem written in c, the main part of my program is a recursive function.
when the recursive depth reaches approximately 500000 it seems stack overflow occurs.
Q1 : I want to know that is this normal?
Q2 : in general how many recursive function calls causes stack overflow?
Q3 : in the code below, removing local variable neighbor can prevent from stack overflow?
my code:
/*
* recursive function to form Wolff Cluster(= WC)
*/
void grow_Wolff_cluster(lattic* l, Wolff* wolff, site *seed){
/*a neighbor of site seed*/
site* neighbor;
/*go through all neighbors of seed*/
for (int i = 0 ; i < neighbors ; ++i) {
neighbor = seed->neighbors[i];
/*add to WC according to the Wolff Algorithm*/
if(neighbor->spin == seed->spin && neighbor->WC == -1 && ((double)rand() / RAND_MAX) < add_probability)
{
wolff->Wolff_cluster[wolff->WC_pos] = neighbor;
wolff->WC_pos++; // the number of sites that is added to WC
neighbor->WC = 1; // for avoiding of multiple addition of site
neighbor->X = 0;
///controller_site_added_to_WC();
/*continue growing Wolff cluster(recursion)*/
grow_Wolff_cluster(l, wolff, neighbor);
}
}
}

I want to know that is this normal?
Yes. There's only so much stack size.
In the code below, removing local variable neighbor can prevent from stack overflow?
No. Even with no variables and no return values the function calls themselves must be stored in the stack so the stack can eventually be unwound.
For example...
void recurse() {
recurse();
}
int main (void)
{
recurse();
}
This still overflows the stack.
$ ./test
ASAN:DEADLYSIGNAL
=================================================================
==94371==ERROR: AddressSanitizer: stack-overflow on address 0x7ffee7f80ff8 (pc 0x00010747ff14 bp 0x7ffee7f81000 sp 0x7ffee7f81000 T0)
#0 0x10747ff13 in recurse (/Users/schwern/tmp/./test+0x100000f13)
SUMMARY: AddressSanitizer: stack-overflow (/Users/schwern/tmp/./test+0x100000f13) in recurse
==94371==ABORTING
Abort trap: 6
In general how many recursive function calls causes stack overflow?
That depends on your environment and function calls. Here on OS X 10.13 I'm limited to 8192K by default.
$ ulimit -s
8192
This simple example with clang -g can recurse 261976 times. With -O3 I can't get it to overflow, I suspect compiler optimizations have eliminated my simple recursion.
#include <stdio.h>
void recurse() {
puts("Recurse");
recurse();
}
int main (void)
{
recurse();
}
Add an integer argument and it's 261933 times.
#include <stdio.h>
void recurse(int cnt) {
printf("Recurse %d\n", cnt);
recurse(++cnt);
}
int main (void)
{
recurse(1);
}
Add a double argument, now it's 174622 times.
#include <stdio.h>
void recurse(int cnt, double foo) {
printf("Recurse %d %f\n", cnt, foo);
recurse(++cnt, foo);
}
int main (void)
{
recurse(1, 2.3);
}
Add some stack variables and it's 104773 times.
#include <stdio.h>
void recurse(int cnt, double foo) {
double this = 42.0;
double that = 41.0;
double other = 40.0;
double thing = 39.0;
printf("Recurse %d %f %f %f %f %f\n", cnt, foo, this, that, other, thing);
recurse(++cnt, foo);
}
int main (void)
{
recurse(1, 2.3);
}
And so on. But I can increase my stack size in this shell and get twice the calls.
$ ./test 2> /dev/null | wc -l
174622
$ ulimit -s 16384
$ ./test 2> /dev/null | wc -l
349385
I have a hard upper limit to how big I can make the stack of 65,532K or 64M.
$ ulimit -Hs
65532

A stack overflow isn’t defined by the C standard, but by the implementation. The C standard defines a language with unlimited stack space (among other resources) but does have a section about how implementations are allowed to impose limits.
Usually it’s the operating system that actually first creates the error. The OS doesn’t care about how many calls you make, but about the total size of the stack. The stack is composed of stack frames, one for each function call. Usually a stack frame consists of some combination of the following five things (as an approximation; details can vary a lot between systems):
The parameters to the function call (probably not actually here, in this case; they’re probably in registers, although this doesn’t actually buy anything with recursion).
The return address of the function call (in this case, the address of the ++i instruction in the for loop).
The base pointer where the previous stack frame starts
Local variables (at least those that don’t go in registers)
Any registers the caller wants to save when it makes a new function call, so the called function doesn’t overwrite them (some registers may be saved by the caller instead, but it doesn’t particularly matter for stack size analysis). This is why passing parameters in registers doesn’t help much in this case; they’ll end up on the stack sooner or later.
Because some of these (specifically, 1., 4., and 5.) can vary in size by a lot, it can be difficult to estimate how big an average stack frame is, although it’s easier in this case because of the recursion. Different systems also have different stack sizes; it currently looks like by default I can have 8 MiB for a stack, but an embedded system would probably have a lot less.
This also explains why removing a local variable gives you more available function calls; you reduced the size of each of the 500,000 stack frames.
If you want to increase the amount of stack space available, look into the setrlimit(2) function (on Linux like the OP; it may be different on other systems). First, though, you might want to try debugging and refactoring to make sure you need all that stack space.

Yes and no - if you come across a stack overflow in your code, it could mean a few things
Your algorithm is not implemented in a way that respects the amount of memory on the stack you have been given. You may adjust this amount to suit the needs of the algorithm.
If this is the case, it's more common to change the algorithm to more efficiently utilize the stack, rather than add more memory. Converting a recursive function to an iterative one, for example, saves a lot of precious memory.
It's a bug trying to eat all your RAM. You forgot a base case in the recursion or mistakenly called the same function. We've all done it at least 2 times.
It's not necessarily how many calls cause an overflow - it's dependent upon how much memory each individual call takes up on a stack frame. Each function call uses up stack memory until the call returns. Stack memory is statically allocated -- you can't change it at runtime (in a sane world). It's a last-in-first-out (LIFO) data structure behind the scenes.
It's not preventing it, it's just changing how many calls to grow_Wolff_cluster it takes to overflow the stack memory. On a 32-bit system, removing neighbor from the function costs a call to grow_Wolff_cluster 4 bytes less. It adds up quickly when you multiply that in the hundreds of thousands.
I suggest you learn more about how stacks work for you. Here's a good resource over on the software engineering stack exchange. And another here on stack overflow (zing!)

For each time a function recurs, your program takes more memory on the stack, the memory it takes for each function depends upon the function and variables within it. The number of recursions that can be done of a function is entirely dependant upon your system.
There is no general number of recursions that will cause stack overflow.
Removing the variable 'neighbour' will allow for the function to recur further as each recursion takes less memory, but it will still eventually cause stack overflow.

This is a simple c# function that will show you how many iteration your computer can take before stack overflow (as a reference, I have run up to 10478):
private void button3_Click(object sender, EventArgs e)
{
Int32 lngMax = 0;
StackIt(ref lngMax);
}
private void StackIt(ref Int32 plngMax, Int32 plngStack = 0)
{
if (plngStack > plngMax)
{
plngMax = plngStack;
Console.WriteLine(plngMax.ToString());
}
plngStack++;
StackIt(ref plngMax, plngStack);
}
in this simple case, the condition check: "if (plngStack > plngMax)" could be removed,
but if you got a real recursive function, this check will help you localize the problem.

How to corrupt the stack in a C program

I have to change the designated section of function_b so that it changes the stack in such a way that the program prints:
Executing function_a
Executing function_b
Finished!
At this point it also prints Executed function_b in between Executing function_b and Finished!.
I have the following code and I have to fill something in, in the part where it says // ... insert code here
#include <stdio.h>
void function_b(void){
char buffer[4];
// ... insert code here
fprintf(stdout, "Executing function_b\n");
}
void function_a(void) {
int beacon = 0x0b1c2d3;
fprintf(stdout, "Executing function_a\n");
function_b();
fprintf(stdout, "Executed function_b\n");
}
int main(void) {
function_a();
fprintf(stdout, "Finished!\n");
return 0;
}
I am using Ubuntu Linux with the gcc compiler. I compile the program with the following options: -g -fno-stack-protector -fno-omit-frame-pointer. I am using an intel processor.

Here is a solution, not exactly stable across environments, but works for me on x86_64 processor on Windows/MinGW64.
It may not work for you out of the box, but still, you might want to use a similar approach.
void function_b(void) {
char buffer[4];
buffer[0] = 0xa1; // part 1
buffer[1] = 0xb2;
buffer[2] = 0xc3;
buffer[3] = 0x04;
register int * rsp asm ("rsp"); // part 2
register size_t r10 asm ("r10");
r10 = 0;
while (*rsp != 0x04c3b2a1) {rsp++; r10++;} // part 3
while (*rsp != 0x00b1c2d3) rsp++; // part 4
rsp -= r10; // part 5
rsp = (int *) ((size_t) rsp & ~0xF); // part 6
fprintf(stdout, "Executing function_b\n");
}
The trick is that each of function_a and function_b have only one local variable, and we can find the address of that variable just by searching around in the memory.
First, we put a signature in the buffer, let it be the 4-byte integer 0x04c3b2a1 (remember that x86_64 is little-endian).
After that, we declare two variables to represent the registers: rsp is the stack pointer, and r10 is just some unused register.
This allows to not use asm statements later in the code, while still being able to use the registers directly.
It is important that the variables don't actually take stack memory, they are references to processor registers themselves.
After that, we move the stack pointer in 4-byte increments (since the size of int is 4 bytes) until we get to the buffer. We have to remember the offset from the stack pointer to the first variable here, and we use r10 to store it.
Next, we want to know how far in the stack are the instances of function_b and function_a. A good approximation is how far are buffer and beacon, so we now search for beacon.
After that, we have to push back from beacon, the first variable of function_a, to the start of instance of the whole function_a on the stack.
That we do by subtracting the value stored in r10.
Finally, here comes a werider bit.
At least on my configuration, the stack happens to be 16-byte aligned, and while the buffer array is aligned to the left of a 16-byte block, the beacon variable is aligned to the right of such block.
Or is it something with a similar effect and different explanation?..
Anyway, so we just clear the last four bits of the stack pointer to make it 16-byte aligned again.
The 32-bit GCC doesn't align anything for me, so you might want to skip or alter this line.
When working on a solution, I found the following macro useful:
#ifdef DEBUG
#define show_sp() \
do { \
register void * rsp asm ("rsp"); \
fprintf(stdout, "stack pointer is %016X\n", rsp); \
} while (0);
#else
#define show_sp() do{}while(0);
#endif
After this, when you insert a show_sp(); in your code and compile with -DDEBUG, it prints what is the value of stack pointer at the respective moment.
When compiling without -DDEBUG, the macro just compiles to an empty statement.
Of course, other variables and registers can be printed in a similar way.

ok, let assume that epilogue (i.e code at } line) of function_a and for function_b is the same
despite functions A and B not symmetric, we can assume this because it have the same signature (no parameters, no return value), same calling conventions and same size of local variables (4 byte - int beacon = 0x0b1c2d3 vs char buffer[4];) and with optimization - both must be dropped because unused. but we must not use additional local variables in function_b for not break this assumption. most problematic point here - what is function_A or function_B will be use nonvolatile registers (and as result save it in prologue and restore in epilogue) - but however look like here no place for this.
so my next code based on this assumption - epilogueA == epilogueB (really solution of #Gassa also based on it.
also need very clearly state that function_a and function_b must not be inline. this is very important - without this any solution impossible. so I let yourself add noinline attribute to function_a and function_b. note - not code change but attribute add, which author of this task implicitly implies but not clearly stated. don't know how in GCC mark function as noinline but in CL __declspec(noinline) for this used.
next code I write for CL compiler where exist next intrinsic function
void * _AddressOfReturnAddress();
but I think that GCC also must have the analog of this function. also I use
void* _ReturnAddress();
but however really _ReturnAddress() == *(void**)_AddressOfReturnAddress() and we can use _AddressOfReturnAddress() only. simply using _ReturnAddress() make source (but not binary - it equal) code smaller and more readable.
and next code is work for both x86 and x64. and this code work (tested) with any optimization.
despite I use 2 global variables - code is thread safe - really we can call main from multiple threads in concurrent, call it multiple time - but all will be worked correct (only of course how I say at begin if epilogueA == epilogueB)
hope comments in code enough self explained
__declspec(noinline) void function_b(void){
char buffer[4];
buffer[0] = 0;
static void *IPa, *IPb;
// save the IPa address
_InterlockedCompareExchangePointer(&IPa, _ReturnAddress(), 0);
if (_ReturnAddress() == IPa)
{
// we called from function_a
function_b();
// <-- IPb
if (_ReturnAddress() == IPa)
{
// we called from function_a, change return address for return to IPb instead IPa
*(void**)_AddressOfReturnAddress() = IPb;
return;
}
// we at stack of function_a here.
// we must be really at point IPa
// and execute fprintf(stdout, "Executed function_b\n"); + '}' (epilogueA)
// but we will execute fprintf(stdout, "Executing function_b\n"); + '}' (epilogueB)
// assume that epilogueA == epilogueB
}
else
{
// we called from function_b
IPb = _ReturnAddress();
return;
}
fprintf(stdout, "Executing function_b\n");
// epilogueB
}
__declspec(noinline) void function_a(void) {
int beacon = 0x0b1c2d3;
fprintf(stdout, "Executing function_a\n");
function_b();
// <-- IPa
fprintf(stdout, "Executed function_b\n");
// epilogueA
}
int main(void) {
function_a();
fprintf(stdout, "Finished!\n");
return 0;
}

Detecting stack overflows during runtime beforehand

I have a rather huge recursive function (also, I write in C), and while I have no doubt that the scenario where stack overflow happens is extremely unlikely, it is still possible. What I wonder is whether you can detect if stack is going to get overflown within a few iterations, so you can do an emergency stop without crashing the program.

In the C programming language itself, that is not possible. In general, you can't know easily that you ran out of stack before running out. I recommend you to instead place a configurable hard limit on the recursion depth in your implementation, so you can simply abort when the depth is exceeded. You could also rewrite your algorithm to use an auxillary data structure instead of using the stack through recursion, this gives you greater flexibility to detect an out-of-memory condition; malloc() tells you when it fails.
However, you can get something similar with a procedure like this on UNIX-like systems:
Use setrlimit to set a soft stack limit lower than the hard stack limit
Establish signal handlers for both SIGSEGV and SIGBUS to get notified of stack overflows. Some operating systems produce SIGSEGV for these, others SIGBUS.
If you get such a signal and determine that it comes from a stack overflow, raise the soft stack limit with setrlimit and set a global variable to identify that this occured. Make the variable volatile so the optimizer doesn't foil your plains.
In your code, at each recursion step, check if this variable is set. If it is, abort.
This may not work everywhere and required platform specific code to find out that the signal came from a stack overflow. Not all systems (notably, early 68000 systems) can continue normal processing after getting a SIGSEGV or SIGBUS.
A similar approach was used by the Bourne shell for memory allocation.

Heres a simple solution that works for win-32. Actually resembles what Wossname already posted but less icky :)
unsigned int get_stack_address( void )
{
unsigned int r = 0;
__asm mov dword ptr [r], esp;
return r;
}
void rec( int x, const unsigned int begin_address )
{
// here just put 100 000 bytes of memory
if ( begin_address - get_stack_address() > 100000 )
{
//std::cout << "Recursion level " << x << " stack too high" << std::endl;
return;
}
rec( x + 1, begin_address );
}
int main( void )
{
int x = 0;
rec(x,get_stack_address());
}

Here's a naive method, but it's a bit icky...
When you enter the function for the first time you could store the address of one of your variables declared in that function. Store that value outside your function (e.g. in a global). In subsequent calls compare the current address of that variable with the cached copy. The deeper you recurse the further apart these two values will be.
This will most likely cause compiler warnings (storing addresses of temporary variables) but it does have the benefit of giving you a fairly accurate way of knowing exactly how much stack you're using.
Can't say I really recommend this but it would work.
#include <stdio.h>
char* start = NULL;
void recurse()
{
char marker = '#';
if(start == NULL)
start = &marker;
printf("depth: %d\n", abs(&marker - start));
if(abs(&marker - start) < 1000)
recurse();
else
start = NULL;
}
int main()
{
recurse();
return 0;
}

An alternative method is to learn the stack limit at the start of the program, and each time in your recursive function to check whether this limit has been approached (within some safety margin, say 64 kb). If so, abort; if not, continue.
The stack limit on POSIX systems can be learned by using getrlimit system call.
Example code that is thread-safe: (note: it code assumes that stack grows backwards, as on x86!)
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
void *stack_limit;
#define SAFETY_MARGIN (64 * 1024) // 64 kb
void recurse(int level)
{
void *stack_top = &stack_top;
if (stack_top <= stack_limit) {
printf("stack limit reached at recursion level %d\n", level);
return;
}
recurse(level + 1);
}
int get_max_stack_size(void)
{
struct rlimit rl;
int ret = getrlimit(RLIMIT_STACK, &rl);
if (ret != 0) {
return 1024 * 1024 * 8; // 8 MB is the default on many platforms
}
printf("max stack size: %d\n", (int)rl.rlim_cur);
return rl.rlim_cur;
}
int main (int argc, char *argv[])
{
int x;
stack_limit = (char *)&x - get_max_stack_size() + SAFETY_MARGIN;
recurse(0);
return 0;
}
Output:
max stack size: 8388608
stack limit reached at recursion level 174549

snprintf() prints garbage floats with newlib nano

I am running a bare metal embedded system with an ARM Cortex-M3 (STM32F205). When I try to use snprintf() with float numbers, e.g.:
float f;
f = 1.23;
snprintf(s, 20, "%5.2f", f);
I get garbage into s. The format seems to be honored, i.e. the garbage is a well-formed string with digits, decimal point, and two trailing digits. However, if I repeat the snprintf, the string may change between two calls.
Floating point mathematics seems to work otherwise, and snprintf works with integers, e.g.:
snprintf(s, 20, "%10d", 1234567);
I use the newlib-nano implementation with the -u _printf_float linker switch. The compiler is arm-none-eabi-gcc.
I do have a strong suspicion of memory allocation problems, as integers are printed without any hiccups, but floats act as if they got corrupted in the process. The printf family functions call malloc with floats, not with integers.
The only piece of code not belonging to newlib I am using in this context is my _sbrk(), which is required by malloc.
caddr_t _sbrk(int incr)
{
extern char _Heap_Begin; // Defined by the linker.
extern char _Heap_Limit; // Defined by the linker.
static char* current_heap_end;
char* current_block_address;
// first allocation
if (current_heap_end == 0)
current_heap_end = &_Heap_Begin;
current_block_address = current_heap_end;
// increment and align to 4-octet border
incr = (incr + 3) & (~3);
current_heap_end += incr;
// Overflow?
if (current_heap_end > &_Heap_Limit)
{
errno = ENOMEM;
current_heap_end = current_block_address;
return (caddr_t) - 1;
}
return (caddr_t)current_block_address;
}
As far as I have been able to track, this should work. It seems that no-one ever calls it with negative increments, but I guess that is due to the design of the newlib malloc. The only slightly odd thing is that the first call to _sbrk has a zero increment. (But this may be just malloc's curiosity about the starting address of the heap.)
The stack should not collide with the heap, as there is around 60 KiB RAM for the two. The linker script may be insane, but at least the heap and stack addresses seem to be correct.

As it may happen that someone else gets bitten by the same bug, I post an answer to my own question. However, it was #Notlikethat 's comment which suggested the correct answer.
This is a lesson of Thou shall not steal. I borrowed the gcc linker script which came with the STMCubeMX code generator. Unfortunately, the script along with the startup file is broken.
The relevant part of the original linker script:
_estack = 0x2000ffff;
and its counterparts in the startup script:
Reset_Handler:
ldr sp, =_estack /* set stack pointer */
...
g_pfnVectors:
.word _estack
.word Reset_Handler
...
The first interrupt vector position (at 0) should always point to the startup stack top. When the reset interrupt is reached, it also loads the stack pointer. (As far as I can say, the latter one is unnecessary as the HW anyway reloads the SP from the 0th vector before calling the reset handler.)
The Cortex-M stack pointer should always point to the last item in the stack. At startup there are no items in the stack and thus the pointer should point to the first address above the actual memory, 0x020010000 in this case. With the original linker script the stack pointer is set to 0x0200ffff, which actually results in sp = 0x0200fffc (the hardware forces word-aligned stack). After this the stack is misaligned by 4.
I changed the linker script by removing the constant definition of _estack and replacing it by _stacktop as shown below. The memory definitions were there before. I changed the name just to see where the value is used.
MEMORY
{
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 128K
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 64K
}
_stacktop = ORIGIN(RAM) + LENGTH(RAM);
After this the value of _stacktop is 0x20010000, and my numbers float beautifully... The same problem could arise with any external (library) function using double length parameters, as the ARM Cortex ABI states that the stack must be aligned to 8 octets when calling external functions.

snprintf accepts size as second argument. You might want to go through this example http://www.cplusplus.com/reference/cstdio/snprintf/
/* snprintf example */
#include <stdio.h>
int main ()
{
char buffer [100];
int cx;
cx = snprintf ( buffer, 100, "The half of %d is %d", 60, 60/2 );
snprintf ( buffer+cx, 100-cx, ", and the half of that is %d.", 60/2/2 );
puts (buffer);
return 0;
}

Why do very large stack allocations fail despite unlimited ulimit?

The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?

When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.

Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.

There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}

You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight