How is stack memory consumption calculated? - c

I need to calculate the stack memory consumption of my program.
gcc's -fstack-usage only calculates the stack usage of function, but does not include an additional function call in that function as far as I understand.
void test1(){
uint32_t stackmemory[100];
function1(); //needs aditional stack, say 200 Bytes
uint32_t stackmemory2[100];
}
void test2(){
uint32_t stackmemory[100];
uint32_t stackmemory2[100];
function1(); //needs additional stack, say 200 Bytes
}
Which test() function uses less stack? I would say test1(), as the stack is freed after the function1() call. Or does this depend on the optimization level -Os/-O2...?
Does the compiler allocate memory in test1() for all its static variables, as soon as the function is entered? Or is stackmemory2[100] allocated when the line is reached?

In general you need to combine call-graph information with the .su files generated by -fstack-usage to find the deepest stack usage starting from a specific function. Starting at main() or a thread entry-point will then give you the worst-case usage for that thread.
Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.
ARM's armcc compiler (as used in Keil ARM-MDK) has this functionality built-in and can include detailed stack analysis in the link map, including the worst-case call path and warnings of non-deterministic stack usage (due to recursion for example).
In my experience observing the behaviour of several compilers, the stack-frame is typically defined for the lifetime of the function regardless of the lifetime and scope of the variables declared. So the two versions in that case will not differ. Without declaring them volatile the optimiser will likely remove both arrays in any event. However you should not rely on any observations in this respect being universal - it implementation rather then language defined.

Which test() function uses less stack? I would say test1(), as the
stack is freed after the function1() call. Or does this depend on the
optimization level -Os/-O2...?
They allocate exactly the same memory on the stack. And exactly at the same moment :)
test1:
push rbp
mov rbp, rsp
sub rsp, 800 <- on stack allocation
mov eax, 0
call function1
nop
leave
ret
test2:
push rbp
mov rbp, rsp
sub rsp, 800 <- on stack allocation
mov eax, 0
call function1
nop
leave
ret
The only difference is that unused variables when optimizations are on will be completely removed.
Does the compiler allocate memory in test1() for all its static
variables
You don't have any static variables in your example, only the automatic ones
Static variables are not allocated on the stack as they remain their values between the function calls.
godbolt.org/z/j9-VQq

Related

Absolute worst case stack size based on automatic varaibles

In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
A bit of context here: I told a friend that I wrote a program not using dynamic memory allocation ("malloc") and allocate all memory static (by modeling all my state variables in a struct, which I then declared global). He then told me that if I'm using automatic variables, I still make use of dynamic memory. I argued that my automatic variables are not state variables but control variables, so my program is still to be considered static. We then discussed that there has to be a way to make a statement about the absolute worst-case behaviour about my program, so I came up with the above question.
Bonus question: If the assumptions above hold, I could simply declare all automatic variables static and would end up with a "truly" static program?
Even if array sizes are constant a C implementation could allocate arrays and even structures dynamically. I'm not aware of any that do (anyone) and it would appear quite unhelpful. But the C Standard doesn't make such guarantees.
There is also (almost certainly) some further overhead in the stack frame (the data added to the stack on call and released on return).
You would need to declare all your functions as taking no parameters and returning void to ensure no program variables in the stack. Finally the 'return address' of where execution of a function is to continue after return is pushed onto the stack (at least logically).
So having removed all parameters, automatic variables and return values to you 'state' struct there will still be something going on to the stack - probably.
I say probably because I'm aware of a (non-standard) embedded C compiler that forbids recursion that can determine the maximum size of the stack by examining the call tree of the whole program and identify the call chain that reaches the peek size of the stack.
You could achieve this a monstrous pile of goto statements (some conditional where a functon is logically called from two places or by duplicating code.
It's often important in embedded code on devices with tiny memory to avoid any dynamic memory allocation and know that any 'stack-space' will never overflow.
I'm happy this is a theoretical discussion. What you suggest is a mad way to write code and would throw away most of (ultimately limited) services C provides to infrastructure of procedural coding (pretty much the call stack)
Footnote: See the comment below about the 8-bit PIC architecture.
Bonus question: If the assumptions above hold, I could simply declare
all automatic variables static and would end up with a "truly" static
program?
No. This would change the function of the program. static variables are initialized only once.
Compare this 2 functions:
int canReturn0Or1(void)
{
static unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
int willAlwaysReturn0(void)
{
unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
No, because of function pointers..... Read n1570.
Consider the following code, where rand(3) is some pseudo random number generator (it could also be some input from a sensor) :
typedef int foosig(int);
int foo(int x) {
foosig* fptr = (x>rand())?&foo:NULL;
if (fptr)
return (*fptr)(x);
else
return x+rand();
}
An optimizing compiler (such as some recent GCC suitably invoked with enough optimizations) would make a tail-recursive call for (*fptr)(x). Some other compiler won't.
Depending on how you compile that code, it would use a bounded stack or could produce a stack overflow. With some ABI and calling conventions, both the argument and the result could go thru a processor register and won't consume any stack space.
Experiment with a recent GCC (e.g. on Linux/x86-64, some GCC 10 in 2020) invoked as gcc -O2 -fverbose-asm -S foo.c then look inside foo.s. Change the -O2 to a -O0.
Observe that the naive recursive factorial function could be compiled into some iterative machine code with a good enough C compiler and optimizer. In practice GCC 10 on Linux compiling the below code:
int fact(int n)
{
if (n<1) return 1;
else return n*fact(n-1);
}
as gcc -O3 -fverbose-asm tmp/fact.c -S -o tmp/fact.s produces the following assembler code:
.type fact, #function
fact:
.LFB0:
.cfi_startproc
endbr64
# tmp/fact.c:3: if (n<1) return 1;
movl $1, %eax #, <retval>
testl %edi, %edi # n
jle .L1 #,
.p2align 4,,10
.p2align 3
.L2:
imull %edi, %eax # n, <retval>
subl $1, %edi #, n
jne .L2 #,
.L1:
# tmp/fact.c:5: }
ret
.cfi_endproc
.LFE0:
.size fact, .-fact
.ident "GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0"
And you can observe that the call stack is not increasing above.
If you have serious and documented arguments against GCC, please submit a bug report.
BTW, you could write your own GCC plugin which would choose to randomly apply or not such an optimization. I believe it stays conforming to the C standard.
The above optimization is essential for many compilers generating C code, such as Chicken/Scheme or Bigloo.
A related theorem is Rice's theorem. See also this draft report funded by the CHARIOT project.
See also the Compcert project.

Inline and stack frame control

The following are artificial examples. Clearly compiler optimizations will dramatically change the final outcome. However, and I cannot stress this more: by temporarily disabling optimizations, I intend to have an upper bound on stack usage, likely, I expect that further compiler optimization can improve the situation.
The discussion in centered around GCC only. I would like to have fine control over how automatic variables get released from the stack. Scoping with blocks does not ensure that memory will be released when automatic variables go out of scope. Functions, as far as I know, do ensure that.
However, when inlining, what is the case? For example:
inline __attribute__((always_inline)) void foo()
{
uint8_t buffer1[100];
// Stack Size Measurement A
// Do something
}
void bar()
{
foo();
uint8_t buffer2[100];
// Stack Size Measurement B
// Do something else
}
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
I would like to have fine control over how automatic variables get released from the stack.
Lots of confusion here. The optimizing compiler could store some automatic variables only in registers, without using any slot in the call frame. The C language specification (n1570) does not require any call stack.
And a given register, or slot in the call frame, can be reused for different purposes (e.g. different automatic variables in different parts of the function). Register allocation is a significant role of compilers.
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Certainly not. The compiler could prove that at some later point in your code, the space for buffer1 is not useful anymore so reuse that space for other purposes.
is there any way I can have fine control over stack deallocations?
No, there is not. The call stack is an implementation detail, and might not be used (or be "abused" in your point of view) by the compiler and the generated code.
For some silly example, if buffer1 is not used in foo, the compiler might not allocate space for it. And some clever compilers might just allocate 8 bytes in it, if they can prove that only 8 first bytes of buffer1 are useful.
More seriously, in some cases, GCC is able to do tail-call optimizations.
You should be interested in invoking GCC with -fstack-reuse=all, -Os,
-Wstack-usage=256, -fstack-usage, and other options.
Of course, the concrete stack usage depends upon the optimization levels. You might also inspect the generated assembler code, e.g. with -S -O2 -fverbose-asm
For example, the following code e.c:
int f(int x, int y) {
int t[100];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
when compiled with GCC8.1 on Linux/Debian/x86-64 using gcc -S -fverbose-asm -O2 e.c gives in e.s
.text
.p2align 4,,15
.globl f
.type f, #function
f:
.LFB0:
.cfi_startproc
# e.c:5: return t[0]+t[1];
leal (%rdi,%rsi), %eax #, tmp90
# e.c:6: }
ret
.cfi_endproc
.LFE0:
.size f, .-f
and you see that the stack frame is not grown by 100*4 bytes. And this is still the case with:
int f(int x, int y, int n) {
int t[n];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
which actually generates the same machine code as above. And if instead of the + above I'm calling some inline int add(int u, int v) { return u+v; } the generated code is not changing.
Be aware of the as-if rule, and of the tricky notion of undefined behavior (if n was 1 above, it is UB).
Can I always expect that at measurement B, the stack will only containbuffer2 and buffer1 has been released?
No. It's going to depend on GCC version, target, optimization level, options.
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
Your requirement is so specific I guess you will likely have to write yourself the code in assembler.
mov BYTE PTR [rbp-20], 1 and mov BYTE PTR [rbp-10], 2 only show the relative offset of stack pointer in stack frame. when considering run-time situation, they have the same peak stack usage.
There are two differences about whether using inline:
1) In function call mode, buffer1 will be released when exit from foo(). But in inline method, buffer1 will not be kept until exit from bar(), that means peak stack usage will last a longer time. 2) Function call will add a few overhead, such as saving stack frame information, comparing with inline mode

Stack pointer in assembly moves more than the required number of bytes for storage of auto variables [duplicate]

This question comes from answering Stack Overflow question Why do books say, “the compiler allocates space for variables in memory”?, where I tried to demonstrate to the OP what happens when you allocate a variable on the stack and how the compiler generates code that knows the size of memory to allocate. Apparently the compiler allocates much more space than what is needed.
However, when compiling the following
#include <iostream>
using namespace std;
int main()
{
int foo;
return 0;
}
You get the following assembler output with Visual C++ 2012 compiled in debug mode with no optimisations on:
int main()
{
00A31CC0 push ebp
00A31CC1 mov ebp,esp
00A31CC3 sub esp,0CCh // Allocates 204 bytes here.
00A31CC9 push ebx
00A31CCA push esi
00A31CCB push edi
00A31CCC lea edi,[ebp-0CCh]
00A31CD2 mov ecx,33h
00A31CD7 mov eax,0CCCCCCCCh
00A31CDC rep stos dword ptr es:[edi]
int foo;
return 0;
00A31CDE xor eax,eax
}
Adding one more int to my program makes the commented line above to the following:
00B81CC3 sub esp,0D8h // Allocate 216 bytes
The question raised by #JamesKanze in my answer linked atop, is why the compiler, and apparently it's not only Visual C++ (I haven't done the experiment with another compiler), allocated 204 and 216 bytes respectively, where in the first case it only needs four and in the second it needs only eight?
This program creates a 32-bit executable.
From a technical perspective, why may it need to allocate 204 bytes instead of just 4?
EDIT:
Calling two functions and creating a double and two int in main, you get
01374493 sub esp,0E8h // 232 bytes
For the same program as the edit above, it does this in release mode (no optimizations):
sub esp, 8 // Two ints
movsd QWORD PTR [esp], xmm0 // I suspect this is where my `double` goes
This extra space is generated by the /Zi compile option. Which enables Edit + Continue. The extra space is available for local variables that you might add when you edit code while debugging.
You are also seeing the effect of /RTC, it initializes all local variables to 0xcccccccc so that it is easier to diagnose problems due to forgetting to initialize variables. Of course none of this code is generated in the default Release configuration settings.

Is 'goto' smart for correct working with stack variables in C (not C++)

(Sorry for bad English.)
Question 1.
void foo(void)
{
goto inside;
for (;;) {
int stack_var = 42;
inside:
...
}
}
Will be a place in stack allocated for the stack_var when I goto the inside label? I.e. can I correctly use the stack_var variable within ...?
Question 2.
void foo(void)
{
for (;;) {
int stack_var = 42;
...
goto outside;
}
outside:
...
}
Will be a place in stack of the stack_var deallocated when I goto the outside label? E.g. is it correct to do return within ...?
In other words, is goto smart for correct working with stack variables (automatic (de)allocation when I walk through blocks), or it's just a stupid jump?
Question 1:
can I correctly use the stack_var variable within ...?
The code in ... can write to stack_var. However, this variable is uninitialized because the execution flow jumped over the initialization, so the code should not read from it without having written to it first.
From the C99 standard, 6.8:3
The initializers of objects that have automatic storage duration […] are evaluated and the values are stored in the objects (including storing an indeterminate value in objects without an initializer) each time the declaration is reached in the order of execution
My compiler compiles the function below to a piece of assembly that sometimes returns the uninitialized contents of x:
int f(int c){
if (c) goto L;
int x = 42;
L:
return x;
}
cmpl $0, %eax
jne LBB1_2
movl $42, -16(%rbp)
LBB1_2:
movl -16(%rbp), %eax
...
popq %rbp
ret
Question 2:
Will be a place in stack of the stack_var deallocated when I goto the outside label?
Yes, you can expect the memory reserved for stack_var to be reclaimed as soon as the variable goes out of scope.
There are two different issues:
lexical scoping of variables inside C code. A C variable only makes sense inside the block in which it is declared. You could imagine that the compiler is renaming variables to unique names, which have sense only inside the scope block.
call frames in the generated code. A good optimizing compiler usually allocate the call frame of the current function on the machine class stack at the beginning of the function. A given location in that call frame, called a slot can (and usually is) reused by the compiler for several local variables (or other purposes).
And a local variable can be kept in a register only (without any slot in the call frame), and that register will obviously be reused for various purposes.
You are probably hurting undefined behavior for your first case. After the goto inside the stack_var is uninitialized.
I suggest you to compile with gcc -Wall and to improve the code till no warnings are given.

Regarding stack reuse of a function calling itself?

if a function calls itself while defining variables at the same time
would it result in stack overflow? Is there any option in gcc to reuse the same stack.
void funcnew(void)
{
int a=10;
int b=20;
funcnew();
return ;
}
can a function reuse the stack-frame which it used earlier?
What is the option in gcc to reuse the same frame in tail recursion??
Yes. See
-foptimize-sibling-calls
Optimize sibling and tail recursive calls.
Enabled at levels -O2, -O3, -Os.
Your function is compiled to:
funcstack:
.LFB0:
.cfi_startproc
xorl %eax, %eax
jmp func
.cfi_endproc
(note the jump to func)
Reusing the stack frame when a function end by a call -- this include in its full generality manipulating the stack to put the parameters at the correct place and replacing the function call by a jump to the start of the function -- is a well known optimisation called [i]tail call removal[/i]. It is mandated by some languages (scheme for instance) for recursive calls (a recursive call is the natural way to express a loop in these languages). As given above, gcc has the optimisation implemented for C, but I'm not sure which other compiler has it, I would not depend on it for portable code. And note that I don't know which restriction there are on it -- I'm not sure for instance that gcc will manipulate the stack if the parameters types are different.
Even without defining the parameters you'd get a stackoverflow. Since the return address also is pushed onto the stack.
It is (I've learned this recently) possible that the compiler optimizes your loop into a tail recursion (which makes the stack not grow at all). Link to tail recursion question on SO
No, each recursion is a new stack frame. If the recursion is infinitely deep, then the stack needed is also infinite, so you get a stack overflow.
Yes, in some cases the compiler may be able to perform something called tail call optimization. You should check with your compiler manual. (AProgrammer seems to have quoted the GCC manual in his answer.)
This is an essential optimization when implementing for example functional languages, where such code occurs frequently.
You can;t do away with the stack frame altogether, as it is needed for the return address. unless you are using tail-recursion, and your compiler has optimised it to a loop. But to be completely technically honest, you can do away with all the variables in the the frame by making them static. However, this is almost certainly not what you want to do, and you should not do it without knowing exactly what you are doing, which as you had to ask this question, you don't.
As others have noted, it is only possible if (1) your compiler supports tail call optimization, and (2) if your function is eligible for such an optimization. The optimization is to reuse the existing stack and perform a JMP (i.e., a GOTO in assembly) instead of a CALL.
In fact, your example function is indeed eligible for such an optimization. The reason is that the last thing your function does before returning is call itself; it doesn't have to do anything after the last call to funcnew(). However, only certain compilers will perform such an optimization. GCC, for instance, will do it. For more info, see Which, if any, C++ compilers do tail-recursion optimization?
The classic material on this is the factorial function. Let's make a recursive factorial function that is not eligible for tail call optimization (TCO).
int fact(int n)
{
if ( n == 1 ) return 1;
return n*fact(n-1);
}
The last thing it does is to multiply n with the result from fact(n-1). By somehow eliminating this last operation, we would be able to reuse the stack. Let's introduce an accumulator variable that will compute the answer for us:
int fact_helper(int n, int acc)
{
if ( n == 1 ) return acc;
return fact_helper(n-1, n*acc);
}
int fact_acc(int n)
{
return fact_helper(n, 1);
}
The function fact_helper does the work, while fact_acc is just a convenience function to initialize the accumulator variable.
Note how the last thing fact_helper does is to call itself. This CALL can be converted to a JMP by reusing the existing stack for the variables.
With GCC, you can verify that it is optimized to a jump by looking at the generated assembly, for instance gcc -c -O3 -Wa,-a,-ad fact.c:
...
37 L12:
38 0040 0FAFC2 imull %edx, %eax
39 0043 83EA01 subl $1, %edx
40 0046 83FA01 cmpl $1, %edx
41 0049 75F5 jne L12
...
Some programming languages, such as Scheme, will actually guarantee that proper implementations will perform such optimizations. They will even do it for non-recursive tail calls.

Resources