In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
A bit of context here: I told a friend that I wrote a program not using dynamic memory allocation ("malloc") and allocate all memory static (by modeling all my state variables in a struct, which I then declared global). He then told me that if I'm using automatic variables, I still make use of dynamic memory. I argued that my automatic variables are not state variables but control variables, so my program is still to be considered static. We then discussed that there has to be a way to make a statement about the absolute worst-case behaviour about my program, so I came up with the above question.
Bonus question: If the assumptions above hold, I could simply declare all automatic variables static and would end up with a "truly" static program?
Even if array sizes are constant a C implementation could allocate arrays and even structures dynamically. I'm not aware of any that do (anyone) and it would appear quite unhelpful. But the C Standard doesn't make such guarantees.
There is also (almost certainly) some further overhead in the stack frame (the data added to the stack on call and released on return).
You would need to declare all your functions as taking no parameters and returning void to ensure no program variables in the stack. Finally the 'return address' of where execution of a function is to continue after return is pushed onto the stack (at least logically).
So having removed all parameters, automatic variables and return values to you 'state' struct there will still be something going on to the stack - probably.
I say probably because I'm aware of a (non-standard) embedded C compiler that forbids recursion that can determine the maximum size of the stack by examining the call tree of the whole program and identify the call chain that reaches the peek size of the stack.
You could achieve this a monstrous pile of goto statements (some conditional where a functon is logically called from two places or by duplicating code.
It's often important in embedded code on devices with tiny memory to avoid any dynamic memory allocation and know that any 'stack-space' will never overflow.
I'm happy this is a theoretical discussion. What you suggest is a mad way to write code and would throw away most of (ultimately limited) services C provides to infrastructure of procedural coding (pretty much the call stack)
Footnote: See the comment below about the 8-bit PIC architecture.
Bonus question: If the assumptions above hold, I could simply declare
all automatic variables static and would end up with a "truly" static
program?
No. This would change the function of the program. static variables are initialized only once.
Compare this 2 functions:
int canReturn0Or1(void)
{
static unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
int willAlwaysReturn0(void)
{
unsigned a=0;
a++;
if(a>1)
{
return 1;
}
return 0;
}
In a C99 program, under the (theoretical) assumption that I'm not using variable-length arrays, and each of my automatic variables can only exist once at a time in the whole stack (by forbidding circular function calls and explicit recursion), if I sum up all the space they are consuming, could I declare that this is the maximal stack size that can ever happen?
No, because of function pointers..... Read n1570.
Consider the following code, where rand(3) is some pseudo random number generator (it could also be some input from a sensor) :
typedef int foosig(int);
int foo(int x) {
foosig* fptr = (x>rand())?&foo:NULL;
if (fptr)
return (*fptr)(x);
else
return x+rand();
}
An optimizing compiler (such as some recent GCC suitably invoked with enough optimizations) would make a tail-recursive call for (*fptr)(x). Some other compiler won't.
Depending on how you compile that code, it would use a bounded stack or could produce a stack overflow. With some ABI and calling conventions, both the argument and the result could go thru a processor register and won't consume any stack space.
Experiment with a recent GCC (e.g. on Linux/x86-64, some GCC 10 in 2020) invoked as gcc -O2 -fverbose-asm -S foo.c then look inside foo.s. Change the -O2 to a -O0.
Observe that the naive recursive factorial function could be compiled into some iterative machine code with a good enough C compiler and optimizer. In practice GCC 10 on Linux compiling the below code:
int fact(int n)
{
if (n<1) return 1;
else return n*fact(n-1);
}
as gcc -O3 -fverbose-asm tmp/fact.c -S -o tmp/fact.s produces the following assembler code:
.type fact, #function
fact:
.LFB0:
.cfi_startproc
endbr64
# tmp/fact.c:3: if (n<1) return 1;
movl $1, %eax #, <retval>
testl %edi, %edi # n
jle .L1 #,
.p2align 4,,10
.p2align 3
.L2:
imull %edi, %eax # n, <retval>
subl $1, %edi #, n
jne .L2 #,
.L1:
# tmp/fact.c:5: }
ret
.cfi_endproc
.LFE0:
.size fact, .-fact
.ident "GCC: (Ubuntu 10.2.0-5ubuntu1~20.04) 10.2.0"
And you can observe that the call stack is not increasing above.
If you have serious and documented arguments against GCC, please submit a bug report.
BTW, you could write your own GCC plugin which would choose to randomly apply or not such an optimization. I believe it stays conforming to the C standard.
The above optimization is essential for many compilers generating C code, such as Chicken/Scheme or Bigloo.
A related theorem is Rice's theorem. See also this draft report funded by the CHARIOT project.
See also the Compcert project.
Related
The following are artificial examples. Clearly compiler optimizations will dramatically change the final outcome. However, and I cannot stress this more: by temporarily disabling optimizations, I intend to have an upper bound on stack usage, likely, I expect that further compiler optimization can improve the situation.
The discussion in centered around GCC only. I would like to have fine control over how automatic variables get released from the stack. Scoping with blocks does not ensure that memory will be released when automatic variables go out of scope. Functions, as far as I know, do ensure that.
However, when inlining, what is the case? For example:
inline __attribute__((always_inline)) void foo()
{
uint8_t buffer1[100];
// Stack Size Measurement A
// Do something
}
void bar()
{
foo();
uint8_t buffer2[100];
// Stack Size Measurement B
// Do something else
}
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
I would like to have fine control over how automatic variables get released from the stack.
Lots of confusion here. The optimizing compiler could store some automatic variables only in registers, without using any slot in the call frame. The C language specification (n1570) does not require any call stack.
And a given register, or slot in the call frame, can be reused for different purposes (e.g. different automatic variables in different parts of the function). Register allocation is a significant role of compilers.
Can I always expect that at measurement point B, the stack will only containbuffer2 and buffer1 has been released?
Certainly not. The compiler could prove that at some later point in your code, the space for buffer1 is not useful anymore so reuse that space for other purposes.
is there any way I can have fine control over stack deallocations?
No, there is not. The call stack is an implementation detail, and might not be used (or be "abused" in your point of view) by the compiler and the generated code.
For some silly example, if buffer1 is not used in foo, the compiler might not allocate space for it. And some clever compilers might just allocate 8 bytes in it, if they can prove that only 8 first bytes of buffer1 are useful.
More seriously, in some cases, GCC is able to do tail-call optimizations.
You should be interested in invoking GCC with -fstack-reuse=all, -Os,
-Wstack-usage=256, -fstack-usage, and other options.
Of course, the concrete stack usage depends upon the optimization levels. You might also inspect the generated assembler code, e.g. with -S -O2 -fverbose-asm
For example, the following code e.c:
int f(int x, int y) {
int t[100];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
when compiled with GCC8.1 on Linux/Debian/x86-64 using gcc -S -fverbose-asm -O2 e.c gives in e.s
.text
.p2align 4,,15
.globl f
.type f, #function
f:
.LFB0:
.cfi_startproc
# e.c:5: return t[0]+t[1];
leal (%rdi,%rsi), %eax #, tmp90
# e.c:6: }
ret
.cfi_endproc
.LFE0:
.size f, .-f
and you see that the stack frame is not grown by 100*4 bytes. And this is still the case with:
int f(int x, int y, int n) {
int t[n];
t[0] = x;
t[1] = y;
return t[0]+t[1];
}
which actually generates the same machine code as above. And if instead of the + above I'm calling some inline int add(int u, int v) { return u+v; } the generated code is not changing.
Be aware of the as-if rule, and of the tricky notion of undefined behavior (if n was 1 above, it is UB).
Can I always expect that at measurement B, the stack will only containbuffer2 and buffer1 has been released?
No. It's going to depend on GCC version, target, optimization level, options.
Apart from function calls (which result in additional stack usage) is there any way I can have fine control over stack deallocations?
Your requirement is so specific I guess you will likely have to write yourself the code in assembler.
mov BYTE PTR [rbp-20], 1 and mov BYTE PTR [rbp-10], 2 only show the relative offset of stack pointer in stack frame. when considering run-time situation, they have the same peak stack usage.
There are two differences about whether using inline:
1) In function call mode, buffer1 will be released when exit from foo(). But in inline method, buffer1 will not be kept until exit from bar(), that means peak stack usage will last a longer time. 2) Function call will add a few overhead, such as saving stack frame information, comparing with inline mode
How Get arguments value using inline assembly in C without Glibc?
i require this code for Linux archecture x86_64 and i386.
if you know about MAC OS X or Windows , also submit and please guide.
void exit(int code)
{
//This function not important!
//...
}
void _start()
{
//How Get arguments value using inline assembly
//in C without Glibc?
//argc
//argv
exit(0);
}
New Update
https://gist.github.com/apsun/deccca33244471c1849d29cc6bb5c78e
and
#define ReadRdi(To) asm("movq %%rdi,%0" : "=r"(To));
#define ReadRsi(To) asm("movq %%rsi,%0" : "=r"(To));
long argcL;
long argvL;
ReadRdi(argcL);
ReadRsi(argvL);
int argc = (int) argcL;
//char **argv = (char **) argvL;
exit(argc);
But it still returns 0.
So this code is wrong!
please help.
As specified in the comment, argc and argv are provided on the stack, so you cannot use a regular C function to get them, even with inline assembly, as the compiler will touch the stack pointer to allocate the local variables, setup the stack frame & co.; hence, _start must be written in assembly, as it's done in glibc (x86; x86_64). A small stub can be written to just grab the stuff and forward it to your "real" C entrypoint according to the regular calling convention.
Here a minimal example of a program (both for x86 and x86_64) that reads argc and argv, prints all the values in argv on stdout (separated by newline) and exits using argc as status code; it can be compiled with the usual gcc -nostdlib (and -static to make sure ld.so isn't involved; not that it does any harm here).
#ifdef __x86_64__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movq 0(%rsp),%rdi\n" // get argc
" lea 8(%rsp),%rsi\n" // the arguments are pushed just below, so argv = %rbp + 8
" call bare_main\n" // call our bare_main
" movq %rax,%rdi\n" // take the main return code and use it as first argument for...
" movl $60,%eax\n" // ... the exit syscall
" syscall\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; the calling convention is pretty much ok as is
" movq $1,%rax\n" // 1 = write syscall on x86_64
" syscall\n"
" ret\n");
#endif
#ifdef __i386__
asm(
".global _start\n"
"_start:\n"
" xorl %ebp,%ebp\n" // mark outermost stack frame
" movl 0(%esp),%edi\n" // argc is on the top of the stack
" lea 4(%esp),%esi\n" // as above, but with 4-byte pointers
" sub $8,%esp\n" // the start starts 16-byte aligned, we have to push 2*4 bytes; "waste" 8 bytes
" pushl %esi\n" // to keep it aligned after pushing our arguments
" pushl %edi\n"
" call bare_main\n" // call our bare_main
" add $8,%esp\n" // fix the stack after call (actually useless here)
" movl %eax,%ebx\n" // take the main return code and use it as first argument for...
" movl $1,%eax\n" // ... the exit syscall
" int $0x80\n"
" int3\n"); // just in case
asm(
"bare_write:\n" // write syscall wrapper; convert the user-mode calling convention to the syscall convention
" pushl %ebx\n" // ebx is callee-preserved
" movl 8(%esp),%ebx\n" // just move stuff from the stack to the correct registers
" movl 12(%esp),%ecx\n"
" movl 16(%esp),%edx\n"
" mov $4,%eax\n" // 4 = write syscall on i386
" int $0x80\n"
" popl %ebx\n" // restore ebx
" ret\n"); // notice: the return value is already ok in %eax
#endif
int bare_write(int fd, const void *buf, unsigned count);
unsigned my_strlen(const char *ch) {
const char *ptr;
for(ptr = ch; *ptr; ++ptr);
return ptr-ch;
}
int bare_main(int argc, char *argv[]) {
for(int i = 0; i < argc; ++i) {
int len = my_strlen(argv[i]);
bare_write(1, argv[i], len);
bare_write(1, "\n", 1);
}
return argc;
}
Notice that here several subtleties are ignored - in particular, the atexit bit. All the documentation about the machine-specific startup state has been extracted from the comments in the two glibc files linked above.
This answer is for x86-64, 64-bit Linux ABI, only. All the other OSes and ABIs mentioned will be broadly similar, but different enough in the fine details that you will need to write your custom _start once for each.
You are looking for the specification of the initial process state in the "x86-64 psABI", or, to give it its full title, "System V Application Binary Interface, AMD64 Architecture Processor Supplement (With LP64 and ILP32 Programming Models)". I will reproduce figure 3.9, "Initial Process Stack", here:
Purpose Start Address Length
------------------------------------------------------------------------
Information block, including varies
argument strings, environment
strings, auxiliary information
...
------------------------------------------------------------------------
Null auxiliary vector entry 1 eightbyte
Auxiliary vector entries... 2 eightbytes each
0 eightbyte
Environment pointers... 1 eightbyte each
0 8+8*argc+%rsp eightbyte
Argument pointers... 8+%rsp argc eightbytes
Argument count %rsp eightbyte
It goes on to say that the initial registers are unspecified except
for %rsp, which is of course the stack pointer, and %rdx, which may contain "a function pointer to register with atexit".
So all the information you are looking for is already present in memory, but it hasn't been laid out according to the normal calling convention, which means you must write _start in assembly language. It is _start's responsibility to set everything up to call main with, based on the above. A minimal _start would look something like this:
_start:
xorl %ebp, %ebp # mark the deepest stack frame
# Current Linux doesn't pass an atexit function,
# so you could leave out this part of what the ABI doc says you should do
# You can't just keep the function pointer in a call-preserved register
# and call it manually, even if you know the program won't call exit
# directly, because atexit functions must be called in reverse order
# of registration; this one, if it exists, is meant to be called last.
testq %rdx, %rdx # is there "a function pointer to
je skip_atexit # register with atexit"?
movq %rdx, %rdi # if so, do it
call atexit
skip_atexit:
movq (%rsp), %rdi # load argc
leaq 8(%rsp), %rsi # calc argv (pointer to the array on the stack)
leaq 8(%rsp,%rdi,8), %rdx # calc envp (starts after the NULL terminator for argv[])
call main
movl %eax, %edi # pass return value of main to exit
call exit
hlt # should never get here
(Completely untested.)
(In case you're wondering why there's no adjustment to maintain stack pointer alignment, this is because upon a normal procedure call, 8(%rsp) is 16-byte aligned, but when _start is called, %rsp itself is 16-byte aligned. Each call instruction displaces %rsp down by eight, producing the alignment situation expected by normal compiled functions.)
A more thorough _start would do more things, such as clearing all the other registers, arranging for greater stack pointer alignment than the default if desired, calling into the C library's own initialization functions, setting up environ, initializing the state used by thread-local storage, doing something constructive with the auxiliary vector, etc.
You should also be aware that if there is a dynamic linker (PT_INTERP section in the executable), it receives control before _start does. Glibc's ld.so cannot be used with any C library other than glibc itself; if you are writing your own C library, and you want to support dynamic linkage, you will also need to write your own ld.so. (Yes, this is unfortunate; ideally, the dynamic linker would be a separate development project and its complete interface would be specified.)
As a quick and dirty hack, you can make an executable with a compiled C function as the ELF entry point. Just make sure you use exit or _exit instead of returning.
(Link with gcc -nostartfiles to omit CRT but still link other libraries, and write a _start() in C. Beware of ABI violations like stack alignment, e.g. use -mincoming-stack-boundary=2 or an __attribte__ on _start, as in Compiling without libc)
If it's dynamically linked, you can still use glibc functions on Linux (because the dynamic linker runs glibc's init functions). Not all systems are like this, e.g. on cygwin you definitely can't call libc functions if you (or the CRT start code) hasn't called the libc init functions in the correct order. I'm not sure it's even guaranteed that this works on Linux, so don't depend on it except for experimentation on your own system.
I have used a C _start(void){ ... } + calling _exit() for making a static executable to microbenchmark some compiler-generated code with less startup overhead for perf stat ./a.out.
Glibc's _exit() works even if glibc wasn't initialized (gcc -O3 -static), or use inline asm to run xor %edi,%edi / mov $60, %eax / syscall (sys_exit(0) on Linux) so you don't have to even statically link libc. (gcc -O3 -nostdlib)
With even more dirty hacking and UB, you can access argc and argv by knowing the x86-64 System V ABI that you're compiling for (see #zwol's answer for a quote from ABI doc), and how the process startup state differers from the function calling convention:
argc is where the return address would be for a normal function (pointed to by RSP). GNU C has a builtin for accessing the return address of the current function (or for walking up the stack.)
argv[0] is where the 7th integer/pointer arg should be (the first stack arg, just above the return address). It happens to / seems to work to take its address and use that as an array!
// Works only for the x86-64 SystemV ABI; only tested on Linux.
// DO NOT USE THIS EXCEPT FOR EXPERIMENTS ON YOUR OWN COMPUTER.
#include <stdio.h>
#include <stdlib.h>
// tell gcc *this* function is called with a misaligned RSP
__attribute__((force_align_arg_pointer))
void _start(int dummy1, int dummy2, int dummy3, int dummy4, int dummy5, int dummy6, // register args
char *argv0) {
int argc = (int)(long)__builtin_return_address(0); // load (%rsp), casts to silence gcc warnings.
char **argv = &argv0;
printf("argc = %d, argv[argc-1] = %s\n", argc, argv[argc-1]);
printf("%f\n", 1.234); // segfaults if RSP is misaligned
exit(0);
//_exit(0); // without flushing stdio buffers!
}
# with a version without the FP printf
peter#volta:~/src/SO$ gcc -nostartfiles _start.c -o bare_start
peter#volta:~/src/SO$ ./bare_start
argc = 1, argv[argc-1] = ./bare_start
peter#volta:~/src/SO$ ./bare_start abc def hij
argc = 4, argv[argc-1] = hij
peter#volta:~/src/SO$ file bare_start
bare_start: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=af27c8416b31bb74628ef9eec51a8fc84e49550c, not stripped
# I could have used -fno-pie -no-pie to make a non-PIE executable
This works with or without optimization, with gcc7.3. I was worried that without optimization, the address of argv0 would be below rbp where it copies the arg, rather than its original location. But apparently it works.
gcc -nostartfiles links glibc but not the CRT start files.
gcc -nostdlib omits both libraries and CRT startup files.
Very little of this is guaranteed to work, but it does in practice work with current gcc on current x86-64 Linux, and has worked in the past for years. If it breaks, you get to keep both pieces. IDK what C features are broken by omitting the CRT startup code and just relying on the dynamic linker to run glibc init functions. Also, taking the address of an arg and accessing pointers above it is UB, so you could maybe get broken code-gen. gcc7.3 happens to do what you'd expect in this case.
Things that definitely break
atexit() cleanup, e.g. flushing stdio buffers.
static destructors for static objects in dynamically-linked libraries. (On entry to _start, RDX is a function pointer you should register with atexit for this reason. In a dynamically linked executable, the dynamic linker runs before your _start and sets RDX before jumping to your _start. Statically linked executables have RDX=0 under Linux.)
gcc -mincoming-stack-boundary=3 (i.e. 2^3 = 8 bytes) is another way to get gcc to realign the stack, because the -mpreferred-stack-boundary=4 default of 2^4 = 16 is still in place. But that makes gcc assume under-aligned RSP for all functions, not just for _start, which is why I looked in the docs and found an attribute that was intended for 32-bit when the ABI transitioned from only requiring 4-byte stack alignment to the current requirement of 16-byte alignment for ESP in 32-bit mode.
The SysV ABI requirement for 64-bit mode has always been 16-byte alignment, but gcc options let you make code that doesn't follow the ABI.
// test call to a function the compiler can't inline
// to see if gcc emits extra code to re-align the stack
// like it would if we'd used -mincoming-stack-boundary=3 to assume *all* functions
// have only 8-byte (2^3) aligned RSP on entry, with the default -mpreferred-stack-boundary=4
void foo() {
int i = 0;
atoi(NULL);
}
With -mincoming-stack-boundary=3, we get stack-realignment code there, where we don't need it. gcc's stack-realignment code is pretty clunky, so we'd like to avoid that. (Not that you'd really ever use this to compile a significant program where you care about efficiency, please only use this stupid computer trick as a learning experiment.)
But anyway, see the code on the Godbolt compiler explorer with and without -mpreferred-stack-boundary=3.
When I attempt to run this code as it is, I receive the compiler message "error: incompatible types in return". I marked the location of the error in my code. If I take the line out, then the compiler is happy.
The problem is I want to return a value representing invalid input to the function (which in this case is calling f2(2).) I only want a struct returned with data if the function is called without using 2 as a parameter.
I feel the only two ways to go is to either:
make the function return a struct pointer instead of a dead-on struct but then my caller function will look funny as I have to change y.b to y->b and the operation may be slower due to the extra step of fetching data in memory.
Allocate extra memory, zero-byte fill it, and set the return value to the struct in that location in memory. (example: return x[nnn]; instead of return x[0];). This approach will use more memory and some processing to zero-byte fill it.
Ultimately, I'm looking for a solution that will be fastest and cleanest (in terms of code) in the long run. If I have to be stuck with using -> to address members of elements then I guess that's the way to go.
Does anyone have a solution that uses the least cpu power?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef struct{
long a;
char b;
}ab;
static char dat[500];
ab f2(int set){
ab* x=(ab*)dat;
if (set==2){return NULL;}
if (set==1){
x->a=1;
x->b='1';
x++;
x->a=2;
x->b='2';
x=(ab*)dat;
}
return x[0];
}
int main(){
ab y;
y=f2(1);
printf("%c",y.b);
y.b='D';
y=f2(0);
printf("%c",y.b);
return 0;
}
If you care about speed, it is implementation specific.
Notice that the Linux x86-64 ABI defines that a struct of two (exactly) scalar members (that is, integers, doubles, or pointers, -which all fit in a single machine register- but not struct etc... which are aggregate data) is returned thru two registers (without going thru the stack), and that is quite fast.
BTW
if (set==2){return NULL;} //wrong
is obviously wrong. You could code:
if (set==2) return (aa){0,0};
Also,
ab* x=(ab*)dat; // suspicious
looks suspicious to me (since you return x[0]; later). You are not guaranteed that dat is suitably aligned (e.g. to 8 or 16 bytes), and on some platforms (notably x86-64) if dat is misaligned you are at least losing performance (actually, it is undefined behavior).
BTW, I would suggest to always return with instructions like return (aa){l,c}; (where l is an expression convertible to long and c is an expression convertible to char); this is probably the easiest to read, and will be optimized to load the two return registers.
Of course if you care about performance, for benchmarking purposes, you should enable optimizations (and warnings), e.g. compile with gcc -Wall -Wextra -O2 -march=native if using GCC; on my system (Linux/x86-64 with GCC 5.2) the small function
ab give_both(long x, char y)
{ return (ab){x,y}; }
is compiled (with gcc -O2 -march=native -fverbose-asm -S) into:
.globl give_both
.type give_both, #function
give_both:
.LFB0:
.file 1 "ab.c"
.loc 1 7 0
.cfi_startproc
.LVL0:
.loc 1 7 0
xorl %edx, %edx # D.2139
movq %rdi, %rax # x, x
movb %sil, %dl # y, D.2139
ret
.cfi_endproc
you see that all the code is using registers, and no memory is used at all..
I would use the return value as an error code, and the caller passes in a pointer to his struct such as:
int f2(int set, ab *outAb); // returns -1 if invalid input and 0 otherwise
Let's say I have pseudocode like this:
main() {
BOOL b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(true) {
do_stuff(b);
}
}
do_stuff(BOOL b) {
if(b)
path_a();
else
path_b();
}
Now, since we know that the external environment can influence get_bool_from_environment() to potentially produce either a true or false result, then we know that the code for both the true and false branches of if(b) must be included in the binary. We can't simply omit path_a(); or path_b(); from the code.
BUT -- we only set BOOL b the one time, and we always reuse the same value after program initialization.
If I were to make this valid C code and then compile it using gcc -O0, the if(b) would be repeatedly evaluated on the processor each time that do_stuff(b) is invoked, which inserts what are, in my opinion, needless instructions into the pipeline for a branch that is basically static after initialization.
If I were to assume that I actually had a compiler that was as stupid as gcc -O0, I would re-write this code to include a function pointer, and two separate functions, do_stuff_a() and do_stuff_b(), which don't perform the if(b) test, but simply go ahead and perform one of the two paths. Then, in main(), I would assign the function pointer based on the value of b, and call that function in the loop. This eliminates the branch, though it admittedly adds a memory access for the function pointer dereference (due to architecture implementation I don't think I really need to worry about that).
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()? If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I realize that compilers can't generate dynamic code at runtime, and the only types of systems that could do that in principle would be bytecode virtual machines or interpreters (e.g. Java, .NET, Ruby, etc.) -- so the question remains whether or not it is possible to do this statically and generate code that contains both the path_a(); branch and the path_b() branch, but avoid evaluating the conditional test if(b) for every call of do_stuff(b);.
If you tell your compiler to optimise, you have a good chance that the if(b) is evaluated only once.
Slightly modifying the given example, using the standard _Bool instead of BOOL, and adding the missing return types and declarations,
_Bool get_bool_from_environment(void);
void path_a(void);
void path_b(void);
void do_stuff(_Bool b) {
if(b)
path_a();
else
path_b();
}
int main(void) {
_Bool b = get_bool_from_environment(); //get it from a file, network, registry, whatever
while(1) {
do_stuff(b);
}
}
the (relevant part of the) produced assembly by clang -O3 [clang-3.0] is
callq get_bool_from_environment
cmpb $1, %al
jne .LBB1_2
.align 16, 0x90
.LBB1_1: # %do_stuff.exit.backedge.us
# =>This Inner Loop Header: Depth=1
callq path_a
jmp .LBB1_1
.align 16, 0x90
.LBB1_2: # %do_stuff.exit.backedge
# =>This Inner Loop Header: Depth=1
callq path_b
jmp .LBB1_2
b is tested only once, and main jumps into an infinite loop of either path_a or path_b depending on the value of b. If path_a and path_b are small enough, they would be inlined (I strongly expect). With -O and -O2, the code produced by clang would evaluate b in each iteration of the loop.
gcc (4.6.2) behaves similarly with -O3:
call get_bool_from_environment
testb %al, %al
jne .L8
.p2align 4,,10
.p2align 3
.L9:
call path_b
.p2align 4,,6
jmp .L9
.L8:
.p2align 4,,8
call path_a
.p2align 4,,8
call path_a
.p2align 4,,5
jmp .L8
oddly, it unrolled the loop for path_a, but not for path_b. With -O2 or -O, it would however call do_stuff in the infinite loop.
Hence to
Is it possible, even in principle, for a compiler to take code of the same style as the original pseudocode sample, and to realize that the test is unnecessary once the value of b is assigned once in main()?
the answer is a definitive Yes, it is possible for compilers to recognize this and take advantage of that fact. Good compilers do when asked to optimise hard.
If so, what is the theoretical name for this compiler optimization, and can you please give an example of an actual compiler implementation (open source or otherwise) which does this?
I don't know the name of the optimisation, but two implementations doing that are gcc and clang (at least, recent enough releases).
if a function calls itself while defining variables at the same time
would it result in stack overflow? Is there any option in gcc to reuse the same stack.
void funcnew(void)
{
int a=10;
int b=20;
funcnew();
return ;
}
can a function reuse the stack-frame which it used earlier?
What is the option in gcc to reuse the same frame in tail recursion??
Yes. See
-foptimize-sibling-calls
Optimize sibling and tail recursive calls.
Enabled at levels -O2, -O3, -Os.
Your function is compiled to:
funcstack:
.LFB0:
.cfi_startproc
xorl %eax, %eax
jmp func
.cfi_endproc
(note the jump to func)
Reusing the stack frame when a function end by a call -- this include in its full generality manipulating the stack to put the parameters at the correct place and replacing the function call by a jump to the start of the function -- is a well known optimisation called [i]tail call removal[/i]. It is mandated by some languages (scheme for instance) for recursive calls (a recursive call is the natural way to express a loop in these languages). As given above, gcc has the optimisation implemented for C, but I'm not sure which other compiler has it, I would not depend on it for portable code. And note that I don't know which restriction there are on it -- I'm not sure for instance that gcc will manipulate the stack if the parameters types are different.
Even without defining the parameters you'd get a stackoverflow. Since the return address also is pushed onto the stack.
It is (I've learned this recently) possible that the compiler optimizes your loop into a tail recursion (which makes the stack not grow at all). Link to tail recursion question on SO
No, each recursion is a new stack frame. If the recursion is infinitely deep, then the stack needed is also infinite, so you get a stack overflow.
Yes, in some cases the compiler may be able to perform something called tail call optimization. You should check with your compiler manual. (AProgrammer seems to have quoted the GCC manual in his answer.)
This is an essential optimization when implementing for example functional languages, where such code occurs frequently.
You can;t do away with the stack frame altogether, as it is needed for the return address. unless you are using tail-recursion, and your compiler has optimised it to a loop. But to be completely technically honest, you can do away with all the variables in the the frame by making them static. However, this is almost certainly not what you want to do, and you should not do it without knowing exactly what you are doing, which as you had to ask this question, you don't.
As others have noted, it is only possible if (1) your compiler supports tail call optimization, and (2) if your function is eligible for such an optimization. The optimization is to reuse the existing stack and perform a JMP (i.e., a GOTO in assembly) instead of a CALL.
In fact, your example function is indeed eligible for such an optimization. The reason is that the last thing your function does before returning is call itself; it doesn't have to do anything after the last call to funcnew(). However, only certain compilers will perform such an optimization. GCC, for instance, will do it. For more info, see Which, if any, C++ compilers do tail-recursion optimization?
The classic material on this is the factorial function. Let's make a recursive factorial function that is not eligible for tail call optimization (TCO).
int fact(int n)
{
if ( n == 1 ) return 1;
return n*fact(n-1);
}
The last thing it does is to multiply n with the result from fact(n-1). By somehow eliminating this last operation, we would be able to reuse the stack. Let's introduce an accumulator variable that will compute the answer for us:
int fact_helper(int n, int acc)
{
if ( n == 1 ) return acc;
return fact_helper(n-1, n*acc);
}
int fact_acc(int n)
{
return fact_helper(n, 1);
}
The function fact_helper does the work, while fact_acc is just a convenience function to initialize the accumulator variable.
Note how the last thing fact_helper does is to call itself. This CALL can be converted to a JMP by reusing the existing stack for the variables.
With GCC, you can verify that it is optimized to a jump by looking at the generated assembly, for instance gcc -c -O3 -Wa,-a,-ad fact.c:
...
37 L12:
38 0040 0FAFC2 imull %edx, %eax
39 0043 83EA01 subl $1, %edx
40 0046 83FA01 cmpl $1, %edx
41 0049 75F5 jne L12
...
Some programming languages, such as Scheme, will actually guarantee that proper implementations will perform such optimizations. They will even do it for non-recursive tail calls.