How does GCC implement variable-length arrays? - c

How does GCC implement Variable-length arrays (VLAs)? Are such arrays essentially pointers to the dynamically allocated storage such as returned by alloca?
The other alternative I could think of, is that such an array is allocated as last variable in a function, so that the offset of the variables are known during compile-time. However, the offset of a second VLA would then again not be known during compile-time.

Here's the allocation code (x86 - the x64 code is similar) for the following example line taken from some GCC docs for VLA support:
char str[strlen (s1) + strlen (s2) + 1];
where the calculation for strlen (s1) + strlen (s2) + 1 is in eax (GCC MinGW 4.8.1 - no optimizations):
mov edx, eax
sub edx, 1
mov DWORD PTR [ebp-12], edx
mov edx, 16
sub edx, 1
add eax, edx
mov ecx, 16
mov edx, 0
div ecx
imul eax, eax, 16
call ___chkstk_ms
sub esp, eax
lea eax, [esp+8]
add eax, 0
mov DWORD PTR [ebp-16], eax
So it looks to be essentially alloca().

Well, these are just a few wild stabs in the dark, based on the restrictions around VLA's, but anyway:
VLA's can't be:
extern
struct members
static
declared with unspecified bounds (save for function prototype)
All this points to VLA's being allocated on the stack, rather than the heap. So yes, VLA's probably are the last chunks of stack memory allocated whenever a new block is allocated (block as in block scope, these are loops, functions, branches or whatever).
That's also why VLA's increase the risk of Stack overflow, in some cases significantly (word of warning: don't even think about using VLA's in combination with recursive function calls, for example!).
This is also why out-of-bounds access is very likely to cause issues: once the block ends, anything pointing to what Was VLA memory, is pointing to invalid memory.
But on the plus side: this is also why these arrays are thread safe, though (owing to threads having their own stack), and why they're faster compared to heap memory.
The size of a VLA can't be:
an extern value
zero or negative
the extern restriction is pretty self evident, as is the non-zero, non-negative one... however: if the variable that specifies the size of a VLA is a signed int, for example, the compiler won't produce an error: the evaluation, and thus allocation, of a VLA is done during runtime, not compile-time. Hence The size of a VLA can't, and needn't be a given during compile-time.
As MichaelBurr rightly pointed out, VLA's are very similar to alloca memory, with one, IMHO, crucial distinction: memory allocated by alloca is valid from the point of allocation, and throughout the rest of the function. VLA's are block scoped, so the memory is freed once you exit the block in which a VLA is used:
void alloca_diff( void )
{
char *alloca_c, *vla_c;
for (int i=1;i<10;++i)
{
char *alloca_mem = alloca(i*sizeof(*alloca_mem));
alloca_c = alloca_mem;//valid
char vla_arr[i];
vla_c = vla_arr;//invalid
}//end of scope, VLA memory is freed
printf("alloca: %c\n", *alloca_c);//fine
printf("vla: %c\n\", *vla_c);//undefined behaviour... avoid!
}//end of function alloca memory is freed, irrespective of block scope

Related

How does the compiler allocate memory without knowing the size at compile time?

I wrote a C program that accepts integer input from the user, that is used as the size of an integer array, and using that value it declares an array of given size, and I am confirming it by checking the size of the array.
Code:
#include <stdio.h>
int main(int argc, char const *argv[])
{
int n;
scanf("%d",&n);
int k[n];
printf("%ld",sizeof(k));
return 0;
}
and surprisingly it is correct! The program is able to create the array of required size.
But all static memory allocation is done at compile time, and during compile time the value of n is not known, so how come the compiler is able to allocate memory of required size?
If we can allocate the required memory just like that then what is the use of dynamic allocation using malloc() and calloc()?
This is not a "static memory allocation". Your array k is a Variable Length Array (VLA), which means that memory for this array is allocated at run time. The size will be determined by the run-time value of n.
The language specification does not dictate any specific allocation mechanism, but in a typical implementation your k will usually end up being a simple int * pointer with the actual memory block being allocated on the stack at run time.
For a VLA sizeof operator is evaluated at run time as well, which is why you obtain the correct value from it in your experiment. Just use %zu (not %ld) to print values of type size_t.
The primary purpose of malloc (and other dynamic memory allocation functions) is to override the scope-based lifetime rules, which apply to local objects. I.e. memory allocated with malloc remains allocated "forever", or until you explicitly deallocate it with free. Memory allocated with malloc does not get automatically deallocated at the end of the block.
VLA, as in your example, does not provide this "scope-defeating" functionality. Your array k still obeys regular scope-based lifetime rules: its lifetime ends at the end of the block. For this reason, in general case, VLA cannot possibly replace malloc and other dynamic memory allocation functions.
But in specific cases when you don't need to "defeat scope" and just use malloc to allocate a run-time sized array, VLA might indeed be seen as a replacement for malloc. Just keep in mind, again, that VLAs are typically allocated on the stack and allocating large chunks of memory on the stack to this day remains a rather questionable programming practice.
In C, the means by which a compiler supports VLAs (variable length arrays) is up to the compiler - it doesn't have to use malloc(), and can (and often does) use what is sometimes called "stack" memory - e.g. using system specific functions like alloca() that are not part of standard C. If it does use stack, the maximum size of an array is typically much smaller than is possible using malloc(), because modern operating systems allow programs a much smaller quota of stack memory.
Memory for variable length arrays clearly can't be statically allocated. It can however be allocated on the stack. Generally this involves the use of a "frame pointer" to keep track of the location of the functions stack frame in the face of dynamicly determined changes to the stack pointer.
When I try to compile your program it seems that what actually happens is that the variable length array got optimised out. So I modified your code to force the compiler to actually allocate the array.
#include <stdio.h>
int main(int argc, char const *argv[])
{
int n;
scanf("%d",&n);
int k[n];
printf("%s %ld",k,sizeof(k));
return 0;
}
Godbolt compiling for arm using gcc 6.3 (using arm because I can read arm ASM) compiles this to https://godbolt.org/g/5ZnHfa. (comments mine)
main:
push {fp, lr} ; Save fp and lr on the stack
add fp, sp, #4 ; Create a "frame pointer" so we know where
; our stack frame is even after applying a
; dynamic offset to the stack pointer.
sub sp, sp, #8 ; allocate 8 bytes on the stack (8 rather
; than 4 due to ABI alignment
; requirements)
sub r1, fp, #8 ; load r1 with a pointer to n
ldr r0, .L3 ; load pointer to format string for scanf
; into r0
bl scanf ; call scanf (arguments in r0 and r1)
ldr r2, [fp, #-8] ; load r2 with value of n
ldr r0, .L3+4 ; load pointer to format string for printf
; into r0
lsl r2, r2, #2 ; multiply n by 4
add r3, r2, #10 ; add 10 to n*4 (not sure why it used 10,
; 7 would seem sufficient)
bic r3, r3, #7 ; and clear the low bits so it is a
; multiple of 8 (stack alignment again)
sub sp, sp, r3 ; actually allocate the dynamic array on
; the stack
mov r1, sp ; store a pointer to the dynamic size array
; in r1
bl printf ; call printf (arguments in r0, r1 and r2)
mov r0, #0 ; set r0 to 0
sub sp, fp, #4 ; use the frame pointer to restore the
; stack pointer
pop {fp, lr} ; restore fp and lr
bx lr ; return to the caller (return value in r0)
.L3:
.word .LC0
.word .LC1
.LC0:
.ascii "%d\000"
.LC1:
.ascii "%s %ld\000"
The memory for this construct, which is called "variable length array", VLA, is allocated on the stack, in a similar way to alloca. Exactly how this happens depends on exactly which compiler you're using, but essentially it's a case of calculating the size when it is known, and then subtracting [1] the total size from the stack-pointer.
You do need malloc and friends because this allocation "dies" when you leave the function. [And it's not valid in standard C++]
[1] For typical processors that use a stack that "grows towards zero".
When it is said that the compiler allocates memory for variables at compile time, it means that the placement of those variables is decided upon and embedded in the executable code that the compiler generates, not that the compiler is making space for them available while it works.
The actual dynamic memory allocation is carried out by the generated program when it runs.

Why is gcc allowed to speculatively load from a struct?

Example Showing the gcc Optimization and User Code that May Fault
The function 'foo' in the snippet below will load only one of the struct members A or B; well at least that is the intention of the unoptimized code.
typedef struct {
int A;
int B;
} Pair;
int foo(const Pair *P, int c) {
int x;
if (c)
x = P->A;
else
x = P->B;
return c/102 + x;
}
Here is what gcc -O3 gives:
mov eax, esi
mov edx, -1600085855
test esi, esi
mov ecx, DWORD PTR [rdi+4] <-- ***load P->B**
cmovne ecx, DWORD PTR [rdi] <-- ***load P->A***
imul edx
lea eax, [rdx+rsi]
sar esi, 31
sar eax, 6
sub eax, esi
add eax, ecx
ret
So it appears that gcc is allowed to speculatively load both struct members in order to eliminate branching. But then, is the following code considered undefined behavior or is the gcc optimization above illegal?
#include <stdlib.h>
int naughty_caller(int c) {
Pair *P = (Pair*)malloc(sizeof(Pair)-1); // *** Allocation is enough for A but not for B ***
if (!P) return -1;
P->A = 0x42; // *** Initializing allocation only where it is guaranteed to be allocated ***
int res = foo(P, 1); // *** Passing c=1 to foo should ensure only P->A is accessed? ***
free(P);
return res;
}
If the load speculation will happen in the above scenario there is a chance that loading P->B will cause an exception because the last byte of P->B may lie in unallocated memory. This exception will not happen if the optimization is turned off.
The Question
Is the gcc optimization shown above of load speculation legal? Where does the spec say or imply that it's ok?
If the optimization is legal, how is the code in 'naughtly_caller' turn out to be undefined behavior?
Reading a variable (that was not declared as volatile) is not considered to be a "side effect" as specified by the C standard. So the program is free to read a location and then discard the result, as far as the C standard is concerned.
This is very common. Suppose you request 1 byte of data from a 4 byte integer. The compiler may then read the whole 32 bits if that's faster (aligned read), and then discard everything but the requested byte. Your example is similar to this but the compiler decided to read the whole struct.
Formally this is found in the behavior of "the abstract machine", C11 chapter 5.1.2.3. Given that the compiler follows the rules specified there, it is free to do as it pleases. And the only rules listed are regarding volatile objects and sequencing of instructions. Reading a different struct member in a volatile struct would not be ok.
As for the case of allocating too little memory for the whole struct, that's undefined behavior. Because the memory layout of the struct is usually not for the programmer to decide - for example the compiler is allowed to add padding at the end. If there's not enough memory allocated, you might end up accessing forbidden memory even though your code only works with the first member of the struct.
No, if *P is allocated correctly P->B will never be in unallocated memory. It might not be initialized, that is all.
The compiler has every right to do what they do. The only thing that is not allowed is to oops about the access of P->B with the excuse that it is not initialized. But what and how they do all of this is under the discretion of the implementation and not your concern.
If you cast a pointer to a block returned by malloc to Pair* that is not guaranteed to be wide enough to hold a Pair the behavior of your program is undefined.
This is perfectly legal because reading some memory location isn't considered an observable behavior in the general case (volatile would change this).
Your example code is indeed undefined behavior, but I can't find any passage in the standard docs that explicitly states this. But I think it's enough to have a look at the rules for effective types ... from N1570, ยง6.5 p6:
If a value is stored into an object having no declared type through an
lvalue having a type that is not a character type, then the type of the lvalue becomes the
effective type of the object for that access and for subsequent accesses that do not modify
the stored value.
So, your write access to *P actually gives that object the type Pair -- therefore it just extends into memory you didn't allocate, the result is an out of bounds access.
A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points
If invoking the expression P->A is well-defined, then P must actually point to an object of type struct Pair, and consequently P->B is well-defined as well.
A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (#Hurkyl quotes the standard.)
x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.
Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)
To get the behaviour you were expecting, use a const int * arg.
An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).
In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.
Source + gcc and clang output for this and other variations on the Godbolt compiler explorer
// exactly equivalent to const int p[2]
int load_pointer(const int *p, int c) {
int x;
if (c)
x = p[0];
else
x = p[1]; // gcc missed optimization: still does an add with c known to be zero
return c + x;
}
load_pointer: # gcc7.2 -O3
test esi, esi
jne .L9
mov eax, DWORD PTR [rdi+4]
add eax, esi # missed optimization: esi=0 here so this is a no-op
ret
.L9:
mov eax, DWORD PTR [rdi]
add eax, esi
ret
In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]
int load_array(const int p[static 2], int c) {
... // same body
}
But gcc doesn't take advantage, and emits identical code to load_pointer.
Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.
lea rax, [rdi + 4]
test esi, esi
cmovne rax, rdi
add esi, dword ptr [rax]
mov eax, esi # missed optimization: mov on the critical path
ret
This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.
We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.
int load_pointer_v3(const int *p, int c) {
int offset = (c==0);
int x = p[offset];
return c + x;
}
xor eax, eax
test esi, esi
sete al
add esi, dword ptr [rdi + 4*rax]
mov eax, esi
ret
gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:
xor eax, eax
test esi, esi
sete al
mov eax, dword ptr [rdi + 4*rax]
add eax, esi
ret
Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.
After inlining, that final mov probably wouldn't exist, and it could just add into esi.
This is always allowed under the "as-if" rule if no conforming program can tell the difference. For example, an implementation could guarantee that after each block allocated with malloc, there are at least eight bytes that can be accessed without side effects. In that situation, the compiler can generate code that would be undefined behaviour if you wrote it in your code. So it would be legal for the compiler to read P[1] whenever P[0] is correctly allocated, even if that would be undefined behaviour in your own code.
But in your case, if you don't allocate enough memory for a struct, then reading any member is undefined behaviour. So here the compiler is allowed to do this, even if reading P->B crashes.

x86 mov instruction in C pointer of different size

I'm trying to replicate an x86 mov instruction, such as mov %ecx,-0x4(%ebp) in C and am confused about how to do it. I have an int array for the registers and an int displacement. How would I move the value of %ecx into the memory address 4 less than the value stored in %ebp?
I have:
int* destAddress=(int*)(displacement + registers[destination]);
*destAddress=registers[source];
I'm getting a Warning: cast to pointer from integer of different size.
mov %ecx,-0x4(%ebp)
or, in Intel syntax:
mov DWORD PTR [ebp-4], ecx
is storing the value in ECX into the memory location [ebp-4].
EBP is the "base pointer" and is commonly used (in unoptimized code) to access data on the stack. Based on the negative offset, this instruction is almost certainly storing the value of ECX into the first DWORD-sized local variable.
If you wanted to translate this to C, it would be:
int local = value;
assuming that value is mapped to the ECX register, and local is a local variable allocated on the stack. Really, that's it.
[Except that a C compiler would generally put a local variable like this in a register, so this would really translate to something more like mov edx, ecx. The only time it would spill to stack would be if it ran out of registers (which isn't uncommon in the very register-poor x86 ISA).Alternatively, you could force it to spill by making the variable volatile: volatile int local = value;.But there is no good reason for doing that in real code.]
There is pointer dereferencing going on here under the hood, of course, as you see in the assembly-language instruction, but it doesn't manifest in the C representation.
If you wanted to get some pointer notation in there, say you had an array of values allocated on the stack, and wanted to initialize its first member:
int array[4];
array[0] = value; // set first element of array to 'value' (== ECX)
The displacement (-4) won't appear at all in the C code. The C compiler handles that.

Understanding space allocation for variables on stack

I want to understand how space allocation is done for variables on stack.
Here for this C program with no variables
main() { return 0; }
It's disassembly is
push ebp
mov ebp, esp
sub esp, 0c0h
main() {
int i = 10; }
The dis-assembly for this program is
push ebp
mov ebp, esp
sub esp, 0cch
I am initializing an INT variable, whose size is 4 bytes. But in the above dis-assembly compiler is allocating 12 bytes (0cc-0c0).
For the following program
main() { long long int i = 10LL; }
The disassembly is
push ebp
mov ebp, esp
sub esp, 0D0h
In the above disassembly compiler is allocating 16 bytes(0D0 - 0C0) for long long int, whose size is 8 bytes.
Why is compiler assigning 12 bytes(4 bytes extra allocated. It should be 8 byte or 16 byte aligned) for INT, whose size is 4 bytes and 16 bytes for LONG LONG INT, whose size is 8 bytes?
Can someone please clarify this.
Thanks.
The compiler is free to allocate as much extra storage as it wants. The C standard does not dictate constraints on the stack allocation.
EDIT:
I did some experimentation on godbolt with the ICC compiler, the only compiler that generates code like your example. I disproved myself about the arguments to main thing I mentioned before. I also tried creating some character arrays and found that the stack will always allocate in increments of 16 bytes. A char array of 1-16 bytes all cause a 16-byte allocation. Next 17-32 will cause a 32-byte allocation and so on.

Is declaration of variables expensive?

While coding in C, I came across the below situation.
int function ()
{
if (!somecondition) return false;
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
Considering the if statement in the above code can return from the function, I can declare the variables in two places.
Before the if statement.
After the if statement.
As a programmer, I would think to keep the variable declaration after if Statement.
Does the declaration place cost something? Or is there some other reason to prefer one way over the other?
In C99 and later (or with the common conforming extension to C89), you are free to mix statements and declarations.
Just as in earlier versions (only more so as compilers got smarter and more aggressive), the compiler decides how to allocate registers and stack, or do any number of other optimizations conforming to the as-if-rule.
That means performance-wise, there's no expectation of any difference.
Anyway, that was not the reason such was allowed:
It was for restricting scope, and thus reducing the context a human must keep in mind when interpreting and verifying your code.
Do whatever makes sense, but current coding style recommends putting variable declarations as close to their usage as possible
In reality, variable declarations are free on virtually every compiler after the first one. This is because virtually all processors manage their stack with a stack pointer (and possibly a frame pointer). For example, consider two functions:
int foo() {
int x;
return 5; // aren't we a silly little function now
}
int bar() {
int x;
int y;
return 5; // still wasting our time...
}
If I were to compile these on a modern compiler (and tell it not to be smart and optimize out my unused local variables), I'd see this (x64 assembly example.. others are similar):
foo:
push ebp
mov ebp, esp
sub esp, 8 ; 1. this is the first line which is different between the two
mov eax, 5 ; this is how we return the value
add esp, 8 ; 2. this is the second line which is different between the two
ret
bar:
push ebp
mov ebp, esp
sub esp, 16 ; 1. this is the first line which is different between the two
mov eax, 5 ; this is how we return the value
add esp, 16 ; 2. this is the second line which is different between the two
ret
Note: both functions have the same number of opcodes!
This is because virtually all compilers will allocate all of the space they need up front (barring fancy things like alloca which are handled separately). In fact, on x64, it is mandatory that they do so in this efficient manner.
(Edit: As Forss pointed out, the compiler may optimize some of the local variables into registers. More technically, I should be arguing that the first varaible to "spill over" into the stack costs 2 opcodes, and the rest are free)
For the same reasons, compilers will collect all of the local variable declarations, and allocate space for them right up front. C89 requires all declarations to be up-front because it was designed to be a 1 pass compiler. For the C89 compiler to know how much space to allocate, it needed to know all of the variables before emitting the rest of the code. In modern languages, like C99 and C++, compilers are expected to be much smarter than they were back in 1972, so this restriction is relaxed for developer convenience.
Modern coding practices suggest putting the variables close to their usage
This has nothing to do with compilers (which obviously could not care one way or another). It has been found that most human programmers read code better if the variables are put close to where they are used. This is just a style guide, so feel free to disagree with it, but there is a remarkable consensus amongst developers that this is the "right way."
Now for a few corner cases:
If you are using C++ with constructors, the compiler will allocate the space up front (since it's faster to do it that way, and doesn't hurt). However, the variable will not be constructed in that space until the correct location in the flow of the code. In some cases, this means putting the variables close to their use can even be faster than putting them up front... flow control might direct us around the variable declaration, in which case the constructor doesn't even need to be called.
alloca is handled on a layer above this. For those who are curious, alloca implementations tend to have the effect of moving the stack pointer down some arbitrary amount. Functions using alloca are required to keep track of this space in one way or another, and make sure the stack pointer gets re-adjusted upwards before leaving.
There may be a case where you usually need 16-bytes of stack space, but on one condition you need to allocate a local array of 50kB. No matter where you put your variables in the code, virtually all compilers will allocate 50kB+16B of stack space every time the function gets called. This rarely matters, but in obsessively recursive code this could overflow the stack. You either have to move the code working with the 50kB array into its own function, or use alloca.
Some platforms (ex: Windows) need a special function call in the prologue if you allocate more than a page worth of stack space. This should not change analysis very much at all (in implementation, it is a very fast leaf function that just pokes 1 word per page).
In C, I believe all variable declarations are applied as if they were at the top of the function declaration; if you declare them in a block, I think it's just a scoping thing (I don't think it's the same in C++). The compiler will perform all optimizations on the variables, and some may even effectively disappear in the machine code in higher optimizations. The compiler will then decide how much space is needed by the variables, and then later, during execution, create a space known as the stack where the variables live.
When a function is called, all of the variables that are used by your function are put on the stack, along with information about the function that is called (i.e. the return address, parameters, etc.). It doesn't matter where the variable was declared, just that it was declared - and it will be allocated onto the stack, regardless.
Declaring variables isn't "expensive," per se; if it's easy enough to be not used as a variable, the compiler will probably remove it as a variable.
Check this out:
Wikipedia on call stacks, Some other place on the stack
Of course, all of this is implementation-dependent and system-dependent.
Yes, it can cost clarity. If there is a case where the function must do nothing at all on some condition, (as when finding the global false, in your case), then placing the check at the top, where you show it above, is surely easier to understand - something that is essential while debugging and/or documenting.
It ultimately depends on the compiler but usually all locals are allocated at the beginning of the function.
However, the cost of allocating local variables is very small as they are put on the stack (or are put in a register after optimization).
Keep the declaration as close to where it's used as possible. Ideally inside nested blocks. So in this case it would make no sense to declare the variables above the if statement.
The best practice is to adapt a lazy approach, i.e., declare them only when you really need them ;) (and not before). It results in the following benefit:
Code is more readable if those variables are declared as near to the place of usage as possible.
If you have this
int function ()
{
{
sometype foo;
bool somecondition;
/* do something with foo and compute somecondition */
if (!somecondition) return false;
}
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
then the stack space reserved for foo and somecondition can be obviously reused for str1etc., so by declaring after the if, you may save stack space. Depending on the optimization capabilities of the compiler, the saving of stack space may also take place if you flatten the fucntion by removing the inner pair of braces or if you do declare str1 etc. before the if; however, this requires the compiler/optimizer to notice that the scopes do not "really" overlap. By positining the declarations after the if you facilitate this behaviour even without optimization - not to mention the improved code readability.
Whenever you allocate local variables in a C scope (such as a functions), they have no default initialization code (such as C++ constructors). And since they're not dynamically allocated (they're just uninitialized pointers), no additional (and potentially expensive) functions need to be invoked (e.g. malloc) in order to prepare/allocate them.
Due to the way the stack works, allocating a stack variable simply means decrementing the stack pointer (i.e. increasing the stack size, because on most architectures, it grows downwards) in order to make room for it. From the CPU's perspective, this means executing a simple SUB instruction: SUB rsp, 4 (in case your variable is 4 bytes large--such as a regular 32-bit integer).
Moreover, when you declare multiple variables, your compiler is smart enough to actually group them together into one large SUB rsp, XX instruction, where XX is the total size of a scope's local variables. In theory. In practice, something a little different happens.
In situations like these, I find GCC explorer to be an invaluable tool when it comes to finding out (with tremendous ease) what happens "under the hood" of the compiler.
So let's take a look at what happens when you actually write a function like this: GCC explorer link.
C code
int function(int a, int b) {
int x, y, z, t;
if(a == 2) { return 15; }
x = 1;
y = 2;
z = 3;
t = 4;
return x + y + z + t + a + b;
}
Resulting assembly
function(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
cmp DWORD PTR [rbp-20], 2
jne .L2
mov eax, 15
jmp .L3
.L2:
-- snip --
.L3:
pop rbp
ret
As it turns out, GCC is even smarter than that. It doesn't even perform the SUB instruction at all to allocate the local variables. It just (internally) assumes that the space is "occupied", but doesn't add any instructions to update the stack pointer (e.g. SUB rsp, XX). This means that the stack pointer is not kept up to date but, since in this case no more PUSH instructions are performed (and no rsp-relative lookups) after the stack space is used, there's no issue.
Here's an example where no additional variables are declared: http://goo.gl/3TV4hE
C code
int function(int a, int b) {
if(a == 2) { return 15; }
return a + b;
}
Resulting assembly
function(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
cmp DWORD PTR [rbp-4], 2
jne .L2
mov eax, 15
jmp .L3
.L2:
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
.L3:
pop rbp
ret
If you take a look at the code before the premature return (jmp .L3, which jumps to the cleanup and return code), no additional instructions are invoked to "prepare" the stack variables. The only difference is that the function parameters a and b, which are stored in the edi and esi registers, are loaded onto the stack at a higher address than in the first example ([rbp-4] and [rbp - 8]). This is because no additional space has been "allocated" for the local variables like in the first example. So, as you can see, the only "overhead" for adding those local variables is a change in a subtraction term (i.e. not even adding an additional subtraction operation).
So, in your case, there is virtually no cost for simply declaring stack variables.
I prefer keeping the "early out" condition at the top of the function, in addition to documenting why we are doing it. If we put it after a bunch of variable declarations, someone not familiar with the code could easily miss it, unless they know they have to look for it.
Documenting the "early out" condition alone is not always sufficient, it is better to make it clear in the code as well. Putting the early out condition at the top also makes it easier to keep the document in sync with the code, for instance, if we later decide to remove the early out condition, or to add more such conditions.
If it actually mattered the only way to avoid allocating the variables is likely to be:
int function_unchecked();
int function ()
{
if (!someGlobalValue) return false;
return function_unchecked();
}
int function_unchecked() {
internalStructure *str1;
internalStructure *str2;
char *dataPointer;
float xyz;
/* do something here with the above local variables */
}
But in practice I think you'll find no performance benefit. If anything a minuscule overhead.
Of course if you were coding C++ and some of those local variables had non-trivial constructors you would probably need to place them after the check. But even then I don't think it would help to split the function.
If you declare variables after if statement and returned from the function immediately the compiler does not commitment memory in the stack.

Resources