In C, where is static variable stored in memory? Suppose there are two static variables, one local to a function and the other global. How is this entry maintained in symbol table? Please explain.
In C, they can be stored wherever the implementation sees fit. The C standard does not dictate how the implementation does things, only how it behaves.
Typically, all static storage duration variables (statics within a function and all variables outside a function) will be stored in the same region, regardless of whether they at at file level or within a function.
That bit in parentheses above is important. Outside of a function, static doesn't decide the storage duration of a variable like it does within a function. It decides whether the variable is visible outside of the current translation unit. All variables outside of functions are static storage duration.
And, regarding the symbol table, that's a construct that exists only during the build process. Once an executable is generated, there are no symbols (debugging information excluded of course, but that has nothing to do with the execution of code). All references to variables at that point will almost certainly be hard-coded addresses or offsets.
In other words, it's the compiler that figures out which variable you're referring to with a name.
You can see an example here as to how the variables are stored. Consider the following little C program:
#include <stdio.h>
int var1;
static int var2;
int main (void) {
int var3;
static int var4;
var1 = 111;
var2 = 222;
var3 = 333;
var4 = 444;
return 0;
}
This generates the following assembly:
.file "qq.c"
.comm var1,4,4
.local var2
.comm var2,4,4
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $111, var1
movl $222, var2
movl $333, -4(%ebp)
movl $444, var4.1705
movl $0, %eax
leave
ret
.size main, .-main
.local var4.1705
.comm var4.1705,4,4
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
And you can see that var1, var2 and var4 (the static storage duration ones) all have a .comm line to mark them as common entries, subject to consolidation by the linker.
In addition, var2, var3 and var4 (the ones that are invisible outside the current transdlation unit) all have a .local line, so that the linker won't use them for satisfying unresolved externals in other object file.
And, by examining the output of ld --verbose while linking a file, you can see that all common entries end up in the .bss area:
.bss :
{
*(.dynbss)
*(.bss .bss.* .gnu.linkonce.b.*)
*(COMMON)
: : :
}
It's impossible to generalize to every compiler, but this is how it's most often done.
There will be a block of memory set aside by the linker for variables which are initialized at load time but modifiable at run time. All static variables will be placed in this block no matter if they are local or global.
Given the following source:
static int a_static_var = 5;
void foo(void)
{
static int a_static_var = 6;
return;
}
Visual Studio compiles the variable as follows (at lest in this instance - details will vary from compiler to compiler and depend on options):
_DATA SEGMENT
_a_static_var DD 05H
?a_static_var#?1??foo##9#9 DD 06H ; `foo'::`2'::a_static_var
_DATA ENDS
So both static variables end up in the data segment - the static that's scoped to a function has it's name mangled in such a way that it will not 'match up' with a similar variable ion a different function or source file.
Compiler implementations are free to handle this is whatever manner, but the general idea will usually be similar.
Related
I have set the size of a symbol in assembly using the .size directive of GNU assembler, how do I access this size in C?
void my_func(void){}
asm(
"_my_stub:\n\t"
"call my_func\n\t"
".size _my_stub, .-_my_stub"
);
extern char* _my_stub;
int main(){
/* print size of _my_stub here */
}
Here is relevant objdump
0000000000000007 <_my_stub>:
_my_stub():
7: e8 00 00 00 00 callq c <main>
Here is relevant portion of readelf
Symbol table '.symtab' contains 14 entries:
Num: Value Size Type Bind Vis Ndx Name
5: 0000000000000007 5 NOTYPE LOCAL DEFAULT 1 _my_stub
I can see from the objdump and symbol table that the size of _my_stub is 5. How can I get this value in C?
I don't know of a way to access the size attribute from within gas. As an alternative, how about replacing .size with
hhh:
.long .-_my_stub # 32-bit integer constant, regardless of the size of a C long
and
extern const uint32_t gfoo asm("hhh");
// asm("asm_name") sidesteps the _name vs. name issue across OSes
I can't try it here, but I expect you should be able to printf("%ld\n", gfoo);. I tried using .equ so this would be a constant rather than allocating memory for it, but I never got it to work right.
This does leave the question as to the purpose of the .size attribute. Why set it if you you can't read it? I'm not an expert, but I've got a theory:
According to the docs, for COFF outputs, .size must be within a .def/.endef. Looking at .def, we see that it's used to Begin defining debugging information for a symbol name.
While ELF doesn't have the same nesting requirement, it seems plausible to assume that debugging is the intent there too. If this is only intended to be used by debuggers, it (kinda) makes sense that there's no way to access it from within the assembler.
I guess you just want to get the size of a subset of a code segment or data segment. Here is an assembly example (GAS&AT style) you can refer to:
target_start:
// Put your code segment or data segment here
// ...
target_end = .
// Use .global to export the label
.global target_size
target_size = target_end - target_start
In C/C++ source file, you can use label target_size by extern long target_size.
Note: this example hasn't been tested.
Consider a C function (with external linkage) like the following one:
void f(void **p)
{
/* do something with *p */
}
Now assume that f is being called in a way such that p points to the return address of f on the stack, as in the following code (assuming the System V AMD64 ABI):
leaq -8(%rsp), %rdi
callq f
What may happen is that the code of f modifies the return address on the stack by assigning a value to *p. Thus the compiler will have to treat the return address on the stack as a volatile value. How can I tell the compiler, gcc in my case, that the return address is volatile?
Otherwise, the compiler could, at least in principle, generate the following code for f:
pushq %rbp
movq 8(%rsp), %r10
pushq %r10
## do something with (%rdi)
popq %r10
popq %rbp
addq 8,%rsp
jmpq *%r10
Admittedly, it is unlikely that a compiler would ever generate code like this but it does not seem to be forbidden without any further function attributes. And this code wouldn't notice if the return address on the stack is being modified in the middle of the function because the original return address is already retrieved at the beginning of the function.
P.S.: As has been suggested by Peter Cordes, I should better explain the purpose of my question: It is about garbage collecting dynamically generated machine code using a moving garbage collector: The function f stands for the garbage collector. The callee of f may be a function whose code is being moved around while f is running, so I came up with the idea of letting f know the return address so that f may modify it accordingly to whether the memory area the return address points to has been moved around or not.
Using the SysV ABI (Linux, FreeBSD, Solaris, Mac OS X / macOS) on AMD64/x86-64, you only need a trivial assembly function wrapped around the actual garbage collector function.
The following f.s defines void f(void *), and calls the real GC, real_f(void *, void **), with the added second parameter pointing to the return address.
.file "f.s"
.text
.p2align 4,,15
.globl f
.type f, #function
f:
movq %rsp, %rsi
call real_f
ret
.size f, .-f
If real_f() already has two other parameters, use %rdx (for the third) instead of %rsi. If three to five, use %rcx, %r8, or %r9, respectively. SysV ABI on AMD64/x86-64 only supports up to six non-floating-point parameters in registers.
Let's test the above with a small example.c:
#include <stdlib.h>
#include <stdio.h>
extern void f(void *);
void real_f(void *arg, void **retval)
{
printf("real_f(): Returning to %p instead of %p.\n", arg, *retval);
*retval = arg;
}
int main(void)
{
printf("Function and label addresses:\n");
printf("%p f()\n", f);
printf("%p real_f()\n", real_f);
printf("%p one_call:\n", &&one_call);
printf("%p one_fail:\n", &&one_fail);
printf("%p one_skip:\n", &&one_skip);
printf("\n");
printf("f(one_skip):\n");
fflush(stdout);
one_call:
f(&&one_skip);
one_fail:
printf("At one_fail.\n");
fflush(stdout);
one_skip:
printf("At one_skip.\n");
fflush(stdout);
return EXIT_SUCCESS;
}
Note that the above relies on both GCC behaviour (&& providing the address of a label) as well as GCC behaviour on AMD64/x86-64 architecture (object and function pointers being interchangeable), as well as the C compiler not making any of the myriad optimizations they are allowed to do to the code in main().
(It does not matter if real_f() is optimized; it's just that I was too lazy to work out a better example in main(). For example, one that creates a small function in an executable data segment that calls f(), with real_f() moving that data segment, and correspondingly adjusting the return address. That would match OP's scenario, and is just about the only practical use case for this kind of manipulation I can think of. Instead, I just hacked a crude example that might or might not work for others.)
Also, we might wish to declare f() as having two parameters (they would be passed in %rdi and %rsi) too, with the second being irrelevant, to make sure the compiler does not expect %rsi to stay unchanged. (If I recall correctly, the SysV ABI lets us clobber it, but I might remember wrong.)
On this particular machine, compiling the above with
gcc -Wall -O0 f.s example.c -o example
running it
./example
produces
Function and label addresses:
0x400650 f()
0x400659 real_f()
0x400729 one_call:
0x400733 one_fail:
0x40074c one_skip:
f(one_skip):
real_f(): Returning to 0x40074c instead of 0x400733.
At one_skip.
Note that if you tell GCC to optimize the code (say, -O2), it will make assumptions about the code in main() it is perfectly allowed to do by the C standard, but which may lead to all three labels having the exact same address. This happens on my particular machine and GCC-5.4.0, and of course causes an endless loop. It does not reflect on the implementation of f() or real_f() at all, only that my example in main() is quite poor. I'm lazy.
It is said that we can write multiple declarations but only one definition. Now if I implement my own strcpy function with the same prototype :
char * strcpy ( char * destination, const char * source );
Then am I not redefining the existing library function? Shouldn't this display an error? Or is it somehow related to the fact that the library functions are provided in object code form?
EDIT: Running the following code on my machine says "Segmentation fault (core dumped)". I am working on linux and have compiled without using any flags.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *strcpy(char *destination, const char *source);
int main(){
char *s = strcpy("a", "b");
printf("\nThe function ran successfully\n");
return 0;
}
char *strcpy(char *destination, const char *source){
printf("in duplicate function strcpy");
return "a";
}
Please note that I am not trying to implement the function. I am just trying to redefine a function and asking for the consequences.
EDIT 2:
After applying the suggested changes by Mats, the program no longer gives a segmentation fault although I am still redefining the function.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *strcpy(char *destination, const char *source);
int main(){
char *s = strcpy("a", "b");
printf("\nThe function ran successfully\n");
return 0;
}
char *strcpy(char *destination, const char *source){
printf("in duplicate function strcpy");
return "a";
}
C11(ISO/IEC 9899:201x) §7.1.3 Reserved Identifiers
— Each macro name in any of the following subclauses (including the future library
directions) is reserved for use as specified if any of its associated headers is included;
unless explicitly stated otherwise.
— All identifiers with external linkage in any of the following subclauses (including the
future library directions) are always reserved for use as identifiers with external
linkage.
— Each identifier with file scope listed in any of the following subclauses (including the
future library directions) is reserved for use as a macro name and as an identifier with
file scope in the same name space if any of its associated headers is included.
If the program declares or defines an identifier in a context in which it is reserved, or defines a reserved identifier as a macro name, the behavior is undefined. Note that this doesn't mean you can't do that, as this post shows, it can be done within gcc and glibc.
glibc §1.3.3 Reserved Names proveds a clearer reason:
The names of all library types, macros, variables and functions that come from the ISO C standard are reserved unconditionally; your program may not redefine these names. All other library names are reserved if your program explicitly includes the header file that defines or declares them. There are several reasons for these restrictions:
Other people reading your code could get very confused if you were using a function named exit to do something completely different from what the standard exit function does, for example. Preventing this situation helps to make your programs easier to understand and contributes to modularity and maintainability.
It avoids the possibility of a user accidentally redefining a library function that is called by other library functions. If redefinition were allowed, those other functions would not work properly.
It allows the compiler to do whatever special optimizations it pleases on calls to these functions, without the possibility that they may have been redefined by the user. Some library facilities, such as those for dealing with variadic arguments (see Variadic Functions) and non-local exits (see Non-Local Exits), actually require a considerable amount of cooperation on the part of the C compiler, and with respect to the implementation, it might be easier for the compiler to treat these as built-in parts of the language.
That's almost certainly because you are passing in a destination that is a "string literal".
char *s = strcpy("a", "b");
Along with the compiler knowing "I can do strcpy inline", so your function never gets called.
You are trying to copy "b" over the string literal "a", and that won't work.
Make a char a[2]; and strcpy(a, "b"); and it will run - it probably won't call your strcpy function, because the compiler inlines small strcpy even if you don't have optimisation available.
Putting the matter of trying to modify non-modifiable memory aside, keep in mind that you are formally not allowed to redefine standard library functions.
However, in some implementations you might notice that providing another definition for standard library function does not trigger the usual "multiple definition" error. This happens because in such implementations standard library functions are defined as so called "weak symbols". Foe example, GCC standard library is known for that.
The direct consequence of that is that when you define your own "version" of standard library function with external linkage, your definition overrides the "weak" standard definition for the entire program. You will notice that not only your code now calls your version of the function, but also all class from all pre-compiled [third-party] libraries are also dispatched to your definition. It is intended as a feature, but you have to be aware of it to avoid "using" this feature inadvertently.
You can read about it here, for one example
How to replace C standard library function ?
This feature of the implementation doesn't violate the language specification, since it operates within uncharted area of undefined behavior not governed by any standard requirements.
Of course, the calls that use intrinsic/inline implementation of some standard library function will not be affected by the redefinition.
Your question is misleading.
The problem that you see has nothing to do with the re-implementation of a library function.
You are just trying to write non-writable memory, that is the memory where the string literal a exists.
To put it simple, the following program gives a segmentation fault on my machine (compiled with gcc 4.7.3, no flags):
#include <string.h>
int main(int argc, const char *argv[])
{
strcpy("a", "b");
return 0;
}
But then, why the segmentation fault if you are calling a version of strcpy (yours) that doesn't write the non-writable memory? Simply because your function is not being called.
If you compile your code with the -S flag and have a look at the assembly code that the compiler generates for it, there will be no call to strcpy (because the compiler has "inlined" that call, the only relevant call that you can see from main, is a call to puts).
.file "test.c"
.section .rodata
.LC0:
.string "a"
.align 8
.LC1:
.string "\nThe function ran successfully"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movw $98, .LC0(%rip)
movq $.LC0, -8(%rbp)
movl $.LC1, %edi
call puts
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE2:
.size main, .-main
.section .rodata
.LC2:
.string "in duplicate function strcpy"
.text
.globl strcpy
.type strcpy, #function
strcpy:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movl $.LC2, %edi
movl $0, %eax
call printf
movl $.LC0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE3:
.size strcpy, .-strcpy
.ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3"
.
I think Yu Hao answer has a great explanation for this, the quote from the standard:
The names of all library types, macros, variables and functions that
come from the ISO C standard are reserved unconditionally; your
program may not redefine these names. All other library names are
reserved if your program explicitly includes the header file that
defines or declares them. There are several reasons for these
restrictions:
[...]
It allows the compiler to do whatever special optimizations it pleases
on calls to these functions, without the possibility that they may
have been redefined by the user.
your example can operate in this way : ( with strdup )
char *strcpy(char *destination, const char *source);
int main(){
char *s = strcpy(strdup("a"), strdup("b"));
printf("\nThe function ran successfully\n");
return 0;
}
char *strcpy(char *destination, const char *source){
printf("in duplicate function strcpy");
return strdup("a");
}
output :
in duplicate function strcpy
The function ran successfully
The way to interpret this rule is that you cannot have multiple definitions of a function end up in the final linked object (the executable). So, if all the objects included in the link have only one definition of a function, then you are good. Keeping this in mind, consider the following scenarios.
Let's say you redefine a function somefunction() that is defined in some library. Your function is in main.c (main.o) and in the library the function is in an a object named someobject.o (in the libray). Remember that in the final link, the linker only looks for unresolved symbols in the libraries. Because somefunction() is resolved already from main.o, the linker does not even look for it in the libraries and does not pull in someobject.o. The final link has only one definition of the function, and things are fine.
Now imagine that there is another symbol anotherfunction() defined in someobject.o that you also happen to call. The linker will try to resolve anotherfunction() from someobject.o, and pull it in from the library, and it will become a part of the final link. Now you have two definitions of somefunction() in the final link - one from main.o and another from someobject.o, and the linker will throw an error.
I use this one frequently:
void my_strcpy(char *dest, char *src)
{
int i;
i = 0;
while (src[i])
{
dest[i] = src[i];
i++;
}
dest[i] = '\0';
}
and you can also do strncpy just by modify one line
void my_strncpy(char *dest, char *src, int n)
{
int i;
i = 0;
while (src[i] && i < n)
{
dest[i] = src[i];
i++;
}
dest[i] = '\0';
}
Building on my last question i'm trying to figure out how .local and .comm directives work exactly and in particular how they affect linkage and duration in C.
So I've run the following experiment:
static int value;
which produces the following assembly code (using gcc):
.local value
.comm value,4,4
When initialized to zero yields the same assembly code (using gcc):
.local value
.comm value,4,4
This sounds logical because in both cases i would expect that the variable will be stored in the bss segment. Moreover, after investigating using ld --verbose it looks that all .comm variables are indeed placed in the bss segment:
.bss :
{
*(.dynbss)
*(.bss .bss.* .gnu.linkonce.b.*)
*(COMMON)
// ...
}
When i initialize however my variable to a value other than zero, the compiler defines the variable in the data segment as i would expected, but produces the following output:
.data
.align 4
.type value, #object
.size value, 4
value:
.long 1
Besides the different segments (bss and data respectively) which thanks to your help previously i now understand, my variable has been defined as .local and .comm in the first example but not in the second. Could anyone explain that difference between the two outputs produced from each case?
The .local directive marks a symbol as a local, non-externally-visible symbol, and creates it if it doesn't already exist. It's necessary for 0-initialized local symbols, because .comm declares but does not define symbols. For the 1-initialized variant, the symbol itself (value:) declares the symbol.
Using .local and .comm is essentially a bit of a hack (or at least a shorthand); the alternative would be to place the symbol into .bss explicitly:
.bss
.align 4
.type value, #object
.size value, 4
value:
.zero 4
Linux kernel zeros the virtual memory of a process after allocation due to security reasons. So, the compiler already knows that the memory will be filled with zeros and does an optimization: if some variable is initialized to 0, there's no need to keep space for it in a executable file (.data section actually takes some space in ELF executable, whereas .bss section stores only its length assuming that its initial contents will be zeros).
(Sorry for bad English.)
Question 1.
void foo(void)
{
goto inside;
for (;;) {
int stack_var = 42;
inside:
...
}
}
Will be a place in stack allocated for the stack_var when I goto the inside label? I.e. can I correctly use the stack_var variable within ...?
Question 2.
void foo(void)
{
for (;;) {
int stack_var = 42;
...
goto outside;
}
outside:
...
}
Will be a place in stack of the stack_var deallocated when I goto the outside label? E.g. is it correct to do return within ...?
In other words, is goto smart for correct working with stack variables (automatic (de)allocation when I walk through blocks), or it's just a stupid jump?
Question 1:
can I correctly use the stack_var variable within ...?
The code in ... can write to stack_var. However, this variable is uninitialized because the execution flow jumped over the initialization, so the code should not read from it without having written to it first.
From the C99 standard, 6.8:3
The initializers of objects that have automatic storage duration […] are evaluated and the values are stored in the objects (including storing an indeterminate value in objects without an initializer) each time the declaration is reached in the order of execution
My compiler compiles the function below to a piece of assembly that sometimes returns the uninitialized contents of x:
int f(int c){
if (c) goto L;
int x = 42;
L:
return x;
}
cmpl $0, %eax
jne LBB1_2
movl $42, -16(%rbp)
LBB1_2:
movl -16(%rbp), %eax
...
popq %rbp
ret
Question 2:
Will be a place in stack of the stack_var deallocated when I goto the outside label?
Yes, you can expect the memory reserved for stack_var to be reclaimed as soon as the variable goes out of scope.
There are two different issues:
lexical scoping of variables inside C code. A C variable only makes sense inside the block in which it is declared. You could imagine that the compiler is renaming variables to unique names, which have sense only inside the scope block.
call frames in the generated code. A good optimizing compiler usually allocate the call frame of the current function on the machine class stack at the beginning of the function. A given location in that call frame, called a slot can (and usually is) reused by the compiler for several local variables (or other purposes).
And a local variable can be kept in a register only (without any slot in the call frame), and that register will obviously be reused for various purposes.
You are probably hurting undefined behavior for your first case. After the goto inside the stack_var is uninitialized.
I suggest you to compile with gcc -Wall and to improve the code till no warnings are given.