How are variable names stored in memory in C? - c

In C, let's say you have a variable called variable_name. Let's say it's located at 0xaaaaaaaa, and at that memory address, you have the integer 123. So in other words, variable_name contains 123.
I'm looking for clarification around the phrasing "variable_name is located at 0xaaaaaaaa". How does the compiler recognize that the string "variable_name" is associated with that particular memory address? Is the string "variable_name" stored somewhere in memory? Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?

Variable names don't exist anymore after the compiler runs (barring special cases like exported globals in shared libraries or debug symbols). The entire act of compilation is intended to take those symbolic names and algorithms represented by your source code and turn them into native machine instructions. So yes, if you have a global variable_name, and compiler and linker decide to put it at 0xaaaaaaaa, then wherever it is used in the code, it will just be accessed via that address.
So to answer your literal questions:
How does the compiler recognize that the string "variable_name" is associated with that particular memory address?
The toolchain (compiler & linker) work together to assign a memory location for the variable. It's the compiler's job to keep track of all the references, and linker puts in the right addresses later.
Is the string "variable_name" stored somewhere in memory?
Only while the compiler is running.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Yes, that's pretty much what happens, except it's a two-stage job with the linker. And yes, it uses memory, but it's the compiler's memory, not anything at runtime for your program.
An example might help you understand. Let's try out this program:
int x = 12;
int main(void)
{
return x;
}
Pretty straightforward, right? OK. Let's take this program, and compile it and look at the disassembly:
$ cc -Wall -Werror -Wextra -O3 example.c -o example
$ otool -tV example
example:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl 0x00000096(%rip),%eax
0000000100000f6a popq %rbp
0000000100000f6b ret
See that movl line? It's grabbing the global variable (in an instruction-pointer relative way, in this case). No more mention of x.
Now let's make it a bit more complicated and add a local variable:
int x = 12;
int main(void)
{
volatile int y = 4;
return x + y;
}
The disassembly for this program is:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl $0x00000004,0xfc(%rbp)
0000000100000f6b movl 0x0000008f(%rip),%eax
0000000100000f71 addl 0xfc(%rbp),%eax
0000000100000f74 popq %rbp
0000000100000f75 ret
Now there are two movl instructions and an addl instruction. You can see that the first movl is initializing y, which it's decided will be on the stack (base pointer - 4). Then the next movl gets the global x into a register eax, and the addl adds y to that value. But as you can see, the literal x and y strings don't exist anymore. They were conveniences for you, the programmer, but the computer certainly doesn't care about them at execution time.

A C compiler first creates a symbol table, which stores the relationship between the variable name and where it's located in memory. When compiling, it uses this table to replace all instances of the variable with a specific memory location, as others have stated. You can find a lot more on it on the Wikipedia page.

All variables are substituted by the compiler. First they are substituted with references and later the linker places addresses instead of references.
In other words. The variable names are not available anymore as soon as the compiler has run through

This is what's called an implementation detail. While what you describe is the case in all compilers I've ever used, it's not required to be the case. A C compiler could put every variable in a hashtable and look them up at runtime (or something like that) and in fact early JavaScript interpreters did exactly that (now, they do Just-In-TIme compilation that results in something much more raw.)
Specifically for common compilers like VC++, GCC, and LLVM: the compiler will generally assign a variable to a location in memory. Variables of global or static scope get a fixed address that doesn't change while the program is running, while variables within a function get a stack address-that is, an address relative to the current stack pointer, which changes every time a function is called. (This is an oversimplification.) Stack addresses become invalid as soon as the function returns, but have the benefit of having effectively zero overhead to use.
Once a variable has an address assigned to it, there is no further need for the name of the variable, so it is discarded. Depending on the kind of name, the name may be discarded at preprocess time (for macro names), compile time (for static and local variables/functions), and link time (for global variables/functions.) If a symbol is exported (made visible to other programs so they can access it), the name will usually remain somewhere in a "symbol table" which does take up a trivial amount of memory and disk space.

Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it
Yes.
and if so, wouldn't it have to use memory in order to make that substitution?
Yes. But it's the compiler, after it compiled your code, why do you care about memory?

Related

Is _start() a function?

It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
However, _start() does not; for example in this reference implementation there is no return address on the stack:
.section .text
.global _start
_start:
# Set up end of the stack frame linked list.
movq $0, %rbp
pushq %rbp # rip=0
pushq %rbp # rbp=0
movq %rsp, %rbp
# We need those in a moment when we call main.
pushq %rsi
pushq %rdi
# Prepare signals, memory allocation, stdio and such.
call initialize_standard_library
# Run the global constructors.
call _init
# Restore argc and argv.
popq %rdi
popq %rsi
# Run main
call main
# Terminate the process with the exit code.
movl %eax, %edi
call exit
.size _start, . - _start
Yet it's called a function in a myriad of sources. A number of questions and answers on StackOverflow also refer to it as a function.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention? The C standard does not seem to define the concept of a function, neither do the gcc and clang docs. What is the authoritative source that defines this concept?
About the lack of a return making a piece of code not a function, even a function written in C, does not have to have a return instruction in it:
int call_fn(int(*fn)()) {
return fn();
}
This function, with proper optimizations compiles down to a single jmp instruction: https://godbolt.org/z/nxT9qTvaf
call_fn(int (*)()): # #call_fn(int (*)())
jmp rdi # TAILCALL
In general, I don't think the C or the C++ standard would define anything about stuff written in assembly. A common calling convention helps for making direct calls into functions written in other languages, but you can still call functions using other calling conventions using a trampoline.
It stands to reason that, for executable code to be called a function, it should conform to the function calling convention of the platform it's running on.
"Function" is the primary idea here; "calling convention" is subsidiary to that. As such, I think a more supportable claim would be that for every function, there is a convention for calling it.
Interoperability considerations lead to standardization of calling conventions, but there is no One True calling convention, not even on a per-platform basis. Even subject to the influence of interoperability, there are platforms that support multiple standard calling conventions. In any case the existence of standard calling conventions does not necessarily relegate code with other conventions for entry and exit to non-function-hood.
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
This is a question of the definition of "function". There is room for variation on this, and in practice, different definitions apply in different contexts. For example, the question refers to the C language specification, but this speaks to the meaning of "function" in the context of C source code, not assembly or machine code.
In practice, in various languages and contexts, there are
functions with identifiers and functions without;
functions that return a value and functions that don't;
functions with a single entry point and functions with multiple entry points;
functions with a single exit point and functions with multiple exit points;
functions that always return to the caller, functions that usually return, functions that occasionally return, and functions that never return;
a wide variety of patterns for how functions receive data to operate on, how they return data to their caller (if they do so), and what invariants they do and do not ensure
other dimensions of variation, too
Thus, no, I do not accept in any universal sense that a piece of code needs to conform to a particular calling convention to be called a "function", and I also do not accept "a group of instructions identified by the address to the entry point" as a satisfactory universal definition.
Is _start() a function?
A _start() function such as is provided by GCC / Glibc satisfies some relevant definitions of the term. I have no problem with calling it a "function".
There seems to be this idea going around in the newer programming models that all running code is inside functions; but in the beginning this was not so, and if we look at the old languages we can observe this.
Drawing from lisp:
(format t "Hello, World!")
This is hello world in common lisp, and is not a function in any normal sense. For comparison, here is it as a function:
(defun hello ()
(format t "Hello, World!"))
(hello)
And from near the other root of all programming languages; here is Fortran (source):
PROGRAM FUNDEM
C Declarations for main program
REAL A,B,C
REAL AV, AVSQ1, AVSQ2
REAL AVRAGE
C Enter the data
DATA A,B,C/5.0,2.0,3.0/
C Calculate the average of the numbers
AV = AVRAGE(A,B,C)
AVSQ1 = AVRAGE(A,B,C) **2
AVSQ2 = AVRAGE(A**2,B**2,C**2)
PRINT *,'Statistical Analysis'
PRINT *,'The average of the numbers is:',AV
PRINT *,'The average squared of the numbers: ',AVSQl
PRINT *,'The average of the squares is: ', AVSQ2
END
REAL FUNCTION AVRAGE(X,Y,Z)
REAL X,Y,Z,SUM
SUM = X + Y + Z
AVRAGE = SUM /3.0
RETURN
END
Yup that's top level statements and a function definition. Fortran has three things, the PROGRAM, SUBROUTINEs, and FUNCTIONs.
And again, we can do the same kind of example in QuickBasic:
CALL Hello
Sub Hello()
Print "Hello, World"
End Sub
QuickBasic was kind of funny; you never even tried to name the entry point and whatever .OBJ file was first in the build script was where the entry point was.
There's a general recurring theme here. In all of these, the top level isn't very function-like. The compiler would add stuff to the beginning of the entry point for you so that runtime initialization worked correctly.
Now what happened in C? C took a different path. The initialization routines were written in their own file that calls main() and the compiler just compiles main() as it would any other function and has no capacity for emitting code that runs at top level. Thus, the entry point (traditionally called _start but doesn't have to be) is not and cannot be written in C.
Don't get me wrong here, if you were to compile any of these on a Unix platform today and look at the resulting .o files you would see the modern compilers emit a main() function with the top level code in it. This is because of the preeminence of the C runtime and not because of any need for it to be a function. Had the other languages carried around in their runtimes the definitions of the system calls like they used to, this would not need to be.
Thus we have the process entry point is not a function.
We can take this argument one step farther; suppose (and I have seen news articles reference a thing kind of like this) we had a full native Java compiler that emitted .o files and linked against .so files providing the Java runtime; we could then ask Is _start a class method? The answer isn't no. The answer is the question makes no sense because you can't get a valid Java reference to the symbol. The same silly thing happens in C, we just need to pick a different platform. On DOS FAR model, _start is exported as PROC NEAR but void _start() expects a PROC FAR. The emitted link-time fixup is of the wrong size and trying to take the address of _start results in undefined behavior.
You are mixing fields. You can't apply "text specification" to oranges.
the C standard does not seem to define the concept of a function
C is a language. In the C language, the text like the following:
void func();
is a function declaration of a function func.
Is _start() a function?
The text you posted is not in C language. There are no functions declarations and definitions in it.
As you stated, the term function is not defined in the C standard. I would assume that the English language understanding of the term "function" applies here, as to any other word in the C standard.
I see in Merriam-Webster that a "function" is a computer subroutine, where a subroutine is a a sequence of computer instructions for performing a specified task that can be used repeatedly.
Clearly, _start is a function - it is a sequence of instructions to be executed repeatedly, it is executed on a computer, and it also operates on variables in the form of registers.
The text you posted represents the function _start in the form of a text using assembly language. It is not possible to represent the function _start in the C programming language.
(It is also not possible to express oranges, yet they exist in the real world. My point is, you can take any other word in the C standard, like, I don't know, "international", and ask "Are oranges international?". Applying C standard and "language-lawyer" tag to abstract contexts is not going to give you answers. Bottom line is that the C standard is a specification - it tells what happens when, it is not a dictionary.)
Is a function simply a group of instructions identified by the address to the entry point, or must it conform to the calling convention?
See Merriam-Webster function.
What is the authoritative source that defines this concept?
I googled and "There is no official agency that makes rules for English language".
The C standard is created by http://www.open-std.org/jtc1/sc22/wg14/ .

gcc 8.2+ doesn't always align the stack before a call on x86?

The current (Linux) version of the SysV i386 ABI requires 16-byte stack alignment before a call:
The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%esp + 4) is always a multiple of 16 (32) when control is transferred to the function entry point.
On GCC 8.1 this code aligns the stack to 16-byte boundary prior to the call to callee: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 24
24
sub esp, 4
4
push eax
4
push eax
4
push eax
4
Total
48
On all versions of GCC 8.2 and later, it aligns to a 4-byte boundary: (Godbolt)
source
# bytes
call
4
push ebp
4
sub esp, 16
16
push eax
4
push eax
4
push eax
4
Total
36
Easily verifiable if we shorten or raise the number of parameters required by callee.
Changing -mprefered-stack-boundary bizarrely changes the operand to the sub instruction, but does nothing to change the actual stack alignment: (Godbolt)
So, uh, what gives?
Since you provided a definition of the function in the same translation unit, apparently GCC sees that the function doesn't care about stack alignment and doesn't bother much with it. And apparently this basic inter-procedural analysis / optimization (IPA) is on by default even at -O0.
Turns out this option even has an obvious name when I searched for "ipa" options in the manual: -fipa-stack-alignment is on by default even at -O0. Manually turning it off with -fno-ipa-stack-alignment results in what you expected, a second sub whose value depends on the number of pushes (Godbolt), making sure ESP is aligned by 16 before a call like modern Linux versions of the i386 SysV ABI use.
Or if you change the definition to just a declaration, then the resulting asm is as expected, fully respecting -mpreferred-stack-boundary.
void callee(void* a, void* b) {
}
to
void callee(void* a, void* b);
Using -fPIC also forces GCC to not assume anything about the callee, so it does respect the possibility of function interposition (e.g. via LD_PRELOAD) with the appropriate option.
Without compiling for a shared library, GCC is allowed to assume that any definition it sees for a global function is the definition, thanks to ISO C's one-definition-rule.
If you use __attribute__((noipa)) on the function definition, then call sites won't assume anything based on the definition. Just like if you'd renamed the definition (so you could still look at it) and provided only a declaration of the name the caller uses.
If you just want to stop inlining, you can use __attribute__((noinline,noclone)) instead, to still allow the callsite to be like it would if the optimizer simply chose not to inline, but could still see this definition. That may or may not be what you want.
See also How to remove "noise" from GCC/clang assembly output? re: writing functions whose asm is interesting to look at, and compiler options.
And BTW, I found it easiest to change the declaration / definition to variadic, so I could add or remove args with only a change to the caller. I was still able to reproduce your result of that not changing the sub amount even when the push amount changes with an extra arg, when there's a definition, but not with just a declaration.
void callee(void* a, ...) // {} // comment out a body or not
;

C language variable in executable file

Simply coding C file like bellow:
int main()
{
int a = 999;
return 0;
}
after compile and link using gcc, it will generate a executable file (e.g. .exe, .out)
But when I open (NOT RUNNING) the executable file with some Editor, I can not find the value of variable 'a' the number 999 which in hex is 0x3E7.
My question is:
Is the variable number 999 exist in executable file?
If not, where does the variable number stored? How the executable file get the variable number in running?
P.S: I have a little knowledge about memory section like .data .bss .text .etc and assembly language. Evenly I can not find it using ollydbg.
There is no reason for the compiler to put the value 999 anywhere since it is not used anywhere. The program has the same observable behaviour regardless 999 being somewhere in memory or not.
ISO/IEC 9899:TC2 - 5.1.2.3 Program execution:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
One of the important functions of optimization is removing unused stuff. The behaviour of the program doesn't depend at all on the 999, and a isn't volatile, so the assignment isn't part of any visible side-effect, and the program is exactly equivalent to int main(){return 0;}
It's a lot easier to look at compiler output in asm form.
On the Godbolt compiler explorer, you can set it up so you can see asm output for gcc -O0 and gcc -O1 at the same time, in different panes. https://godbolt.org/z/wNHEnN.
mov DWORD PTR [rbp-4], 999 is there at -O0, along with stack-frame setup fluff. So the -O0 output will have 999 as a dword immediate operand to a mov instruction. The variable is local, so you won't find a symbol-table entry for it in the .data or .rdata sections (like you would with a global with a static initializer).
See also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output.

Measuring size of a function generated with Clang/LLVM?

Recently, when working on a project, I had a need to measure the size of a C function in order to be able to copy it somewhere else, but was not able to find any "clean" solutions (ultimately, I just wanted to have a label inserted at the end of the function that I could reference).
Having written the LLVM backend for this architecture (while it may look like ARM, it isn't) and knowing that it emitted assembly code for that architecture, I opted for the following hack (I think the comment explains it quite well):
/***************************************************************************
* if ENABLE_SDRAM_CALLGATE is enabled, this function should NEVER be called
* from C code as it will corrupt the stack pointer, since it returns before
* its epilog. this is done because clang does not provide a way to get the
* size of the function so we insert a label with inline asm to measure the
* function. in addition to that, it should not call any non-forceinlined
* functions to avoid generating a PC relative branch (which would fail if
* the function has been copied)
**************************************************************************/
void sdram_init_late(sdram_param_t* P) {
/* ... */
#ifdef ENABLE_SDRAM_CALLGATE
asm(
"b lr\n"
".globl sdram_init_late_END\n"
"sdram_init_late_END:"
);
#endif
}
It worked as desired but required some assembler glue code in order to call it and is a pretty dirty hack that only worked because I could assume several things about the code generation process.
I've also considered other ways of doing this which would work better if LLVM was emitting machine code (since this approach would break once I added an MC emitter to my LLVM backend). The approach I considered involved taking the function and searching for the terminator instruction (which would either be a b lr instruction or a variation of pop ..., lr) but that could also introduce additional complications (though it seemed better than my original solution).
Can anyone suggest a cleaner way of getting the size of a C function without having to resort to incredibly ugly and unreliable hacks such as the ones outlined above?
I think you're right that there aren't any truly portable ways to do this. Compilers are allowed to re-order functions, so taking the address of the next function in source order isn't safe (but does work in some cases).
If you can parse the object file (maybe with libbfd), you might be able to get function sizes from that.
clang's asm output has this metadata (the .size assembler directive after every function), but I'm not sure whether it ends up in the object file.
int foo(int a) { return a * a * 2; }
## clang-3.8 -O3 for amd64:
## some debug-info lines manually removed
.globl foo
foo:
.Lfunc_begin0:
.cfi_startproc
imul edi, edi
lea eax, [rdi + rdi]
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo ####### This line
Compiling this to a .o with clang-3.8 -O3 -Wall -Wextra func-size.c -c, I can then do:
$ readelf --symbols func-size.o
Symbol table '.symtab' contains 4 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS func-size.c
2: 0000000000000000 0 SECTION LOCAL DEFAULT 2
3: 0000000000000000 7 FUNC GLOBAL DEFAULT 2 foo ### This line
The three instructions total 7 bytes, which matches up with the size output here. It doesn't include the padding to align the entry point, or the next function: the .align directives are outside the two labels that are subtracted to calculate the .size.
This probably doesn't work well for stripped executables. Even their global functions won't still be present in the symbol table of the executable. So you might need a two-step build process:
compile your "normal" code
get sizes of functions you care about into a table, using readelf | some text processing > sizes.c
compile sizes.c
link everything together
Caveat
A really clever compiler could compile multiple similar functions to share a common implementation. So one of the functions jumps into the middle of the other function body. If you're lucky, all the functions are grouped together, with the "size" of each measuring from its entry point all the way to the end of the blocks of code it uses. (But that overlap would make the total sizes add up to more than the size of the file.)
Current compilers don't do this, but you can prevent it by putting the function in a separate compilation unit, and not using whole-program link-time optimization.
A compiler could decide to put a conditionally-executed block of code before the function entry point, so the branch can use a shorter encoding for a small displacement. This makes that block look like a static "helper" function which probably wouldn't be included in the "size" calculation for function. Current compilers never do this, either, though.
Another idea, which I'm not confident is safe:
Put an asm volatile with just a label definition at the end of your function, and then assume the function size is at most that + 32 bytes or something. So when you copy the function, you allocate a buffer 32B larger than your "calculated" size. Hopefully there's only a "ret" insn beyond the label, but actually it probably goes before the function epilogue which pops all the call-preserved registers it used.
I don't think the optimizer can duplicate an asm volatile statement, so it would force the compiler to jump to a common epilogue instead of duplicating the epilogue like it might sometimes for early-out conditions.
But I'm not sure there's an upper bound on how much could end up after the asm volatile.

Does assembly code ignore const keyword?

Does assembly code when used(linked) in a c project ignore const keyword declared before C datatypes and functions?? And can it modify the contents of the datatypes??
Does assembly code when used(linked) in a c project ignore const
keyword declared before C datatypes and functions??
Yes, the const keyword is utterly ignored by assembly code.
And can it modify the contents of the datatypes??
If the compiler was able to place a const location in a read-only segment, then assembly code trying to modify it will cause a segmentation fault. Otherwise, unpredictable behavior will result, because the compiler may have optimized parts of the code, but not others, under the assumption that const locations were not modified.
And can it modify the contents of the datatypes??
Maybe, maybe not. If the original object was declared const then the compiler might emit it into a read-only data segment, which would be loaded into a read-only memory page at runtime. Writing to that page, even from assembly, would trigger a runtime exception (access violation or segmentation fault).
You won't receive a compile-time error, but at runtime your program might crash or behave erratically.
Assembly uses the datatypes you declared in C to better optimize how it stores the information in memory. Everything is written in binary at the end of the day (int, long, char, etc), so there is no datatypes once you get to the low level code.
The question is not very precise. What does "ignores" mean?
Assembly language does not have the same concept of const as C language does. So, assembly cannot ignore it. It simply has no idea about it.
Yet the assembly code generated by C compiler for a C program might easily be affected by the placement of const keywords in your C program.
In other words, assembly code can be affected by const keywords. But once that assembly code is built, the const keyword is no longer necessary.
To say that assembler can modify something declared as const is not exactly correct either. If you declare a variable as const, in some cases the compiler might be smart enough to eliminate that variable entirely, replacing it with immediate value of that variable. This means that that const variable might disappear from the final code entirely, leaving nothing for the assembly code to "modify".
GCC places global variables marked as const in a separate section, called .rodata.
The .rodata is also used for storing string constants.
Since contents of .rodata section will not be modified,
they can be placed in Flash. The linker script has to modified to accomodate this.
#include <stdio.h>
const int a = 10 ;
int main ( void ) {
return a ;
}
.section .rodata
.align 4
a:
.long 10
gcc with 00 :
movl a(%rip), %eax // variabile constant
addq $32, %rsp
popq %rbp
gcc with O3 :
movl $10, %eax // numeric constant
addq $40, %rsp
ret

Resources