How variables of .bss are resolved in final image?

How variables of .bss are resolved in final image? - linker

Pardon me if this question is too trivial! It is known that the final executable does not allocate space for the uninitialized data within the image. But I want to know, how are the references to the symbols with in .bss get resolved?
Does the object file contain only the addresses of these variables of .bss some where else and NOT allocate space for them? If so, where are these resolved addresses are stored?
For eg. if in a C module I have something like following global variables -
int x[10];
char chArray[100];
The space for above variables may not be present in the image, but how one will reference them? Where are their addresses resolved?
Thanks in advance!
/MS

.bss symbols get resolved just like any other symbol the compiler (or assembler) generates. Usually this works by placing related symbols in "sections". For example, a compiler might place program code in a section called ".text" (for historical reasons ;-), initialized data in a section called, ".data", and unitiatialzed data in a section called .".bss".
For example:
int i = 4;
int x[10];
char chArray[100];
int main(int argc, char**argv)
{
}
produces (with gcc -S):
.file "test.c"
.globl i
.data
.align 4
.type i, #object
.size i, 4
i:
.long 4
.text
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $4, %esp
addl $4, %esp
popl %ecx
popl %ebp
leal -4(%ecx), %esp
ret
.size main, .-main
.comm x,40,32
.comm chArray,100,32
.ident "GCC: (GNU) 4.3.2 20081105 (Red Hat 4.3.2-7)"
.section .note.GNU-stack,"",#progbits
The .data directive tells the assembler to place i in the data section, the ".long 4" gives it its initial value. When the file is assembled, i will be defined at offset 0 in the data section.
The .text directive will place main in the .text section, again with an offset of zero.
The interesting thing about this example is that x and chArray are defined using the .comm directive, not placed in .bss directly. Both are given only a size, not an offset (yet).
When the linker gets the object files, it links them together by combining all the sections with the same name and adjusting the symbol offsets accordingly. It also gives each section an absolute address at which it should be loaded.
The symbols defined by the .comm directive are combined (if multiple definitions with the same name exist) and placed in the .bss section. It's at this point that they are given their address.

Related

Assembly of C program and gcc generated things that nowhere can be found, do I need those sections & symbols if I am programming in assembly&not in C

This is a simple program in C.
char a;
void main(){};
And it caused this assembly to be generated startig with
.text
.globl a
.bss
.type a, #object
.size a, 1
so I like to know how to interpret the above
so I see .text I belive this is just symbol . and text means start of code section
And U see .global so I believe my variable(s) that start right after that will be global variables or functions, etc. or do I need to write section name, i.e. .text right before all variables and functions? this is the question
then u see .bss now after that . and bss all uninitialied variables and functions are declared
and then finally I see something akin to what my C program had a global variable named char a
like
.type a, #object
so .type tells what is it so I assume its of object type as mentioned with # and object in .type a,#object
so now size which is 1 char. so this line
.size a, 1
so I assume if I had global int a; then that would be
.size a,4
char is 1 byte
int is 4 bytes
then moving on
I have
a:
so the first few lines becomes like following
assume this is code 1
# my comment 1
# my comment 2
.text
.globl a
.bss
.type a, #object
.size a, 1
a:
So the question is why a: is at the bottom
what if I do like this
this is code 2
a:
.text
.globl a
.bss
.type a, #object
.size a, 1
so I like to know is code 1 and code 2 same? to declare or define a: appearing first in one and at second in code 2
so from above my a is in .text and .global and .bss and .type is #object and size is 1 byte. This is lots of code to define just one char variable. So is it correct understanding??? should I doubt it
further moving on, now it turn of a global main which is in .text section plus .global
so I see
.zero 1
.text
.globl main
.type main, #function
main:
so I really dont want to care about .zero 1 line but if I am wrong not to care then tell me the use of it. so again have my gcc place main in .zero (some section???) and .text section plus .global code section and the type is #function so now I know type come after , as in .type main,#function and
in .type a, #object
then I encounter complete BS, searching for .LFB0: brought zero google search results
is .LFB0: a some section of program that my x86-64 processor can run
and .cfi_startproc is eh_frame so I read .eh_frame is a section that lives in the loaded part of the program. so I like to know if I am coding in assembly can I ignore .cfi_startproc line. but What is the point of this. does this mean after this everything is loaded in memory or registers and and is .ehframe
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
so if I am making a simple assembly program simlar to above C program in assembly do I need to code from .LFB0: to movq %rsp, %rbp #,\n.cfi_def_cfa_register 6 if not needed then I can assume my program will become
.text
.globl a
.bss
.type a, #object
.size a, 1
a:
.zero 1
.text
.globl main
.type main, #function
main:
.cfi_startproc
pushq %rbp
movq %rsp, %rbp
nop
popq %rbp
ret
.cfi_endproc
so my full program becomes above, how to compile this with nasm can any one please tell
I believe I have to save it with .s or .S extension which one s small or large S? I am coding in Ubuntu
This is gcc generated code
.file "test.c"
# GNU C17 (Ubuntu 11.2.0-7ubuntu2) version 11.2.0 (x86_64-linux-gnu)
# compiled by GNU C version 11.2.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.0, isl version isl-0.24-GMP
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -fstack-protector-strong -fstack-clash-protection -fcf-protection
.text
.globl a
.bss
.type a, #object
.size a, 1
a:
.zero 1
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp #
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp #,
.cfi_def_cfa_register 6
# test.c:2: void main(){};
nop
popq %rbp #
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 11.2.0-7ubuntu2) 11.2.0"
.section .note.GNU-stack,"",#progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:

.text is a directive that tells the assembler to start a program code section (the “text” section of the program, a read-only executable section containing mostly instructions to be executed). It is here because GCC without optimization always puts a .text at the top of the file, even if it's about to switch to another section (like .bss in this case) and then back to .text when it's ready to emit some bytes into that section (in your case, a definition for main). GCC does still parse the whole compilation unit before emitting any asm, though; it's not just compiling one global variable / function at a time as it goes along.
.globl a is a directive that tells the assembler that a is a “global” symbol, so its definition should be listed as an external symbol for the linker to link with.
.bss is a directive that tells the assembler to start the “block starting symbol” section (which will contain data that is initialized to zero or, on some systems, mostly older, is not initialized).
.type a #object and .size a, 1 are directives that describe the type and size of an object named a. The assembler adds this information to the symbol table or other information in the object file it outputs. It is useful for debuggers to know about the types of objects.
a: is label. It acts to define the symbol. As the assembler reads assembly, it counts bytes in the section it is current generated. Each data declaration or instruction takes up some bytes, and the assembler counts those. When it sees a label, it associates the label with the current count. (This is commonly called the program counter even when it is counting data bytes.) When the assembler writes information about a to the symbol table, it will include the number of bytes it is from the beginning of the section. When the program is loaded into memory, this offset is used to calculate the address where the object a will be in memory.
So the question is why a: is at the bottom
a: must be after .bss because a will be put into the section the assembler is currently working on, so that needs to be set to the desired section before declaring the label. The location of a relative to the other directives might be flexible, so that reordering them would have no consequence.
so I like to know is code 1 and code 2 same?
No, a: must appear after .bss so that it is put into the correct section.
.zero 1 says to emit 1 zero byte in the current section. Like (almost?) all directives GCC uses, it's well documented in the GNU assembler manual: https://sourceware.org/binutils/docs/as/Zero.html
so again have my gcc place main in .zero
No, .text starts (or switches back to) the code section, so main will be in the code section.
is .LFB0: a some section of program that my x86-64 processor can run
Anything ending with a colon is a label. .LFB0 is a local label the compiler is using in case it needs it as a jump or branch target.
so I like to know if I am coding in assembly can I ignore .cfi_startproc line.
When writing assembly for simple functions without exception handling and related features, you can ignore .cfi_startproc and other call-frame information directives that generate metadata that goes in the .eh_frame section. (Which is not executed, it's just there as data in the file for exception handlers and debuggers to read.)
… if not needed then I can assume my program will become…
If you are omitting some of the .cfi… directives, I would omit all of them, unless you look into what they do and determine which ones can be omitted selectively.
I believe I have to save it with .s or .S extension which one s small or large S?
With GCC and Clang, assembly files ending in .S are processed by the “preprocessor” before assembly, and assembly files ending in .s are not. This is the preprocessor familiar from C, with #define, #if, and other directives. Other tools may not do this. If you are not using preprocessor features, it generally does not matter whether you use .s or .S.

Understanding a few of the 'helper' gnu-as directives

I have compiled a program main.c with about two lines of code to see what directives gcc / gas add to the unoptimized assembly file, using:
gcc -o main.s main.c -S
I can look up the concise description of each directive on the gas directive page, but was hoping someone could give a bit more context to some of these directives and what its practical usage is (for example, in gdb or the linker or wherever downstream). Here is the full assembly file with the items in question below:
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $4, -8(%rbp)
movl $6, -4(%rbp)
movl -8(%rbp), %edx
movl -4(%rbp), %eax
addl %edx, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0"
.section .note.GNU-stack,"",#progbits
.file: it seems this is halfway-obsolete based on This statement may go away in future: it is only recognized to be compatible with old as programs.. But given that it is still there, where or how is this currently being used?
.ident: it seems like this gives the same thing as doing gcc --version. Is this used at all beyond giving helper information on the 'gcc' that was used to issue the command, or how is this used?
.section .note...: I have seen .section .text, .section .bss, .section .text, ...but I've never come across a .note, and doing a ctrl-f to search for note doesn't give anything on this page. What is this line doing with the three arguments? And the #progbits ?
.size: given that the directives take up no space, this is giving us the length of the first statement within main -- pushq %rbp minus the last statement ret, which is the length of the main function. But again, what usage is this? Also, it says on the as page that It is only permitted inside .def/.endef pairs., but this isn't inside those pairs, right?
.section .text.startup,"ax",#progbits -- what is text.startup, the ax looks like it means allocatable+executable, but what or where is the text.startup ?

Regarding variables in a process memory

In the following code:
int main()
{
int i;
char* s = "Hello";
i = 10;
}
In memory:
10 should go in stack
address of "hello" should go in stack
"Hello" should be stored in read only memory
In the process memory, where is i and s. Where do they reside?

The variable names are just a convenience for the programmer, so that he can refer to them. The values themselve are stored wherever the compiler sees fit to place them, but the names are discarded.
If the optimizer decides that a certain variable has a small enough scope and there are enough registers available, the variable you refer to as i may not even have a storage place in the process memory, because it can be kept in a register as well.
So it mostly depends on the compiler decision, where a certain variable goes. Static and global variables are always in the process memory, but local variables may not be on the stack.

For this program:
The locally-scoped variables i and s are on the stack.
The string "Hello" is stored in the program .rodata section.
The value 10 ($10) for i and the address of the "Hello" string (.LC0) for s are initialized into i and s during the main function preamble.
You can see this all happening with e.g. gcc -S -o bar.s bar.c which will output the assembly language code for bar.c:
.file "bar.c"
.section .rodata
.LC0:
.string "Hello"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq $.LC0, -8(%rbp)
movl $10, -12(%rbp)
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3"
.section .note.GNU-stack,"",#progbits

The i and s here are local variables so they are stored in the stack segment. By referring i and s here are just representations in the language.

Can I list the local variables in a stack by using some system functions in C language

The C function backtrace just returns a series of functions calls for the programn, but i want to list all the locals variables in my programn, just like the info locals in gdb.Any idea if this can be done? Thanks

Generally, no. You should move away from thinking about a "stack" as some sort of god given factum. A call stack is merely a common implementation technique for C. It has no intrinsic meaning or required semantics. Automatic variables ("local variables", as you say) have to behave in a certain way, and sometimes that means that they are written onto the call stack. However, it is entirely conceivable that local variables are never realized in memory at all -- they may instead only ever be stored in a processor register, or eliminated entirely if an equivalent program can be formulated without them.
So, no, there is no language-intrinsic mechanism for enumerating local variables. As you say, the debugger can do so to some extent (depending on debug symbols being present and subject to optimizations); perhaps you can find a library that can process debug symbols from within a running program.

If this is just for occasional debugging, then you can invoke the debugger. However, since the debugger itself will freeze your program, you need an intermediary to capture the output. You can, for example, use system, and redirect the output to a file, then read the file afterwards. In the example below, the file gdbcmds.txt contains the line info locals.
char buf[512];
FILE *gdb;
snprintf(buf, sizeof(buf), "gdb -batch -x gdbcmds.txt -p %d > gdbout.txt",
(int)getpid());
system(buf);
gdb = fopen("gdbout.txt", "r");
while (fgets(buf, sizeof(buf), gdb) != 0) {
printf("%s", buf);
}
fclose(gdb);

First, note that backtrace is not a standard C library function, but a GNU-specific extension.
In general, it's difficult to impossible retrieve local variable information from compiled code, especially if it was compiled without debugging or with optimization enabled. If debugging isn't turned on, variable names and types are generally not preserved in the resulting machine code.
For example, take the following ridiculously simple code:
#include <stdio.h>
#include <math.h>
int main(void)
{
int x = 1, y = 2, z;
z = 2 * y - x;
printf("x = %d, y = %d, z = %d\n", x, y, z);
return 0;
}
Here's the resulting machine code, no debugging or optimization:
.file "varinfo.c"
.version "01.01"
gcc2_compiled.:
.section .rodata
.LC0:
.string "x = %d, y = %d, z = %d\n"
.text
.align 4
.globl main
.type main,#function
main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
movl $1, -4(%ebp)
movl $2, -8(%ebp)
movl -8(%ebp), %eax
movl %eax, %eax
sall $1, %eax
subl -4(%ebp), %eax
movl %eax, -12(%ebp)
pushl -12(%ebp)
pushl -8(%ebp)
pushl -4(%ebp)
pushl $.LC0
call printf
addl $16, %esp
movl $0, %eax
leave
ret
.Lfe1:
.size main,.Lfe1-main
.ident "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)"
x, y, and z are referred to through -4(%ebp), -8(%ebp), and -12(%ebp) respectively. There's nothing to indicate that they're integers other than the instructions used to perform the arithmetic.
It's even better with optimization (-O1) turned on:
.file "varinfo.c"
.version "01.01"
gcc2_compiled.:
.section .rodata.str1.1,"ams",#progbits,1
.LC0:
.string "x = %d, y = %d, z = %d\n"
.text
.align 4
.globl main
.type main,#function
main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
pushl $3
pushl $2
pushl $1
pushl $.LC0
call printf
movl $0, %eax
leave
ret
.Lfe1:
.size main,.Lfe1-main
.ident "GCC: (GNU) 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)"
In this case, the compiler was able to do some static analysis and compute the value z at compile time; there's no need to set aside any memory for any of the variables at all, because the compiler already knows what those values have to be.

What's the difference between this 2 type of code?

Here are 2 type of code snippets which have the same outputs.
char *p = "abc";
::printf("%s",p);
And
::printf("%s","abc");
Is there any difference as to where the "abc" string is stored in memory?
I once heard that in the second code, the "abc" string is placed by the compiler in read-only memory (the .text part?)
How to tell this difference from code if any?
Many thanks.
Update
My current understanding is:
when we write:
char *p="abc"
Though this seems to be only a declarative statement, but indeed the compiler will generate many imperative instructions for it. These instructions will allocate proper space within the stack frame of the containing method, it could be like this:
subl %esp, $4
then the address of "abc" string is moved to that allocated space, it could be like this:
movl $abc_string_address, -4(%ebp)
The "abc" string is stored in the executable file image. But where in the memory it (i mean the string) will be loaded totally depends on the implementation of the compiler/linker, if it is loaded into the read-only part of the process's address space (i.e. the protection bit of the memory page is flagged as read-only), then the p is a read-only pointer, if it is loaded into the r/w part, the p is writable.
Correct me if I am wrong. Now I am looking into the assembly code generated by the gcc to have a confirmation for my understanding. I'll update this thread again shortly.

Is there any difference as to where the "abc" string is stored in memory?
Nope, and that is true for both. String literals are stored in the read-only segment. However, if you declare your variable as a char[] it will be copied onto the stack, i.e., not read only.

There is no difference in where the string literal is stored. The only difference is that the former also allocates space on the stack for the variable to store the pointer.

No difference besides a char pointer allocated on the stack for the first one.
They both use a string literal, delimited by double quotes.

Yes, it won't be stored in different locations, they are all compile-time know variables, so the compiler will generate the assembly code, and the "abc" string will be on the data segment, that is initialized data. .bss section is for unitialized data.
Try compiling with gcc and the -s option. It will generate a .s file, which is assembly code. The "abc" variable will be under the .rodata segment, same as .data for NASM assembly.
Here's the assembly code if you don't want to do the work:
This is for char* c = "abc"; printf("%s\n", c);
Note how this file has more lines of code than the other one, since this code allocates a pointer variable, and print this variable, the other solution doesn't use a variable, it justs references a static memory address.
.file "test.c"
.section .rodata
.LC0:
.string "abc"
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $32, %esp
movl $.LC0, 28(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call puts
movl $0, %eax
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
And this is for printf("abc\n");
.file "test2.c"
.section .rodata
.LC0:
.string "abc"
.text
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $16, %esp
movl $.LC0, (%esp)
call puts
movl $0, %eax
leave
ret
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
To your edit:
I don't think any compiler would put p on a read-only memory, since you declared it as a variable on the code, it's not a hidden/protected variable generated by the compiler, it's a variable you can use whenever you want.
If you do char* p = "abc"; the compiler will sub the pointer's size to the stack, and on the later instruction, it will insert the memory address of the "abc" string (now this is put into read only) into the register, and if the compiler needs register, save it's value to the stack.
If you do printf("abc"); no variable will be alocated, since the compiler knows the string's value at compile time, so it just inserts a number there (relative to the start of the executable file) and it can read the content of that part of the memory.
In this option, you can compile it, generate a .exe, then use a HEX editor, and search for the "abc" string, and change it to "cba" or whatever (probably it will be one of the first lines or one of the last), if the compiler generates a simple .exe like this, which is probable.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight