Memory layout of a c program - c

I am reading this article http://www.geeksforgeeks.org/memory-layout-of-c-program/,
it said " Uninitialized variable stored in bss", "Initialized variable stored in Data segment"
My question is why we need to have 2 separate segments for variables? 1. BSS 2. Data segment?
Why not just put everything into 1 segment?

BSS takes up no space in the program image. It just indicates how large the BSS section is and the runtime will set that memory to zero.
The data section is filled with the initial values for the variables so it takes space in the program image file.

To my knowledge, uninitialized variables (in .bss) are (or should be) zerod out when entering the program. Initialised variables (.data) get a specific value.
This means that in the executable of your program (stored on disk), the .data segment must be included byte per byte (since each variable has a potentially different value). The .bss however, must not be saved byte per byte. One must only know the size to reserve in memory when loading the executable. The program knows the offset of each variable in .bss
To zero out all the uninitialized variables, a few assembler instructions will do (for x86: rep stosw with some register settings for instance).
Conclusion: loading and initialisation time for .data is lot worse than for large .bss segments, since the .data must be loaded from disk, and .bss is only to be reserved on the fly with very few cpu instructions.

Related

Why do we need .bss segment? [duplicate]

What I know is that global and static variables are stored in the .data segment, and uninitialized data are in the .bss segment. What I don't understand is why do we have dedicated segment for uninitialized variables? If an uninitialized variable has a value assigned at run time, does the variable exist still in the .bss segment only?
In the following program, a is in the .data segment, and b is in the .bss segment; is that correct? Kindly correct me if my understanding is wrong.
#include <stdio.h>
#include <stdlib.h>
int a[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9};
int b[20]; /* Uninitialized, so in the .bss and will not occupy space for 20 * sizeof (int) */
int main ()
{
;
}
Also, consider following program,
#include <stdio.h>
#include <stdlib.h>
int var[10]; /* Uninitialized so in .bss */
int main ()
{
var[0] = 20 /* **Initialized, where this 'var' will be ?** */
}
The reason is to reduce program size. Imagine that your C program runs on an embedded system, where the code and all constants are saved in true ROM (flash memory). In such systems, an initial "copy-down" must be executed to set all static storage duration objects, before main() is called. It will typically go like this pseudo:
for(i=0; i<all_explicitly_initialized_objects; i++)
{
.data[i] = init_value[i];
}
memset(.bss,
0,
all_implicitly_initialized_objects);
Where .data and .bss are stored in RAM, but init_value is stored in ROM. If it had been one segment, then the ROM had to be filled up with a lot of zeroes, increasing ROM size significantly.
RAM-based executables work similarly, though of course they have no true ROM.
Also, memset is likely some very efficient inline assembler, meaning that the startup copy-down can be executed faster.
The .bss segment is an optimization. The entire .bss segment is described by a single number, probably 4 bytes or 8 bytes, that gives its size in the running process, whereas the .data section is as big as the sum of sizes of the initialized variables. Thus, the .bss makes the executables smaller and quicker to load. Otherwise, the variables could be in the .data segment with explicit initialization to zeroes; the program would be hard-pressed to tell the difference. (In detail, the address of the objects in .bss would probably be different from the address if it was in the .data segment.)
In the first program, a would be in the .data segment and b would be in the .bss segment of the executable. Once the program is loaded, the distinction becomes immaterial. At run time, b occupies 20 * sizeof(int) bytes.
In the second program, var is allocated space and the assignment in main() modifies that space. It so happens that the space for var was described in the .bss segment rather than the .data segment, but that doesn't affect the way the program behaves when running.
From Assembly Language Step-by-Step: Programming with Linux by Jeff Duntemann, regarding the .data section:
The .data section contains data definitions of initialized data items. Initialized
data is data that has a value before the program begins running. These values
are part of the executable file. They are loaded into memory when the
executable file is loaded into memory for execution.
The important thing to remember about the .data section is that the
more initialized data items you define, the larger the executable file
will be, and the longer it will take to load it from disk into memory
when you run it.
and the .bss section:
Not all data items need to have values before the program begins running.
When you’re reading data from a disk file, for example, you need to have a
place for the data to go after it comes in from disk. Data buffers like that are
defined in the .bss section of your program. You set aside some number of
bytes for a buffer and give the buffer a name, but you don’t say what values
are to be present in the buffer.
There’s a crucial difference between data items defined in the .data
section and data items defined in the .bss section: data items in the
.data section add to the size of your executable file. Data items in
the .bss section do not. A buffer that takes up 16,000 bytes (or more,
sometimes much more) can be defined in .bss and add almost nothing
(about 50 bytes for the description) to the executable file size.
Well, first of all, those variables in your example aren't uninitialized; C specifies that static variables not otherwise initialized are initialized to 0.
So the reason for .bss is to have smaller executables, saving space and allowing faster loading of the program, as the loader can just allocate a bunch of zeroes instead of having to copy the data from disk.
When running the program, the program loader will load .data and .bss into memory. Writes into objects residing in .data or .bss thus only go to memory, they are not flushed to the binary on disk at any point.
The System V ABI 4.1 (1997) (AKA ELF specification) also contains the answer:
.bss This section holds uninitialized data that contribute to the
program’s memory image. By definition, the system initializes the
data with zeros when the program begins to run. The section occupies no file space, as indicated by the section type, SHT_NOBITS.
says that the section name .bss is reserved and has special effects, in particular it occupies no file space, thus the advantage over .data.
The downside is of course that all bytes must be set to 0 when the OS puts them on memory, which is more restrictive, but a common use case, and works fine for uninitialized variables.
The SHT_NOBITS section type documentation repeats that affirmation:
sh_size This member gives the section’s size in bytes. Unless the section type is SHT_NOBITS, the section occupies sh_size
bytes in the file. A section of type SHT_NOBITS may have a non-zero
size, but it occupies no space in the file.
The C standard says nothing about sections, but we can easily verify where the variable is stored in Linux with objdump and readelf, and conclude that uninitialized globals are in fact stored in the .bss. See for example this answer: What happens to a declared, uninitialized variable in C?
The wikipedia article .bss provides a nice historical explanation, given that the term is from the mid-1950's (yippee my birthday;-).
Back in the day, every bit was precious, so any method for signalling reserved empty space, was useful. This (.bss) is the one that has stuck.
.data sections are for space that is not empty, rather it will have (your) defined values entered into it.

Set .data segment size in C, dynamically

Is there a means by which to manipulate the .data segment size in C without increasing compile size of the binary (i.e. setting the size without setting any variables within)?
Linux programs have two data sections: ".data" and ".bss". The ".data" is used for variables with initial value (static int x=5), while the ".bss" is used for variables that start with 0 (static int x). Adding data to '.data' will result in space to hold the initial value.
Consider going for the ".bss" section, which will have little impact on object size.

Why is there no content for the .bss section in an object (ELF) file?

This question confused me a lot. As far as I know, .bss section is for saving data that initialized but not used yet. But I don't understand what 'content' here mean and why there is no content here?
Thanks for any helps!
The quick response is: Well, there's no content to fill the .bss with, so there's no sense in putting any data on the executable in relation to that section. Only the positions of the variables are stored, but that belongs to another ELF section.
.bss section is where your program has all the uninitialized variables (by default all initialized to zero) The linker only needs to know the actual size of this region and the actual variable positions, but not the values, because its contents are obvious, independently of the nature or the distribution of the variables put there.
When your program is loaded, the kernel normally assigns a read-only segment for the unmodifiable text of the program (.text section) and also puts in that segment the contents of the initialized const variables (.rodata section) so in case yo attempt to modify something there, you get an exception. Then comes the initialized data section with the initial values of all the initialized variables of your program (.data section) and the uninitialized ones (.bss section)
The data segment (look how I call different a section and a load segment) is given more space, the sum of .data and .bss sections, to hold all the variables (both are included, so that's the reason it uses its length) but while the contents of the .data section have to be filled from the file, the contents of the .bss section don't, because all are zeroed by the operating system, before allowing the user process to access the allocated segment. That's not true for small systems, where the operating system doesn't fill the data with zeros... but there, the compiler adds some code to zero all the .bss segment, so again, there's no need to copy any data from the executable file.
The historic (and main) reason for this behaviour is that the pages the kernel assigns that have to be loaded with your program, are cleared to zero for security reasons (so you cannot luckily get a page full of other users' passwords, or other sensible information) so there's no reason to fill it with zeros again and nothing has to be copied there, there's no reason to put anything on the executable file. The pages the kernel maintains normally are zeroed only when they are going to be given to a user, but maintain (as they are designed for that purpose) the information until they are overwritten.
There's no content in the BSS (Block started By Symbol) section because it would be wasted storage. The contents of the BSS is all zeros and it is cleared by the startup code before main is called. Think of the BSS as a run-length compressed block of bytes. All you need to know to uncompress that block is the value (0) and the length, which is stored in the ELF entry for the BSS.
Your notion of "data that [is] initialized but not used yet" is a bit off. Consider that all sections in an ELF file are somehow "not used yet". The text segment may or may not become used (it may contain dead/unreachable code). The data segment may or may not be used at all (you can define objects never used by code).

where should the .bss section of ELF file take in memory?

It is known that .bss section was not stored in the disk, but the .bss section in memory should be initialized to zero. but where should it take in the memory? Is there any information displayed in the ELF header or the Is the .bss section likely to appear next to the data section, or something else??
The BSS is between the data and the heap, as detailed in this marvelous article.
You can find out the size of each section using size:
cnicutar#lemon:~$ size try
text data bss dec hex filename
1108 496 16 1620 654 try
To know where the bss segment will be in memory, it is sufficient to run readelf -S program, and check the Addr column on the .bss row.
In most cases, you will also see that the initialized data section (.data) comes immediately before. That is, you will see that Addr+Size of the .data section matches the starting address of the .bss section.
However, that is not always necessarily the case. These are historical conventions, and the ELF specification (to be read alongside the platform specific supplement, for instance Chapter 5 in the one covering 32-bit x86 machines) allows for much more sophisticated configurations, and not all of them are supported by Linux.
For instance, the section may not be called .bss at all. The only 2 properties that make a BSS section such are:
The section is marked with SHT_NOBITS (that is, it takes space in memory but none on the storage) which shows up as NOBITS in readelf's output.
It maps to a loadable (PT_LOAD), readable (PF_R), and writeable (PF_W) segment. Such a segment is also shorter on storage than it is in memory (p_filesz < p_memsz).
You can have multiple BSS sections: PowerPC executables may have .sbss and .sbss2 for uninitialized data variables.
Finally, the BSS section is not necessarily adjacent to the data section or the heap. If you check the Linux kernel (more in particular the load_elf_binary function) you can see that the BSS sections (or more precisely, the segment it maps to) may even be interleaved with code and initialized data. The Linux kernel manages to sort that out.

What is the advantage of having a .bss section?

What is the benefit of having 2 sections - .data and .bss for process scope variables. Why not just have one? I know what each section is used for. I am using gcc.
.bss consumes "memory" but not space within the executable file. Its sole purpose is to hold zero-initialized data (as you know).
.data (and related sections such as rodata) do actually consume space within the executable file, and usually holds strings, integers, and perhaps even entire objects.
There is a lot of zero-initialized data in a typical program, so having that data not consume extra space in the output file is a significant bonus.
As for the multiple .*data sections... .rodata/.data can be used as a hint for memory protection (disallow overwriting .rodata, allow read/write to .data).

Resources