Why size of extern variables is 0 in ELF? - c

Extern declarations of variables in a c program result in size 0 in ELF. Why isn't actual size stored in ELF when known? For cases like incomplete arrays I understand there is no size information but for other cases it should be possible to store size.
I tried some simple codes and verified in ELF size emitted is zero.
// file1.c
extern int var;
int main()
{
var = 2;
}
// file 2.c
long long int var = 8;
gcc -c file1.c
readelf -s file1.o
...
9: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND var
...
gcc -c file2.c
readelf -s file2.o
...
7: 0000000000000000 8 OBJECT GLOBAL DEFAULT 2 var
...
If size of var was stored as 4 in file1.o, linker can actually detect potential mismatch due to size when linking with file2.o.
So why isn't size emitted as it can help catch some subtle issues like this?

In file1, var is just a placeholder. It will not occupy any memory. The extern identifier is to indicate to the compiler and the linker that the variable var is stored elsewhere.
It would be wrong to have two storage locations for the single variable var as you have suggested.
It is a quirk of the C language that you can define a different type for extern variable and a different type for the underlying global variable as you have done in your example. This is one of the reasons that we have static analysis tools.

Related

Example of a translation unit vs file scope

Taking the following program as an example:
// myprogram.c
#include<stdio.h>
int a, b;
int main(void)
{
printf("A: %d\n", a);
printf("B: %d\n", b);
}
// friendsprogram.c
int a=1;
static int b=2;
$ gcc myprogram.c friendsprogram.c -o out; ./out
A: 1
B: 0
How would the translation units be classified in the above? And how what that be different than just the contents of the file "myprogram.c" and "friendsprogram.c"? Does the translation unit ever depend on the command that is issued to the compiler? For example, If I change the command to just:
$ gcc myprogram.c -o out; ./out
My output becomes:
A: 0
B: 0
A translation unit is a source file along with all of its included headers that is compiled as a single unit.
In this example, myprogram.c along with the header stdio.h is one translation unit. The file friendsprogram.c is another translation unit.
Note that this doesn't change when you compile like this:
gcc myprogram.c friendsprogram.c -o out
Because this command line combines compiling and linking into a single step. A temporary object file is created for myprogram.c and another for friendsprogram.c, then those object files are linked to create the file "out".
What's happening is a side effect of old lenient compiler/linker behavior and a so-called "common" section.
int a;
The C spec says this global-scope variable is initialized to zero. You would expect this to go into the .bss (zero-initialized data) section of the executable.
But in GCC <10, the variable is put into the "common" section when that file (translation unit) is compiled.
int a=1;
Now you've provided an initialization, and this variable will go into the .data section.
But when the linker links these two object files together, rather than issue a "multiple definitions" (for the same name) error, it will do something controversial, and merge them into one variable, due to the common section semantics.
By passing -fno-common, or using GCC >= 10, the common section is not used, and the linker will issue an error and refuse to link your program.
So what should you do? Simple: provide only one definition for any name.
If you really want to use global variables (undesirable in general), and you want to put them in a separate translation unit (weird), use extern in your other files:
data.h
// Declaration: Tells everyone that 'a' exists somewhere
extern int a;
data.c
#include "data.h"
// Definition: defines the variable and its initial value
int a = 42;
main.c
#include <stdio.h>
#include "data.h"
int main(void)
{
printf("a = %d\n", a);
}

Why does global variable definition in C header file work? [duplicate]

This question already has answers here:
What happens if I define the same variable in each of two .c files without using "extern"?
(3 answers)
Closed 2 years ago.
From what I saw across many many stackoverflow questions among other places, the way to define globals is to define them in exactly one .c file, then declare it as an extern in a header file which then gets included in the required .c files.
However, today I saw in a codebase global variable definition in the header file and I got into arguing, but he insisted it will work. Now, I had no idea why, so I created a small project to test it out real quick:
a.c
#include <stdio.h>
#include "a.h"
int main()
{
p1.x = 5;
p1.x = 4;
com = 6;
change();
printf("p1 = %d, %d\ncom = %d\n", p1.x, p1.y, com);
return 0;
}
b.c
#include "a.h"
void change(void)
{
p1.x = 7;
p1.y = 9;
com = 1;
}
a.h
typedef struct coord{
int x;
int y;
} coord;
coord p1;
int com;
void change(void);
Makefile
all:
gcc -c a.c -o a.o
gcc -c b.c -o b.o
gcc a.o b.o -o run.out
clean:
rm a.o b.o run.out
Output
p1 = 7, 9
com = 1
How is this working? Is this an artifact of the way I've set up the test? Is it that newer gcc has managed to catch this condition? Or is my interpretation of the whole thing completely wrong? Please help...
This relies on so called "common symbols" which are an extension to standard C's notion of tentative definitions (https://port70.net/~nsz/c/c11/n1570.html#6.9.2p2), except most UNIX linkers make it work across translation units too (and many even with shared dynamic libaries)
AFAIK, the feature has existed since pretty much forever and it had something to do with fortran compatibility/similarity.
It works by the compiler placing giving uninitialized (tentative) globals a special "common" category (shown in the nm utility as "C", which stands for "common").
Example of data symbol categories:
#!/bin/sh -eu
(
cat <<EOF
int common_symbol; //C
int zero_init_symbol = 0; //B
int data_init_symbol = 4; //D
const int const_symbol = 4; //R
EOF
) | gcc -xc - -c -o data_symbol_types.o
nm data_symbol_types.o
Output:
0000000000000004 C common_symbol
0000000000000000 R const_symbol
0000000000000000 D data_init_symbol
0000000000000000 B zero_init_symbol
Whenever a linker sees multiple redefinitions for a particular symbol, it usually generates linkers errors.
But when those redefinitions are in the common category, the linker will merge them into one.
Also, if there are N-1 common definitions for a particular symbol and one non-tentative definition (in the R,D, or B category), then all the definitions are merged into the one nontentative definition and also no error is generated.
In other cases you get symbol redefinition errors.
Although common symbols are widely supported, they aren't technically standard C and relying on them is theoretically undefined behavior (even though in practice it often works).
clang and tinycc, as far as I've noticed, do not generate common symbols (there you should get a redefinition error). On gcc, common symbol generation can be disabled with -fno-common.
(Ian Lance Taylor's serios on linkers has more info on common symbols and it also mentions how linkers even allow merging differently sized common symbols, using the largest size for the final object: https://www.airs.com/blog/archives/42 . I believe this weird trick was once used by libc's to some effect)
That program should not compile (well it should compile, but you'll have double definition errors in your linking phase) due to how the variables are defined in your header file.
A header file informs the compiler about external environment it normally cannog guess by itself, as external variables defined in other modules.
As your question deals with this, I'll try to explain the correct way to define a global variable in one module, and how to inform the compiler about it in other modules.
Let's say you have a module A.c with some variable defined in it:
A.c
int I_am_a_global_variable; /* you can even initialize it */
well, normally to make the compiler know when compiling other modules that you have that variable defined elsewhere, you need to say something like (the trick is in the extern keyword used to say that it is not defined here):
B.c
extern int I_am_a_global_variable; /* you cannot initialize it, as it is defined elsewhere */
As this is a property of the module A.c, we can write a A.h file, stating that somewhere else in the program, there's a variable named I_am_a_global_variable of type int, in order to be able to access it.
A.h
extern int I_am_a_global_variable; /* as above, you cannot initialize the variable here */
and, instead of declaring it in B.c, we can include the file A.h in B.c to ensure that the variable is declared as the author of B.c wanted to.
So now B.c is:
B.c
#include "A.h"
void some_function() {
/* ... */
I_am_a_global_variable = /* some complicated expression */;
}
this ensures that if the author of B.c decides to change the type or the declaration of the variable, he can do changing the file A.h and all the files that #include it should be recompiled (you can do this in the Makefile for example)
A.c
#include "A.h" /* extern int I_am_a_global_variable; */
int I_am_a_global_variable = 27;
In order to prevent errors, it is good that A.c also #includes the file A.h, so the declaration
extern int I_am_a_global_variable; /* as above, you cannot initialize the variable here */
and the final definition (that is included in A.c):
int I_am_a_global_variable = 23; /* I have initialized it to a non-default value to show how to do it */
are consistent between them (consider the author changes the type of I_am_a_global_variable to double and forgets to change the declaration in A.h, the compiler will complaint about non-matching declaration and definition, when compiling A.c (which now includes A.h).
Why I say that you will have double definition errors when linking?
Well, if you compile several modules with the statement (result of #includeing the file A.h in several modules) with the statement:
#include "A.h" /* this has an extern int I_am_a_global_variable; that informs the
* compiler that the variable is defined elsewhere, but see below */
int I_am_a_global_variable; /* here is _elsewhere_ :) */
then all those modules will have a global variable I_m_a_global_variable, initialized to 0, because the compiler defined it in every module (you don't say that the variable is defined elsewhere, you are stating it to declare and define it in this compilation unit) and when you link all the modules together you'll end with several definitions of a variable with the same name at several places, and the references from other modules using this variable will don't know which one is to be used.
The compiler doesn't know anything of other compilations for an application when it is compiling module A, so you need some means to tell it what is happening around. The same as you use function prototypes to indicate it that there's a function somewhere that takes some number of arguments of types A, B, C, etc. and returns a value of type Z, you need to tell it that there's a variable defined elsewhere that has type X, so all the accesses you do to it in this module will be compiled correctly.

How to determine the address range of global variables in a shared library at runtime?

At runtime, are global variables in a loaded shared library guaranteed to occupy a contiguous memory region? If so, is it possible to find out that address range?
Context: we want to have multiple "instances" of a shared library (e.g. a protocol stack implementation) in memory for simulation purposes (e.g. to simulate a network with multiple hosts/routers). One of the approaches we are trying is to load the library only once, but emulate additional instances by creating and maintaining "shadow" sets of global variables, and switch between instances by memcpy()'ing the appropriate shadow set in/out of the memory area occupied by the global variables of the library. (Alternative approaches like using dlmopen() to load the library multiple times, or introducing indirection inside the shared lib to access global vars have their limitations and difficulties too.)
Things we tried:
Using dl_iterate_phdr() to find the data segment of the shared lib. The resulting address range was not too useful, because (1) it did not point to an area containing the actual global variables but to the segment as loaded from the ELF file (in readonly memory), and (2) it contained not only the global vars but also additional internal data structures.
Added start/end guard variables in C to the library code, and ensured (via linker script) that they are placed at the start and end of the .data section in the shared object. (We verified that with objdump -t.) The idea was that at runtime, all global variables would be located in the address range between the two guard variables. However, our observation was that the relative order of the actual variables in memory was quite different than what would follow from the addresses in the shared object. A typical output was:
$ objdump -t libx.so | grep '\.data'
0000000000601020 l d .data 0000000000000000 .data
0000000000601020 l O .data 0000000000000000 __dso_handle
0000000000601038 l O .data 0000000000000000 __TMC_END__
0000000000601030 g O .data 0000000000000004 custom_data_end_marker
0000000000601028 g O .data 0000000000000004 custom_data_begin_marker
0000000000601034 g .data 0000000000000000 _edata
000000000060102c g O .data 0000000000000004 global_var
$ ./prog
# output from dl_iterate_phdr()
name=./libx.so (7 segments)
header 0: type=1 flags=5 start=0x7fab69fb0000 end=0x7fab69fb07ac size=1964
header 1: type=1 flags=6 start=0x7fab6a1b0e08 end=0x7fab6a1b1038 size=560 <--- data segment
header 2: type=2 flags=6 start=0x7fab6a1b0e18 end=0x7fab6a1b0fd8 size=448
header 3: type=4 flags=4 start=0x7fab69fb01c8 end=0x7fab69fb01ec size=36
header 4: type=1685382480 flags=4 start=0x7fab69fb0708 end=0x7fab69fb072c size=36
header 5: type=1685382481 flags=6 start=0x7fab69bb0000 end=0x7fab69bb0000 size=0
header 6: type=1685382482 flags=4 start=0x7fab6a1b0e08 end=0x7fab6a1b1000 size=504
# addresses obtained via dlsym() are consistent with the objdump output:
dlsym('custom_data_begin_marker') = 0x7fab6a1b1028
dlsym('custom_data_end_marker') = 0x7fab6a1b1030 <-- between the begin and end markers
# actual addresses: at completely different address range, AND in completely different order!
&custom_data_begin_marker = 0x55d613f8e018
&custom_data_end_marker = 0x55d613f8e010 <-- end marker precedes begin marker!
&global_var = 0x55d613f8e01c <-- after both markers!
Which means the "guard variables" approach does not work.
Maybe we should iterate over the Global Offset Table (GOT) and collect the addresses of global variables from there? However, there doesn't seem to be an official way for doing that, if it's possible at all.
Is there something we overlooked? I'll be happy to clarify or post our test code if needed.
EDIT: To clarify, the shared library in question is a 3rd party library whose source code we prefer not to modify, hence the quest for the above general solution.
EDIT2: As further clarification, the following code outlines what I would like to be able to do:
// x.c -- source for the shared library
#include <stdio.h>
int global_var = 10;
void bar() {
global_var++;
printf("global_var=%d\n", global_var);
}
// a.c -- main program
#include <stdlib.h>
#include <dlfcn.h>
#include <memory.h>
struct memrange {
void *ptr;
size_t size;
};
extern int global_var;
void bar();
struct memrange query_globals_address_range(const char *so_file)
{
struct memrange result;
// TODO what generic solution can we use here instead of the next two specific lines?
result.ptr = &global_var;
result.size = sizeof(int);
return result;
}
struct memrange g_range;
void *allocGlobals()
{
// allocate shadow set and initialize it with actual global vars
void *globals = malloc(g_range.size);
memcpy(globals, g_range.ptr, g_range.size);
return globals;
}
void callBar(void *globals) {
memcpy(g_range.ptr, globals, g_range.size); // overwrite globals from shadow set
bar();
memcpy(globals, g_range.ptr, g_range.size); // save changes into shadow set
}
int main(int argc, char *argv[])
{
g_range = query_globals_address_range("./libx.so");
// allocate two shadow sets of global vars
void *globals1 = allocGlobals();
void *globals2 = allocGlobals();
// call bar() in the library with a few times with each
callBar(globals1);
callBar(globals2);
callBar(globals2);
callBar(globals1);
callBar(globals1);
return 0;
}
Build+run script:
#! /bin/sh
gcc -c -g -fPIC x.c -shared -o libx.so
gcc a.c -g -L. -lx -ldl -o prog
LD_LIBRARY_PATH=. ./prog
EDIT3: Added dl_iterate_phdr() output
Shared libraries are compiled as Position-Independent Code. That means that unlike executables, addresses are not fixed, but are rather decided during dynamic linkage.
From a software engineering standpoint, the best approach is to use objects (structs) to represent all your data and avoid global variables (such data structures are typically called "contexts"). All API functions then take a context argument, which allows you to have multiple contexts in the same process.
At runtime, are global variables in a loaded shared library guaranteed to occupy a contiguous memory region?
Yes: on any ELF platform (such as Linux) all writable globals are typically grouped into a single writable PT_LOAD segment, and that segment is located at a fixed address (determined at the library load time).
If so, is it possible to find out that address range?
Certainly. You can find the library load address using dl_iterate_phdr, and iterate over the program segments that it gives you. One of the program headers will have .p_type == PT_LOAD, .p_flags == PF_R|PF_W. The address range you want is [dlpi_addr + phdr->p_vaddr, dlpi_addr + phdr->p_vaddr + phdr->p_memsz).
Here:
# actual addresses: completely different order:
you are actually looking at the address of the GOT entries in the main executable, and not the addresses of the variables themselves.

Static function and variable get exported in shared library

So far I've assumed that objects with static linkage (i.e. static functions and static variables) in C do not collide with other objects (of static or external linkage) in other compilation units (i.e. .c files) so I've used "short" names for internal helper functions rather than prefixing everything with the library name. Recently a user of my library experienced a crash due to a name collision with an exported function from another shared library. On investigation it turned out that several of my static functions are part of the symbol table of the shared library. Since it happens with several GCC major versions I assume I'm missing something (such a major bug would be noticed and fixed).
I managed to get it down to the following minimum example:
#include <stdbool.h>
#include <stdlib.h>
bool ext_func_a(void *param_a, char const *param_b, void *param_c);
bool ext_func_b(void *param_a);
static bool bool_a, bool_b;
static void parse_bool_var(char *doc, char const *var_name, bool *var)
{
char *var_obj = NULL;
if (!ext_func_a(doc, var_name, &var_obj)) {
return;
}
*var = ext_func_b(var_obj);
}
static void parse_config(void)
{
char *root_obj = getenv("FOO");
parse_bool_var(root_obj, "bool_a", &bool_a);
parse_bool_var(root_obj, "bool_b", &bool_b);
}
void libexample_init(void)
{
parse_config();
}
Both the static variable bool_a and the static function parse_bool_var are visible in the symbol table of the object file and the shared library:
$ gcc -Wall -Wextra -std=c11 -O2 -fPIC -c -o example.o example.c
$ objdump -t example.o|egrep 'parse_bool|bool_a'
0000000000000000 l O .bss 0000000000000001 bool_a
0000000000000000 l F .text 0000000000000050 parse_bool_var
$ gcc -shared -Wl,-soname,libexample.so.1 -o libexample.so.1.1 x.o -fPIC
$ nm libexample.so.1.1 |egrep 'parse_bool|bool_a'
0000000000200b79 b bool_a
0000000000000770 t parse_bool_var
I've dived into C11, Ulrich Drepper's "How to Write Shared Libraries" and a couple of other sources explaining visibility of symbols, but I'm still at a loss. Why are bool_a and parse_bool_var ending up in the dynamic symbol table even though they're declared static?
The lower case letter in the second column of nm output means they're local (if they were upper case, it would be a different story).
Those symbols won't conflict with other symbols of the same name and AFAIK, are basically there only for debugging purposes.
Local symbols won't go into the dynamic symbol table (printable with nm -D but only in shared libraries) either and they are strippable along with exported symbols (upper case letters in the second column of nm output) that aren't dynamic.
(As you will have learned from Drepper's How to Write Shared Libraries, you can control visibility with -fvisibility=(default|hidden) (shouldn't use protected) and with visibility attributes.)

Why the int type takes up 8 bytes in BSS section but 4 bytes in DATA section

I am trying to learn the structure of executable files of C program. My environment is GCC and 64bit Intel processor.
Consider the following C code a.cc.
#include <cstdlib>
#include <cstdio>
int x;
int main(){
printf("%d\n", sizeof(x));
return 10;
}
The size -o a shows
text data bss dec hex filename
1134 552 8 1694 69e a
After I added another initialized global variable y.
int y=10;
The size a shows (where a is the name of the executable file from a.cc)
text data bss dec hex filename
1134 556 12 1702 6a6 a
As we know, the BSS section stores the size of uninitialized global variables and DATA stores initialized ones.
Why int takes up 8 bytes in BSS? The sizeof(x) in my code shows that the int actually takes up 4 bytes.
The int y=10 added 4 bytes to DATA which makes sense since int should take 4 bytes. But, why does it adds 4 bytes to BSS?
The difference between two size commands stays the same after deleting the two lines #include ....
Update:
I think my understanding of BSS is wrong. It may not store the uninitialized global variables. As the Wikipedia says "The size that BSS will require at runtime is recorded in the object file, but BSS (unlike the data segment) doesn't take up any actual space in the object file." For example, even the one line C code int main(){} has bss 8.
Does the 8 or 16 of BSS comes from alignment?
It doesn't, it takes up 4 bytes regardless of which segment it's in. You can use the nm tool (from the GNU binutils package) with the -S argument to get the names and sizes of all of the symbols in the object file. You're likely seeing secondary affects of the compiler including or not including certain other symbols for whatever reasons.
For example:
$ cat a1.c
int x;
$ cat a2.c
int x = 1;
$ gcc -c a1.c a2.c
$ nm -S a1.o a2.o
a1.o:
0000000000000004 0000000000000004 C x
a2.o:
0000000000000000 0000000000000004 D x
One object file has a 4-byte object named x in the uninitialized data segment (C), while the other object file has a 4-byte object named x in the initialized data segment (D).

Resources