Remove dead code when linking static library into dynamic library - c

Suppose I have the following files:
libmy_static_lib.c:
#include <stdio.h>
void func1(void){
printf("func1() called from a static library\n");
}
void unused_func1(void){
printf("printing from the unused function1\n");
}
void unused_func2(void){
printf("printing from unused function2\n");
}
libmy_static_lib.h:
void func(void);
void unused_func1(void);
void unused_func2(void);
my_prog.c:
#include "libmy_static_lib.h"
#include <stdio.h>
void func_in_my_prog()
{
printf("in my prog\n");
func1();
}
And here is how I link the library:
# build the static library libmy_static_lib.a
gcc -fPIC -c -fdata-sections --function-sections -c libmy_static_lib.c -o libmy_static_lib.o
ar rcs libmy_static_lib.a libmy_static_lib.o
# build libmy_static_lib.a into a new shared library
gcc -fPIC -c ./my_prog.c -o ./my_prog.o
gcc -Wl,--gc-sections -shared -m64 -o libmy_shared_lib.so ./my_prog.o -L. -l:libmy_static_lib.a
There are 2 functions in libmy_static_lib.c that are not used, and from this post, I think
gcc fdata-sections --function-sections
should create a symbol for each function, and
gcc -Wl,--gc-sections
should remove the unused symbols when linking
however when I run
nm libmy_shared_lib.so
It is showing that these 2 unused functions are also being linked into the shared library.
Any suggestions on how to have gcc remove the unused functions automatically?
Edit:
I am able to use the above options for gcc to remove the unused functions if I am linking a static library directly to executable. But it doesn't remove the unused functions if I link the static library to a shared library.

You can use a version script to mark the entry points in combination with -ffunction-sections and --gc-sections.
For example, consider this C file (example.c):
int
foo (void)
{
return 17;
}
int
bar (void)
{
return 251;
}
And this version script, called version.script:
{
global: foo;
local: *;
};
Compile and link the sources like this:
gcc -Wl,--gc-sections -shared -ffunction-sections -Wl,--version-script=version.script example.c
If you look at the output of objdump -d --reloc a.out, you will notice that only foo is included in the shared object, but not bar.
When removing functions in this way, the linker will take indirect dependencies into account. For example, if you turn foo into this:
void *
foo (void)
{
extern int bar (void);
return bar;
}
the linker will put both foo and bar into the shared object because both are needed, even though only bar is exported.
(Obviously, this will not work on all platforms, but ELF supports this.)

You're creating a library, and your symbols aren't static, so it's normal that the linker doesn't remove any global symbols.
This -gc-sections option is designed for executables. The linker starts from the entrypoint (main) and discovers the function calls. It marks the sections that are used, and discards the others.
A library doesn't have 1 entrypoint, it has as many entrypoints as global symbols, which explains that it cannot clean your symbols. What if someone uses your .h file in his program and calls the "unused" functions?
To find out which functions aren't "used", I'd suggest that you convert void func_in_my_prog() to int main() (or copy the source into a modified one containing a main()), then create an executable with the sources, and add -Wl,-Map=mapfile.txt option when linking to create a mapfile.
gcc -Wl,--gc-sections -Wl,--Map=mapfile.txt -fdata-sections -ffunction-sections libmy_static_lib.c my_prog.c
This mapfile contains the discarded symbols:
Discarded input sections
.drectve 0x00000000 0x54 c:/gnatpro/17.1/bin/../lib/gcc/i686-pc-mingw32/6.2.1/crt2.o
.drectve 0x00000000 0x1c c:/gnatpro/17.1/bin/../lib/gcc/i686-pc-
...
.text$unused_func1
0x00000000 0x14 C:\Users\xx\AppData\Local\Temp\ccOOESqJ.o
.text$unused_func2
0x00000000 0x14 C:\Users\xx\AppData\Local\Temp\ccOOESqJ.o
.rdata$zzz 0x00000000 0x38 C:\Users\xx\AppData\Local\Temp\ccOOESqJ.o
...
now we see that the unused functions have been removed. They don't appear in the final executable anymore.
There are existing tools that do that (using this technique but not requiring a main), for instance Callcatcher. One can also easily create a tool to disassemble the library and check for symbols defined but not called (I've written such tools in python several times and it's so much easier to parse assembly than from high-level code)
To cleanup, you can delete the unused functions manually from your sources (one must be careful with object-oriented languages and dispatching calls when using existing/custom assembly analysis tools. On the other hand, the compiler isn't going to remove a section that could be used, so that is safe)
You can also remove the relevant sections in the library file, avoiding to change source code, for instance by removing sections:
$ objcopy --remove-section .text$unused_func1 --remove-section text$unused_func2 libmy_static_lib.a stripped.a
$ nm stripped.a
libmy_static_lib.o:
00000000 b .bss
00000000 d .data
00000000 r .rdata
00000000 r .rdata$zzz
00000000 t .text
00000000 t .text$func1
00000000 T _func1
U _puts

Related

Why there are no .rel.dyn/.got.plt section in dynamic ELF files?

I have code like this
// test_printf.c
#include <stdio.h>
int f(){
printf("aaa %d\n", 1);
}
And I compile it with the following code
gcc -shared -fPIC -m32 -g -c -o test_printf.so test_printf.c
I think if run readelf -S test_printf.so, I will see .rel.dyn and .rel.plt. This is because both of these sections act like what .rel.data and .rel.text in static linked programs do.
For example, in my program, since printf is an external symbol, and is referred by my test_printf.so. So when I look into test_printf.so's relocation table, there should be one entry names printf. I check that, and the entry exists.
Then I think, since printf is an external symbol, its location should be determined at runtime. However, we must allocate a .got.plt section for this printf function, and the section should be in both a dynamicly linked executive and a dynamic library(test_printf.so).
However, when I run readelf -S, there is no .got.plt section, and I am confused about this. Is this section not necessary in a dynamic library(test_printf.so)? I don't think it is possible. Suppose test_printf.so is finally linked with executive program a, then how can a know where is the .got.plt section for printf? Does this .got.plt finally generated in a?
Meanwhile, I have a question 2. Are .rel.dyn and .rel.plt always present, if there are .got and .got.plt?
.rel.dyn and .got.plt can be present in a shared library .elf it depends on the structure of the functions in the library, this is in https://www.technovelty.org/linux/plt-and-got-the-key-to-code-sharing-and-dynamic-libraries.html
.rela.dyn is present if on extern function is included :
$ cat test.c
extern int foo;
int function(void) {
return foo;
}
$ gcc -shared -fPIC -o libtest.so test.c
Also see https://eli.thegreenplace.net/2011/08/25/load-time-relocation-of-shared-libraries/ and
https://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries
https://blog.ramdoot.in/how-can-i-link-a-static-library-to-a-dynamic-library-e1f25c8095ef (linking static libraries into a dynamic library)

does the dynamic linker modify the reference after executable is copied to memory?

let's say we have the following code:
main.c
extern void some_function();
int main()
{
some_function();
return 0;
}
mylib.c
void some_function()
{
...
}
and I have create a share library mylib.so and link i to executable object file prog
linux> gcc -shared -fpic -o mylib.so mylib.c
linux> gcc -o prog main.c ./mylib.so
and let's say the picture below is the executable object format of prog
By dynamic linking, we know that none of the code or data sections from mylib.so are actually copied into the executable prog2l at this point. Instead, the linker copies some relocation and symbol table information that will allow references to code and data in mylib.so to be resolved at load time.
and I just want double check if my understanding is correct:
when prog is loaded into the memory by the loader as the picture below show
then the dynamic linker will modify the .data section of prog in memory so that it can be linked/relocated to the instruction address of some_function in the .text section of mylib.so.
Is my understanding correct?
Fairly close. The dynamic linker will modify something in the data segment, not specifically the .data section - segments are a coarser-grained thing corresponding to how the file is mapped into memory rather than the original semantic breakdown. The actual section is usually called .got or .got.plt but may vary by platform. The modification is not "relocating it to the instruction address" but resolving a relocation reference to the function name to get the address it was loaded at, and filling that address in.

Function with same names in different files - C

I have two functions with same name and want to use it in my application.
I referred various answers like here and here but couldn't get clear solution.
I have following functions
// xxxx_input.h
int8_t input_system_init(InputParams params);
int8_t input_system_easy_load(uint32_t interval_ms);
// yyyy_input.h
int8_t input_system_init(InputParams params);
int8_t input_system_easy_load(uint32_t interval_ms);
Reason there are two files is xxxx_input and yyyy_input work way different internally.
Modifying the function isn't easy since the code is provided by external party and we have to keep the xxxx_input files.
What we can do is modify yyyy_input.h but functions like input_system_easy_load are to be kept consistent as they are called from different places.
Is there a way we can achieve the same?
I tried replacing xxxx_input with yyyy_input.h but since the include directory already contains the same function it gives error.
input_system_init multiply defined (by xxxx_input.o and yyyy_input.o).
If you have the source code to the functions defined in xxxx_input.h and yyyy_input.h, you could compile both modules with command line options redefining the function names via the preprocessor:
gcc -Dinput_system_init=xxxx_input_system_init -Dinput_system_easy_load=xxxx_input_system_easy_load xxxx_input.c
gcc -Dinput_system_init=yyyy_input_system_init -Dinput_system_easy_load=yyyy_input_system_easy_load yyyy_input.c
You would then compile your code with modified prototypes and you could link all 3 modules together.
If the modules are provided in object form only, you could define wrapper functions xxxx_input_system_init and xxxx_input_system_easy_load that you would link with the xxxx_input.o to produce a dynamic library, and the same for the yyyy alternatives. You would use modified prototypes in your module and would link it with the dynamic libraries.
Mike Kinghan showed a simpler approach for object files and libraries on systems where objcopy is available.
To get modified prototypes automatically, you can use this include file:
my_input_system.h:
#define input_system_init xxxx_input_system_init
#define input_system_easy_load xxxx_input_system_easy_load
#include "xxxx_input.h"
#undef input_system_init
#undef input_system_easy_load
#define input_system_init yyyy_input_system_init
#define input_system_easy_load yyyy_input_system_easy_load
#include "yyyy_input.h"
#undef input_system_init
#undef input_system_easy_load
/* prevent direct use of the redefined functions */
#define input_system_init do_not_use_input_system_init#
#define input_system_easy_load do_not_use_input_system_easy_load#
I'll walk through a solution you can use if your supplier has given you object files compiled
with GCC or Clang for GNU/Linux computers.
My supplier has given me a header foo_a.h file that declares function foo
$cat foo_a.h
#pragma once
extern void foo(void);
and the matching object file foo.o:
$ nm foo_a.o
0000000000000000 T foo
U _GLOBAL_OFFSET_TABLE_
U puts
that defines foo.
Likewise they've given me a header foo_b.h that also declares foo
$ cat foo_b.h
#pragma once
extern void foo(void);
and the matching object file foo_b.o
$ nm foo_b.o
0000000000000000 T foo
U _GLOBAL_OFFSET_TABLE_
U puts
that also defines foo.
The functions foo_a.o:foo and foo_b.o:foo do different
things (or different variations of the same thing). I want to do both these things in the same program,
prog.c:
$ cat prog.c
extern void foo_a(void);
extern void foo_b(void);
int main(void)
{
foo_a(); // Calls `foo_a.o:foo`
foo_b(); // Calls `foo_b.o:foo`
return 0;
}
I can make such a program as follows:
$ objcopy --redefine-sym foo=foo_a foo_a.o prog_foo_a.o
$ objcopy --redefine-sym foo=foo_b foo_b.o prog_foo_b.o
Now I have made a copy prog_foo_a.o of foo_a.o in which the symbol foo is
renamed foo_a, and a copy prog_foo_b.o of foo_b.o in which
the symbol foo is renamed foo_b.
Then I compile and link like this:
$ gcc -c -Wall -Wextra prog.c
$ gcc -o prog prog.o prog_foo_a.o prog_foo_b.o
And prog runs like:
$ ./prog
foo_a
foo_b
Perhaps my supplier has given me foo_a.o within a static library liba.a
that also contains other object files that refer to foo_a.o:foo? And similarly
with foo_b.o.
That's OK. Instead of:
$ objcopy --redefine-sym foo=foo_a foo_a.o prog_foo_a.o
$ objcopy --redefine-sym foo=foo_b foo_b.o prog_foo_b.o
I will run:
$ objcopy --redefine-sym foo=foo_a liba.a libprog_a.a
$ objcopy --redefine-sym foo=foo_b libb.a libprog_b.a
and this will give me a new static library libprog_a.a
in which foo is renamed foo_a in all the object files in the
library. Similarly foo is renamed foo_b throughout libprog_b.a.
Then I'll link prog:
$ gcc -o prog prog.o -L. -lprog_a -lprog_b
Consider a potential drawback with this solution. Possibly my supplier has
given me foo_a.o and foo_b.o with debugging information in them, and I want to
used it for debugging my prog with gdb?
I have changed the original symbol names, foo_a.o:foo to foo_a and
foo_b.o:foo to foo_b, but I haven't changed the debugging info associated
with those symbols. Debugging with gdb will still work, but some of the
debugging output will be incorrect and possibly confusing. E.g. if I put
a breakpoint on foo_a, gdb will run to it and stop, but it will say
it has stopped at foo from file foo_a.c. And if I then breakpoint at foo_b, gdb will run to
it and again say it is at foo, but from file foo_b.c. If the person doing
the debugging doesn't know how the program was built, this would certainly be
confusing.
But giving you debugging info with binaries is not far from giving you the
source code, so as you haven't got source code you likely don't have
debugging info and are not concerned about it.

C symbol visibility in static archives

I have files foo.c bar.c and baz.c, plus wrapper code myfn.c defining a function myfn() that uses code and data from those other files.
I would like to create something like an object file or archive, myfn.o or libmyfn.a, so that myfn() can be made available to other projects without also exporting a load of symbols from {foo,bar,baz}.o as well.
What's the right way to do that in Linux/gcc? Thanks.
Update: I've found one way of doing it. I should've emphasised originally that this was about static archives, not DSOs. Anyway, the recipe:
#define PUBLIC __attribute__ ((visibility("default"))) then mark myfn() as PUBLIC in myfn.c. Don't mark anything else PUBLIC.
Compile objects with gcc -c foo.c bar.c baz.c myfn.c -fvisibility=hidden, which marks everything as hidden except for myfn().
Create a convenience archive using ld's partial-linking switch: ld -r foo.o bar.o baz.o myfn.o -o libmyfn.a
Localise everything that wasn't PUBLIC like so: objcopy --localize-hidden libmyfn.a
Now nm says myfn is the only global symbol in libmyfn.a and subsequent linking into other programs works just fine: gcc -o main main.c -L. -lmyfn (here, the program calls myfn(); if it tried to call foo() then compilation would fail).
If I use ar instead of ld -r in step 3 then compilation fails in step 5: I guess ar hasn't linked foo etc to myfn, and no longer can once those functions are localised, whereas ld -r resolves the link before it gets localised-away.
I'd welcome any response that confirms this is the "right" way, or describes a slicker way of achieving the same.
Unfortunately, C linkage for globals is all-or-nothing, in the sense that the globals of all modules would be available in libmyfn.a's final list of external symbols.
gcc tool chain offers an extension that lets you hide symbols from outside users, while making them available to other translation units in your library:
foo.h:
void foo();
foo.c:
void foo() __attribute__ ((visibility ("hidden")));
myfn.h:
void myfn();
myfn.c:
#include <stdio.h>
#include "foo.h"
void myfn() {
printf("calling foo...\n");
foo();
printf("calling foo again...\n");
foo();
}
For portability, you would probably benefit from making a macro for __attribute__ ((visibility ("hidden"))), and placing it in a conditional compilation block conditioned on gcc.
In addition, Linux offers a utility called strip, which lets you remove some of the symbols from compiled object files. Options -N and -K let you identify individual symbols that you want to keep or remove.
Start with this to build a static library
gcc -c -O2 foo.c bar.c baz.c myfn.c
ar av libmyfunctions.a foo.o bar.o baz.o myfn.o
Compile and link with other programs like:
gcc -O2 program.c -lmyfunctions -o myprogram
Now your libmyfunctions.a will ultimately have extra stuff from the source that isn't required by the code in myfn.c But the linker should do a reasonable job of removing this when it creates the final program.
Suppose myfn.c has function myfun() which you want to use in other three files foo.c, bar.c & baz.c
Now create a shared library from code in myfn.c viz libmyf.a
Use this function call myfun() in other three files. Declare function as extern in these files. Now you can create object code of these thee files and link the libmyf.a at linking phase.
Refer to following link for using shared libraries.
http://www.cprogramming.com/tutorial/shared-libraries-linux-gcc.html

when dlopen one so, it's symbol is not covered by main symbol, why?

libp2.c
#include <stdio.h>
void pixman()
{
printf("pixman in libp1\n");
}
libc2.c
#include <stdio.h>
void pixman();
void cairo()
{
printf("cairo2\n");
pixman();
}
main.c
#include <stdio.h>
#include <dlfcn.h>
void pixman()
{
printf("pixman in main\n");
}
int main()
{
pixman();
void* handle=NULL;
void (*callfun)();
handle=dlopen("/home/zpeng/test/so_test/libc2.so",RTLD_LAZY);
callfun = (void(*)())dlsym(handle, "cairo");
callfun();
...
}
compile
gcc -c libp2.c -fPIC -olibp2.o
rm libp2.a
ar -rs libp2.a libp2.o
gcc -shared -fPIC libc2.c ./libp2.a -o libc2.so
gcc main.c -ldl -L. -g
the result:
pixman in main
cairo2
pixman: libp2
why the last is not "pixman in main"?
I see the symbols processing(LD_DEBUG=symbols), it begins with :
21180: symbol=pixman; lookup in file=./a.out
21180: symbol=pixman; lookup in file=/lib64/libdl.so.2
21180: symbol=pixman; lookup in file=/lib64/tls/libc.so.6
21180: symbol=pixman; lookup in file=/lib64/ld-linux-x86-64.so.2
21180: symbol=pixman; lookup in file=/home/zpeng/test/so_test/libc2.so
if I add -lc2 or -rdynamic to gcc main cmd , it will generate:
pixman in main
cairo2
pixman in main
My questions:
why lookup symbol in a.out but not get the result and continue to search libc2.so when without -rdynamic and -lc2 ?
Why the last is not "pixman in main"?
That's because shared libraries have their own global offset table or GOT. When you use the cairo function in libc2.so, the pixman function that will be called is the same function that was resolved when compiling the .so file in the first place.
That is:
# creates object file only -- contains first pixman implementation
gcc -c libp2.c -fPIC -olibp2.o
# just turns the object file into an archive
ar -rs libp2.a libp2.o
# creates the .so file -- all symbols in libc2.c are resolved here
# and you passed in the .a file for that purpose. The .a file containing the
# first pixman implementation gets put in libc2.so.
gcc -shared -fPIC libc2.c ./libp2.a -o libc2.so
After this, anyone using libc2.so will get the copy stored in libc2.so. The lookup order you post is for a.out I believe and it's right. It looks for pixman in a.out, then libc2.so, and so on.
Why lookup symbol in a.out but not get the result and continue to search libc2.so when without -rdynamic and -lc2?
The rdynamic option loads ALL symbols to the dynamic symbol table -- not just the ones it thinks are used (lc2 has the same effect). When you load all those symbols you have a conflict -- the pixman function. The main.c implementation is used in this case. As others have pointed out, this will probably generate a warning.
You need to compile the sources that get archived into the .a file with -fvisibility=hidden, to indicate that, although they are global functions, they are not meant to be used outside the resulting library but are instead meant to resolve symbols inside the library. That will cause the symbols in the .a file to appear with the qualifier " t " in nm -a instead of " T ", which is used for symbols available to other libraries.
It just auto binded to LOCAL symbol,
Since there not __attribute__((visibility("default"))) explicit in libp2.c, the compiler auto bind this function calling to LOCAL .symtab, instead of .dynsym
appendix1: more about ELF header: readelf -s xxx.lib
appendix2: keyword of ld argument -Bsymbolic-functions

Resources