I understand that symbol tables are created by the compiler to help with its process.
They exist per object file for when they are being linked together.
Assume:
void test(void){
//
}
void main(){
return 0;
}
compiling above with gcc and running nm a.out shows:
0000000100000fa0 T _main
0000000100000f90 T _test
Why are these symbols still needed? why doesn't the linker remove them once done? aren't they potentially a security risk for hackers to read the source?
Edit
Is this what you mean by debugging a release binary (the ones compiled without -g)?
Assume:
int test2(){
int *p = (int*) 0x123;
return *p;
}
int test1(){
return test2();
}
int main(){
return test1();
}
which segfaults on test2. doing gdb ./a.out > where shows:
(gdb) where
#0 0x000055555555460a in test2 ()
#1 0x000055555555461c in test1 ()
#2 0x000055555555462c in main ()
But stripping a.out and doing the same shows:
(gdb) where
#0 0x000055555555460a in ?? ()
#1 0x000055555555461c in ?? ()
#2 0x000055555555462c in ?? ()
Is this what you mean by keeping symbol tables for debugging release builds? is this the normal way of doing it? are there other tools used?
Why are these symbols still needed?
They are not needed for correctness of execution, but they are helpful for debugging.
Some programs can record their own stack trace (e.g. TCMalloc performs allocation sampling), and report it on crash (or other kind of errors).
While all such stack traces could be symbolized off-line (given a binary which did contain symbols), it is often much more convenient for the program to produce symbolized stack trace, so you don't need to find a matching binary.
Consider a case where you have 1000s of different applications running in the cloud at multiple versions, and you get 100 reports of a crash. Are they the same crash, or are there different causes?
If all you have are bunches of hex numbers, it's hard to tell. You'd have to find a matching binary for each instance, symbolize it, and compare to all the other ones (automation could help here).
But if you have the stack traces in symbolized form, it's pretty easy to tell at a glance.
This does come with a little bit of cost: your binaries are perhaps 1% larger than they have to be.
why doesn't the linker remove them once done?
You have to remember traditional UNIX roots. In the environment in which UNIX was developed everybody had access to the source for all UNIX utilities (including ld), and debuggability was way more important than keeping things secret. So I am not at all surprised that this default (keep symbols) was chosen.
Compare the to choice made by Microsoft -- keep everything to .DBG (later .PDB files).
aren't they potentially a security risk for hackers to read the source?
They are helpful in reverse engineering, yes. They don't contain the source, so unless the source is already open, they don't add that much.
Still, if your program contains something like CheckLicense(), this helps hackers to concentrate their efforts on bypassing your license checks.
Which is why commercial binaries are often shipped fully-stripped.
Update:
Is this what you mean by keeping symbol tables for debugging release builds?
Yes.
is this the normal way of doing it?
It's one way of doing it.
are there other tools used?
Yes: see best practice below.
P.S. The best practice is to build your binaries with full debug info:
gcc -c -g -O2 foo.c bar.c
gcc -g -o app.dbg foo.o bar.o ...
Then keep the full debug binary app.dbg for when you need to debug crashes, but ship a fully-stripped version app to your customers:
strip app.dbg -o app
P.P.S.
gcc -g is used for gdb. gcc without -g still has symbol tables.
Sooner or later you will find out that you must perform debugging on a binary that is built without -g (such as when the binary built without -g crashes, but one built with -g does not).
When that moment comes, your job will be much easier if the binary still has symbol table.
Related
I have this program called parser I compiled with -g flag this is my makefile
parser: header.h parser.c
gcc -g header.h parser.c -o parser
clean:
rm -f parser a.out
code for one function in parser.c is
int _find(char *html , struct html_tag **obj)
{
char temp[strlen("<end")+1];
memcpy(temp,"<end",strlen("<end")+1);
...
...
.
return 0;
}
What I like to see when I debug the parser or something can I also have the capability to change the lines of code after hitting breakpoint and while n through the code of above function. If its not the job of gdb then is there any opensource solution to actually changing code and possible saving so when I run through the next statement in code then changed statement before doing n (possible different index of array) will execute, is there any opensource tool or can it be done in gdb do I need to do some compiling options.
I know I can assign values to variables at runtime in gdb but is this it? like is there any thing like actually also being capable of changing soure
Most C implementations are compiled. The source code is analyzed and translated to processor instructions. This translation would be difficult to do on a piecewise basis. That is, given some small change in the source code, it would be practically impossible to update the executable file to represent those changes. As part of the translation, the compiler transforms and intertwines statements, assigns processor registers to be used for computing parts of expressions, designates places in memory to hold data, and more. When source code is changed slightly, this may result in a new compilation happening to use a different register in one place or needing more or less memory in a particular function, which results in data moving back or forth. Merging these changes into the running program would require figuring out all the differences, moving things in memory, rearranging what is in what processor register, and so on. For practical purposes, these changes are impossible.
GDB does not support this.
(Apple’s developer tools may have some feature like this. I saw it demonstrated for the Swift programming language but have not used it.)
I'm running OS X 10.12 and I'm developing a basic text-based operating system. I have developed a boot loader and that seems to be running fine. My only problem is that when I attempt to compile my kernel into pure binary, the linker won't work. I have done some research and I think that this is because of the fact OS X runs the Darwin linker and not the GNU linker. Because of this, I have downloaded and installed the GNU binutils. However, it still won't work...
Here is my kernel:
void main() {
// Create pointer to a character and point it to the first cell of video
// memory (i.e. the top-left)
char* video_memory = (char*) 0xb8000;
// At that address, put an x
*video_memory = 'x';
}
And this is when I attempt to compile it:
Hazims-MacBook-Pro:32 bit root# gcc -ffreestanding -c kernel.c -o kernel.o
Hazims-MacBook-Pro:32 bit root# ld -o kernel.bin -T text 0x1000 kernel.o --oformat binary
ld: unknown option: -T
Hazims-MacBook-Pro:32 bit root#
I would love to know how to solve this issue. Thank you for your time.
-T is a gcc compiler flag, not a linker flag. Have a look at this:
With these components you can now actually build the final kernel. We use the compiler as the linker as it allows it greater control over the link process. Note that if your kernel is written in C++, you should use the C++ compiler instead.
You can then link your kernel using:
i686-elf-gcc -T linker.ld -o myos.bin -ffreestanding -O2 -nostdlib boot.o kernel.o -lgcc
Note: Some tutorials suggest linking with i686-elf-ld rather than the compiler, however this prevents the compiler from performing various tasks during linking.
The file myos.bin is now your kernel (all other files are no longer needed). Note that we are linking against libgcc, which implements various runtime routines that your cross-compiler depends on. Leaving it out will give you problems in the future. If you did not build and install libgcc as part of your cross-compiler, you should go back now and build a cross-compiler with libgcc. The compiler depends on this library and will use it regardless of whether you provide it or not.
This is all taken directly from OSDev, which documents the entire process, including a bare-bones kernel, very clearly.
You're correct in that you probably want binutils for this especially if you're coding baremetal; while clang as is purports to be a cross compiler it's far from optimal or usable here, for various reasons. noticing you're developing on ARM I infer; you want this.
https://developer.arm.com/open-source/gnu-toolchain/gnu-rm
Aside from the fact that gcc does this thing better than clang markedly, there's also the issue that ld does not build on OS X from the binutils package; it in some configurations silently fails so you may in fact never have actually installed it despite watching libiberty etc build, it will even go through the motions of compiling the source of that target sometimes and just refuse to link it... to the fellow with the lousy tone blaming OP, if you had relevant experience ie ever had built this under this condition you would know that is patently obnoxious. it'd be nice if you'd refrain from discouraging people from asking legitimate questions.
In the CXXfilt package they mumble about apple-darwin not being a target; try changing FAKE_TARGET to instead of mn10003000-whatever or whatever they used, to apple-rhapsody some time.
You're still in way better shape just building them from current if you say need to strip relocations from something or want to work on restoring static linkage to the system. which is missing by default from that clang installation as well...anyhow it's not really that ld couldn't work with macho, it's all there, codewise in fact...that i am sure of
Regarding locating things in memory, you may want to refer to a linker script
http://svn.screwjackllc.com/?p=noid.git;a=blob_plain;f=new_mbed_bs.link_script.ld
As i have some code in there that will directly place things in memory, rather than doing it on command line it is more reproducible to go with the linker script. it's a little complex but what it is doing is setting up a couple of regions of memory to be used with my memory allocators, you can use malloc, but you should prefer not to use actual malloc; dynamic memory is fine when it isn't dynamic...heh...
The script also sets flags for the stack and heap locations, although they are just markers, not loaded til go time, they actually get placed, stack and heap, by the startup code, which is in assembly and rather readable and well commented (hard to believe, i know)... neat trick, you have some persistence to volatile memory, so i set aside a very tiny bit to flip and you can do things like have it control what bootloader to run on the next power cycle. again you are 100% correct regarding the linker; seems to be you are headed the right direction. incidentally another way you can modify objects prior to loading them , and preload things in memory, similar to this method, well there are a ton of ways, but, check out objcopy and objdump...you can use gdb to dump srecs of structures in memory, note the address, and then before linking but after assembly use dd to insert the records you extracted with gdb back in to extracted sections..is one of my favorite ways just because is smartass route :D also, if you are tight on memory ever and need to precalculate constants it's one way to optimize things...that way is actually closer to what ld is doing, just doing it by hand... probably path of least resistance on this now though is linker script.
I'm running out of good ideas on how to crack this bug. I have 1000 lines of code that crashes every 2 or 3 runs. It is currently a prototype command line application written in C. An issue is that it's proprietary and I cannot give you the source, but I'd be happy to send a debug compiled executable to any brave soul on a Debian Squeeze x86_64 machine.
Here is what I got so far:
When I run it in GDB, it always complete successfully.
When I run it in Valgrind, it always complete successfully.
The issue seems to emanate from a recursive function call that is very basic. In an effort to pin point the error in this recursive function I wrote the same function in a separate application. It always completes successfully.
I built my own gcc 4.7.1 compiler, compiled my code with it and I'm still getting the same behavior.
FTped my application to another machine to eliminate the risk of HW issues and I still get the same behavior.
FTped my source code to another machine to eliminate the risk of a corrupt build environment and I still get the same behavior.
The application is single threaded and does no signal handling that might cause race conditions. I memset(,0,) all large objects
There are no exotic dependencies, the ldd follows below.
ldd gives me this:
ldd tst
linux-vdso.so.1 => (0x00007fff08bf0000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fe8c65cd000)
libm.so.6 => /lib/libm.so.6 (0x00007fe8c634b000)
libc.so.6 => /lib/libc.so.6 (0x00007fe8c5fe8000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe8c67fc000)
Are there any tools out there that could help me?
What would be your next step if you were in my position?
Thanks!
This is what got me in the right direction -Wextra I already used -Wall.
THANKS!!! This was really driving me crazy.
I suggested in comments :
to compile with -Wall -Wextra and improve the source code till no warnings are given;
to compile with both -g and -O; this is helpful to inspect dumped core files with gdb (you may want to set a big enough coredump size limit with e.g. ulimit bash builtin)
to show your code to a colleague and explain the issue?
to use ltrace or strace
Apparently -Wextra was helpful. It would be nice to understand why and how.
BTW, for larger programs, you could even add your own warnings to GCC by extending it with MELT; this may take days and is worthwhile mostly in big projects.
In this case, i think that you have some memory problems (see the output of valgrind carefully), cause GDB and valgrind change the original program by adding some memory tracking functions (so your original addresses are changed). You can compile with -ggdb option and set coredump (ulimit -c unlimited) and then trying to analyze what's going on. This link may help you:
http://en.wikipedia.org/wiki/Unusual_software_bug
Regards.
The title is clear, we can loaded a library by dl_open etc..
But how can I get the signature of functions in it?
This answer cannot be answered in general. Technically if you compiled your executable with exhaustive debugging information (code may still be an optimized, release version), then the executable will contain extra sections, providing some kind of reflectivity of the binary. On *nix systems (you referred to dl_open) this is implemented through DWARF debugging data in extra sections of the ELF binary. Similar it works for Mach Universal Binaries on MacOS X.
Windows PEs however uses a completely different format, so unfortunately DWARF is not truley cross plattform (actually in the early development stages of my 3D engine I implemented an ELF/DWARF loader for Windows, so that I could use a common format for the engines various modules, so with some serious effort such can be done).
If you don't want to go into implementing your own loaders, or debugging information accessors, then you may embed the reflection information through some extra symbols exported (by some standard naming scheme) which refer to a table of function names, mapping to their signature. In the case of C source files writing a parser to extract the information from the source file itself is rather trivial. C++ OTOH is so notoriously difficult to parse correctly, that you need some fully fledged compiler to get it right. For this purpose GCCXML was developed, technically a GCC that emits the AST in XML form instead of an object binary. The emitted XML then is much easier to parse.
From the extracted information create a source file with some kind of linked list/array/etc. structure describing each function. If you don't directly export each function's symbol but instead initialize some field in the reflection structure with the function pointer you got a really nice and clean annotated exporting scheme. Technically you could place this information in a spearate section of the binary as well, but putting it in the read only data section does the job as well, too.
However if you're given a 3rd party binary – say worst case scenario it has been compiled from C source, no debugging information and all symbols not externally referenced stripped – you're pretty much screwed. The best you could do, was applying some binary analysis of the way the function accesses the various places in which parameters can be passed.
This will only tell you the number of parameters and the size of each parameter value, but not the type or name/meaning. When reverse engineering some program (e.g. malware analysis or security audit), identifying the type and meaning of the parameters passed to functions is one of the major efforts. Recently I came across some driver I had to reverse for debugging purposes, and you cannot believe how astounded I was by the fact that I found C++ symbols in a Linux kernel module (you can't use C++ in the Linux kernel in a sane way), but also relieved, because the C++ name mangling provided me with plenty information.
On Linux (or Mac) you can use a combination of "nm" and "c++filt" (for C++ libraries)
nm mylibrary.so | c++filt
or
nm mylibrary.a | c++filt
"nm" will give you the mangled form and "c++filt" attempts to put them in a more human-readable format. You might want to use some options in nm to filter down the results, especially if the library is large (or you can "grep" the final output to find a particular item)
No this is not possible. Signature of a function doesn't mean anything at runtime, its a piece of information useful at compile time for the compiler to validate your program.
You can't. Either the library publishes a public API in a header, or you need to know the signature by some other means.
The parameters of a function in the lower level depends on how many stack arguments in the stack frame you consider and how you interpret them. Therefore once the function is compiled into object code it is not possible to get the signature like that. One remote possibility is to disassemble the code and read how it function is working to know the number if parameters, but still the type would be difficult or impossible to determine. In a word, it is not possible.
This information is not available. Not even the debugger knows:
$ cat foo.c
#include <stdio.h>
#include <string.h>
int main(int argc, char* argv[])
{
char foo[10] = { 0 };
char bar[10] = { 0 };
printf("%s\n", "foo");
memcpy(bar, foo, sizeof(foo));
return 0;
}
$ gcc -g -o foo foo.c
$ gdb foo
Reading symbols from foo...done.
(gdb) b main
Breakpoint 1 at 0x4005f3: file foo.c, line 5.
(gdb) r
Starting program: foo
Breakpoint 1, main (argc=1, argv=0x7fffffffe3e8) at foo.c:5
5 {
(gdb) ptype printf
type = int ()
(gdb) ptype memcpy
type = int ()
(gdb)
G'day,
This has been asked before for VC++ but I am interested in the answer for Solaris.
I'm compiling and linking the following trivial C code:
#include <stdio.h>
int main() {
printf("Hello world!\n");
return 0;
}
using the command:
cc -o hello1 hello.c
and doing this a couple of times to get executables hello2 and hello3. This is being done on the same machine with the same compiler and in the same directory just at different times.
The sizes of the executables are the same but diff reports the binaries as differing and cmp -l goes crazy with a long list of differing locations.
Anyone know what cc is embedding in the executables to make them differ? Timestamps?
Edit: Stripping the executables as Chris suggested below makes diff report the two executables as identical.
cheers,
If you use "od -c" on the two binaries, and then use a side-by-side diff program, you can get an idea what the differences are. In the past when I have investigated the Sun compilers, it's usually a date string. You can also try stripping the executable to see if that removes the ELF section that has the difference in it.
If you take the exact same source code and compile it twice with Sun's compiler you will not get two exact-binary-duplicate files. There will be minor differences. As far as I know, it mostly just comes down to date/time issues.