How to extract C source code from .so file?

How to extract C source code from .so file? - c

I am working on previously developed software and source code is compiled as linux shared libraries (.so) and source code is not present. Is there any tool which can extract source code from the linux shared libraries?
Thanks,
Ravi

There isn't. Once you compile your code there is no trace of it left in the binary, only machine code.
Some may mention decompilers but those don't extract the source, they analyze the executable and produce some source that should have the same effect as the original one did.

You can try disassembling the object code and get the machine code mnemonics.
objdump -D --disassembler-options intel sjt.o to get Intel syntax assembly
objdump -D --disassembler-options att sjt.o or objdump -D sjt.o to get AT&T syntax assembly
But the original source code could never be found. You might try to reverse the process by studying and reconstruct the sections. It would be hell pain.

Disclaimer: I work for Hex-Rays SA.
The Hex-Rays decompiler is the only commercially available decompiler I know of that works well with real-life x86 and ARM code. It's true that you don't get the original source, but you get something which is equivalent to it. If you didn't strip your binary, you might even get the function names, or, with some luck, even types and local variables. However, even if you don't have symbol info, you don't have to stick to the first round of decompilation. The Hex-Rays decompiler is interactive - you can rename any variable or function, change variable types, create structure types to represent the structures in the original code, add comments and so on. With a little work you can recover a lot. And quite often what you need is not the whole original file, but some critical algorithm or function - and this Hex-Rays can usually provide to you.
Have a look at the demo videos and the comparison pages. Still think "staring at the assembly" is the same thing?

No. In general, this is impossible. Source is not packaged in compiled objects or libraries.

You cannot. But you can open it as an archive in 7-Zip. You can see the file type and size of each file separately in that. You can replace the files in it with your custom files.

Related

how can get lib functions bodies in C?

As you can see above,I want to know how library functions (like printf) are made in C. I am using the borlandC++ compiler.
They are defined in lib files (***.lib), header files only have prototypes.
Lib files cannot be read in text editors.
So, please let me know how they could read?

C is a compiled language, so the C source code gets translated to binary machine-language code.
Because of that, you can't see the actual source code of any given library you have.
If you want to know how it works, you can see if it's an open source library, find the source code of the particular revision that generated the version you're using, and read it.
If it's not open source, you could try decompiling - use a tool that tries to guess what the original source code could have been like for generating the machine code your library has. As you can guess, this is not an accurate process - compiling isn't an isomorphic process - and, as you probably wouldn't have guessed, it could be illegal - but I'm not really sure what conditions it depends on, if any.

Usage differences between. a.out, .ELF, .EXE, and .COFF

Don't get me wrong by looking at the question title - I know what they are (format for portable executable files). But my interest scope is slightly different
MY CONFUSION
I am involved in re-hosting/retargeting applications that are originally from third parties. The problem is that sometimes the formats for object codes are also in .elf, .COFF formats and still says, "Executable and linkable".
I am primarily a Windows user and know that when you compile and assemble your C/C++ code, you get something similar to .o or .obj. that are not executable (well, I never tried to execute them). But when you complete linking static and dynamic libraries and finish building, the executable appears. My understanding is that you can then go about and link that executable or "bash" test it with some form of script if necessary.
However, in Linux (or UNIX-like systems) there are .o files after you compile and assemble the C/C++ code. And once the linking is done, the executable is in a.out format (at least in Ubuntu distribution of Linux). It may very well be .elf in some other distrib. In my quick web search none of the sources mentioned anything about .o files as executables.
QUESTIONS
Therefore my question turns into the followings:
What is the true definitions for portable executables and object code?
How is it that Windows and UNIX platform covers both executables annd object code under the same file format (.COFF, .elf).
Am I misinterpreting "Linkable"? My interpretation of "Linkable" is something that is compiled object code and can then be "linked" to other static/dynamic link libraries. Is this a stupid thought?
Based on question 1. (and perhaps 2) do I need to use symbol tables (e.g. .LUM or .MAP files) with object code then? Symbols as in debug symbols and using them when re-hosting the executables/object files on a different machine.
Thanks in advance for the right nudges. Meanwhile, I will keep digging and update the question if necessary.
UPDATE
I have managed to dig this out from somewhere :( Seems like a lot to swallow to me.

I am primarily a Windows user and know that when you compile your C/C++ code, you get something similar to .o or .obj. that are not executable
Well, last time I compiled stuff on Windows, the result of the compilation was an .obj file, which is exactly what its name suggests: it's an object file. You're right in that it's not an executable in itself. It contains machine code which doesn't (yet) contain enough information to be directly run on the CPU.
However, in Linux (or UNIX-like systems) there are .o files after you compile the C/C++ code. And once the linking is done, the executable is in a.out format (at least in Ubuntu distribution of Linux). It may very well be .elf in some other distrib.
Living in the 90's, that is :P No modern compilers I am aware of target the a.out format as their default output format for object code. Maybe it's a misleading default of GCC to put the object code into a file called a.out when no explicit output file name is specified, but if you run the file command on a.out, you'll find out that it's an ELF file. The a.out format is ancient and it's kind of "de facto obsolete".
What is the true definitions for portable executables and object code?
You've already got the Wikipedia link to object files, here's the one to "Portable Executable".
How is it that Windows and UNIX platform covers both executables annd object code under the same file format (.COFF, .elf).
Because the ELF format (and apparently COFF too) has been designed like so. And why not? It's just the very same machine code after all, it seems quite logical to use one file format during all the compilation steps. Just like we don't like when dynamic libraries and stand-alone executables have a different format. (That's why ELF is called ELF - it's an "Executable and Linkable Format".)
Am I misinterpreting "Linkable"?
I don't know. From your question it's not clear to me what you think "linkable" is. In general, it means that it's a file that can be linked against, i. e. a library.
Based on question 1. (and perhaps 2) do I need to use symbol tables (e.g. .LUM or .MAP files) with object code then? Symbols as in debug symbols and using them when re-hosting the object files on a different machine.
I think this one is not related to the executable format used. If you want to debug, you have to generate debugging information no matter what. But if you don't need to debug, then you're free to omit them of course.

How to write your own code generator backend for gcc?

I have created my very own (very simple) byte code language, and a virtual machine to execute it. It works fine, but now I'd like to use gcc (or any other freely available compiler) to generate byte code for this machine from a normal c program. So the question is, how do I modify or extend gcc so that it can output my own byte code? Note that I do NOT want to compile my byte code to machine code, I want to "compile" c-code to (my own) byte code.
I realize that this is a potentially large question, and it is possible that the best answer is "go look at the gcc source code". I just need some help with how to get started with this. I figure that there must be some articles or books on this subject that could describe the process to add a custom generator to gcc, but I haven't found anything by googling.

I am busy porting gcc to an 8-bit processor we design earlier. I is kind of a difficult task for our machine because it is 8-bit and we have only one accumulator, but if you have more resources it can became easy. This is how we are trying to manage it with gcc 4.9 and using cygwin:
Download gcc 4.9 source
Add your architecture name to config.sub around line 250 look for # Decode aliases for certain CPU-COMPANY combinations. In that list add | my_processor \
In that same file look for # Recognize the basic CPU types with company name. add yourself to the list: | my_processor-* \
Search for the file gcc/config.gcc, in the file look for case ${target} it is around line 880, add yourself in the following way:
;;
my_processor*-*-*)
c_target_objs="my_processor-c.o"
cxx_target_objs="my_processor-c.o"
target_has_targetm_common=no
tmake_file="${tmake_file} my_processor/t-my_processor"
;;
Create a folder gcc-4.9.0\gcc\config\my_processor
Copy files from an existing project and just edit it, or create your own from scratch. In our project we had copied all the files from the msp430 project and edited it all
You should have the following files (not all files are mandatory):
my_processor.c
my_processor.h
my_processor.md
my_processor.opt
my_processor-c.c
my_processor.def
my_processor-protos.h
constraints.md
predicates.md
README.txt
t-my_processor
create a path gcc-4.9.0/build/object
run ../../configure --target=my_processor --prefix=path for my compiler --enable-languages="c"
make
make install
Do a lot of research and debugging.
Have fun.

It is hard work.
For example I also design my own "architecture" with my own byte code and wanted to generate C/C++ code with GCC for it. This is the way how I make it:
At first you should read everything about porting in the manual of GCC.
Also not forget too read GCC Internals.
Read many things about Compilers.
Also look at this question and the answers here.
Google for more information.
Ask yourself if you are really ready.
Be sure to have a very good cafe machine... you will need it.
Start to add machine dependet files to gcc.
Compile gcc in a cross host-target way.
Check the code results in the Hex-Editor.
Do more tests.
Now have fun with your own architecture :D
When you are finished you can use c or c++ only without os-dependet libraries (you have currently no running OS on your architecture) and you should now (if you need it) compile many other libraries with your cross compiler to have a good framework.
PS: LLVM (Clang) is easier to port... maybe you want to start there?

It's not as hard as all that. If your target machine is reasonably like another, take its RTL (?) definitions as a starting point and amend them, then make compile test through the bootstrap stages; rinse and repeat until it works. You probably don't have to write any actual code, just machine definition templates.

How do i compile a c program without all the bloat?

I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)

Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.

You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.

The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!

How to write a linker

I have written a compiler for C that outputs byte code. The reason for this was to be able to write applications for an embedded platform that runs on multiple platforms.
I have the compiler and the assembler.
I need to write a linker, and am stuck.
The object format is a custom one, designed around the byte code interpreter, so I cant really use any existing linkers.
My biggest hurdle is how to organize the object code to output the linked binary.
Dynamic linking is not necessary, at this time.
I need to get static linking working first.

Ian Lance Taylor, one of the main developers on the gold linker(now part of binutils), posted a series of blogs on how linkers work. You can find it here.

http://linker.iecc.com is the only book I know about this subject.

I second the Linkers and Loaders book. You state that your object format is a custom one. If the format is under your control, you could consider using the ELF format with your bytecode as a new machine architecture, a la x86, SPARC, ARM, etc. The GNU binutils sources are sufficiently malleable to allow you to incorporate your "architecture".

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight