How do i compile a c program without all the bloat? - c

I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)

Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.

You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.

The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!

Related

How to get a c source code from the compiled code

I have the compiled C code in text format. I need to extract the source code by decompiling the machine code. How to do that?
"True" decompiling is, basically, impossible. Foremost, you can't "decompile" local names (in functions and source code files / modules). For those, you'll get something like, for int local variables: i1, i2... Of course, unless you also have debug information, which is not often the case.
Decompiling to "something" (which might not be very readable) is possible, but it usually relies on some heuristics, recognizing code patterns that compilers generate and can be fooled into generating strange (possibly even incorrect) C code. In practice that means that a decompiler usually works OK for a certain compiler with certain (default) compile options, but, not so nice with others.
Having said that, decompilers do exist and you can try your luck with, say Snowman
As Srdjan has said, in general decompilation of a C (or C++) program is not possible. There is too much information lost during the compilation process. For example consider a declaration such as int x this is 'lost' as it does not directly produce any machine level instruction. The compiler needs this information to do type checking only.
Now, however it is possible to disassembly which is taking the compiled executable back up a level to assembly language. However, interpretation of the assembly might (will ?) be difficult and certainly time consuming. There are several disassemblers available, if you have money IDA-Pro is probably the industry standard in disassemblers, and if you are doing this type work, well worth the several thousand dollars per license. There are a number of open source disassemblers available, google can find them.
Now, that being said there have been efforts to create a decompilers, IDA-Pro has one, and you can look at http://boomerang.sourceforge.net/ in addition to Snowman linked above.
Lastly, other languages are more friendly towards decompilation then C or C++. For example a C# programs is decompilable with tools like dotPeek or ilSpy. Similarly with Java there are a number of tools that can convert Java bytecode back into Java source.
Please post a sample of the "compiled C code in text format."
Perhaps then it will be easier to see what you are trying to achieve.
Typically it is not practical to reverse engineer assembly language into C because much the human readable information in the form of Labels and variable names is permanently lost in the compilation process.

What files need to be modified to compile for a custom architecture of an existing cpu with gcc?

I've been looking at examples of C code that is compiled for some lesser known processors (like ZPU) using the gcc cross compiler.
Most of the working examples I see assume a certain arquitecture (Memory map and set of peripherals) and simply give you a recipe to compile for these and they work.
However I can find very little information on what needs to modified if you use the same cpu with a different memory map and set of peripherals.
From what I've read. There are two main files that I need to make sure that are done "right". The linker script that is used and the crt0.o (Which if I need to modify means recompiling the crt0.S which is assembler). On this last one, especially I find very little information on what is actually supposed to do (other that setting up reset there is no clear info, and I'm talking conceptually not for an specific processor. Although something for this would also be useful).
Can any one tell me what is the relationship between a the c files for the code of program (bare metal development), the crt0.S (specially why it is needed) and it's relationship with a working linker script?
PD: Answers of the form "read this book" are welcome and I would love them.
PD: I realize this kind of question is usually vague and closed quickly but I don't know where else to turn, so I ask for a bit of leniency.

Combining source code into a single file for optimization

I was aiming at reducing the size of the executable for my C project and I have tried all compiler/linker options, which have helped to some extent. My code consists of a lot of separate files. My question was whether combining all source code into a single file will help with optimization that I desire? I read somewhere that a compiler will optimize better if it finds all code in a single file in place of separate multiple files. Is that true?
A compiler can indeed optimize better when it finds needed code in the same compilable (*.c) file. If your program is longer than 1000 lines or so, you'll probably regret putting all the code in one file, because doing so will make your program hard to maintain, but if shorter than 500 lines, you might try the one file, and see if it does not help.
The crucial consideration is how often code in one compilable file calls or otherwise uses objects (including functions) defined in another. If there are few transfers of control across this boundary, then erasing the boundary will not help performance appreciably. Therefore, when coding for performance, the key is to put tightly related code in the same file.
I like your question a great deal. It is the right kind of question to ask, in my view; and, though the complete answer is not simple enough to treat fully in a Stackexchange answer, your pursuit of the answer will teach you much. Though you may not yet realize it, your question really regards linking, a subject every advancing programmer eventually has to learn. Your question regards symbol tables, inlining, the in-place construction of return values and several, other, subtle factors.
At any rate, if your program is shorter than 500 lines or so, then you have little to lose by trying the single-file approach. If longer than 1000 lines, then a single file is not recommended.
It depends on the compiler. The Intel C++ Composer XE for example can automatically optimize over multiple files (when building using icc -fast *.c *.cpp or icl /fast *.c *.cpp, for linux/windows respectively).
When you use Microsoft Visual Studio, or a derived product (like Atmel Studio for microcontrollers), every single source file is compiled on its own (i. e. one cl, icl, or gcc command is issued for every c and cpp file in the project). This means no optimization.
For microcontroller projects I sometimes have to put everything in a single file in order make it even fit in the limited flash memory on the controller. If your compiler/IDE does it like visual studio, you can use a trick: Select all the source files and make them not participate in the build process (but leave them in the project), then create a file (I always use whole_program.c, and #include every single source (i.e. non-header) file in it (note that including c files is frowned upon by many high level programmers, but sometimes, you have to do it the dirty way, and with microcontrollers, that's actually more often than not).
My experience has been that with gnu/gcc the optimization is within the single file plus includes to create a single object. With clang/llvm it is quite easy and I recommend, DO NOT optimize the clang step, use clang to get from C to bytecode, the use llvm-link to link all of your bytecode modules into one bytecode module, then you can optimize the whole project, all source files optimized together, the llc adds more optimization as it heads for the target. Your best results are to tell clang using the something triple command line option what your ultimate target is. For the gnu path to do the same thing either use includes to make one big file compiled to one object, or if there is a machine code level optimizer other than a few things the linker does, then that is where it would have to happen. maybe gnu has an exposed ir file format, optimizer, and ir to target tool, but I think I would have seen that by now.
http://github.com/dwelch67 a number of my projects, although very simple programs, have llvm and gnu builds for the same source files, you can see where the llvm builds I make a binary from unoptimized bytecode and also optimized bytecode (llvm's optimizer has problems with small while loops and sometimes generates non-working code, a very quick check to see if it is you or them is to try the non-optimized llvm binary and the gnu binary to see if they all behave the same (you) or if only the optimized llvm doesnt work (them)).

Arm Rom.ld ,Ram.ld ,scatterfile ,startup.s ,what all these files do?

I have programmed avr microcontroller , but new to arm.I just looked a sample code for sam7s64 that comes with winarm.I am confused about these files rom.ld , ram.ld , scatter file , cstartup.s file. I never saw these kind of files when i programmed avr .Please clarify my doubts what each of them file do.
I have even more samples for you to ponder over http://github.com/dwelch67
Assume you have a toolchain that supports a specific instruction set. Tools often try to support different implementations. You might have a microcontroller with X amount of flash and Y amount of ram. One chip might have the ram at a different place than another, etc. The instruction set may be the same (or itself may have subtle changes) in order for the toolchain to encode some of the instructions it eventually wants to know what your memory layout is. It is possible to write code for some processors that is purely position independent, in general though that is not necessarily a goal as it has a cost. tools also tend to have a unix approach to things. From source language to object file, which doesnt know the memory layout yet, it leaves some holes to be filled in later. You can get from different languages depending on the toolchain and instruction set, maybe mixing ada and C and other languages that compile to object. Then the linker needs to combine all of those things. You as the programmer can and sometimes have to control what goes where. You want the vector table to be at the right place, you want your entry code perhaps to be at a certain place, you definitely want .data in ram ultimately and .text in flash.
For the gnu tools you tell the linker where things go using a linker script, other toolchains may have other methods. With gnu ld you can also use the ld command line...the .ld files you are seeing are there to control this. Now sometimes this is buried in the bowels of the toolchain install, there is a default place where the default linker script will be found, if that is fine then you dont need to craft a linker script and carry it around with the project. Depending on the tools you were using on the avr, you either didnt need to mess with it (were using assembly, avra or something where you control this with .org or other similar statements) or the toolchain/sandbox took care of it for you, it was buried (for example with the arduino sandbox). For example if you write a hello world program
#include <stdio.h>
int main ( void )
{
printf("Hello World!\n");
return(0);
}
and compile that on your desktop/laptop
gcc hello.c -o hello
there was a linker script involved, likely a nasty, scary, ugly one. But since you are content with the default linker script and layout for your operating system, you dont need to mess with it it just works. For these microcontrollers where one toolchain can support a vast array of chips and vendors, you start to have to deal with this. It is a good idea to keep the linker script with the project as you dont know from one machine or person to the next what exact gnu cross compiler they have, it is not difficult to create projects that work on many gnu cross compiler installs if you keep a few things with the project rather than force them into the toolchain.
The other half of this, in particular with the gnu tools an intimate relationship with the linker script is the startup code. Before your C program is called there are some expectations. for example the .data is in place and .bss has been zeroed. For a microcontroller you want .data saved in non volatile memory so it is there when you start your C program, so it needs to be in flash, but it cant run from there as .data is read/write, so before the entry point of the C code is called you need to copy .data from flash to the proper place in ram. The linker script describes both where in flash to keep .data and where in ram to copy it. The startup code, which you can name whatever you want startup.s, start.s, crt0.s, etc, gets variables filled in during the link stage so that code can copy .data to ram, can zero out .bss, can set the stack pointer so you have a stack (another item you need for C to work), then that code calls the C entry point. This is true for any other high level language as well, if nothing else everyone needs a stack pointer so you need some startup code.
If you look at some of my examples you will see me doing linker scripts and startup code for avr processors as well.
It's hard to know exactly what the content of each of the files (rom.ld , ram.ld , scatter file , cstartup.s) are in your specific case. However assuming their names are descriptive enough I will give you an idea of what they are intended to do:
1- rom.ld/ram.ld: by the files extensions these are "linker scripts". These files tell the linker how where to put each of the memory sections of the object files (see GNU LD to learn all about linker scripts and their syntax)
2- cstartup.s: Again, from the extension of this file. It appears to be code written in assembly. Generally in this file the software developer will initialize that microcontroller before passing control to the your main application. Examples of actions performed by this file are:
Setup the ARM vectors
Configure the oscillator frequency
Initialize volatile memory
Call main()
3- Scatter : Personally I have never used this file. However it appears to be a file used to control the memory layout of your application and how that is laid out in your micro (see reference). This appears to be a Keil specific file no different from any other linker script.

removing unneeded code from gcc andd mingw

i noticed that mingw adds alot of code before calling main(), i assumed its for parsing command line parameters since one of those functions is called __getmainargs(), and also lots of strings are added to the final executable, such as mingwm.dll and some error strings (incase the app crashed) says mingw runtime error or something like that.
my question is: is there a way to remove all this stuff? i dont need all these things, i tried tcc (tiny c compiler) it did the job. but not cross platform like gcc (solaris/mac)
any ideas?
thanks.
Yes, you really do need all those things. They're the startup and teardown code for the C environment that your code runs in.
Other than non-hosted environments such as low-level embedded solutions, you'll find pretty much all C environments have something like that. Things like /lib/crt0.o under some UNIX-like operating systems or crt0.obj under Windows.
They are vital to successful running of your code. You can freely omit library functions that you don't use (printf, abs and so on) but the startup code is needed.
Some of the things that it may perform are initialisation of atexit structures, argument parsing, initialisation of structures for the C runtime library, initialisation of C/C++ pre-main values and so forth.
It's highly OS-specific and, if there are things you don't want to do, you'll probably have to get the source code for it and take them out, in essence providing your own cut-down replacement for the object file.
You can safely assume that your toolchain does not include code that is not needed and could safely be left out.
Make sure you compiled without debug information, and run strip on the resulting executable. Anything more intrusive than that requires intimate knowledge of your toolchain, and can result in rather strange behaviour that will be hard to debug - i.e., if you have to ask how it could be done, you shouldn't try to do it.

Resources