Why does compiling a C program produce such a long binary? - c

I have heard that when a compiler compiles code, what it does is create a file that contains instructions that a machine can execute. According to this video, a simple program like int main(){ int i; i = 3; } should, when compiled, produce a file that's only several bytes long. So why does clang compile this into a file that's several kilobytes long?

This is likely due to some #include statements that statically bind libraries with your executable, or a compiler and a linker including debugging information. Of course an executable also contains a lot of OS specific data/information which add up to the size, see this question for more detailed answers. If you're after a small size executable there's plenty of suggestions in the answers to this question.
EDIT: Reading more about it, the size comes down to C being a high-level language in the sense that it does not communicate directly with hardware, but rather talks with an operation system. Basically, main is not the entry point of your program and there's a lot that goes on before it is even called. I strongly recommend you reading through this blog post and its follow-up and foremost watching Matt Godbolt's insightful talk on the topic. These are all concerned mostly with gcc and GNU/Linux, but I think it's fair to assume that similar reasons apply to executable sizes on other operating systems as well.

Related

How can I know where function ends in memory(get the address)- c/c++

I'm looking for a simple way to find function ending in memory. I'm working on a project that will find problems on run time in other code, such as: code injection, viruses and so fourth. My program will run with the code that is going to be checked on run time, so that I will have access to memory. I don't have access to the source code itself. I would like to examine only specific functions from it. I need to know where functions start and end in stack. I'm working with windows 8.1 64 bit.
In general, you cannot find where the function is ending in memory, because the compiler could have optimized, inlined, cloned or removed that function, split it in different parts, etc. That function could be some system call mostly implemented in the kernel, or some function in an external shared library ("outside" of your program's executable)... For the C11 standard (see n1570) point of view, your question has no sense. That standard defines the semantics of the language, i.e. properties on the behavior of the produced program. See also explanations in this answer.
On some computers (Harvard architecture) the code would stay in a different memory, so there is no point in asking where that function starts or ends.
If you restrict your question to a particular C implementation (that is a specific compiler with particular optimization settings, for a specific operating system and instruction set architecture and ABI) you might (in some cases, not in all of them) be able to find the "end of a function" (but that won't be simple, and won't be failproof). For example, you could post-process the assembler code and/or the object file produced by the compiler, inspect the ELF executable and its symbol table, examine DWARF debug information, etc...
Your question smells a lot like some XY problem, so you should motivate it, whith a lot more explanation and context.
I need to know where functions start and end in stack.
Functions don't sit on the stack, but mostly in the code segment of your executable (or library). What is on the call stack is a sequence of call frames. The organization of the call frames is specific to your ABI. Some compiler options (e.g. -fomit-frame-pointer) would make difficult to explore the call stack (without access to the source code and help from the compiler).
I don't have access to the source code itself. I would like to examine only specific functions from it.
Your problem is still ill-defined, probably undecidable, much more complex than what you believe (since related to the halting problem), and there is considerable literature related to it (read about decompiler, static code analysis, anti-virus & malware analysis). I recommend spending several months or years learning more about compilers (start with the Dragon Book), linkers, instruction set architecture, ABIs. Then look into several proceedings of conferences related to ACM SIGPLAN etc. On a practical side, study the assembler code generated by compilers (e.g. use GCC with gcc -O2 -S -fverbose-asm....); the CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is a nice introduction.
I'm working on a project that will find problems on run time in other code, such as: code injection, viruses and so fourth.
I hope you can dedicate several years of full time work to your ambitious project. It probably is much more difficult than what you thought, because optimizing compilers are much more complex than what you believe (and malware software uses various complex tricks to hide itself from inspection). Malware research is really difficult, but interesting.

Is it possible to modify a C program which is running?

i was wondering if it is possible to modify a piece of C program (or other binary) while it is running ?
I wrote this small C program :
#include <stdio.h>
#include <stdint.h>
static uint32_t gcui32_val_A = 0xAABBCCDD;
int main(int argc, char *argv[]) {
uint32_t ui32_val_B = 0;
uint32_t ui32_cpt = 0;
printf("\n\n Program SHOW\n\n");
while(1) {
if(gcui32_val_A != ui32_val_B) {
printf("Value[%d] of A : %x\n",ui32_cpt,gcui32_val_A);
ui32_val_B = gcui32_val_A;
ui32_cpt++;
}
}
return 0;
}
With a Hex editor i'm able to find "0xAABBCCDD" and modify it when the program is stopped. The modification works when I relauch the program. Cool !
I would like to do this when the program s running is it possible ?
Here is a simple example to understand the phenomena and play a little with it but my true project is bigger.
I have an old DOS game called Dangerous Dave.
I'm able to modify the tiles by simply editing the binary (thanks to http://www.shikadi.net/moddingwiki/Dangerous_Dave)
I developped a small editor that do this pretty well and had fun with it.
I launch the DOS game by using DOSBOX, it works !
I would like to do this dynamically when the game is running. Is it possible ?
PS : I work under Debian 64bit
regards
I was wondering if it is possible to modify a piece of C program (or other binary) while it is running ?
Not in standard (and portable) C11. Read the n1570 specification to check. Notice that most of the time in practice, it is not the C source program (made of several translation units) which is running, but an executable result of some compiler & linker.
However, on Linux (e.g. Debian/Sid/x86-64) you could use some of the following tricks (often with function pointers):
use plugins, so design your program to accept them and define conventions about your plugins. A plugin is a shared object ELF file (some *.so) containing position-independent code (so it should be compiled with specific options). You'll use dlopen(3) & dlsym(3) to do the dynamic loading of the plugin.
use some JIT-compiling library, like GCCJIT or LLVM or libjit or asmjit.
alter your virtual address space (not recommended) manually, using mprotect(2) and mmap(2); then you could overwrite something in a code segment (you really should not do that). This might be tricky (e.g. because of ASLR) and brittle.
perhaps use debug related facilities, either with ptrace(2) or by scripting or extending the gdb debugger.
I suggest to play a bit with /proc/ (see proc(5)) and try at least to run in some terminal the following commands
cat /proc/self/maps
cat /proc/$$/maps
ls /proc/$$/fd/
(and read enough things to understand their outputs) to understand a bit more what a process "is".
So overwriting your text segment (if you really need to do that) is possible, but perhaps more tricky than what you believe !
(do you mind working for several weeks or months simply to improve some old gaming experience?)
Read also about homoiconic programming languages (try Common Lisp with SBCL), about dynamic software updating, about persistence, about application checkpointing, and about operating systems (I recommend: Operating Systems: Three Easy Pieces & OsDev wiki)
I work under Debian 64bit
I suppose you have programming skills and do know C. Then you should read ALP or some newer Linux programming book (and of course look into intro(2) & syscalls(2) & intro(3) and other man pages etc...)
BTW, in your particular case, perhaps the "OS" is DOSBOX (acting as some virtual machine). You might use strace(1) on DOSBOX (or on other commands or processes), or study its source code.
You mention games in your question. If you want to code some, consider libraries like SDL, SFML, Qt, GTK+, ....
Yes you can modify piece of code while running in C. You got to have pointer to your program memory area, and compiled pieces of code that you want to change. Naturally this is considered to be a dangerous practice, with lot of restrictions, and with many possibilities for error. However, this was practice at olden times when the memory was precious.

Is there a reason even my tiniest .c files always compile to at least 128-kilobyte executables?

I am using Dev-C++, which compiles using GCC, on Windows 8.1, 64-bit.
I noticed that all my .c files always compiled to at least 128-kilobyte .exe files, no matter how small the source is. Even a simple "Hello, world!" was 128kb. Source files with more lines of code increased the size of the executable as I would expect, but all the files started off at at least 128kb, as if that's some sort of minimum size.
I know .exe's don't actually have a minimum size like that; .kkrieger is a full first-person shooter with 3d graphics and sound that all fit inside a single 96kb executable.
Trying to get to the bottom of this, I opened up my hello_world.exe in Notepad++. Perhaps my compiler adds a lengthy header that happens to be 128kb, I thought.
Unfortunately, I don't know enough about executables to be able to make sense of it, though I did find strings like "Address %p has no image-section VirtualQuery failed for %d bytes at address %p" buried among the usual garble of characters in an .exe.
Of course, this isn't a serious problem, but I'd like to know why it's happening.
Why is this 128kb minimum happening? Does it have something to do with my 64-bit OS, or perhaps with a quirk of my compiler?
Short answer: it depends.
Long answer: it depends on what operating system you have and how it handles executables.
Most (if not all) compilers of programming languages do not break it down to the absolute, raw x86/ARM/other architecture's machine code. Instead, after they pack your source code into a .o (object) file, they then bring the .o and its libraries and "link" it all together, in such a way that it forms a standard executable format. These "executable formats" are essentially system-specific file formats that contain low level, very-close-to-machine-code instructions that the OS interprets in such a way that it can relay those low-level instructions to the CPU in the form of machine-code instructions.
For example, I'll talk about the two most commonly used executable formats for Linux devices: ELF and ELF64 (I'll let you figure out what the namesake differences are yourself). ELF stands for Executable and Linkable Format. In every ELF-compiled program, the file starts off with a 4-byte "magic number", which is simply a hexadecimal 0x7F followed by the string "ELF" in ASCII. The next byte is set to either 1 or 2, which signifies that the program is for 32-bit or 64-bit architectures, respectively. And after that, another byte to signify the program's endianness. After that, there's a few more bytes that tell what the architecture is, and so on, until you reach a total of up to 64 bytes for the 64-bit header.
However, 64 bytes is not even close to the 128K that you have stated. That's because (aside from the fact that the windows .exe format is usually much more complex), there is the C++ standard library at fault here. For instance, let's have a look at a common use of the C++ iostream library:
#include <iostream>
int main()
{
std::cout<<"Hello, World!"<<std::endl;
return 0;
}
This program may compile to an extremely large executable on a windows system, because the moment you add iostream to your program, it adds the entire C++ standard library into it, increasing your executable's size immensely.
So, how do we rectify this problem? Simple:
Use the C standard library implementation for C++!
#include <cstdio>
int main()
{
printf("Hello, World!\n");
return 0;
}
Simply using the original C standard library can decrease your size from a couple hundred KBytes to a handful at most. The reason that this happens is simply because GCC/G++ really likes linking programs with the entire standard C++ library for some odd reason.
However, sometimes you absolutely need to use the C++-specific libraries. In that case,a lot of linkers have some kind of command-line option that essentially tells the linker "Hey, I'm only using like, 2 functions from the STDCPP library, you don't need the whole thing". On the Linux linker ld, this is the command-line option -nodefaultlibs. I'm not entirely sure what this is on windows, though. Of course, this can very quickly break a TON of calls and such in programs that make a lot of standard C++ calls.
So, in the end, I would worry more about simply re-writing your program to use the regular C functions instead of the new-fangled C++ functions, as amazing as they are. that is if you're worried about size.

hidden routines linked in c program

Hullo,
When one disasembly some win32 exe prog compiled by c compiler it
shows that some compilers links some 'hidden' routines in it -
i think even if c program is an empty one and has a 5 bytes or so.
I understand that such 5 bytes is enveloped in PE .exe format but
why to put some routines - it seem not necessary for me and even
somewhat annoys me. What is that? Can it be omitted? As i understand
c program (not speaking about c++ right now which i know has some
initial routines) should not need such complementary hidden functions..
Much tnx for answer, maybe even some extended info link, cause this
topic interests me much
//edit
ok here it is some disasembly Ive done way back then
(digital mars and old borland commandline (i have tested also)
both make much more code, (and Im specialli interested in bcc32)
but they do not include readable names/symbols in such dissassembly
so i will not post them here
thesse are somewhat readable - but i am not experienced in understending
what it is ;-)
https://dl.dropbox.com/u/42887985/prog_devcpp.htm
https://dl.dropbox.com/u/42887985/prog_lcc.htm
https://dl.dropbox.com/u/42887985/prog_mingw.htm
https://dl.dropbox.com/u/42887985/prog_pelles.htm
some explanatory comments whats that heere?
(I am afraid maybe there is some c++ sh*t here, I am
interested in pure c addons not c++ though,
but too tired now to assure that it was compiled in c
mode, extension of compiled empty-main prog was c
so I was thinking it will be output in c not c++)
tnx for longer explanations what it is
Since your win32 exe file is a dynamically linked object file, it will contain the necessary data needed by the dynamic linker to do its job, such as names of libraries to link to, and symbols that need resolving.
Even a program with an empty main() will link with the c-runtime and kernel32.dll libraries (and probably others? - a while since I last did Win32 dev).
You should also be aware that main() is only the entry point of your program - quite a bit has already gone on before this point such as retrieving and tokening the command-line, setting up the locale, creating stderr, stdin, and stdout and setting up the other mechanism required by the c-runtime library such a at_exit(). Similarly, when your main() returns, the runtime does some clean-up - and at the very least needs to call the kernel to tell it that you're done.
As to whether it's necessary? Yes, unless you fancy writing your own program prologue and epilogue each time. There are probably are ways of writing minimal, statically linked applications if you're sufficiently masochistic.
As for storage overhead, why are you getting so worked up? It's not enough to worry about.
There are several initialization functions that load whenever you run a program on Windows. These functions, among other things, call the main() function that you write - which is why you need either a main() or WinMain() function for your program to run. I'm not aware of other included functions though. Do you have some disassembly to show?
You don't have much detail to go on but I think most of what you're seeing is probably the routines of the specific C runtime library that your compiler works with.
For instance there will be code enabling it to run from the entry point 'main' which portable executable format understands to call the main(char ** args) that you wrote in your C program.

How do i compile a c program without all the bloat?

I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)
Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.
You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.
The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!

Resources