I'm starting to program in Rust and one of the first things I noticed is that Rust produces large binaries. For example, Rust's "Hello world!" binary is ~600K large, while the equivalent C binary is ~8K large.
After some searching I found this SO post which explains that Rust binaries are large because all the needed libraries are statically linked. But isn't that the case for C as well? When I write #include <stdio.h> in C, aren't I statically linking the relevant I/O libraries as well? I have always assumed that answer is 'yes' but now I am doubting myself.
#include copies the file contents to the source file, but if the header is nothing more than function declarations, all that would do is tell the program that those functions are available to be called in your code. The actual implementation may be defined in another file that would need to be linked in (either statically or dynamically) to your executable. If you look at the header for stdio.h you would see that it only contains function prototypes.
Many compilers provide options to do either static or dynamic linking for the standard libraries.
Related
When I include some function from a header file in a C++ program, does the entire header file code get copied to the final executable or only the machine code for the specific function is generated. For example, if I call std::sort from the <algorithm> header in C++, is the machine code generated only for the sort() function or for the entire <algorithm> header file.
I think that a similar question exists somewhere on Stack Overflow, but I have tried my best to find it (I glanced over it once, but lost the link). If you can point me to that, it would be wonderful.
You're mixing two distinct issues here:
Header files, handled by the preprocessor
Selective linking of code by the C++ linker
Header files
These are simply copied verbatim by the preprocessor into the place that includes them. All the code of algorithm is copied into the .cpp file when you #include <algorithm>.
Selective linking
Most modern linkers won't link in functions that aren't getting called in your application. I.e. write a function foo and never call it - its code won't get into the executable. So if you #include <algorithm> and only use sort here's what happens:
The preprocessor shoves the whole algorithm file into your source file
You call only sort
The linked analyzes this and only adds the source of sort (and functions it calls, if any) to the executable. The other algorithms' code isn't getting added
That said, C++ templates complicate the matter a bit further. It's a complex issue to explain here, but in a nutshell - templates get expanded by the compiler for all the types that you're actually using. So if have a vector of int and a vector of string, the compiler will generate two copies of the whole code for the vector class in your code. Since you are using it (otherwise the compiler wouldn't generate it), the linker also places it into the executable.
In fact, the entire file is copied into .cpp file, and it depends on compiler/linker, if it picks up only 'needed' functions, or all of them.
In general, simplified summary:
debug configuration means compiling in all of non-template functions,
release configuration strips all unneeded functions.
Plus it depends on attributes -> function declared for export will be never stripped.
On the other side, template function variants are 'generated' when used, so only the ones you explicitly use are compiled in.
EDIT: header file code isn't generated, but in most cases hand-written.
If you #include a header file in your source code, it acts as if the text in that header was written in place of the #include preprocessor directive.
Generally headers contain declarations, i.e. information about what's inside a library. This way the compiler allows you to call things for which the code exists outside the current compilation unit (e.g. the .cpp file you are including the header from). When the program is linked into an executable that you can run, the linker decides what to include, usually based on what your program actually uses. Libraries may also be linked dynamically, meaning that the executable file does not actually include the library code but the library is linked at runtime.
It depends on the compiler. Most compilers today do flow analysis to prune out uncalled functions. http://en.wikipedia.org/wiki/Data-flow_analysis
I'm writing a C program where every bit of the executable size matters.
If, for example, only printf() from stdlib.h is required in my program, would including the header actually cause everything in that library to be copied into the CMake compiled executable?
CMake is just the build system generator. What ultimately goes into the final executable is decided by the linker and which options you use with it. Typical linkers will only link into the executable what they can determine to be necessary – unless you ask them to link everything. However there's some limits on how much they can reduce the footprint.
The rule of thumb is, that if you use a function found in foo.o, then the whole lot of foo.o gets linked; hence if size optimization is your goal, it's a good idea to give each function its own compilation unit.
What headers you use has no effect whatsoever, because headers are processed at compilation time, not linkage time.
Last but not least: In most implementation of the standard library, the printf family of functions is among the most heavyweight ones, so don't use them if you're beancounting.
As a principle, headers should be idempotent, that is, they should not affect the executable if the declarations are not used. stdlib.h should only have things like prototypes, pre-processor macro definitions and struct definitions, it should not contain executable code or variable declarations.
Standard library code is included by the linker as required. However, the C runtime-library library (RTL) might have this code in a DLL or shared object, depending on your platform. Using a DLL (or equivalent) does not affect the size of the executable file, but of course can affect the memory used. Since DLL code is shared between processes it is not uncommon for the C RTL to remain in memory, but, assuming dynamic linking, there will only be one copy, regardless of the number of C processes running. Most C RTLs will have some memory allocated per-process, but how much depends on the compiler/platform.
Okay, until this morning I was thoroughly confused between these terms. I guess I have got the difference, hopefully.
Firstly, the confusion was that since the preprocessor already includes the header files into the code which contains the functions, what library functions does linker link to the object file produced by the assembler/compiler? Part of the confusion primarily arose due to my ignorance about the difference between a header file and a library.
After a bit of googling, and stack-overflowing (is that the term? :p), I gathered that the header file mostly contains the function declarations whereas the actual implementation is in another binary file called the library (I am still not 100% sure about this).
So, suppose in the following program:-
#include<stdio.h>
int main()
{
printf("whatever");
return 0;
}
The preprocessor includes the contents of the header file in the code. The compiler/compiler+assembler does its work, and then finally linker combines this object file with another object file which actually has stored the way printf() works.
Am I correct in my understanding? I may be way off...so could you please help me?
Edit: I have always wondered about the C++ STL. It always confused me as to what it exactly is, a collection of all those headers or what? Now after reading the responses, can I say that STL is an object file/something that resembles an object file?
And also, I thought where I could read the function definitions of functions like pow(), sqrt() etc etc. I would open the header files and not find anything. So, is the function definition in the library in binary unreadable form?
A C source file goes through two main stages, (1) the preprocessor stage where the C source code is processed by the preprocessor utility which looks for preprocessor directives and performs those actions and (2) the compilation stage where the processed C source code is then actually compiled to produce object code files.
The preprocessor is a utility that does text manipulation. It takes as input a file that contains text (usually C source code) that may contain preprocessor directives and outputs a modified version of the file by applying any directives found to the text input to generate a text output.
The file does not have to be C source code because the preprocessor is doing text manipulation. I have seen the C Preprocssor used to extend the make utility by allowing preprossor directives to be included in a make file. The make file with the C Preprocessor directives is run through the C Preprocessor utility and the resulting output then fed into make to do the actual build of the make target.
Libraries and linking
A library is a file that contains object code of various functions. It is a way to package the output from several source files when they are compiled into a single file. Many times a library file is provided along with a header file (include file), typically with a .h file extension. The header file contains the function declarations, global variable declarations, as well as preprocessor directives needed for the library. So to use the library, you include the header file provided using the #include directive and you link with the library file.
A nice feature of a library file is that you are providing the compiled version of your source code and not the source code itself. On the other hand since the library file contains compiled source code, the compiler used to generate the library file must be compatible with the compiler being used to compile your own source code files.
There are two types of libraries commonly used. The first and older type is the static library. The second and more recent is the dynamic library (Dynamic Link Library or DLL in Windows and Shared Library or SO in Linux). The difference between the two is when the functions in the library are bound to the executable that is using the library file.
The linker is a utility that takes the various object files and library files to create the executable file. When an external or global function or variable is used the C source file, a kind of marker is used to tell the linker that the address of the function or variable needs to be inserted at that point.
The C compiler only knows what is in the source it compiles and does not know what is in other files such as object files or libraries. So the linker's job is to take the various object files and libraries and to make the final connections between parts by replacing the markers with actual connections. So a linker is a utility that "links" together the various components, replacing the marker for a global function or variable in the object files and libraries with a link to the actual object code that was generated for that global function or variable.
During the linker stage is when the difference between a static library and a dynamic or shared library becomes evident. When a static library is used, the actual object code of the library is included in the application executable. When a dynamic or shared library is used, the object code included in the application executable is code to find the shared library and connect with it when the application is run.
In some cases the same global function name may be used in several different object files or libraries so the linker will normally just use the first one it comes across and issue a warning about others found.
Summary of compile and link
So the basic process for a compile and link of a C program is:
preprocessor utility generates the C source to be compiled
compiler compiles the C source into object code generating a set of object files
linker links the various object files along with any libraries into executable file
The above is the basic process however when using dynamic libraries it can get more complicated especially if part of the application being generated has dynamic libraries that it is generating.
The loader
There is also the stage of when the application is actually loaded into memory and execution starts. An operating system provides a utility, the loader, which reads the application executable file and loads it into memory and then starts the application running. The starting point or entry point for the executable is specified in the executable file so after the loader reads the executable file into memory it will then start the executable running by jumping to the entry point memory address.
One problem the linker can run into is that sometimes it may come across a marker when it is processing the object code files that requires an actual memory address. However the linker does not know the actual memory address because the address will vary depending on where in memory the application is loaded. So the linker marks that as something for the loader utility to fix when the loader is loading the executable into memory and getting ready to start it running.
With modern CPUs with hardware supported virtual address to physical address mapping or translation, this issue of actual memory address is seldom a problem. Each application is loaded at the same virtual address and the hardware address translation deals with the actual, physical address. However older CPUs or lower cost CPUs such as micro-controllers that are lacking the memory management unit (MMU) hardware support for address translation still need this issue addressed.
Entry points and the C Runtime
A final topic is the C Runtime and the main() and the executable entry point.
The C Runtime is object code provided by the compiler manufacturer that contains the entry point for an application that is written in C. The main() function is the entry point provided by the programmer writing the application however this is not the entry point that the loader sees. The main() function is called by the C Runtime after the application is started and the C Runtime code sets up the environment for the application.
The C Runtime is not the Standard C Library. The purpose of the C Runtime is to manage the runtime environment for the application. The purpose of the Standard C Library is to provide a set of useful utility functions so that a programmer doesn't have to create their own.
When the loader loads the application and jumps to the entry point provided by the C Runtime, the C Runtime then performs the various initialization actions needed to provide the proper runtime environment for the application. Once this is done, the C Runtime then calls the main() function so that the code created by the application developer or programmer starts to run. When the main() returns or when the exit() function is called, the C Runtime performs any actions needed to clean up and close out the application.
This is an extremely common source of confusion. I think the easiest way to understand what's happening is to take a simple example. Forget about libraries for a moment and consider the following:
$ cat main.c
extern int foo( void );
int main( void ) { return foo(); }
$ cat foo.c
int foo( void ) { return 0; }
$ cc -c main.c
$ cc -c foo.c
$ cc main.o foo.o
The declaration extern int foo( void ) is performing exactly the same function as the header file of a library. foo.o is performing the function of the library. If you understand this example, and why neither cc main.c nor cc main.o work, then you understand the difference between header files and libraries.
Yes, almost correct. Except that the linker does not links object files, but also libraries - in thise case, it's the C standard library (libc) is what is linked to your object file. The rest of your assumptions appear to be true about the compilation stages + difference between a header and a library.
I know that, when we call any library function in our source code, The function definitions will be loaded into RAM (assuming dynamic linking) at run time.
But where exactly the definitions of library functions stored.
If they are not in .c format, how they are stored??
If you need to get any function definition, you need to check the source code [That was obvious].
To get the function definitions which are part of a library, [ex - glibc], you've to get the source code of the library and browse through that. Usually, the library source codes, [.c format, if you mean] will be compiled to produce a library, either
static [usually, noted by .a]
dynamic [Usually, noted by .so, shared object]
to be linked with some source code to produce the final binary.
So, yes, they are in .c format (least, human readable format, I better say) which you can browse through.
Note: An online browsable version of glibc.
P.S - Sorry, if my answer is biased towards linux implementations however, it is still valid for windows(xp) PC
The header file contain the definition. Inside the header file named alloc.h, we can find that header file in the folder include. you have to specify the environment you are using.it is saved with extention. .h
You can find an example Windows implementation of malloc here. On Windows, it's mostly a wrapper for WinAPI functions such as HeapAlloc. You can find other implementations of this and other functions in various opensource libraries.
Note that on Windows, a compiler doesn't have to provide implementations for the standard C functions, as they are all available in msvcrt.dll. You can't get the source code of these implementations, but you can still disassemble the DLL and look at the assembly.
I have a bunch of code I need to analyse that I don't know how to do. I have a pile of code that here and there are using math functions from a header file I have included called math.h that came with my IDE. I am being asked to see how much space is used to include this. Specifically is the compiler including all of the library functions or just the ones we use. There is no object file being created. So I think the library code is being compiled into the individual files. Any ideas of a slick way to figure this out? I can't just comment out the includes because then the code wont complie and I won't know a size and if I comment out all the lines that use math functions it is not really representative.
You can use the objdump command to see the individual symbols inside your object files and the space they require.
Note that unless you're doing static compilation, library methods aren't generally copied into your produced binary, but only referenced (and brought in via the dynamic linker when your program is loaded).
As math.h is part of the standard C library, a copy of that library is basically guaranteed to always be in memory, so the additional memory and space requirements on dynamic linking are minimal. (During static linking, all symbols which aren't directly required by your program are discarded, and math functions don't tend to be very big, so usage should be fairly minimal there too).
The code in the header file is being complied into the object file of the .c you are using if your header has the definitions of the functions and just being referenced to if they are simply the declarations. The linker will then find a definition for each symbol and place it in your executable if you are using dynamic linking the OS will pull in the definition at run time.