I've been practicing C today and something came to my mind. Whenever C code is ran, does it load all files needed for execution into memory? Like, does the main.c file and it's header files get copied into memory? What happens if you have a complete C program that takes up 1 GB or something large?
A C program is first compiled into a binary executable so header files, sources files, etc do not exist anymore at this point... unless you compiled your binary with debugging informations (-g flag).
This is a huge topic. Generally the executable is mapped into what's called virtual memory which allows to address more space than you have available in your computer's memory (through paging). When you will try to access code segments that are not yet loaded, it will create a page fault and the os will fetch what's missing. Compilers will often reorder functions to avoid executing code from random memory locations so you're, most of the time, executing only a small part of your binary.
If you look into specific domains such as HPC or embedded devices the loading policies will likely be different.
C is not interpreted but compiled language.
This means that the original *.c source file is never loaded at execution time. Instead, the compiler will process it once, to produce an executable file containing machine language.
Therefore, the size of source file doesn't directly matter. It may totally be very large if it contains a lot of different use cases, then producing a tiny executable because only the applicable case will be picked at compilation time. However, most of the time, the executable size remains correlated with its source, but it doesn't necessarily means that this will end up in something huge.
Also, included *.h headers file at top of C source files are not actually « importing » a dependence (such as use, require, or import would in other languages). #include statement is only here to insert the content of a file at a given point, but these files usually contain only function prototypes, variable declarations and some precompiler #define clauses, which form the API of an external resource that is linked later to your program.
These external resources are typically other object modules (when you have multiple *.c files within a same project and you don't need to recompile them all from scratch at each time), static libraries or dynamic libraries. These later ones are DLL files under Windows and *.so files under Unix. In this case, the operating system will automatically load the required libraries when you run your program.
Related
#include <stdio.h>
#include<a.h> // other file which have function "fun()" declaration
int main()
{
int a =100;
printf("print fun = %d", fun());
return 0;
}
Except linking object files of current code and other included header files, what other task linker do?
see below dummy code:
This answer applies to linked programs (e.g., C, C++) , as opposed to interpreted programs (e.g., shell scripts, most byte-code languages, interpreters). The shades of grey for byte-code languages are ignored here.
Linkers start by collecting the object modules (called .o files herein) and libraries the command line points it to, building a list of all provided and referenced global names therein. By this time the actual source code has been left behind. Ignoring debugging information, described later, effectively only the names of global variables and functions remain in .o files along with associated values.
The object code tracks what parts of .o files are variables and what parts are executable code along with the names and locations of external entries. Some languages (C++) track argument signatures though others (C) do not.
Keeping it simple, after including .o files on the command line in the build, the linker's major job is to look through those .o files to extract references to all externals and then hunt through the libraries to satisfy those externals. The whole lot is bundled up into your executable or, in the case of dynamic libraries loaded during execution, appropriate links to loadable modules are put in the executable.
The linking process assigns everything linkers put into executables a memory addresses to reside at. The linker puts all this into the executable along with additional metadata, such as file headings and optional debugging information, in a way loaders can nicely extract when the program runs. This is usually divided into program segments for executable code, data segments when the program's variables that have initial values associated with them live, and places for variables that have no specific values live that get the default value of binary zeros. These segments provide the run time layout of the executable.
A few basic things, like the names and locations of functions, are often kept in all builds, but this can be so minimal to be worthless in debugging most failures, which leads developers into using -g during development in most POSIX/Linus type compilers.
Debugging information for .o modules compiled under -g (or whatever) will be discarded if the linker is not told to keep it in the build. This information includes all of the external variable names, functions, their addresses, and more; all that stuff you can see with debuggers is included here, and adds considerable to the size of executables. This often includes associating locations of functions, and code therein, to the original source files.
Linkers also identify where the execution is to start at. A small but critical thing and is not the "main()" type functions that beginners tend to believe is the start of their code's run but some place deep with the language library that is automagically included into the build in perhaps obscure ways. This startup code assures all the stuff needed for successful execution happens, like getting ready to handle malloc(), use inherited open files, setting up environment variables, setting up the stack and heap, and the like. Once all this is done main()'s code is used to kick off the user's code. While the linker has nothing to do with this initialization code, nor the final completion code that runs when the program exits or the main returns, the linker does point assure this start up code is present, which also includes properly exit when user code is done.
Once built the linker is done, with the big exclusion of possibility handling dynamic modules that get loaded when called. How such are handled are heavily dependent on the OS and the nature of the link. Sometimes a linker can be involved and sometimes it is "just" loading a module file.
In my view the "loader" that reads in executables into the proper memory, zeros out some chunks of memory, and other start up handling, as a key partner with the linker. While a key partner it is definitely a separate program.
Basically, the linker links object files (the result of compiling your various source code files (not headers), the standard library, the math library, the ncurses library, the big integer library, ...) and adds management data to make a recognizable executable (ELF format, EXE format, PE format, ...).
As far as I know compiler convert source code to machine code. But this code do not have any OS-related sections and linker add them to file.
But is it's possible to make some executable without linker?
Answering your question very literally - yes, it is possible to make an executive file without a linker: you don't need a compiler or linker to generate machine code. Binaries are a series of opcodes and relevant information (offsets, addresses etc). If you open a binary editor then type out some opcodes and make a program. Save and run it.
Of course the binary will be processor specific, just as if you had compiled a binary (native) executive. Here's a reference to the Intel x86 opcodes.
http://ref.x86asm.net/coder32.html.
If you're however asking, "Can I compile a source file directly into an executive file without a linker?" then speaking purely: no - unless the compiler has aspects of a linker integrated within it. The compiler generates intermediate objects that are passed on to the linker to "link" them into a binary such as a library or executive. Without the link step the pipeline is not complete.
Let's first make a statement that is to be considered true, compilers do not generate machine code that can be immediately executed (JIT's do, but lets ignore that).
Instead they generate files (object, static, dynamic, executable) which describe what they contains as well as groups of symbols. Symbols can be global variables or functions.
But symbols just like the file itself contain metadata. This metadata is very important. See the machine code stored in a symbol is the raw instructions for the target architecture but it does not know where memory is stored.
While modern CPU's give each process its own address space, a symbol may not land and probably won't land in the same address twice. In very recent times this is a security measure, but in past its so that dynamic linking works correctly.
So when the OS loads up an executable or shared library it can place it wherever it wants and by doing so make it not repeatable. Otherwise we'd all have to start caring and saying "this file contains 100% of the code I intend to execute". Usually on load the raw binary in the symbol table get transformed by patching it with the symbol locations in RAM. Making everything just work.
In summary the compiler emits files that allow for dynamic patching of assembly
prior to execution. If it didn't, we would be living in a very restrictive and problematic world.
Linkers even have scripts to change how they operate. They are a very complex and delicate piece of software required to make our programs work.
Have a read of the PE-COFF and ELF standards if you want to get an idea of just how complex those formats really are.
Okay, until this morning I was thoroughly confused between these terms. I guess I have got the difference, hopefully.
Firstly, the confusion was that since the preprocessor already includes the header files into the code which contains the functions, what library functions does linker link to the object file produced by the assembler/compiler? Part of the confusion primarily arose due to my ignorance about the difference between a header file and a library.
After a bit of googling, and stack-overflowing (is that the term? :p), I gathered that the header file mostly contains the function declarations whereas the actual implementation is in another binary file called the library (I am still not 100% sure about this).
So, suppose in the following program:-
#include<stdio.h>
int main()
{
printf("whatever");
return 0;
}
The preprocessor includes the contents of the header file in the code. The compiler/compiler+assembler does its work, and then finally linker combines this object file with another object file which actually has stored the way printf() works.
Am I correct in my understanding? I may be way off...so could you please help me?
Edit: I have always wondered about the C++ STL. It always confused me as to what it exactly is, a collection of all those headers or what? Now after reading the responses, can I say that STL is an object file/something that resembles an object file?
And also, I thought where I could read the function definitions of functions like pow(), sqrt() etc etc. I would open the header files and not find anything. So, is the function definition in the library in binary unreadable form?
A C source file goes through two main stages, (1) the preprocessor stage where the C source code is processed by the preprocessor utility which looks for preprocessor directives and performs those actions and (2) the compilation stage where the processed C source code is then actually compiled to produce object code files.
The preprocessor is a utility that does text manipulation. It takes as input a file that contains text (usually C source code) that may contain preprocessor directives and outputs a modified version of the file by applying any directives found to the text input to generate a text output.
The file does not have to be C source code because the preprocessor is doing text manipulation. I have seen the C Preprocssor used to extend the make utility by allowing preprossor directives to be included in a make file. The make file with the C Preprocessor directives is run through the C Preprocessor utility and the resulting output then fed into make to do the actual build of the make target.
Libraries and linking
A library is a file that contains object code of various functions. It is a way to package the output from several source files when they are compiled into a single file. Many times a library file is provided along with a header file (include file), typically with a .h file extension. The header file contains the function declarations, global variable declarations, as well as preprocessor directives needed for the library. So to use the library, you include the header file provided using the #include directive and you link with the library file.
A nice feature of a library file is that you are providing the compiled version of your source code and not the source code itself. On the other hand since the library file contains compiled source code, the compiler used to generate the library file must be compatible with the compiler being used to compile your own source code files.
There are two types of libraries commonly used. The first and older type is the static library. The second and more recent is the dynamic library (Dynamic Link Library or DLL in Windows and Shared Library or SO in Linux). The difference between the two is when the functions in the library are bound to the executable that is using the library file.
The linker is a utility that takes the various object files and library files to create the executable file. When an external or global function or variable is used the C source file, a kind of marker is used to tell the linker that the address of the function or variable needs to be inserted at that point.
The C compiler only knows what is in the source it compiles and does not know what is in other files such as object files or libraries. So the linker's job is to take the various object files and libraries and to make the final connections between parts by replacing the markers with actual connections. So a linker is a utility that "links" together the various components, replacing the marker for a global function or variable in the object files and libraries with a link to the actual object code that was generated for that global function or variable.
During the linker stage is when the difference between a static library and a dynamic or shared library becomes evident. When a static library is used, the actual object code of the library is included in the application executable. When a dynamic or shared library is used, the object code included in the application executable is code to find the shared library and connect with it when the application is run.
In some cases the same global function name may be used in several different object files or libraries so the linker will normally just use the first one it comes across and issue a warning about others found.
Summary of compile and link
So the basic process for a compile and link of a C program is:
preprocessor utility generates the C source to be compiled
compiler compiles the C source into object code generating a set of object files
linker links the various object files along with any libraries into executable file
The above is the basic process however when using dynamic libraries it can get more complicated especially if part of the application being generated has dynamic libraries that it is generating.
The loader
There is also the stage of when the application is actually loaded into memory and execution starts. An operating system provides a utility, the loader, which reads the application executable file and loads it into memory and then starts the application running. The starting point or entry point for the executable is specified in the executable file so after the loader reads the executable file into memory it will then start the executable running by jumping to the entry point memory address.
One problem the linker can run into is that sometimes it may come across a marker when it is processing the object code files that requires an actual memory address. However the linker does not know the actual memory address because the address will vary depending on where in memory the application is loaded. So the linker marks that as something for the loader utility to fix when the loader is loading the executable into memory and getting ready to start it running.
With modern CPUs with hardware supported virtual address to physical address mapping or translation, this issue of actual memory address is seldom a problem. Each application is loaded at the same virtual address and the hardware address translation deals with the actual, physical address. However older CPUs or lower cost CPUs such as micro-controllers that are lacking the memory management unit (MMU) hardware support for address translation still need this issue addressed.
Entry points and the C Runtime
A final topic is the C Runtime and the main() and the executable entry point.
The C Runtime is object code provided by the compiler manufacturer that contains the entry point for an application that is written in C. The main() function is the entry point provided by the programmer writing the application however this is not the entry point that the loader sees. The main() function is called by the C Runtime after the application is started and the C Runtime code sets up the environment for the application.
The C Runtime is not the Standard C Library. The purpose of the C Runtime is to manage the runtime environment for the application. The purpose of the Standard C Library is to provide a set of useful utility functions so that a programmer doesn't have to create their own.
When the loader loads the application and jumps to the entry point provided by the C Runtime, the C Runtime then performs the various initialization actions needed to provide the proper runtime environment for the application. Once this is done, the C Runtime then calls the main() function so that the code created by the application developer or programmer starts to run. When the main() returns or when the exit() function is called, the C Runtime performs any actions needed to clean up and close out the application.
This is an extremely common source of confusion. I think the easiest way to understand what's happening is to take a simple example. Forget about libraries for a moment and consider the following:
$ cat main.c
extern int foo( void );
int main( void ) { return foo(); }
$ cat foo.c
int foo( void ) { return 0; }
$ cc -c main.c
$ cc -c foo.c
$ cc main.o foo.o
The declaration extern int foo( void ) is performing exactly the same function as the header file of a library. foo.o is performing the function of the library. If you understand this example, and why neither cc main.c nor cc main.o work, then you understand the difference between header files and libraries.
Yes, almost correct. Except that the linker does not links object files, but also libraries - in thise case, it's the C standard library (libc) is what is linked to your object file. The rest of your assumptions appear to be true about the compilation stages + difference between a header and a library.
What is the difference between a .o file and a .lib file?
Conceptually, a compilation unit (the unit of code in a source file/object file) is either linked entirely or not at all. While some implementations, with significant levels of cooperation between the compiler and linker, are able to remove unused code from object files at link time, it doesn't change the issue that including 2 compilation units with conflicting symbol names in a program is an error.
As a practical example, suppose your library has two functions foo and bar and they're in an object file together. If I want to use bar, but my program already has an external symbol named foo, I'm stuck with an error. Even if or how the implementation might be able to resolve this problem for me, the code is still incorrect.
On the other hand, if I have a library file containing two separate object files, one with foo and the other with bar, only the one containing bar will get pulled into my program.
When writing libraries, you should avoid including multiple functions in the same object file unless it's essential that they be used together. Doing so will bloat up applications which link your library (statically) and increase the likelihood of symbol conflicts. Personally I prefer erring on the side of separate files when there's a doubt - it's even useful to put foo_create and foo_free in separate files if the latter is nontrivial so that short one-off programs that don't need to call foo_free can avoid pulling in the code for deep freeing (and possibly even avoid pulling in the implementation of free itself).
A .LIB file is a collection of .OBJ
files concatenated together with an
index. There should be no difference
in how the linker treats either.
Quoted from here:
What is the difference between .LIB and .OBJ files? (Visual Studio C++)
They are actually quite different, specially with older linkers.
The .o (or .obj) files are object files, they contain the output of the compiler generated code. It is still in an intermediate format, for example, most references are still unresolved. Usually there is a one to one mapping between the source file and the object file.
The .a (or .lib) files are archives, also known as library, and are a set of object files.
All operating systems have tools that allow you to add/remove/list object files to library files.
Another difference, specially with older linkers is how the files are dealt with, when linking them. Some linked will place the complete object file into the final binary, regardless of what is actually being used, while they will only extract the useful information out of library files.
Nowadays most linkers are smart enough to remove all stuff that is not being used.
The final images produced by compliers contain both bin file and extended loader format ELf file ,what is the difference between the two , especially the utility of ELF file.
A Bin file is a pure binary file with no memory fix-ups or relocations, more than likely it has explicit instructions to be loaded at a specific memory address. Whereas....
ELF files are Executable Linkable Format which consists of a symbol look-ups and relocatable table, that is, it can be loaded at any memory address by the kernel and automatically, all symbols used, are adjusted to the offset from that memory address where it was loaded into. Usually ELF files have a number of sections, such as 'data', 'text', 'bss', to name but a few...it is within those sections where the run-time can calculate where to adjust the symbol's memory references dynamically at run-time.
A bin file is just the bits and bytes that go into the rom or a particular address from which you will run the program. You can take this data and load it directly as is, you need to know what the base address is though as that is normally not in there.
An elf file contains the bin information but it is surrounded by lots of other information, possible debug info, symbols, can distinguish code from data within the binary. Allows for more than one chunk of binary data (when you dump one of these to a bin you get one big bin file with fill data to pad it to the next block). Tells you how much binary you have and how much bss data is there that wants to be initialised to zeros (gnu tools have problems creating bin files correctly).
The elf file format is a standard, arm publishes its enhancements/variations on the standard. I recommend everyone writes an elf parsing program to understand what is in there, dont bother with a library, it is quite simple to just use the information and structures in the spec. Helps to overcome gnu problems in general creating .bin files as well as debugging linker scripts and other things that can help to mess up your bin or elf output.
some resources:
ELF for the ARM architecture
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf
ELF from wiki
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
ELF format is generally the default output of compiling.
if you use GNU tool chains, you can translate it to binary format by using objcopy, such as:
arm-elf-objcopy -O binary [elf-input-file] [binary-output-file]
or using fromELF utility(built in most IDEs such as ADS though):
fromelf -bin -o [binary-output-file] [elf-input-file]
bin is the final way that the memory looks before the CPU starts executing it.
ELF is a cut-up/compressed version of that, which the CPU/MCU thus can't run directly.
The (dynamic) linker first has to sufficiently reverse that (and thus modify offsets back to the correct positions).
But there is no linker/OS on the MCU, hence you have to flash the bin instead.
Moreover, Ahmed Gamal is correct.
Compiling and linking are separate stages; the whole process is called "building", hence the GNU Compiler Collection has separate executables:
One for the compiler (which technically outputs assembly), another one for the assembler (which outputs object code in the ELF format),
then one for the linker (which combines several object files into a single ELF file), and finally, at runtime, there is the dynamic linker,
which effectively turns an elf into a bin, but purely in memory, for the CPU to run.
Note that it is common to refer to the whole process as "compiling" (as in GCC's name itself), but that then causes confusion when the specifics are discussed,
such as in this case, and Ahmed was clarifying.
It's a common problem due to the inexact nature of human language itself.
To avoid confusion, GCC outputs object code (after internally using the assembler) using the ELF format.
The linker simply takes several of them (with an .o extension), and produces a single combined result, probably even compressing them (into "a.out").
But all of them, even ".so" are ELF.
It is like several Word documents, each ending in ".chapter", all being combined into a final ".book",
where all files technically use the same standard/format and hence could have had ".docx" as the extension.
The bin is then kind of like converting the book into a ".txt" file while adding as many whitespace as necessary to be equivalent to the size of the final book (printed on a single spool),
with places for all the pictures to be overlaid.
I just want to correct a point here. ELF file is produced by the Linker, not the compiler.
The Compiler mission ends after producing the object files (*.o) out of the source code files. Linker links all .o files together and produces the ELF.