Object code, linking time in C language - c

When compiling, C produces object code before linking time.
I wonder if object code is in the form of binary yet?
If so, what happened next in the linking time?

Wikipedia says,
In computer science, an object file is
an organised collection of named
objects, and typically these objects
are sequences of computer instructions
in a machine code format, which may be
directly executed by a computer's CPU.
Object files are typically produced by
a compiler as a result of processing a
source code file. Object files contain
compact code, and are often called
"binaries".
A linker is typically used
to generate an executable or library
by amalgamating parts of object files
together. Object files for embedded
systems typically contain nothing but
machine code but generally, object
files also contain data for use by the
code at runtime: relocation
information, stack unwinding
information, comments, program symbols
(names of variables and functions) for
linking and/or debugging purposes, and
other debugging information.
Another great site has much more detailed info and a useful diagram, here:

Object files as produced by the C compiler essentially contain binary code with holes in each place where an address should go that is yet unknown (addresses of function from other files -- including libraries -- called, addresses of variables from other files that are accessed in this one, ...).
It also contains a table indexed by symbol names ("x" or "_x" for variable x, "f" or "_f" for function f). For each such symbol, there is a status code ("defined here", "not defined here but used", ...) and the addresses of holes in the binary code that need to be filed with each address when it becomes known.
If you are using Unix (or gcc on Windows), you can print the later table with the command "nm file.o".

Yes, object code is usually in binary form. Just try opening it in your favorite text editor.
You can learn what linkers do here or here.

Related

Except linking object files of current code and other included header files, what other task linker do?

#include <stdio.h>
#include<a.h> // other file which have function "fun()" declaration
int main()
{
int a =100;
printf("print fun = %d", fun());
return 0;
}
Except linking object files of current code and other included header files, what other task linker do?
see below dummy code:
This answer applies to linked programs (e.g., C, C++) , as opposed to interpreted programs (e.g., shell scripts, most byte-code languages, interpreters). The shades of grey for byte-code languages are ignored here.
Linkers start by collecting the object modules (called .o files herein) and libraries the command line points it to, building a list of all provided and referenced global names therein. By this time the actual source code has been left behind. Ignoring debugging information, described later, effectively only the names of global variables and functions remain in .o files along with associated values.
The object code tracks what parts of .o files are variables and what parts are executable code along with the names and locations of external entries. Some languages (C++) track argument signatures though others (C) do not.
Keeping it simple, after including .o files on the command line in the build, the linker's major job is to look through those .o files to extract references to all externals and then hunt through the libraries to satisfy those externals. The whole lot is bundled up into your executable or, in the case of dynamic libraries loaded during execution, appropriate links to loadable modules are put in the executable.
The linking process assigns everything linkers put into executables a memory addresses to reside at. The linker puts all this into the executable along with additional metadata, such as file headings and optional debugging information, in a way loaders can nicely extract when the program runs. This is usually divided into program segments for executable code, data segments when the program's variables that have initial values associated with them live, and places for variables that have no specific values live that get the default value of binary zeros. These segments provide the run time layout of the executable.
A few basic things, like the names and locations of functions, are often kept in all builds, but this can be so minimal to be worthless in debugging most failures, which leads developers into using -g during development in most POSIX/Linus type compilers.
Debugging information for .o modules compiled under -g (or whatever) will be discarded if the linker is not told to keep it in the build. This information includes all of the external variable names, functions, their addresses, and more; all that stuff you can see with debuggers is included here, and adds considerable to the size of executables. This often includes associating locations of functions, and code therein, to the original source files.
Linkers also identify where the execution is to start at. A small but critical thing and is not the "main()" type functions that beginners tend to believe is the start of their code's run but some place deep with the language library that is automagically included into the build in perhaps obscure ways. This startup code assures all the stuff needed for successful execution happens, like getting ready to handle malloc(), use inherited open files, setting up environment variables, setting up the stack and heap, and the like. Once all this is done main()'s code is used to kick off the user's code. While the linker has nothing to do with this initialization code, nor the final completion code that runs when the program exits or the main returns, the linker does point assure this start up code is present, which also includes properly exit when user code is done.
Once built the linker is done, with the big exclusion of possibility handling dynamic modules that get loaded when called. How such are handled are heavily dependent on the OS and the nature of the link. Sometimes a linker can be involved and sometimes it is "just" loading a module file.
In my view the "loader" that reads in executables into the proper memory, zeros out some chunks of memory, and other start up handling, as a key partner with the linker. While a key partner it is definitely a separate program.
Basically, the linker links object files (the result of compiling your various source code files (not headers), the standard library, the math library, the ncurses library, the big integer library, ...) and adds management data to make a recognizable executable (ELF format, EXE format, PE format, ...).

Does a C program load everything to memory?

I've been practicing C today and something came to my mind. Whenever C code is ran, does it load all files needed for execution into memory? Like, does the main.c file and it's header files get copied into memory? What happens if you have a complete C program that takes up 1 GB or something large?
A C program is first compiled into a binary executable so header files, sources files, etc do not exist anymore at this point... unless you compiled your binary with debugging informations (-g flag).
This is a huge topic. Generally the executable is mapped into what's called virtual memory which allows to address more space than you have available in your computer's memory (through paging). When you will try to access code segments that are not yet loaded, it will create a page fault and the os will fetch what's missing. Compilers will often reorder functions to avoid executing code from random memory locations so you're, most of the time, executing only a small part of your binary.
If you look into specific domains such as HPC or embedded devices the loading policies will likely be different.
C is not interpreted but compiled language.
This means that the original *.c source file is never loaded at execution time. Instead, the compiler will process it once, to produce an executable file containing machine language.
Therefore, the size of source file doesn't directly matter. It may totally be very large if it contains a lot of different use cases, then producing a tiny executable because only the applicable case will be picked at compilation time. However, most of the time, the executable size remains correlated with its source, but it doesn't necessarily means that this will end up in something huge.
Also, included *.h headers file at top of C source files are not actually « importing » a dependence (such as use, require, or import would in other languages). #include statement is only here to insert the content of a file at a given point, but these files usually contain only function prototypes, variable declarations and some precompiler #define clauses, which form the API of an external resource that is linked later to your program.
These external resources are typically other object modules (when you have multiple *.c files within a same project and you don't need to recompile them all from scratch at each time), static libraries or dynamic libraries. These later ones are DLL files under Windows and *.so files under Unix. In this case, the operating system will automatically load the required libraries when you run your program.

Is it possible to produce working binary without linker?

As far as I know compiler convert source code to machine code. But this code do not have any OS-related sections and linker add them to file.
But is it's possible to make some executable without linker?
Answering your question very literally - yes, it is possible to make an executive file without a linker: you don't need a compiler or linker to generate machine code. Binaries are a series of opcodes and relevant information (offsets, addresses etc). If you open a binary editor then type out some opcodes and make a program. Save and run it.
Of course the binary will be processor specific, just as if you had compiled a binary (native) executive. Here's a reference to the Intel x86 opcodes.
http://ref.x86asm.net/coder32.html.
If you're however asking, "Can I compile a source file directly into an executive file without a linker?" then speaking purely: no - unless the compiler has aspects of a linker integrated within it. The compiler generates intermediate objects that are passed on to the linker to "link" them into a binary such as a library or executive. Without the link step the pipeline is not complete.
Let's first make a statement that is to be considered true, compilers do not generate machine code that can be immediately executed (JIT's do, but lets ignore that).
Instead they generate files (object, static, dynamic, executable) which describe what they contains as well as groups of symbols. Symbols can be global variables or functions.
But symbols just like the file itself contain metadata. This metadata is very important. See the machine code stored in a symbol is the raw instructions for the target architecture but it does not know where memory is stored.
While modern CPU's give each process its own address space, a symbol may not land and probably won't land in the same address twice. In very recent times this is a security measure, but in past its so that dynamic linking works correctly.
So when the OS loads up an executable or shared library it can place it wherever it wants and by doing so make it not repeatable. Otherwise we'd all have to start caring and saying "this file contains 100% of the code I intend to execute". Usually on load the raw binary in the symbol table get transformed by patching it with the symbol locations in RAM. Making everything just work.
In summary the compiler emits files that allow for dynamic patching of assembly
prior to execution. If it didn't, we would be living in a very restrictive and problematic world.
Linkers even have scripts to change how they operate. They are a very complex and delicate piece of software required to make our programs work.
Have a read of the PE-COFF and ELF standards if you want to get an idea of just how complex those formats really are.

Add resources to ELF object, grouped in one section

A small program I made contains a lot of small bitmaps and sound clips that I would prefer to include into the binary itself (they need to be memory mapped anyway). In the MS PE/COFF standard, there is a specific description on how to include resources (the .rsrc section) that has a nice file system-like hierarchy. I have not found anything like that in the Linux ELF specification, thus I assume one is free to include these resources as seemed fit.
What I want to achieve is that I can include all resources in only one ELF section with a symbolic name on the start of each resource (so that I can address them from my C code). What I am doing now is using a small NASM file that has the following layout:
SECTION .rsrc
_resource_1:
incbin "../rsrc/file_name_1"
_resource_1_length:
dw $-resource_1
_resource_2:
incbin "../rsrc/file_name_2"
_resource_2_length:
dw $-resource_2
...
I can easily assemble this to an ELF object that can be linked with my C code. However, I dislike the use of assembly as that makes my code platform-dependent.
What would be a better way to achieve the same result?
This question has already been asked on stackoverflow, but the proposed solutions are not applicable to my case:
The solution proposed over here: C/C++ with GCC: Statically add resource files to executable/library
Including the resources as hex arrays in C code is not really useful, as that mixes the code and the data in one section. (Besides, it's not practical either, as I can't preview the resources once they are converted to arrays)
Using objcopy --add-section on every resource works, but then every resource gets its own section (including header and all that). That seems a little wasteful as I have around 120 files to include (each of +/- 4K).
You're wrong saying that using hexarrays mixes data and code, as ELF files will split them by default, in particular, if you define the hexarray as a constant array, it'll end up in .rodata. See an old post of mine for more details on .rodata.
Adding resources with objcopy should create multiple sections in the object file, but then they should all be merged in the output executable, although then you would have some extra padding almost certainly. Another post on a related topic.
Your other alternative if you want to go from the actual binary file (say a PNG) to ELF, you can use ldscripts, which allow you to build ELF files with arbitrary sections/symbols and reading the data from files. You'll still need custom rules to build your ELF file.
I'm actually surprised this kind of resource management is not used more often for ELF, especially since, for many small files, it'll improve filesystem performance quite quickly, as then you only have one file to map rather than many.
If your resource is not too large, you can translate them into C/C++ source code, for example, as a unsigned char array. Then you can access them as global variables, and compile & link them like normal source code.

What is the difference between ELF files and bin files?

The final images produced by compliers contain both bin file and extended loader format ELf file ,what is the difference between the two , especially the utility of ELF file.
A Bin file is a pure binary file with no memory fix-ups or relocations, more than likely it has explicit instructions to be loaded at a specific memory address. Whereas....
ELF files are Executable Linkable Format which consists of a symbol look-ups and relocatable table, that is, it can be loaded at any memory address by the kernel and automatically, all symbols used, are adjusted to the offset from that memory address where it was loaded into. Usually ELF files have a number of sections, such as 'data', 'text', 'bss', to name but a few...it is within those sections where the run-time can calculate where to adjust the symbol's memory references dynamically at run-time.
A bin file is just the bits and bytes that go into the rom or a particular address from which you will run the program. You can take this data and load it directly as is, you need to know what the base address is though as that is normally not in there.
An elf file contains the bin information but it is surrounded by lots of other information, possible debug info, symbols, can distinguish code from data within the binary. Allows for more than one chunk of binary data (when you dump one of these to a bin you get one big bin file with fill data to pad it to the next block). Tells you how much binary you have and how much bss data is there that wants to be initialised to zeros (gnu tools have problems creating bin files correctly).
The elf file format is a standard, arm publishes its enhancements/variations on the standard. I recommend everyone writes an elf parsing program to understand what is in there, dont bother with a library, it is quite simple to just use the information and structures in the spec. Helps to overcome gnu problems in general creating .bin files as well as debugging linker scripts and other things that can help to mess up your bin or elf output.
some resources:
ELF for the ARM architecture
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf
ELF from wiki
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
ELF format is generally the default output of compiling.
if you use GNU tool chains, you can translate it to binary format by using objcopy, such as:
arm-elf-objcopy -O binary [elf-input-file] [binary-output-file]
or using fromELF utility(built in most IDEs such as ADS though):
fromelf -bin -o [binary-output-file] [elf-input-file]
bin is the final way that the memory looks before the CPU starts executing it.
ELF is a cut-up/compressed version of that, which the CPU/MCU thus can't run directly.
The (dynamic) linker first has to sufficiently reverse that (and thus modify offsets back to the correct positions).
But there is no linker/OS on the MCU, hence you have to flash the bin instead.
Moreover, Ahmed Gamal is correct.
Compiling and linking are separate stages; the whole process is called "building", hence the GNU Compiler Collection has separate executables:
One for the compiler (which technically outputs assembly), another one for the assembler (which outputs object code in the ELF format),
then one for the linker (which combines several object files into a single ELF file), and finally, at runtime, there is the dynamic linker,
which effectively turns an elf into a bin, but purely in memory, for the CPU to run.
Note that it is common to refer to the whole process as "compiling" (as in GCC's name itself), but that then causes confusion when the specifics are discussed,
such as in this case, and Ahmed was clarifying.
It's a common problem due to the inexact nature of human language itself.
To avoid confusion, GCC outputs object code (after internally using the assembler) using the ELF format.
The linker simply takes several of them (with an .o extension), and produces a single combined result, probably even compressing them (into "a.out").
But all of them, even ".so" are ELF.
It is like several Word documents, each ending in ".chapter", all being combined into a final ".book",
where all files technically use the same standard/format and hence could have had ".docx" as the extension.
The bin is then kind of like converting the book into a ".txt" file while adding as many whitespace as necessary to be equivalent to the size of the final book (printed on a single spool),
with places for all the pictures to be overlaid.
I just want to correct a point here. ELF file is produced by the Linker, not the compiler.
The Compiler mission ends after producing the object files (*.o) out of the source code files. Linker links all .o files together and produces the ELF.

Resources