How to write a linker

How to write a linker - c

I have written a compiler for C that outputs byte code. The reason for this was to be able to write applications for an embedded platform that runs on multiple platforms.
I have the compiler and the assembler.
I need to write a linker, and am stuck.
The object format is a custom one, designed around the byte code interpreter, so I cant really use any existing linkers.
My biggest hurdle is how to organize the object code to output the linked binary.
Dynamic linking is not necessary, at this time.
I need to get static linking working first.

Ian Lance Taylor, one of the main developers on the gold linker(now part of binutils), posted a series of blogs on how linkers work. You can find it here.

http://linker.iecc.com is the only book I know about this subject.

I second the Linkers and Loaders book. You state that your object format is a custom one. If the format is under your control, you could consider using the ELF format with your bytecode as a new machine architecture, a la x86, SPARC, ARM, etc. The GNU binutils sources are sufficiently malleable to allow you to incorporate your "architecture".

Related

Why doesn't (can't) the OS translate C code directly into machine language instead first translating it into assembly language?

As far as I've understood, when a program (written in C for example) is compiled, it is first translated into assembly language and then into machine language. Why can't (isn't) the "assembly language step" be skipped?

Your understanding is wrong, compilers do not necessarily translate C code into assembler. They usually perform several phases and have internal representations, but this doesn't necessarily resemble to a human readable assembler.
Here, I found a nice introduction for LLVM. LLVM is the compiler toolkit that is used for clang.

It is easier for the compiler developers.
It is possible to write a compiler that reads C and writes object code. However, this requires the compiler writer to write all the computations that encode instructions. Instruction encodings are intricate on some machines. Additionally, there are fields to fill in that depend on other interactions, such as how far away a branch target is, which depends on what instructions are between the branch and the target.
Additionally, part of the way a compiler is written is with patterns that say things like “To increment an object x, issue an increment instruction.” In order to write object code directly, you have to encode all the instructions you want to write into those patterns. That means your patterns must have some sort of language for describing instructions.
Well, we already have a language for that: assembly language. So it is simply easier to write your patterns in ways like “To increment an object x, issue inc x.”
Modern compilers have many layers. There is a front end that reads C text (or other languages) and turns it into a language internal to the compiler. There is an optimizer that operates on the internal language (or a representation of it) and tries to improve the code. There is a back end that turns the internal language into assembly language. There is an assembler that turns the assembly into object code. And there is a linker that links object code into an executable file.
As with many complex tasks, it is simply easier for human minds to work with a complex task when it is separated into nice pieces. This reduces bugs and improves the time it takes to work with software. It also makes software flexible, because we can change the front end to support a new language (e.g., Java instead of C) or change the back end to support a new processor (change from Intel assembly to PowerPC assembly). And changing one optimizer improves all the compilers, for Java and C and Intel and PowerPC.
The gcc command that we use to compile is actually just a driver that calls other programs that perform the front-end processing, the optimization, the assembly, and the linking. You can also call most of these phases separately, or use a switch to tell gcc to show you the commands it is using.
Additionally, GCC has a feature that allows developers to insert assembly language directly intermixed with the C code. This compels GCC to include an assembler.

The operating system does not do anything like that. This is the job of the compiler. And in fact, many do directly emit object files - you have to explicitly ask them to emit assembly code. Others choose not to because emitting a fully-featured object file requires expert knowledge about the various formats which exist for this. Assemblers have various convenience features which make the job easier, can (sometimes?) target multiple object file formats without changes in the assembly code. Also, it is a very useful feature to emit annotated assembly code, so not having a separate code generator only for direct object file emission saves you time without any restrictions (except needing an assembler), which makes it an attractive option when you have limited resources.

Depends on the compiler; there is no actual need for the assembly code.
Maybe the authors of whatever compiler you are talking about (GNU-CC?) considered it slightly easier for themselves if they didn't have to resolve certain things like branches themselves.

Assembly code is purely a convenient, somewhat-human-readable representation of the machine code and the symbolic references and relocations needed by the linker when putting together the output of different translation units. Without an intermediate assembly-language step, the compiler would also be responsible for generating the relocations in the form the linker needs, which is doable, but painful. Since an assembler with this capability already exists for processing hand-written assembly code, it makes sense to use it.

There is usually no assembler stage. MSVC (cl.exe) and GCC produce machine code (.obj, .o) right away.

A cross compiler can directly generate the machine code without the help of the OS where that cross compiler is installed.
For example, tornado package installed in windows can generate machine code for vxworks.

How to create a C compiler for custom CPU?

What would be the easiest way to create a C compiler for a custom CPU, assuming of course I already have an assembler for it?
Since a C compiler generates assembly, is there some way to just define standard bits and pieces of assembly code for the various C idioms, rebuild the compiler, and thereby obtain a cross compiler for the target hardware?
Preferably the compiler itself would be written in C, and build as a native executable for either Linux or Windows.
Please note: I am not asking how to write the compiler itself. I did take that course in college, I know about general compiler-compilers, etc. In this situation, I'd just like to configure some existing framework if at all possible. I don't want to modify the language, I just want to be able to target an arbitrary architecture. If the answer turns out to be "it doesn't work that way", that information will be useful to myself and anyone else who might make similar assumptions.

Quick overview/tutorial on writing a LLVM backend.
This document describes techniques for writing backends for LLVM which convert the LLVM representation to machine assembly code or other languages.
[ . . . ]
To create a static compiler (one that emits text assembly), you need to implement the following:
Describe the register set.
Describe the instruction set.
Describe the target machine.
Implement the assembly printer for the architecture.
Implement an instruction selector for the architecture.

There's the concept of a cross-compiler, ie., one that runs on one architecture, but targets a different one. You can see how GCC does it (for example) and add a new architecture to the set, if that's the compiler you want to extend.
Edit: I just spotted a question a few years ago on a GCC mailing list on how to add a new target and someone pointed to this

vbcc (at www.compilers.de) is a good and simple retargetable C-compiler written in C. It's much simpler than GCC/LLVM. It's so simple I was able to retarget the compiler to my own CPU with a few weeks of work without having any prior knowledge of compilers.

The short answer is that it doesn't work that way.
The longer answer is that it does take some effort to write a compiler for a new CPU type. You don't need to create a compiler from scratch, however. Most compilers are structured in several passes; here's a typical architecture (a lot of variations are possible):
Syntactic analysis (lexer and parser), and for C preprocessing, leading to an abstract syntax tree.
Type checking, leading to an annotated abstract syntax tree.
Intermediate code generation, leading to architecture-independent intermediate code. Some optimizations are performed at this stage.
Machine code generation, leading to assembly or directly to machine code. More optimizations are performed at this stage.
In this description, only step 4 is machine-dependent. So you can take a compiler where step 4 is clearly separated and plug in your own step 4. Doing this requires a deep understanding of the CPU and some understanding of the compiler internals, but you don't need to worry about what happens before.
Almost all CPUs that are not very small, very rare or very old have a backend (step 4) for GCC. The main documentation for writing a GCC backend is the GCC internals manual, in particular the chapters on machine descriptions and target descriptions. GCC is free software, so there is no licensing cost in using it.

1) Short answer:
"No. There's no such thing as a "compiler framework" where you can just add water (plug in your own assembly set), stir, and it's done."
2) Longer answer: it's certainly possible. But challenging. And likely expensive.
If you wanted to do it yourself, I'd start by looking at Gnu CC. It's already available for a large variety of CPUs and platforms.
3) Take a look at this link for more ideas (including the idea of "just build a library of functions and macros"), that would be my first suggestion:
http://www.instructables.com/answers/Custom-C-Compiler-for-homemade-instruction-set/

You can modify existing open source compilers such as GCC or Clang. Other answers have provided you with links about where to learn more. But these compilers are not designed to easily retargeted; they are "easier" to retarget than compilers than other compilers wired for specific targets.
But if you want a compiler that is relatively easy to retarget, you want one in which you can specify the machine architecture in explicit terms, and some tool generates the rest of the compiler (GCC does a bit of this; I don't think Clang/LLVM does much but I could be wrong here).
There's a lot of this in the literature, google "compiler-compiler".
But for a concrete solution for C, you should check out ACE, a compiler vendor that generates compilers on demand for customers. Not free, but I hear they produce very good compilers very quickly. I think it produces standard style binaries (ELF?) so it skips the assembler stage. (I have no experience or relationship with ACE.)
If you don't care about code quality, you can likely write a syntax-directed translation of C to assembler using a C AST. You can get C ASTs from GCC, Clang, maybe ANTLR, and from our DMS Software Reengineering Toolkit.

How do i compile a c program without all the bloat?

I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)

Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.

You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.

The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!

How to extract C source code from .so file?

I am working on previously developed software and source code is compiled as linux shared libraries (.so) and source code is not present. Is there any tool which can extract source code from the linux shared libraries?
Thanks,
Ravi

There isn't. Once you compile your code there is no trace of it left in the binary, only machine code.
Some may mention decompilers but those don't extract the source, they analyze the executable and produce some source that should have the same effect as the original one did.

You can try disassembling the object code and get the machine code mnemonics.
objdump -D --disassembler-options intel sjt.o to get Intel syntax assembly
objdump -D --disassembler-options att sjt.o or objdump -D sjt.o to get AT&T syntax assembly
But the original source code could never be found. You might try to reverse the process by studying and reconstruct the sections. It would be hell pain.

Disclaimer: I work for Hex-Rays SA.
The Hex-Rays decompiler is the only commercially available decompiler I know of that works well with real-life x86 and ARM code. It's true that you don't get the original source, but you get something which is equivalent to it. If you didn't strip your binary, you might even get the function names, or, with some luck, even types and local variables. However, even if you don't have symbol info, you don't have to stick to the first round of decompilation. The Hex-Rays decompiler is interactive - you can rename any variable or function, change variable types, create structure types to represent the structures in the original code, add comments and so on. With a little work you can recover a lot. And quite often what you need is not the whole original file, but some critical algorithm or function - and this Hex-Rays can usually provide to you.
Have a look at the demo videos and the comparison pages. Still think "staring at the assembly" is the same thing?

No. In general, this is impossible. Source is not packaged in compiled objects or libraries.

You cannot. But you can open it as an archive in 7-Zip. You can see the file type and size of each file separately in that. You can replace the files in it with your custom files.

Delphi dcu to obj

Is there a way to convert a Delphi .dcu file to an .obj file so that it can be linked using a compiler like GCC? I've not used Delphi for a couple of years but would like to use if for a project again if this is possible.

Delphi can output .obj files, but they are in a 32-bit variant of Intel OMF. GCC, on the other hand, works with ELF (Linux, most Unixes), COFF (on Windows) or Mach-O (Mac).
But that alone is not enough. It's hard to write much code without using the runtime library, and the implementation of the runtime library will be dependent on low-level details of the compiler and linker architecture, for things like correct order of initialization.
Moreover, there's more to compatibility than just the object file format; code on Linux, in particular, needs to be position-independent, which means it can't use absolute values to reference global symbols, but rather must index all its global data from a register or relative to the instruction pointer, so that the code can be relocated in memory without rewriting references.
DCU files are a serialization of the Delphi symbol tables and code generated for each proc, and are thus highly dependent on the implementation details of the compiler, which changes from one version to the next.
All this is to say that it's unlikely that you'd be able to get much Delphi (dcc32) code linking into a GNU environment, unless you restricted yourself to the absolute minimum of non-managed data types (no strings, no interfaces) and procedural code (no classes, no initialization section, no data that needs initialization, etc.)

(answer to various FPC remarks, but I need more room)
For a good understanding, you have to know that a delphi .dcu translates to two differernt FPC files, .ppu file with the mentioned symtable stuff, which includes non linkable code like inline functions and generic definitions and a .o which is mingw compatible (COFF) on Windows. Cygwin is mingw compatible too on linking level (but runtime is different and scary). Anyway, mingw32/64 is our reference gcc on Windows.
The PPU has a similar version problem as Delphi's DCU, probably for the same reasons. The ppu format is different nearly every major release. (so 2.0, 2.2, 2.4), and changes typically 2-3 times an year in the trunk
So while FPC on Windows uses own assemblers and linkers, the .o's it generates are still compatible with mingw32 In general FPC's output is very gcc compatible, and it is often possible to link in gcc static libs directly, allowing e.g. mysql and postgres linklibs to be linked into apps with a suitable license. (like e.g. GPL) On 64-bit they should be compatible too, but this is probably less tested than win32.
The textmode IDE even links in the entire GDB debugger in library form. GDB is one of the main reasons for gcc compatibility on Windows.
While Barry's points about the runtime in general hold for FPC too, it might be slightly easier to work around this. It might only require calling certain functions to initialize the FPC rtl from your startup code, and similarly for the finalize. Compile a minimal FPC program with -al and see the resulting assembler (in the .s file, most notably initializeunits and finalizeunits) Moreover the RTL is more flexible and probably more easily cut down to a minimum.
Of course as soon as you also require exceptions to work across gcc<->fpc bounderies you are out of luck. FPC does not use SEH, or any scheme compatible with anything else ATM. (contrary to Delphi, which uses SEH, which at least in theory should give you an advantage there, Barry?) OTOH, gcc might use its own libunwind instead of SEH.
Note that the default calling convention of FPC on x86 is Delphi compatible register, so you might need to insert proper cdecl (which should be gcc compatible) modifiers, or even can set it for entire units at a time using {$calling cdecl}
On *nix this is bog standard (e.g. apache modules), I don't know many people that do this on win32 though.
About compatibility; FPC can compile packages like Indy, Teechart, Zeos, ICS, Synapse, VST
and reams more with little or no mods. The dialect levels of released versions are a mix of D7 and up, with the focus on D7. The dialect level is slowly creeping to D2006 level in trunk versions. (with for in, class abstract etc)

Yes. Have a look at the Project Options dialog box:
(High-Res)

As far as I am aware, Delphi only supports the OMF object file format. You may want to try an object format converter such as Agner Fog's.

Since the DCU format is proprietary and has a tendency of changing from one version of Delphi to the next, there's probably no reliable way to convert a DCU to an OBJ. Your best bet is to build them in OBJ format in the first place, as per Andreas's answer.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight