How to put custom DWARF in C resulting binary? - c

I have two questions:
Is it possible to add custom DWARF on the resulting binary of a C program? (I explain later why i want to do this)
How does DWARF work?
First of all, i don't understand DWARF. I tried to read some docs on dwarfstd.org, but i think it's to high for me. Maybe someone could give me some basic instructions which helps me to dig deeper (the entry point is a bit difficult for me).
Why i want to do this? I like playing around with writing my own compiler, implementing my own language. My goal is to write a compiled language and not an interpreted or jitted one. So i have several options as a backend: C, Opcodes, ASM, LLVM and maybe there are a lot more.
Because LLVM is a C++ library (and i have no clue about C++) i tried it a little bit using the C wrapper. Since i'm a newbie on C too i didn't got it working easily (but i didn't investigate a lot). The problem with Opcodes and ASM is, that the learning curve is higher than LLVM and i'm even more than a newbie on that topic.
So, i would like to use C as a backend... but i think about some problems: Debugging info. The resulting C file would have different function names than my source language and even different line numbers. I know that line numbers could be fixed using the #line directive in C but it's not 100% perfect, though. So i'm looking for a really good solution for this before i start implementing something odd. I stumbled upon DWARF and the i got those question.
If anyone knows a well documented alternative to LLVM which would fit my requirements, your welcome to tell me :)
My requirements for target platform are at least: x86, x64 and ARM

Related

Using multiple precision in bare-metal c

So I have a raspberry pi zero and I followed along this really cool tutorial to have a starting point at programming it in bare metal c. Everything's been working good.
Now for what I want to do I need (unsigned) integers with a size of 256 or 512 bit, so I went looking for libraries. I found BigDigits and got it to work easily on my machine.
When I tried to compile it with the rest of my actual bare metal code though (without even including it or using it anywhere in my code) it compiled and linked without warnings or errors but my code doesn't work anymore, i.e. my raspberry pi doesn't do what it did before.
I'm still pretty new to bare-metal programming. I know that there might be system functions used by the library that are not implemented and might therefore not work correctly. But I'm not even calling any BigDigits function, nor am I including any of their headers.
So why does it compile and link but not work? And how could I make it work or are there any other options that would be easier to use in a bare-metal c environment for arbitrary precision? I actually always know at compile time what precision I need so I'd be happy to just have uint256_t types or something like that, but I couldn't find anything like that.
Thanks in advance!
Stuff like bignum libraries can be written in assembly or be included in C as external/linked (or maybe internal) assembly code like in Assembly big numbers calculator or http://x86asm.net/articles/working-with-big-numbers-using-x86-instructions/ however this has to be ported to ARM assembly (https://azeria-labs.com/arm-data-types-and-registers-part-2/). Inclusion into C is in https://www.devdungeon.com/content/how-mix-c-and-assembly and https://en.wikibooks.org/wiki/Embedded_Systems/Mixed_C_and_Assembly_Programming
https://web.sonoma.edu/users/f/farahman/sonoma/courses/es310/310_arm/lectures/Chapter_3_Instructions_ARM.pdf page 5

How to get a c source code from the compiled code

I have the compiled C code in text format. I need to extract the source code by decompiling the machine code. How to do that?
"True" decompiling is, basically, impossible. Foremost, you can't "decompile" local names (in functions and source code files / modules). For those, you'll get something like, for int local variables: i1, i2... Of course, unless you also have debug information, which is not often the case.
Decompiling to "something" (which might not be very readable) is possible, but it usually relies on some heuristics, recognizing code patterns that compilers generate and can be fooled into generating strange (possibly even incorrect) C code. In practice that means that a decompiler usually works OK for a certain compiler with certain (default) compile options, but, not so nice with others.
Having said that, decompilers do exist and you can try your luck with, say Snowman
As Srdjan has said, in general decompilation of a C (or C++) program is not possible. There is too much information lost during the compilation process. For example consider a declaration such as int x this is 'lost' as it does not directly produce any machine level instruction. The compiler needs this information to do type checking only.
Now, however it is possible to disassembly which is taking the compiled executable back up a level to assembly language. However, interpretation of the assembly might (will ?) be difficult and certainly time consuming. There are several disassemblers available, if you have money IDA-Pro is probably the industry standard in disassemblers, and if you are doing this type work, well worth the several thousand dollars per license. There are a number of open source disassemblers available, google can find them.
Now, that being said there have been efforts to create a decompilers, IDA-Pro has one, and you can look at http://boomerang.sourceforge.net/ in addition to Snowman linked above.
Lastly, other languages are more friendly towards decompilation then C or C++. For example a C# programs is decompilable with tools like dotPeek or ilSpy. Similarly with Java there are a number of tools that can convert Java bytecode back into Java source.
Please post a sample of the "compiled C code in text format."
Perhaps then it will be easier to see what you are trying to achieve.
Typically it is not practical to reverse engineer assembly language into C because much the human readable information in the form of Labels and variable names is permanently lost in the compilation process.

Convert C code to MASM32

This seems a ridiculous question, but I really need to know an easy way to convert C code to MASM32 code (with the .if's, .while's). The code has a single function, but it uses structs (which, I believe, exists in MASM). I know there are a few questions like this here, and some in other sites too, but I couldn't find, until this point, a solution to my specific problem (MASM32 readable, not c compiled low level obfuscated pure assembly). Does anyone know some sort of program that would made this miracle happen? It doesn't seem so difficult, as the macros in masm are pretty much just an uglier version of C...
you can look for that command line parameter of the MicroSoft C compiler cl. Most of C compiler will provide that. Despite the output asm source code might be need modify few for MASM.

How to create a C compiler for custom CPU?

What would be the easiest way to create a C compiler for a custom CPU, assuming of course I already have an assembler for it?
Since a C compiler generates assembly, is there some way to just define standard bits and pieces of assembly code for the various C idioms, rebuild the compiler, and thereby obtain a cross compiler for the target hardware?
Preferably the compiler itself would be written in C, and build as a native executable for either Linux or Windows.
Please note: I am not asking how to write the compiler itself. I did take that course in college, I know about general compiler-compilers, etc. In this situation, I'd just like to configure some existing framework if at all possible. I don't want to modify the language, I just want to be able to target an arbitrary architecture. If the answer turns out to be "it doesn't work that way", that information will be useful to myself and anyone else who might make similar assumptions.
Quick overview/tutorial on writing a LLVM backend.
This document describes techniques for writing backends for LLVM which convert the LLVM representation to machine assembly code or other languages.
[ . . . ]
To create a static compiler (one that emits text assembly), you need to implement the following:
Describe the register set.
Describe the instruction set.
Describe the target machine.
Implement the assembly printer for the architecture.
Implement an instruction selector for the architecture.
There's the concept of a cross-compiler, ie., one that runs on one architecture, but targets a different one. You can see how GCC does it (for example) and add a new architecture to the set, if that's the compiler you want to extend.
Edit: I just spotted a question a few years ago on a GCC mailing list on how to add a new target and someone pointed to this
vbcc (at www.compilers.de) is a good and simple retargetable C-compiler written in C. It's much simpler than GCC/LLVM. It's so simple I was able to retarget the compiler to my own CPU with a few weeks of work without having any prior knowledge of compilers.
The short answer is that it doesn't work that way.
The longer answer is that it does take some effort to write a compiler for a new CPU type. You don't need to create a compiler from scratch, however. Most compilers are structured in several passes; here's a typical architecture (a lot of variations are possible):
Syntactic analysis (lexer and parser), and for C preprocessing, leading to an abstract syntax tree.
Type checking, leading to an annotated abstract syntax tree.
Intermediate code generation, leading to architecture-independent intermediate code. Some optimizations are performed at this stage.
Machine code generation, leading to assembly or directly to machine code. More optimizations are performed at this stage.
In this description, only step 4 is machine-dependent. So you can take a compiler where step 4 is clearly separated and plug in your own step 4. Doing this requires a deep understanding of the CPU and some understanding of the compiler internals, but you don't need to worry about what happens before.
Almost all CPUs that are not very small, very rare or very old have a backend (step 4) for GCC. The main documentation for writing a GCC backend is the GCC internals manual, in particular the chapters on machine descriptions and target descriptions. GCC is free software, so there is no licensing cost in using it.
1) Short answer:
"No. There's no such thing as a "compiler framework" where you can just add water (plug in your own assembly set), stir, and it's done."
2) Longer answer: it's certainly possible. But challenging. And likely expensive.
If you wanted to do it yourself, I'd start by looking at Gnu CC. It's already available for a large variety of CPUs and platforms.
3) Take a look at this link for more ideas (including the idea of "just build a library of functions and macros"), that would be my first suggestion:
http://www.instructables.com/answers/Custom-C-Compiler-for-homemade-instruction-set/
You can modify existing open source compilers such as GCC or Clang. Other answers have provided you with links about where to learn more. But these compilers are not designed to easily retargeted; they are "easier" to retarget than compilers than other compilers wired for specific targets.
But if you want a compiler that is relatively easy to retarget, you want one in which you can specify the machine architecture in explicit terms, and some tool generates the rest of the compiler (GCC does a bit of this; I don't think Clang/LLVM does much but I could be wrong here).
There's a lot of this in the literature, google "compiler-compiler".
But for a concrete solution for C, you should check out ACE, a compiler vendor that generates compilers on demand for customers. Not free, but I hear they produce very good compilers very quickly. I think it produces standard style binaries (ELF?) so it skips the assembler stage. (I have no experience or relationship with ACE.)
If you don't care about code quality, you can likely write a syntax-directed translation of C to assembler using a C AST. You can get C ASTs from GCC, Clang, maybe ANTLR, and from our DMS Software Reengineering Toolkit.

Questions for compiling to LLVM

I've been playing around with LLVM hoping to learn how to use it.
However, my mind is boggled by the level of complexity of the interface.
Take for example their Fibonacci function
int fib(int x) {
if(x<=2)
return 1;
return fib(x-1) + fib(x-2);
}
To get this to output LLVM IR, it takes 61 lines of code!!!
They also include BrainFuck which is known for having the smallest compiler (200 bytes).
Unfortunately, with LLVM, it is over 600 lines (18 kb).
Is this the norm for compiler backends?
So far it seems like it would be far easier to do an assembly or C backend.
The problem lies with C++ and not LLVM.
Use a language designed for metaprogramming, like OCaml, and your compiler will be vastly smaller. For example, this OCaml Journal article describes an 87-line LLVM-based Brainfuck compiler, this mailing list post describes complete programming language implementation including parser that can compile the Fibonacci function (amongst other programs) and the whole compiler is under 100 lines of OCaml code using LLVM, and HLVM is a high-level virtual machine with multicore-capable garbage collection in under 2,000 lines of OCaml code using LLVM.
Doesn't LLVM then optimise the IR depending on the specific architecture implemented in the back-end? The IR code is not directly translated 1:1 into the final binary. As far as I understand it, that's how it works. However, I have only started to play around with the back-end (I'm porting it over to a custom processor).
LLVM does require some boilerplate code, but once you understand it, it is really quite simple. Try looking for a simple GCC front end, and you will realize how clean LLVM is. I would definitely recommend LLVM over C or ASM. ASM is not portable at all, and generating source code is usually a bad thing, because it makes compiling slow.
Intermediate representations can be a bit verbose, compared with non-virtual assembler. I learned that looking at .NET IL, though I never went much further than looking. I'm not really familiar with LLVM, but I guess it's the same issue.
It kind of makes sense when you think about it, though. One big difference is that IRs have to deal with a lot of metadata. In assembler there is very little - the processor implicitly defines a lot, and conventions for things like function calls are left to the programmer/compiler to define. That's convenient, but it creates big portability and interop issues.
Intermediate representations such as .NET and LLVM care about making sure that separately compiled components can work together - even components written in different languages and compiled by different compiler front ends. That means metadata is needed to describe what is going on at a higher level than e.g. arbitrary pushes, pops and loads that might be parameter handling, but could be just about anything. The payoff is pretty big, but there's a price to pay.
There's other issues, too. The intermediate representation isn't really meant to be written by humans, but it is meant to be readable. Also, it's meant to be general enough to survive a number of versions without a complete incompatible from-scratch redesign.
Basically, in this context, explicit is almost always better than implicit, so verbosity is hard to avoid.

Resources