So I have a raspberry pi zero and I followed along this really cool tutorial to have a starting point at programming it in bare metal c. Everything's been working good.
Now for what I want to do I need (unsigned) integers with a size of 256 or 512 bit, so I went looking for libraries. I found BigDigits and got it to work easily on my machine.
When I tried to compile it with the rest of my actual bare metal code though (without even including it or using it anywhere in my code) it compiled and linked without warnings or errors but my code doesn't work anymore, i.e. my raspberry pi doesn't do what it did before.
I'm still pretty new to bare-metal programming. I know that there might be system functions used by the library that are not implemented and might therefore not work correctly. But I'm not even calling any BigDigits function, nor am I including any of their headers.
So why does it compile and link but not work? And how could I make it work or are there any other options that would be easier to use in a bare-metal c environment for arbitrary precision? I actually always know at compile time what precision I need so I'd be happy to just have uint256_t types or something like that, but I couldn't find anything like that.
Thanks in advance!
Stuff like bignum libraries can be written in assembly or be included in C as external/linked (or maybe internal) assembly code like in Assembly big numbers calculator or http://x86asm.net/articles/working-with-big-numbers-using-x86-instructions/ however this has to be ported to ARM assembly (https://azeria-labs.com/arm-data-types-and-registers-part-2/). Inclusion into C is in https://www.devdungeon.com/content/how-mix-c-and-assembly and https://en.wikibooks.org/wiki/Embedded_Systems/Mixed_C_and_Assembly_Programming
https://web.sonoma.edu/users/f/farahman/sonoma/courses/es310/310_arm/lectures/Chapter_3_Instructions_ARM.pdf page 5
Related
I have two questions:
Is it possible to add custom DWARF on the resulting binary of a C program? (I explain later why i want to do this)
How does DWARF work?
First of all, i don't understand DWARF. I tried to read some docs on dwarfstd.org, but i think it's to high for me. Maybe someone could give me some basic instructions which helps me to dig deeper (the entry point is a bit difficult for me).
Why i want to do this? I like playing around with writing my own compiler, implementing my own language. My goal is to write a compiled language and not an interpreted or jitted one. So i have several options as a backend: C, Opcodes, ASM, LLVM and maybe there are a lot more.
Because LLVM is a C++ library (and i have no clue about C++) i tried it a little bit using the C wrapper. Since i'm a newbie on C too i didn't got it working easily (but i didn't investigate a lot). The problem with Opcodes and ASM is, that the learning curve is higher than LLVM and i'm even more than a newbie on that topic.
So, i would like to use C as a backend... but i think about some problems: Debugging info. The resulting C file would have different function names than my source language and even different line numbers. I know that line numbers could be fixed using the #line directive in C but it's not 100% perfect, though. So i'm looking for a really good solution for this before i start implementing something odd. I stumbled upon DWARF and the i got those question.
If anyone knows a well documented alternative to LLVM which would fit my requirements, your welcome to tell me :)
My requirements for target platform are at least: x86, x64 and ARM
I have the compiled C code in text format. I need to extract the source code by decompiling the machine code. How to do that?
"True" decompiling is, basically, impossible. Foremost, you can't "decompile" local names (in functions and source code files / modules). For those, you'll get something like, for int local variables: i1, i2... Of course, unless you also have debug information, which is not often the case.
Decompiling to "something" (which might not be very readable) is possible, but it usually relies on some heuristics, recognizing code patterns that compilers generate and can be fooled into generating strange (possibly even incorrect) C code. In practice that means that a decompiler usually works OK for a certain compiler with certain (default) compile options, but, not so nice with others.
Having said that, decompilers do exist and you can try your luck with, say Snowman
As Srdjan has said, in general decompilation of a C (or C++) program is not possible. There is too much information lost during the compilation process. For example consider a declaration such as int x this is 'lost' as it does not directly produce any machine level instruction. The compiler needs this information to do type checking only.
Now, however it is possible to disassembly which is taking the compiled executable back up a level to assembly language. However, interpretation of the assembly might (will ?) be difficult and certainly time consuming. There are several disassemblers available, if you have money IDA-Pro is probably the industry standard in disassemblers, and if you are doing this type work, well worth the several thousand dollars per license. There are a number of open source disassemblers available, google can find them.
Now, that being said there have been efforts to create a decompilers, IDA-Pro has one, and you can look at http://boomerang.sourceforge.net/ in addition to Snowman linked above.
Lastly, other languages are more friendly towards decompilation then C or C++. For example a C# programs is decompilable with tools like dotPeek or ilSpy. Similarly with Java there are a number of tools that can convert Java bytecode back into Java source.
Please post a sample of the "compiled C code in text format."
Perhaps then it will be easier to see what you are trying to achieve.
Typically it is not practical to reverse engineer assembly language into C because much the human readable information in the form of Labels and variable names is permanently lost in the compilation process.
I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)
Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.
You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.
The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!
i noticed that mingw adds alot of code before calling main(), i assumed its for parsing command line parameters since one of those functions is called __getmainargs(), and also lots of strings are added to the final executable, such as mingwm.dll and some error strings (incase the app crashed) says mingw runtime error or something like that.
my question is: is there a way to remove all this stuff? i dont need all these things, i tried tcc (tiny c compiler) it did the job. but not cross platform like gcc (solaris/mac)
any ideas?
thanks.
Yes, you really do need all those things. They're the startup and teardown code for the C environment that your code runs in.
Other than non-hosted environments such as low-level embedded solutions, you'll find pretty much all C environments have something like that. Things like /lib/crt0.o under some UNIX-like operating systems or crt0.obj under Windows.
They are vital to successful running of your code. You can freely omit library functions that you don't use (printf, abs and so on) but the startup code is needed.
Some of the things that it may perform are initialisation of atexit structures, argument parsing, initialisation of structures for the C runtime library, initialisation of C/C++ pre-main values and so forth.
It's highly OS-specific and, if there are things you don't want to do, you'll probably have to get the source code for it and take them out, in essence providing your own cut-down replacement for the object file.
You can safely assume that your toolchain does not include code that is not needed and could safely be left out.
Make sure you compiled without debug information, and run strip on the resulting executable. Anything more intrusive than that requires intimate knowledge of your toolchain, and can result in rather strange behaviour that will be hard to debug - i.e., if you have to ask how it could be done, you shouldn't try to do it.
I've been playing around with LLVM hoping to learn how to use it.
However, my mind is boggled by the level of complexity of the interface.
Take for example their Fibonacci function
int fib(int x) {
if(x<=2)
return 1;
return fib(x-1) + fib(x-2);
}
To get this to output LLVM IR, it takes 61 lines of code!!!
They also include BrainFuck which is known for having the smallest compiler (200 bytes).
Unfortunately, with LLVM, it is over 600 lines (18 kb).
Is this the norm for compiler backends?
So far it seems like it would be far easier to do an assembly or C backend.
The problem lies with C++ and not LLVM.
Use a language designed for metaprogramming, like OCaml, and your compiler will be vastly smaller. For example, this OCaml Journal article describes an 87-line LLVM-based Brainfuck compiler, this mailing list post describes complete programming language implementation including parser that can compile the Fibonacci function (amongst other programs) and the whole compiler is under 100 lines of OCaml code using LLVM, and HLVM is a high-level virtual machine with multicore-capable garbage collection in under 2,000 lines of OCaml code using LLVM.
Doesn't LLVM then optimise the IR depending on the specific architecture implemented in the back-end? The IR code is not directly translated 1:1 into the final binary. As far as I understand it, that's how it works. However, I have only started to play around with the back-end (I'm porting it over to a custom processor).
LLVM does require some boilerplate code, but once you understand it, it is really quite simple. Try looking for a simple GCC front end, and you will realize how clean LLVM is. I would definitely recommend LLVM over C or ASM. ASM is not portable at all, and generating source code is usually a bad thing, because it makes compiling slow.
Intermediate representations can be a bit verbose, compared with non-virtual assembler. I learned that looking at .NET IL, though I never went much further than looking. I'm not really familiar with LLVM, but I guess it's the same issue.
It kind of makes sense when you think about it, though. One big difference is that IRs have to deal with a lot of metadata. In assembler there is very little - the processor implicitly defines a lot, and conventions for things like function calls are left to the programmer/compiler to define. That's convenient, but it creates big portability and interop issues.
Intermediate representations such as .NET and LLVM care about making sure that separately compiled components can work together - even components written in different languages and compiled by different compiler front ends. That means metadata is needed to describe what is going on at a higher level than e.g. arbitrary pushes, pops and loads that might be parameter handling, but could be just about anything. The payoff is pretty big, but there's a price to pay.
There's other issues, too. The intermediate representation isn't really meant to be written by humans, but it is meant to be readable. Also, it's meant to be general enough to survive a number of versions without a complete incompatible from-scratch redesign.
Basically, in this context, explicit is almost always better than implicit, so verbosity is hard to avoid.