Transforming executables, objects or binaries between different architectures - c

Beforehand: This is just a nasty idea I had this night :-)
Think about the following scenario:
You have some arm-elf executable and for some reasons you want to run it on your amd64 box without emulating.
To simplify the scenario, let's say we just want to deal with simple console applications which are just linked against libc and there are no additional architecture specific requirements.
If you want to transform binaries between different architectures you have to consider the following points:
Endianess of the architectures
bit-width of registers
functionality of different registers
Endianess should be one of the lesser problems.
If the bit-width of the destination registers is smaller then those of the source architecture one could insert additional instructions to represent the same behaviour. The same applies to the functionality of registers.
Finally, (and before bashing down this idea), have a look at the following simple code snippet and its corresponding disassembly of the objects.
C Code
Corresponding ARM Disassembly
Corresponding AMD64 Disassembly
In my opinion it should be possible to convert those objects between different architectures. Even function calls (like printf) could be mapped or wrapped to the destination architecture's libc.
And now my questions:
Did anyone already think about realising this?
Is it actually possible?
Are there already some projects dealing with this issue?
Thanks in advance!

Related

Low level languages and their dependencies

I am trying to understand exactly what it means that low-level languages are machine-dependent.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?
In the end processors executes machine code which is basicly a collection of binary numbers. The processor decode each binary number to figure out what it is supposed to do. One binary number could mean "Add register X to register Y and store the result in register Z". Another binary number could mean "Store the content of register X into the memory address held by register Y". And so on...
The complete description of these decoding rules (i.e. binary number into operation) represents the processors instruction set (aka ISA).
A low level language is a language where the code you can write maps very closely to the specific processors instruction set. Assembly is one obvious example. Since different processor may have different instruction sets, it's clear that an assembly program written for one processors ISA can't be used on a processor with a different ISA.
Let's take for example C, well if it is machine-dependent does it mean that if it was compiled on one computer it might not be able to run on another?
Correct. A program compiled for one processor (family) can't run on another processor with (completely) different ISA. The program needs to be recompiled.
Also notice that the target OS also plays a role. If you use the same processor but use different OS you'll also need to recompile.
There are at least 3 different kind of languages.
A languages that is so close to the target systems ISA that the source code can only be used on that specific target. Example: Assembly
A language that allows you to write code that can be used on many different targets using a target specific compilation. Example: C
A language that allows you to write code that can be used on many different targets without a target specific compilation. These still require some kind of target specific runtime environment to be installed. Example: Java.
High-Level languages are portable, meaning every architecture can run high-level programs but, compared to low-level programs (like written in Assembly or even machine code), they are less efficient and consume more memory.
Low-level programs are known as "closer to the hardware" and so they are optimized for a certain type of hardware architecture/processor, being faster programs, but relatively machine-dependant or not-very-portable.
So, a program compiled for a type of processor it's not valid for other types; it needs to be recompiled.
In the before
When the first processors came out, there was no programming language whatsoever, you had a very long and very complicated documentation with a list of "opcodes": the code you had to put into memory for a given operation to be executed in your processor. To create a program, you had to put a long string of number in memory, and hope everything worked as documented.
Later came Assembly languages. The point wasn't really to make algorithms easier to implement or to make the program readable by any human without any experience on the specific processor model you were working with, it was created to save you from spending days and days looking up things in a documentation. For this reason, there isn't "an assembly language" but thousands of them, one per instruction set (which, at the time, basically meant one per CPU model)
At this point in time, all languages were platform-dependent. If you decided to switch CPUs, you'd have to rewrite a significant portion (if not all) of your code. Recognizing that as a bit of a problem, someone created a the first platform-independent language (according to this SE question it was FORTRAN in 1954) that could be compiled to run on any CPU architecture as long as someone made a compiler for it.
Fast forward a bit and C was invented. C is a platform-independent programming language, in the sense that any C program (as long as it conforms with the standard) can be compiled to run on any CPU (as long as this CPU has a C compiler). Once a C program has been compiled, the resulting file is a platform-dependent binary and will only be able to run on the architecture it was compiled for.
C is platform-dependent
There's an issue though: a processor is more than just a list of opcodes. Most processors have hardware control devices like watchdogs or timers that can be completely different from one architecture to another, even the way to talk to other devices can change completely. As such, if you want to actually run a program on a CPU, you have to include things that make it platform-dependent.
A real life example of this is the Linux kernel. The majority of the kernel is written in C but there's still around 1% written in different kinds of assembly. This assembly is required to do things such as initialize the CPU or use timers. Using this hack means Linux can run on your desktop x86_64 CPU, your ARM Android phone or a RISCV SoC but adding any new architecture isn't as simple as just "compile it with your architecture's compiler".
So... Did I just say the only way to run a platform-independent on an actual processor is to use platform-dependent code? Yes, for most architectures, you have to.
Or is it?
But there's a catch! That's only true if you want to run you code on bare metal (meaning: without an OS). One of the great things of using an OS is how abstracted everything is: you don't need to know how the kernel initializes the CPU, nor do you need to know how it gets its clock, you just need to know how to access those abstracted resources.
But the way of accessing resources dependent on the OS, aren't we back to square one? We could be, if not for the standard library! This library is used to access functions like printf in a defined way. It doesn't matter if you're working on a Linux running on PowerPC or on an ARM Windows, printf will always print things on the standard output the same way.
If you write standard C using only the standard library (and intend for your program to run in an OS) C is completely platform-independent!
EDIT: As said in the comments below, even that is not enough. It doesn't really have anything to do with specific CPUs but some things such as the system function or the size of some types are documented as implementation-defined. To make C really platform independent you need to make sure to only use well defined functions of the STL and learn some best practice (never rely on sizeof(int)==4 for instance).
Thinking about 'what's a program' might help you understand your question. Is a program a collection of text (that you've typed in or otherwise manufactured) or is it something you run? Is it both?
In the case of a 'low-level' language like C I'd say that the text is the program source, and that this is turned into a program (aka executable) by a compiler. A program is something you can run. You need a C compiler for a system to be able to make the program source into a program for that system. Once built the program can only be run on systems close to the one it was compiled for. However there is a more interesting, if more difficult question: can you at least keep the program source the same, so that all you need to do is recompile? The answer to this is 'sort-of No' I sort-of think. For example you can't, in pure C, read the state of the shift key. Of course operating systems provide such facilities and you can interface to those in C, but then such code depends on the OS. There might be libraries (eg the curses library) that provide such facilities for many OS and that can help to reduce the dependency, but no library can clain to portably cover all OS.
In the case of a 'higher-level' language like python I'd say the text is both the program and the program source. There is no separate compilation stage with such languages, but you do need an interpreter on a system to be able to run your python program on that system. However that this is happening may not be clear to the user as you may well seem to be able to run your python 'program' just by naming it like you run your C programs. But this, most likely comes down to the shell (the part of the OS that deals with commands) knowing about python programs and invoking the interpreter for you. It can appear then that you can run your python program anywhere but in fact what you can do is pass the program to any python interpreter.
In the zoo of programming there are not only many, very varied beasts, but new kinds of beasts arise all the time, and old beasts metamorphose. Terms like 'program', 'script' and even 'executable' are often used loosely.

Compiled on 32-bit and 64-bit but checksum is different

Two binaries were compiled. One is on a 32-bit Windows 7 machine and other is on a 64-bit Windows 10. All the source files and dependencies are the same; however, after compilation the checksum was compared and they are different. Could someone provide an explanation as to why?
If you're comparing a 32-bit build and a 64-bit build you'll need to keep in mind that the compiled code will be almost completely different. 64-bit x86_64 and 32-bit x86 machine code will not only vary considerably, but even if that was not the case, remember that pointers are twice the size in 64-bit code, so a lot of code will be structured differently and addresses in the code will be expressed as bigger pointers.
The technical reason is the machine code is not the same between different architectures. Intel x86, and x86_64 have considerable variation in how the opcodes are expressed, and in how programs are structured internally. If you want to read up on the differences there's a lot of published material that can explain, like Intel's own references.
As others have pointed out building twice on the same machine may not even produce a byte-for-byte matching binary. There will be slight differences in it, especially if there's a code-signing step.
In short, you can't do this and expect them to match.
All the source files and dependencies are the same; however, after compilation the checksum was compared and they are different.
The checksum is calculated from the binary code that the compiler produces, not the source code. If anything is different, you should get a different checksum. Changes in the source code will change the binary, so a different checksum. But using a different compiler will also produce a different binary. Using the same compiler with different options will produce a different binary. Compiling the same program with the same compiler and the same compiler options on a different machine might produce a different binary. Compiling the same program for different processor architectures will definitely produce a different binary.

Is dividing a program into multiple files heavier than using one big file?

I am currently developping an embedded application, and the problem is that I have reached a point where the whole app is actually too heavy for the RAM.
So I am asking myself this question: Would my compiled program be lighter if I refractored some the files into one big file?
Thanks.
No, not in general.
The source-code organisation doesn't impact the memory use.
It might be possible to refactor the program to use less memory, if you for instance have pieces of functionality that both use significant amounts but never run in parallel, but that's equally doable in a single file. It just requires making the sharing explicit.
If your code would be 100% identical in the "big file" and the "many files" approach, the main difference is that addresses of variables and fuctions would be resolved (i.e. padded into the binary) by the linker rather than the compiler. This should normally result in the very same binary.
There are, however, architectures that have "short relative addressing modes" (Motorola 68k is an example) for calls and references, where the "big file approach" could actually result in (slightly) smaller code and data. The compiler cannot normally insert such short addressing modes into modules intended for linking because at compile time it is not known how far away the reference would be in the resulting program.
So, depending on the CPU you are using, you might actually gain a few bytes.

Possible to decompile DLL written in C?

I want to decompile a DLL that I believe was written in C. How can I do this?
Short answer: you can't.
Long answer: The compilation process for C/C++ is very lossy. The compiler makes a whole lot of high and low level optimizations to your code, and the resulting assembly code more often than not resembles nothing of your original code. Furthermore there are different compilers in the market (and each has several different active versions), which each generate the output a little differently. Without knowledge of which compiler was used the task of decompiling becomes even more hopeless. At the best I've heard of some tools that can give you some partial decompilation, with bits of C code recognized here and there, but you're still going to have to read through a lot of assembly code to make sense of it.
That's by the way one of the reasons why copy protections on software are difficult to crack and require special assembly skills.
It is possible, but extremely difficult and will take ginormous amount of time even if you're pretty well versed in C, assembly and the intricacies of the operating system where this code is supposed to work.
The problem is, optimization makes compiled code hardly recognizable/understandable for humans.
Further, there will be ambiguities if the disassembler loses information (e.g. the same instruction can be encoded in different ways and if the rest of the code depends on a particular encoding which many disassemblers (or their users) fail to take into account, the resultant disassembly becomes incomplete or incorrect).
Self-modifying code complicates the matters as well.
See in this question more on the topic and available tools.
You can, but only up to a certain extent:
Optimizations could change the code
Symbols might have been stripped (DLL allows to refer to functions residing inside via index instead of symbol)
Some instruction combinations might not be convertible to C
and some other things I might forget...

When to worry about endianness?

I have seen countless references about endianness and what it means. I got no problems about that...
However, my coding project is a simple game to run on linux and windows, on standard "gamer" hardware.
Do I need to worry about endianness in this case? When should I need to worry about it?
My code is simple C and SDL+GL, the only complex data are basic media files (png+wav+xm) and the game data is mostly strings, integer booleans (for flags and such) and static-sized arrays. So far no user has had issues, so I am wondering if adding checks is necessary (will be done later, but there are more urgent issues IMO).
The times when you need to worry about endianess:
you are sending binary data between machines or processes (using a network or file). If the machines may have different byte order or the protocol used specifies a particular byte order (which it should), you'll need to deal with endianess.
you have code that access memory though pointers of different types (say you access a unsigned int variable through a char*).
If you do these things you're dealing with byte order whether you know it or not - it might be that you're dealing with it by assuming it's one way or the other, which may work fine as long as your code doesn't have to deal with a different platform.
In a similar vein, you generally need to deal with alignment issues in those same cases and for similar reasons. Once again, you might be dealing with it by doing nothing and having everything work fine because you don't have to cross platform boundaries (which may come back to bite you down the road if that does become a requirement).
If you mean a PC by "standard gamer hardware", then you don't have to worry about endianness as it will always be little endian on x86/x64. But if you want to port the project to other architectures, then you should design it endianness-independently.
Whenever you recieve/transmit data from a network, remeber to convert to/from network and host byte order. The C functions htons, htonl etc, or equivalients in your language, should be used here.
Whenever you read multi-byte values (like UTF-16 characters or 32 bit ints) from a file, since that file might have originated on a system with different endianness. If the file is UTF 16 or 32 it probably has a BOM (byte-order mark). Otherwise, the file format will have to specify endianness in some way.
You only need to worry about it if your game needs to run on different hardware architectures. If you are positive that it will always run on Intel hardware then you can forget about it. If it will run on Linux though many people use different architectures than Intel and you may end up having to think about it.
Are you distributing you game in source code form?
Because if you are distributing you game as a binary only, then you know exactly which processor families your game will run on. Also, the media files, are they user generated (possibly via a level editor) or are they really only ment to be supplied by yourself?
If this is a truly closed environment (your distribute binaries and the game assets are not intended to be customized) then you know your own risks to endians and I personally wouldn't fool with it.
However, if you are either distributing source and/or hoping people will customize their game, then you have a potential for concern. However, with most of the desktop/laptop computers around these days moving to x86 I would think this is a diminishing concern.
The problem occurs with networking and how the data is sent and when you are doing bit fiddling on different processors since different processors may store the data differently in memory.
I believe Power PC has the opposite endianness of the Intel boards. Might be able to have a routine that sets the endianness dependant on the architecture? I'm not sure if you can actually tell what the hardware architecture is in code...maybe someone smarter then me does know the answer to that question.
Now in reference to your statement "standard" Gamer H/W, I would say typically you're going to look at Consumer Off the Shelf solutions are really what most any Standard Gamer is using, so you're almost going to for sure get the same endian across the board. I'm sure someone will disagree with me but that's my $.02
Ha...I just noticed on the right there is a link that is showing up related to the suggestion I had above.
Find Endianness through a c program

Resources