Run dynamically generated assembly in C (GNU/Linux) - c

I'm writing a proof-of-concept JIT compiler in C, which at the moment is generating strings of assembly code. The inline assembly functionality in C only deals with string literals that are known at compile time, so I can't use it to run my generated-at-runtime code.
I've read about using mmap() to execute generated machine code at runtime, but I'd like to avoid working with machine code if possible.
Does anyone know of any solutions? I've thought of writing it to a file and invoking the assembler & linker on said file, but that'd be messy and slow.

I think ultimately to be "JIT" you need to be time sensitive which means generate machine code. You might try putting in some debug code that generates both machine code to run and assembly code to verify, run the assembler compare the machine code from the assembly language to the machine code you generated directly and use that to debug/validate the machine code (if possible, sometimes assemblers want to do their own thing, not what you wanted them to do).

What I've done is generate C/C++/Fortran code, compile it on the fly, link it into a DLL, and dynamically load the DLL, all of which takes on the order of a few seconds at most.
You could do the same, except generate ASM.
It's a very effective technique when you need speed of the resulting code, plus the flexibility of the code (and run-time libraries) of the language you're generating.

Related

How to get a c source code from the compiled code

I have the compiled C code in text format. I need to extract the source code by decompiling the machine code. How to do that?
"True" decompiling is, basically, impossible. Foremost, you can't "decompile" local names (in functions and source code files / modules). For those, you'll get something like, for int local variables: i1, i2... Of course, unless you also have debug information, which is not often the case.
Decompiling to "something" (which might not be very readable) is possible, but it usually relies on some heuristics, recognizing code patterns that compilers generate and can be fooled into generating strange (possibly even incorrect) C code. In practice that means that a decompiler usually works OK for a certain compiler with certain (default) compile options, but, not so nice with others.
Having said that, decompilers do exist and you can try your luck with, say Snowman
As Srdjan has said, in general decompilation of a C (or C++) program is not possible. There is too much information lost during the compilation process. For example consider a declaration such as int x this is 'lost' as it does not directly produce any machine level instruction. The compiler needs this information to do type checking only.
Now, however it is possible to disassembly which is taking the compiled executable back up a level to assembly language. However, interpretation of the assembly might (will ?) be difficult and certainly time consuming. There are several disassemblers available, if you have money IDA-Pro is probably the industry standard in disassemblers, and if you are doing this type work, well worth the several thousand dollars per license. There are a number of open source disassemblers available, google can find them.
Now, that being said there have been efforts to create a decompilers, IDA-Pro has one, and you can look at http://boomerang.sourceforge.net/ in addition to Snowman linked above.
Lastly, other languages are more friendly towards decompilation then C or C++. For example a C# programs is decompilable with tools like dotPeek or ilSpy. Similarly with Java there are a number of tools that can convert Java bytecode back into Java source.
Please post a sample of the "compiled C code in text format."
Perhaps then it will be easier to see what you are trying to achieve.
Typically it is not practical to reverse engineer assembly language into C because much the human readable information in the form of Labels and variable names is permanently lost in the compilation process.

How to write inline Assembly with Turbo C 2.01?

I want to write some inline assembly in a DOS program which is compiled using Turbo C 2.01. When I write
asm {
nop
}
the compiler claims that in-line assembly is not allowed in function .... See:
Any ideas?
See the Turbo C user manual page 430:
Inline assembly not allowed
Your source file contains inline assembly language statements and you are compiling it from within the
Integrated Environment. You must use the TCC command to compile this
source file.
I believe that you need also to pass the -B option to TCC (page 455).
Alternatively you can use __emit__ (page 103) for relatively simple code entered as machine code rather than assembler mnemonics.
It seems an odd restriction to not allow inline assembly in the IDE. You might consider "upgrading" to Turbo C++ 3.0 which I believe does allow it. I would imagine that TC++ will compile C code when presented with a .c file, or that the IDE can be set to compile C explicitly. There's a manual for that too.
Turbo C converts C code directly into machine code without using an assembler phase, and thus cannot include assembly language source within a program. What it can do, however, is use the __emit directive to insert machine code. The cleanest way to use that is probably to use a separate assembler (or perhaps DEBUG) to process the code of interest by itself into a COM file, and then enter the byte values therein into an __emit directive. Parameters are stored in ascending order left to right, starting at either BP+4 (in tiny, small, or compact model) or BP+6 (medium, large, or huge). Local variables are stored at addresses below BP.
When using Turbo Pascal, it's possible to use a handy program called "inline assembler" to convert assembly-language source into a Turbo Pascal literal-code directive. Turbo Pascal's directive is formatted differently from C's (I like Pascal's better) and can accommodate labels in ways Turbo C's cannot. Still, using __emit may have far less impact on build times than trying to use inline assembly code.

Speed up compiled programs using runtime information like for example JVM does it?

Java programs can outperform compiled programming languages like C in specific tasks. It is because the JVM has runtime information, and does JIT compiling when necessary (i guess).
(example: http://benchmarksgame.alioth.debian.org/u32/performance.php?test=chameneosredux)
Is there anything like this for a compiled language?
(i am interested in C first of all)
After compiling the source, the developer runs it and tries to mimic typical workload.
A tool gathers information about the run, and then according to this data, it recompiles again.
gcc has -fprofile-arcs
from the manpage:
-fprofile-arcs
Add code so that program flow arcs are instrumented. During execution the
program records how many times each branch and call is executed and how many
times it is taken or returns. When the compiled program exits it saves this
data to a file called auxname.gcda for each source file. The data may be
used for profile-directed optimizations (-fbranch-probabilities), or for
test coverage analysis (-ftest-coverage).
I don't think the jvm has ever really beaten well optimized C code.
But to do something like that for c, you are looking for profile guided optimization, where the compiler use runtime information from a previous run, to re-compile the program.
Yes there are some tools like this, I think it's known as "profiler-guided optimization".
There are a number of optimizations. Importantly is to reduce backing-store paging, as well as the use of your code caches. Many modern processors have one code cache, maybe a second level of code cache, or a second unified data and code cache, maybe a third level of cache.
The simplest thing to do is to move all of your most-frequently used functions to one place in the executable file, say at the beginning. More sophisticated is for less-frequently-taken branches to be moved into some completely different part of the file.
Some instruction set architectures such as PowerPC have branch prediction bits in their machine code. Profiler-guided optimization tries to set these more advantageously.
Apple used to provide this for the Macintosh Programmer's Workshop - for Classic Mac OS - with a tool called "MrPlus". I think GCC can do it. I expect LLVM can but I don't know how.

Why doesn't (can't) the OS translate C code directly into machine language instead first translating it into assembly language?

As far as I've understood, when a program (written in C for example) is compiled, it is first translated into assembly language and then into machine language. Why can't (isn't) the "assembly language step" be skipped?
Your understanding is wrong, compilers do not necessarily translate C code into assembler. They usually perform several phases and have internal representations, but this doesn't necessarily resemble to a human readable assembler.
Here, I found a nice introduction for LLVM. LLVM is the compiler toolkit that is used for clang.
It is easier for the compiler developers.
It is possible to write a compiler that reads C and writes object code. However, this requires the compiler writer to write all the computations that encode instructions. Instruction encodings are intricate on some machines. Additionally, there are fields to fill in that depend on other interactions, such as how far away a branch target is, which depends on what instructions are between the branch and the target.
Additionally, part of the way a compiler is written is with patterns that say things like “To increment an object x, issue an increment instruction.” In order to write object code directly, you have to encode all the instructions you want to write into those patterns. That means your patterns must have some sort of language for describing instructions.
Well, we already have a language for that: assembly language. So it is simply easier to write your patterns in ways like “To increment an object x, issue inc x.”
Modern compilers have many layers. There is a front end that reads C text (or other languages) and turns it into a language internal to the compiler. There is an optimizer that operates on the internal language (or a representation of it) and tries to improve the code. There is a back end that turns the internal language into assembly language. There is an assembler that turns the assembly into object code. And there is a linker that links object code into an executable file.
As with many complex tasks, it is simply easier for human minds to work with a complex task when it is separated into nice pieces. This reduces bugs and improves the time it takes to work with software. It also makes software flexible, because we can change the front end to support a new language (e.g., Java instead of C) or change the back end to support a new processor (change from Intel assembly to PowerPC assembly). And changing one optimizer improves all the compilers, for Java and C and Intel and PowerPC.
The gcc command that we use to compile is actually just a driver that calls other programs that perform the front-end processing, the optimization, the assembly, and the linking. You can also call most of these phases separately, or use a switch to tell gcc to show you the commands it is using.
Additionally, GCC has a feature that allows developers to insert assembly language directly intermixed with the C code. This compels GCC to include an assembler.
The operating system does not do anything like that. This is the job of the compiler. And in fact, many do directly emit object files - you have to explicitly ask them to emit assembly code. Others choose not to because emitting a fully-featured object file requires expert knowledge about the various formats which exist for this. Assemblers have various convenience features which make the job easier, can (sometimes?) target multiple object file formats without changes in the assembly code. Also, it is a very useful feature to emit annotated assembly code, so not having a separate code generator only for direct object file emission saves you time without any restrictions (except needing an assembler), which makes it an attractive option when you have limited resources.
Depends on the compiler; there is no actual need for the assembly code.
Maybe the authors of whatever compiler you are talking about (GNU-CC?) considered it slightly easier for themselves if they didn't have to resolve certain things like branches themselves.
Assembly code is purely a convenient, somewhat-human-readable representation of the machine code and the symbolic references and relocations needed by the linker when putting together the output of different translation units. Without an intermediate assembly-language step, the compiler would also be responsible for generating the relocations in the form the linker needs, which is doable, but painful. Since an assembler with this capability already exists for processing hand-written assembly code, it makes sense to use it.
There is usually no assembler stage. MSVC (cl.exe) and GCC produce machine code (.obj, .o) right away.
A cross compiler can directly generate the machine code without the help of the OS where that cross compiler is installed.
For example, tornado package installed in windows can generate machine code for vxworks.

How do i compile a c program without all the bloat?

I'm trying to learn x86. I thought this would be quite easy to start with - i'll just compile a very small program basically containing nothing and see what the compiler gives me. The problem is that it gives me a ton of bloat. (This program cannot be run in dos-mode and so on) 25KB file containing an empty main() calling one empty function.
How do I compile my code without all this bloat? (and why is it there in the first place?)
Executable formats contain a bit more than just the raw machine code for the CPU to execute. If you want that then the only option is (I think) a DOS .com file which essentially is just a bunch of code loaded into a page and then jumped into. Some software (e.g. Volkov commander) made clever use of that format to deliver quite much in very little executable code.
Anyway, the PE format which Windows uses contains a few things that are specially laid out:
A DOS stub saying "This program cannot be run in DOS mode" which is what you stumbled over
several sections containing things like program code, global variables, etc. that are each handled differently by the executable loader in the operating system
some other things, like import tables
You may not need some of those, but a compiler usually doesn't know you're trying to create a tiny executable. Usually nowadays the overhead is negligible.
There is an article out there that strives to create the tiniest possible PE file, though.
You might get better result by digging up older compilers. If you want binaries that are very bare to the bone COM files are really that, so if you get hold of an old compiler that has support for generating COM binaries instead of EXE you should be set. There is a long list of free compilers at http://www.thefreecountry.com/compilers/cpp.shtml, I assume that Borland's Turbo C would be a good starting point.
The bloated module could be the loader (operating system required interface) attached by linker. Try adding a module with only something like:
void foo(){}
and see the disassembly (I assume that's the format the compiler 'gives you'). Of course the details vary much from operating systems and compilers. There are so many!

Resources