Tutorial/resource for implementing VM - c

I want self-education purpose implement a simple virtual machine for a dynamic language, prefer in C. Something like the Lua VM, or Parrot, or Python VM, but simpler. Are there any good resources/tutorials on achieving this, apart from looking at code and design documentations of the existing VMs?
Edit: why close vote? I don't understand - is this not programming. Please comment if there is specific problem with my question.

I assume you want a virtual machine rather than a mere interpreter. I think they are two points on a continuum. An interpreter works on something close to the original representation of the program. A VM works on more primitive (and self-contained) instructions. This means you need a compilation stage to translate the one to the other. I don't know if you want to work on that first or if you even have an input syntax in mind yet.
For a dynamic language, you want somewhere that stores data (as key/value pairs) and some operations that act on it. The VM maintains the store. The program running on it is a sequence of instructions (including control flow). You need to define the set of instructions. I'd suggest a simple set to start with, like:
basic arithmetic operations, including arithmetic comparisons, accessing the store
basic control flow
built-in print
You may want to use a stack-based computation approach to arithmetic, as many VMs do. There isn't yet much dynamic in the above. To get to that we want two things: the ability to compute the names of variables at runtime (this just means string operations), and some treatment of code as data. This might be as simple as allowing function references.
Input to the VM would ideally be in bytecode. If you haven't got a compiler yet this could be generated from a basic assembler (which could be part of the VM).
The VM itself consists of the loop:
1. Look at the bytecode instruction pointed to by the instruction pointer.
2. Execute the instruction:
* If it's an arithmetic instruction, update the store accordingly.
* If it's control flow, perform the test (if there is one) and set the instruction pointer.
* If it's print, print a value from the store.
3. Advance the instruction pointer to the next instruction.
4. Repeat from 1.
Dealing with computed variable names might be tricky: an instruction needs to specify which variables the computed names are in. This could be done by allowing instructions to refer to a pool of string constants provided in the input.
An example program (in assembly and bytecode):
offset bytecode (hex) source
0 01 05 0E // LOAD 5, .x
3 01 03 10 // .l1: LOAD 3, .y
6 02 0E 10 0E // ADD .x, .y, .x
10 03 0E // PRINT .x
12 04 03 // GOTO .l1
14 78 00 // .x: "x"
16 79 00 // .y: "y"
The instruction codes implied are:
"LOAD x, k" (01 x k) Load single byte x as an integer into variable named by string constant at offset k.
"ADD k1, k2, k3" (02 v1 v2 v3) Add two variables named by string constants k1 and k2 and put the sum in variable named by string constant k3.
"PRINT k" (03 k) Print variable named by string constant k.
"GOTO a" (04 a) Go to offset given by byte a.
You need variants for when variables are named by other variables, etc. (and the levels of indirection get tricky to reason about). The assembler looks at the arguments like "ADD .x, .y, .x" and generates the correct bytecode for adding from string constants (and not computed variables).

Well, it's not about implementing a VM in C, but since it was the last tab I had open before I saw this question, I feel like I need point out an article about implementing a QBASIC bytecode compiler and virtual machine in JavaScript using the <canvas> tag for display. It includes all of the source code to get enough of QBASIC implemented to run the "nibbles" game, and is the first in a series of articles on the compiler and bytecode interpreter; this one describes the VM, and he's promising future articles describing the compiler as well.
By the way, I didn't vote to close your question, but the close vote you got was as a duplicate of a question from last year on how to learn about implementing a virtual machine. I think this question (about a tutorial or something relatively simple) is different enough from that one that it should remain open, but you might want to refer to that one for some more advice.

Another resource to look at is the implementation of the Lua language. It is a register-based VM that has a good reputation for performance. The source code is in ANSI C89, and is generally very readable.
As with most high performance scripting languages, the end user sees a readable, high level dynamic language (with features like closures, tail calls, immutable strings, numbers and hash tables as the primary data types, functions as first class values, and more). Source text is compiled to the VM's bytecode for execution by a VM implementation whose outline is pretty much as described by Edmund's answer.
A great deal of effort has gone into keeping the implementation of the VM itself both portable and efficient. If even more performance is needed, a just in time compiler from VM byte code to native instructions exists for 32-bit x86, and is in beta release for 64-bit.

For starting (even if not C, but C++) you could give a look to muParser.
It's a math expression parser that use a simple virtual machine to execute operations. I think that even you need time to understand everything; anyway this code is more simple than a complete VM able to run a real complete program. (By the way, I'm designing a similar lib in C# - it is its early stages but next versions will allow compilation to .NET/VM IL or maybe a new simple VM like muParser).
An other interesting thing is NekoVM (it executes .n bytecode files). It's an open source project written in C and it's main language (.neko) is thought to be generated by source-to-source compiler technology. In the spirit of last topic see Haxe from same author (open source too).

Like you I have also been studying virtual machines and compilers and one good book I can recommend is Compiler Design: Virtual Machines. It describes virtual machines for imperative, functional, logic, and object-oriented languages by giving the instruction set for each VM along with a tutorial on how to compile a higher-level language to that VM. I've only implemented the VM for the imperative language and already it has been a very useful exercise.
If you're just starting out then another resource I can recommend is PL101. It is an interactive set of lessons in JavaScript that guides you through the process of implementing parsers and interpreters for various languages.

I am late for the party, but I would.recommend Game Scripting Mastery, which takes your hand to write a working script language and its VM from zero. And with very little prerequisite.

Related

Why using Low-level-Languages or close to it ( C ) for embedded system and not a high level language, when all will be compiled to machine code?

I have searched but I couldn't find a clear answer. If we are compiling the code in a computer(powerful) then we are only sending a machine instruction to the memory in the embedded device. This, for my understandings, will make no difference if we use any sort of language because, in the end, we will be sending only a machine code to the embedded device, the code compilation which is the expensive phase is already done by a powerful machine!
Why using language like C ? Why not Java? we are sending a machine code at the end.
The answer partly lies in the runtime requirements and platform-provided expectations of a language: The size of the runtime for C is minimal - it needs a stack and that is about it to be able to start running code. For a compliant implementation static data initialisation is required, but you can run code without it - the initialisation itself could even be written in C, and even heap and standard library initialisation are optional, as is the presence of a library at all. It need have no OS dependencies, no interpreter and no virtual machine.
Most other languages require a great deal more runtime support and this is usually provided by an OS, runtime-library, or virtual machine. To operate "stand-alone" these languages would require that support to be "built-in" and would consequently be much larger - so much so that you may as well in many cases deploy a system with an OS and/or JVM for example in any case.
There are of course other reasons why particular languages are suited to embedded systems, such as hardware level access, performance and deterministic behaviour.
While the issue of a runtime environment and/or OS is a primary reason you do not often see higher-level languages in small embedded systems, it is by no means unheard of. The .Net Micro Framework for example allows C# to be used in embedded systems, and there are a number of embedded JVM implementations, and of course Linux distributions are widely embedded making language choice virtually unlimited. .Net Micro runs on a limited number of processor architectures, and requires a reasonably large memory (>256kb), and JVM implementations probably have similar requirements. Linux will not boot on less than about 16Mb ROM/4Mb RAM. Neither are particularly suited to hard real-time applications with deadlines in the microsecond domain.
C is more-or-less ubiquitous across 8, 16, 32 and 64 bit platforms and normally available for any architecture from day one, while support for other languages (other than perhaps C++ on 32 bit platforms at least) may be variable and patchy, and perhaps only available on more mature or widely used platforms.
From a developer point of view, one important consideration is also the availability of cross-compilation tools for the target platform and language. It is therefore a virtuous circle where developers choose C (or increasingly also C++) because that is the most widely available tool, and tool/chip vendors provide C and C++ tool-chains because that is what developers demand. Add to that the third-party support in the form of libraries, open-source code, debuggers, RTOS etc., and it would be a brave (or foolish) developer to select a language with barely any support. It is not just high level languages that suffer in this way. I once worked on a project programmed in Forth - a language even lower-level than C - it was a lonely experience, and while there were the enthusiastic advocates of the language, they were frankly a bit nuts favouring language evangelism over commercial success. C has in short reached critical mass acceptance and is hard to dislodge. C++ benefits from broad interoperability with C and similarly minimal runtime requirements, and by tool-chains that normally support both languages. So the only barrier to adoption of C++ is largely developer inertia, and to some extent availability on 8 and 16 bit platforms.
You're misunderstanding things a bit. Let's start by explaining the foundation of how computers work internally. I'll use simple and practical concepts here. For the underlying theories, read about Turing machines. So, what's your machine made up of? All computers have two basic components: a processor and a memory.
The memory is a sequential group of "cells" that works sort of like a table. If you "write" a value into the Nth cell, you can then retrieve that same value by "reading" from the Nth cell. This allows computers to "remember" things. If a computer is to perform a calculation, it needs to retrieve input data for it from somewhere, and to output data from it into somewhere. That place is the memory. In practice, the memory is what we call RAM, short for random access memory.
Then we have the processor. Its job is to perform the actual calculations on memory. The actual operations that are to be performed are mandated by a program, that is, a series of instructions that the processor is able to understand and execute. The processor decodes and executes an instruction, then the next one, and so on until the program halts (stops) the machine. If the program is add cell #1 and cell #2 and store result in cell #3, the processor will grab the values at cells 1 and 2, add their values together, and store the result into cell 3.
Now, there's some sort of an intrinsic question. Where is the program stored, if at all? First of all, a program can't be hardcoded into the wires. Otherwise, the system is not more of a computer than your microwave. To these problems are two distinct approaches/solutions: the Harvard architecture and the Von Neumann Architecture.
Basically, in the Harvard architecture, the data (as always has been) is stored in the memory. The code (or program) is stored somewhere else, usually in read-only memory. In the Von Neumann architecture, code is stored in memory, and is just another form of data. As a result, code is data, and data is code. It's worth noting that most modern systems use the Von Neumann architecture for several reasons, including the fact that this is the only way to implement just-in-time compilation, an essential part of runtime systems for modern bytecode-based programming languages, such as Java.
We now know what the machine does, and how it does that. However, how are both data and code stored? What's the "underlying format", and how shall it be interpreted? You've probably heard of this thing called the binary numeral system. In our usual decimal numeral system, we have ten digits, zero through nine. However, why exactly ten digits? Couldn't they be eight, or sixteen, or sixty, or even two? Be aware that it's impossible to create an unary based computational system.
Have you heard that computers are "logical and cold". Both of them are true... unless your machine has an AMD processor or a special kind of Pentium. The theory states that every logical predicate can be reduced to either "true" or "false". That is to say that "treu" and "false" are the basis of logic. Plus, computers are made up of electrical cruft, no? A light switch is either on or off, no? So, at the electrical level we can easily recognize two voltage levels, right? And we want to handle logic stuff, such as numbers, in computers, right? So zero and one may be, as the only feasible solution they are.
Now, taking all the theory into account, let's talk about programming languages and assembly languages. Assembly languages are a way to express binary instructions in a (supposedly) readable way to human programmers. For instance, something like this...
ADD 0, 1 # Add cells 0 and 1 together and store the result in cell 0
Could be translated by an assembler into something like...
110101110000000000000001
Both are equivalent, but humans will only understand the former, and processors will only understand the later.
A compiler is a program that translates input data that is expected to conform to the rules of a given programming language into another, usually lower-level form. For instance, a C compiler may take this code...
x = some_function(y + z);
And translate it into assembly code such as (of course this is not real assembly, BTW!)...
# Assume x is at cell 1, y at cell 2, and z at cell 3.
# Assuem that, when calling a function, the first argument
# is at cell 16, and the result is stored in cell 0.
MOVE 16, 2
ADD 16, 3
CALL some_function
MOVE 1, 0
And the assembler will spit (this is not random)...
11101001000100000000001001101110000100000000001110111011101101111010101111101111110110100111010010000000100000000
Now, let's talk about another language, namely Java. Java's compiler does not give you assembly/raw binary code, but bytecode. Bytecode is... like a generic, higher-level form of assembly language that the CPU can't understand (there are exceptions), but another program that directly runs on the CPU does. This means that the lie that some badly educated people spread around, that "both interpreted and compiled programs ultimately boil down to machine code" is false. If, for example, the interpreter is written in C, and has this line of code...
Bytecode some_bytecode;
/* ... */
execute_bytecode(&some_bytecode);
(Note: I won't translate that into assembly/binary again!) The processor executes the interpreter, and the interpreter's code executes the bytecode, by performing the actions specified by the bytecode. Although, if not optimized correctly, this can severely degrade performance, this is not the problem per se, but the fact that things such as reflection, garbage collection, and exceptions can add quite some overhead. For embedded systems, whose memories are small and whose processors are slow, this is something you want. You're wasting precious system resources on things you don't need. If C programs are slow on your Arduino, image a full blown Java/Python program with all sorts of bells and whistles! Even if you translated bytecode into machine code before inserting it into the system, support must be there for all that extra stuff, and results in basically the same unwanted overhead/waste. You would still need support for reflection, exceptions, garbage collection, etc... It's basically the same thing.
On most other environments, this is not a big deal, as memory is cheap and abundant, and processors are fast and powerful. Embedded systems have special needs, they're special by themselves, and things are not free in that land.
Why using language like C ? why not Java ? we are sending a machine
code at the end.
No, Java code does not compile to machine code, it needs a virtual machine (the JVM) on the target system.
You're partly right about the compilation, however, but still "higher-level" languages can result in less efficient machine code. For instance, the language can include garbage collection, run-time correctness checks, can't use all the "native" numeric types, etc.
In general it depends on the target. On small targets (i.e. microcontrollers like AVR) you don't have that complex programs running. Additionally, you need to access the hardware directly (f.e. a UART). High level languages like Java don't support accessing the hardware directly, so you usually end up with C.
In the case of C versus Java there's a major difference:
With C you compile the code and get a binary that runs on the target. It directly runs on the target.
Java instead creates Java Bytecode. The target CPU cannot process that. Instead it requires running another program: the Java runtime environment. That translates the Java Bytecode to actual machine code. Obviously this is more work and thus requires more processing power. While this isn't much of a concern for standard PCs it is for small embedded devices. (Note: some CPUs do actually have support for running Java bytecode directly. Those are exceptions though.)
Generally speaking, the compile step isn't the issue -- the limited resources and special requirements of the target device are.
you misunderstand something , 'compiling' java gives a different output then compiling a low level language , it is true that both are machine codes , but in c case the machine code is directly executable by the processor , whereas with java the output will be in an intermediate stage , a bytecode , and it can't be executed by the processor , it needs some extra work , a translation to a machine code , that is the only directly executable format , while that takes a extra time , c will be an attractive choice , because of its speed , with low level language you write you code then you compile to a target machine ( you need to specify the target to the compiler since each processor have his own machine code ) , then your code is understandable by the processor .
in the other hand c allows direct hardware access , that is not allowed in java-like languages even via an api
It's an industry thing.
There are three kinds of high level languages. Interpreted (lua, python, javascript), compiled to bytecode (java, c#), and compiled to machinne code (c, c++, fortran, cobol, pascal)
Yes, C is a high level language, and closer to java than to assembly.
High level languages are popular for two reasons.
Memory management, and a wide standard library.
Managed memory comes with a cost,
somebody must manage it. That's an issue not only for java and c#, where somebody must implement a VM, but also to baremetal c/c++ where someone must implement the memory allocation functions.
A wide standard library can't be supported by all targets because there aren't enough resources. ie, avr arduino doesn't support the full c++ standard library.
C gained popularity, because it can easily be converted to equivalent assembly code. Most statements can be converted, without optimization, to a bunch of fixed assembly instructions, so compilers are easy to program. And its standard is compact and easy to implement. C prevailed because it became the defacto standard for the lowest high level language of any arch.
So in the end, besides special snowflakes like cython, go, rust, haskell etc, industry decided that machinne code is compiled from C, C++ and most optimization efforts went that way
Languages, like java, decided to hide memory from the progarammer, so good luck trying to interface with low level stuff there. As by design they do that, almost nobody bothers trying to bring them to compete with C. Realistically, java without GC would be C++ with different syntax.
Finally, if all the industry money goes to one language, the cheapest/easyest thing to do is choosig that language.
You are right in that you can use any language that generates machine code. But JAVA is not one of them. JAVA, Python and even some languages that compile to machine code may have heavy system requirements. You could and some folks use Pascal, but C won the C vs Pascal war many years ago. There are some other languages that fell by the wayside that if you had a compiler for you could use. there are some new languages you can use, but the tools are not as mature and not as many targets as one would like. But it is very unlikely that they will unseat C. C is just the right amount of power/freedom, low enough and high enough.
Java is an interpreted language and (like all interpreted languages) produces an intermediate code that is not directly executable by the processor. So what you send to the embedded device would be the Bytecode and you should have a JVM running on it and interpreting your code. Clearly not feasible. For what concern the compiled languages (C, C++...) you are right to say that at the end you send machine code to the device. However consider that using high level features of a language will produce much more machine code that you would expect. If you use polymorphism for example, you have just a function call, but when you compile the machine code explodes. Consider also that very often the use of dynamic memory (malloc, new...) is not feasible on an embedded device.

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?
I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.
You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.
What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

Reverse engineer "compiled" Perl vs. C?

Have a client that's claiming complied C is harder to reverse engineer than sudo "compiled" Perl byte-code, or the like. Anyone have a way to prove, or disprove this?
I don't know too much about perl, but I'll give some examples why reversing code compiled to assembly is so ugly.
The ugliest thing about reverse engineering c code is that the compilation removes all type information. This total lack of names and types is very the worst part IMO.
In a dynamically typed language the compiler needs to preserve much more information about that. In particular the names of fields/methods/... since these are usually strings for which it is impossible to find every use.
There is plenty of other ugly stuff. Such as whole program optimization using different registers to pass parameters every time. Functions being inlined so what was one a simple function appears in many places, often in slightly different form due to optimizations.
The same registers and bytes on the stack get reused by different content inside a function. Gets especially ugly with arrays on the stack. Since you have no way to know how big the array is and where it ends.
Then there are micro-optimizations which can get annoying. For example I once spend >15 minutes to reverse a simple function that once was similar to return x/1600. Because the compiler decided that divisions are slow and rewrote that division by a constant into several multiplications additions and bitwise-operations.
Perl is really easy to reverse engineer. The tool of choice is vi, vim, emacs or notepad.
That does raise the question about why they're worried about reverse engineering. It is more difficult to turn machine code back to something resembling the original source code than it is byte-code normally but for most nefarious activities that's irrelevant. If someone wants to copy your secrets or break your security they can do enough without turning it back into a perfect representation of your original source code.
Reverse engineering code for a virtual machine is usually easier. A virtual machine is typically designed to be an easy target for the language. That means it typically represents the constructs of that language reasonably easily and directly.
If, however, you're dealing with a VM that wasn't designed for that particular language (e.g., Perl compiled to the JVM) that would frequently put you back much closer to working with code generated for real hardware -- i.e., you have to do whatever's necessary to target a pre-defined architecture instead of designing the target to fit the source.
Ok, there has been suficient debate on this over the years; and mostly the results are never conclusive ... mainly because it doesn't matter.
For a motivated reverse engineer, both will be the same.
If you are using pseudo exe makers like perl2exe then that will be easier to "decompile" than compiled C, as perl2exe does not compile the perl at all, it's just a bit "hidden" (see http://www.net-security.org/vuln.php?id=2464 ; this is really old, but concept is probably still the same (I haven't researched so don't know for sure, but I hope you get my point) )
I would advise look at the language which is best for the job so maintenance and development of the actual product can be done sensibly and sustainably.
Remember you _can_not_ stop a motivated adversary, you need to make it more expensive to reverse than to write it themselves.
These 4 should make it difficult (but again not impossible)...
[1] Insert noise code (random places, random code) which does pointless maths and complex data structure interaction (if done properly this will be a great headache if the purpose is to reverse the code rather than the functionality).
[2] Chain a few (different) code obfuscators on the source code as part of build process.
[3] Apply a Software protection dongle which will prevent code execution if the h/w is not present, this will mean physical access to the dongle's data is required before rest of the reversing can take place : http://en.wikipedia.org/wiki/Software_protection_dongle
[4] There are always protectors (e.g. Themida http://www.oreans.com/themida.php) you can get which will be able to protect a .exe after it has been built (regardless of how it was compiled).
... That should give the reverser enough headache.
But remember that all this will also cost money, so you should always weigh up what is it that you are trying to achieve and then look at your options.
In short: Both methods are equally insecure. Unless you are using a non-compiling perl-to-exe maker in which case native compiled EXE wins.
I hope this helps.
C is harder to decompile than byte-compiled Perl code. Any Perl code that's been byte-compiled can be decompiled. Byte-compiled code is not machine code like in compiled C programs. Some others suggested using code obfuscation techniques. Those are just tricks to make code harder to read and won't effect the difficulty in decompiling the Perl source. The decompiled source may be harder to read but there are many Perl de-obfuscation tools available and even a Perl module:
http://metacpan.org/pod/B::Deobfuscate
Perl packing programs like Par, PerlAPP or Perl2exe won't offer source code protection either. At some point the source has to be extracted so Perl can execute the script. Even packers like PerlAPP and Perl2exe, which attempt some encryption techniques on the source, can be defeated with a debugger:
http://www.perlmonks.org/?displaytype=print;node_id=779752;replies=1
It'll stop someone from casually browsing your Perl code but even the packer has to unpack the script before it can be run. Anyone who's determined can get the source code.
Decompiling C is a different beast altogether. Once it's compiled it's now machine code. You either end up with Assembly code with most C decompilers or some of the commercial C decompilers will take the Assembly code and try to generate equivalent C code but, unless it's a really simple program, seldom are able to recreate the original code.

Automated tracing use of variables within source code

I'm working with a set of speech processing routines (written in C) meant to be compiled with the mex command on MATLAB. There is this C-function which I'm interested in accelerating using FPGA.
The hardware takes in specified input parameters through input ports, the rest of the inputs as constants to be hard coded, and passes a particular variable some where within the C-function, say foo, to the output port.
I am interested in tracing the computation graph (unsure if this is the right term to use) of foo. i.e. how foo relates to intermediate computed variables, which in turn eventually depends on input parameters and hard coded constants. This is to allow me to flatten the logic so they can be coded using a hardware description language, as well as remove irrevelant logic which does not affect the value of foo. The catch is that some intermediate variables are global, therefore tracing is a headache.
Is there an automated tool which analyzes a given set of C headers and source files and provide a means of tracing how a specified variable is altered, with some kind of dependency graph of all variables used?
I think what you are looking for is a tool to do value analysis.
Among the tools available to do this, I think Code Surfer is probably the best out there. Of course, it is also quite expensive but if you are a student, they do have an academic license program. On the open-source side, Frama-C can also do this in a more limited fashion and has a much, much steeper learning curve. But it is free and will get you where you want to go.

Convert ASM to C (not reverse engineer)

I googled and I see a surprising amount of flippant responses basically laughing at the asker for asking such a question.
Microchip provides some source code for free (I don't want to post it here in case that's a no-no. Basically, google AN937, click the first link and there's a link for "source code" and its a zipped file). Its in ASM and when I look at it I start to go cross-eyed. I'd like to convert it to something resembling a c type language so that I can follow along. Because lines such as:
GLOBAL _24_bit_sub
movf BARGB2,w
subwf AARGB2,f
are probably very simple but they mean nothing to me.
There may be some automated ASM to C translator out there but all I can find are people saying its impossible. Frankly, its impossible for it to be impossible. Both languages have structure and that structure surely can be translated.
You can absolutely make a c program from assembler. The problem is it may not look like what you are thinking, or maybe it will. My PIC is rusty but using another assembler, say you had
add r1,r2
In C lets say that becomes
r1 = r1 + r2;
Possibly more readable. You lose any sense of variable names perhaps as values are jumping from memory to registers and back and the registers are being reused. If you are talking about the older pics that had what two registers an accumulator and another, well it actually might be easier because variables were in memory for the most part, you look at the address, something like
q = mem[0x12];
e = q;
q = mem[0x13];
e = e + q;
mem[0x12] = e;
Long and drawn out but it is clear that mem[0x12] = mem[0x12] + mem[0x13];
These memory locations are likely variables that will not jump around like compiled C code for a processor with a bunch of registers. The pic might make it easier to figure out the variables and then do a search and replace to name them across the file.
What you are looking for is called a static binary translation, not necessarily a translation from one binary to another (one processor to another) but in this case a translation from pic binary to C. Ideally you would want to take the assembler given in the app note and assemble it to a binary using the microchip tools, then do the translation. You can do dynamic binary translation as well but you are even less likely to find one of those and it doesnt normally result in C but one binary to another. Ever wonder how those $15 joysticks at wal-mart with pac-man and galaga work? The rom from the arcade was converted using static binary translation, optimized and cleaned up and the C or whatever intermediate language compiled for the new target processor in the handheld box. I imagine not all of them were done this way but am pretty sure some were.
The million dollar question, can you find a static binary translator for a pic? Who knows, you probably have to write one yourself. And guess what that means, you write a disassembler, and instead of disassembling to an instruction in the native assembler syntax like add r0,r1 you have your disassembler print out r0=r0+r1; By the time you finish this disassembler though you will know the pic assembly language so well that you wont need the asm to C translator. You have a chicken and egg problem.
Getting the exact same source code back from a compiled program is basically impossible. But decompilers have been an area of research in computer science (e.g. the dcc decompiler, which was a PhD project).
There are various algorithms that can be used to do pattern matching on assembly code and generate equivalent C code, but it is very hard to do this in a general way that works well for all inputs.
You might want to check out Boomerang for a semi-recent open source effort at a generalized decompiler.
I once worked a project where a significant part of the intellectual property was some serious algorithms coded up in x86 assembly code. To port the code to an embedded system, the developer of that code (not me) used a tool from an outfit called MicroAPL (if I recall correctly):
http://www.microapl.co.uk/asm2c/index.html
I was very, very surprised at how well the tool did.
On the other hand, I think it's one of those "if you have to ask, you can't afford it" type of things (their price ranges for a one-off conversion of a project work out to around 4 lines of assembly processed for a dollar).
But, often the assembly routines you get from a vendor are packaged as functions that can be called from C - so as long as the routines do what you want (on the processor you want to use), you might just need to assemble them and more or less forget about them - they're just library functions you call from C.
You can't deterministically convert assembly code to C. Interrupts, self modifying code, and other low level things have no representation other than inline assembly in C. There is only some extent to which an assembly to C process can work. Not to mention the resultant C code will probably be harder to understand than actually reading the assembly code... unless you are using this as a basis to begin reimplementation of the assembly code in C, then it is somewhat useful. Check out the Hex-Rays plugin for IDA.
Yes, it's very possible to reverse-engineer assembler code to good quality C.
I work for a MicroAPL, a company which produces a tool called Relogix to convert assembler code to C. It was mentioned in one of the other posts.
Please take a look at the examples on our web site:
http://www.microapl.co.uk/asm2c/index.html
There must be some automated ASM to C translator out there but all I can find are people saying its impossible. Frankly, its impossible for it to be impossible.
No, it's not. Compilation loses information: there is less information in the final object code than in the C source code. A decompiler cannot magically create that information from nothing, and so true decompilation is impossible.
It isn't impossible, just very hard. A skilled assembly and C programmer could probably do it, or you could look at using a Decompiler. Some of these do quite a good job of converting the asm to C, although you will probably have to rename some variables and methods.
Check out this site for a list of decompilers available for the x86 architecture.
Check out this: decompiler
A decompiler is the name given to a
computer program that performs the
reverse operation to that of a
compiler. That is, it translates a
file containing information at a
relatively low level of abstraction
(usually designed to be computer
readable rather than human readable)
into a form having a higher level of
abstraction (usually designed to be
human readable).
Not easily possible.
One of the great advantages of C over ASM apart from readability was that it prevented "clever" programing tricks.
There are numerous things you can do in assembler that have no direct C equivalent,
or involve tortuous syntax in C.
The other problem is datatypes most assemblers essentialy have only two interchangeable datatypes: bytes and words. There may be some language constructs to define ints and floats
etc. but there is no attempt to check that the memory is used as defined. So its very difficult to map ASM storage to C data types.
In addition all assembler storage is essentially a "struct"; storage is layed out in the order it is defined (unlike C where storage is ordered at the whim of the runtime). Many ASM programs depend on the exact storage layout - to acheive the same effect in C you would need to define all storage as part of a single struct.
Also there are a lot of absused instructions ( on olde worldy IBM manframes the LA, load address, instruction was regulary used to perform simple arithimatic as it was faster and didnt need an overflow register )
While it may be technically possible to translate to C the resulting C code would be less readable than the ASM code that was transalated.
I can say with 99% guarantee, there is no ready converter for this assembly language, so you need to write one. You can simply implement it replacing ASM command with C function:
movf BARGB2,w -> c_movf(BARGB2,w);
subwf AARGB2,f -> c_subwf(AARGB2,f);
This part is easy :)
Then you need to implement each function. You can declare registers as globals to make things easy. Also you can use not functions, but #defines, calling functions if needed. This will help with arguments/results processing.
#define c_subwf(x,y) // I don't know this ASM, but this is some Substraction must be here
Special case is ASM directives/labels, I think it can be converted with #defines only.
The fun starts when you'll reach some CPU-specific features. This can be simple function calls with stack operations, some specific IO/Memory operations. More fun are operations with Program Counter register, used for calculations, or using/counting ticks/latencies.
But there is another way, if this hardcore happens. It's hardcore too :)
There is a technique named dynamic recompilation exists. It's used in many emulators.
You don't need recompile your ASM, but the idea is almost the same. You can use all your #defines from first step, but add support of needed functionality to them (incrementing PC/Ticks). Also you need to add some virtual environment for your code, such as Memory/IO managers, etc.
Good luck :)
I think it is easier to pick up a book on PIC assembly and learn to read it. Assembler is generally quite simple to learn, as it is so low level.
Check out asm2c
Swift tool to transform DOS/PMODEW 386 TASM Assembly code to C code
It is difficult to convert a function from asm to C but doable by hand. Converting an entire program with a decompiler will give you code that can be impossible to understand since to much of the structure was lost during compilation. Without meaningful variable and function names the resultant C code is still very difficult to understand.
The output of a C compiler (especially unoptimised output) of an basic program could be translatable to C because of repeated patterns and structures.

Resources