Tool to produce self-referential programs? - theory

Many results in computability theory (such as Kleene's second recursion theorem) ensure that it is possible to construct programs that can operate over their own source code. For example, in Michael Sipser's "Introduction to the Theory of Computation," he proves a special case of the Recursion Theorem, which states that any program representing a function that accepts two strings and produces a string can be converted into an equivalent program where the second argument is equal to the program's own source code. Moreover, this process can be done automatically.
The construction that one uses to produce programs with access to their own source code is well-known (most theory of computation books contain it) and is often used to generate quines. My question is whether someone has written a general-purpose tool that accepts as input a program in some language (perhaps C, for example) that contains some placeholder for the source of the program, then processes the program to produce a new program with access to its own source code. This would make it possible, for example, to generate quines automatically, or to write programs that can introspect on their syntax trees (possibly enabling reflection in languages that don't already support it). If not, I was planning on writing my own version of such a tool, but I don't want to reinvent the wheel if this has already been done.
EDIT: Based on #Henning Makholm's suggestion, I decided to just sit down and implement such a program. The resulting program (which I've dubbed "kleene") accepts as input a C++ program and produces a new C++ program that can access its own source code by calling the function kleene::MySource(). This means that you could transform this very simple program into a Quine using the kleene program:
#include <iostream>
int main() {
std::cout << kleene::MySource() << std::endl;
}
If you're curious to check it out, it's available here on my website.

Lots of examples at the Wikipedia article and links therefrom. After looking at one or two it should be obvious how to build a quine generator a given language that takes an arbitrary piece of payload code as input.
One problem with your reflection idea is that the program cannot, in general, know that what it has constructed is its own source code.

Our DMS Software Reengineering Toolkit is a program transformation system, that will accept programs in arbitrary syntax (described to DMS in an explicit parameter called a "domain description"), parse them to ASTs, carry out analyses and transformations of the ASTs, and can regenerate revised program text from the modified version.
DMS is of course coded in a language (actually as set of domain-specific languages) for which there are already DMS-domain descriptions. So, DMS can read itself, and we use that capability to bootstrap additional DMS capabilities and optimize its performance.
So while we aren't producing quines, we are building programs with self-enhancing code.
And yes, your observation about such a tool providing reflection for arbitrary langauges is smack on. Most reflection facilities provided in languages allow only access to those things the language-compiler folks thought of paramount importance to access at runtime, such as "method names". Things they weren't interested in, of course, aren't accessible; ever seen a reflection mechanism that will tell you what's in an expression? In a comment?
DMS provides complete access to all the details of the source code, by virtue of inspecting the code from outside, using general purpose, complete mechanisms. If your language doesn't have reflection, DMS is the way to access the code and reason arbitrarily about it. Even if your langauge has reflection, DMS can reason about programs in your language in ways that your language cannot, because it can't get access to its own detailed structure.

Related

How to get a c source code from the compiled code

I have the compiled C code in text format. I need to extract the source code by decompiling the machine code. How to do that?
"True" decompiling is, basically, impossible. Foremost, you can't "decompile" local names (in functions and source code files / modules). For those, you'll get something like, for int local variables: i1, i2... Of course, unless you also have debug information, which is not often the case.
Decompiling to "something" (which might not be very readable) is possible, but it usually relies on some heuristics, recognizing code patterns that compilers generate and can be fooled into generating strange (possibly even incorrect) C code. In practice that means that a decompiler usually works OK for a certain compiler with certain (default) compile options, but, not so nice with others.
Having said that, decompilers do exist and you can try your luck with, say Snowman
As Srdjan has said, in general decompilation of a C (or C++) program is not possible. There is too much information lost during the compilation process. For example consider a declaration such as int x this is 'lost' as it does not directly produce any machine level instruction. The compiler needs this information to do type checking only.
Now, however it is possible to disassembly which is taking the compiled executable back up a level to assembly language. However, interpretation of the assembly might (will ?) be difficult and certainly time consuming. There are several disassemblers available, if you have money IDA-Pro is probably the industry standard in disassemblers, and if you are doing this type work, well worth the several thousand dollars per license. There are a number of open source disassemblers available, google can find them.
Now, that being said there have been efforts to create a decompilers, IDA-Pro has one, and you can look at http://boomerang.sourceforge.net/ in addition to Snowman linked above.
Lastly, other languages are more friendly towards decompilation then C or C++. For example a C# programs is decompilable with tools like dotPeek or ilSpy. Similarly with Java there are a number of tools that can convert Java bytecode back into Java source.
Please post a sample of the "compiled C code in text format."
Perhaps then it will be easier to see what you are trying to achieve.
Typically it is not practical to reverse engineer assembly language into C because much the human readable information in the form of Labels and variable names is permanently lost in the compilation process.

How to get abstract syntax tree of a `c` program in `GCC`

How can I get the abstract syntax tree of a c program in gcc?
I'm trying to automatically insert OpenMP pragmas to the input c program.
I need to analyze nested for loops for finding dependencies so that I can insert appropriate OpenMP pragmas.
So basically what I want to do is traverse and analyze the abstract syntax tree of the input c program.
How do I achieve this?
You need full dataflow to find 'dependencies'. Then you will need to actually insert the OpenMP calls.
What you want is a program transformation system. GCC probably has the dependency information, but it is famously difficult to work with for custom projects. Others have mentioned Clang and Rose. Clang might be a decent choice, but custom analysis/transformation isn't its main purpose. Rose is designed to support custom tools, but IMHO is a rather complicated scheme to use in practice because of its use of the EDG front end, which isn't designed to support transformation.
[THE FOLLOWING TEXT WAS DELETED BY A MODERATOR. I HAVE PUT IT BACK, BECAUSE IT IS ONE THE VALID TRANSFORMATION SYSTEMS FOR THIS TASK. THE FACT THAT I AM RESPONSIBLE FOR IT IN NO WAY DIMINISHES ITS VALUE AS A USEFUL ANSWER TO THE OP.]
Our DMS Software Reengineering Toolkit with its C front end is explicitly designed to be a program transformation system. It has full data flow analysis (including points-to analysis, call graph construction and range analyses) tied to the AST in sensible ways. It provides source-to-source rewrite rules enabling changes to the ASTs expressed in surface syntax form; you can read the transformations rather than inspect a bunch of procedural code. With a modified AST, DMS can regenerate source code including the comments in a compilable form.
Not exactly an AST but GCCXML might help http://linux.die.net/man/1/gccxml
edit : as stated by Ira Baxter gccxml does not output information about function/methods bodies. Here's a fork that seems to fix that lack http://sourceforge.net/projects/gccxml-bodies/

Turning strings into code?

So let's say I have a string containing some code in C, predictably read from a file that has other things in it besides normal C code. How would I turn this string into code usable by the program? Do I have to write an entire interpreter, or is there a library that already does this for me? The code in question may call subroutines that I declared in my actual C file, so one that only accounts for stock C commands may not work.
Whoo. With C this is actually pretty hard.
You've basically got a couple of options:
interpret the code
To do this, you'll hae to write an interpreter, and interpreting C is a fairly hard problem. There have been C interpreters available in the past, but I haven't read about one recently. In any case, unless you reallY really need this, writing your own interpreter is a big project.
Googling does show a couple of open-source (partial) C interpreters, like picoc
compile and dynamically load
If you can capture the code and wrap it so it makes a syntactically complete C source file, then you can compile it into a C dynamically loadable library: a DLL in Windows, or a .so in more variants of UNIX. Then you could load the result at runtime.
Now, what normally would lead someone to do this is a need to be able to express some complicated scripting functions. Have you considered the possibility of using a different language? Python, Scheme (guile) and Lua are easily available to add as a scripting language to a C application.
C has nothing of this nature. That's because C is compiled, and the compiler needs to do a lot of building of the code before the code starts running (hence receives a string as input) that it can't really change on the fly that easily. Compiled languages have a rigidity to them while interpreted languages have a flexibility.
You're thinking of Perl, Python PHP etc. and so called "fourth generation languages." I'm sure there's a technical term in c.s. for this flexibility, but C doesn't have it. You'll need to switch to one of these languages (and give up performance) if you have a task that requires this sort of string use much. Check out Perl's /e flag with regexes, for instance.
In C, you'll need to design your application so you don't need to do this. This is generally quite doable, as for its non-OO-ness and other deficiencies many huge, complex applications run on well-written C just fine.

C to IEC 61131-3 IL compiler

I have a requirement for porting some existing C code to a IEC 61131-3 compliant PLC.
I have some options of splitting the code into discrete function blocks and weaving those blocks into a standard solution (Ladder, FB, Structured Text etc). But this would require carving up the C code in order to build each function block.
When looking at the IEC spec I realsied that the IEC Instruction List form could be a target language for a compiler. The wikepedia article lists two development tools:
CoDeSys
Beremiz
But these seem to be targeted compiling IEC languages to C, not C to IEC.
Another possible solution is to push the C code through a C to Pascal translator and use that as a starting point for a Structured Text solution.
If not any of these I will go down the route of splitting the code up into function blocks.
Edit
As prompted by mlieson's reply I should have mentioned that the C code is an existing real-time control system. So the programs algorithms should already suit a PLC environment.
Maybe this answer comes too late but it is possible to call C code from CoDeSys thanks to an external library.
You can find documentation on the CoDeSys forum at http://forum-en.3s-software.com/viewtopic.php?t=620
That would give you to use your C code into the PLC with minor modifcations. You'll just have to define the functions or function blocks interfaces.
My guess is that a C to Pascal translator will not get you near enough for being worth the trouble. Structured text looks a lot like Pascal, but there are differences that you will need to fix everywhere.
Not a bug issue, but don't forget that PLCs runtime enviroment is a bit different. A C applications starts at main() and ends when main() returns. A PLC calls it main() over and over again, 100:s of times per second and it never ends.
Usally lengthy calculations and I/O needs to be coded in diffent fashion than a C appliation would use.
Unless your C source is many many thousands lines of code - Rewrite it.
It is impossible. To be short: the IL language is a 4GL (i.e. limited to
the domain, as well as other IEC 61131-3 languages -- ST, FBD, LD, SFC).
The C language is a 3GL.
To understand the problem, try to answer the question, which way to
express in IL manipulations with a pointer? for example, to express call a
function by a pointer. What about interrupts? Low level access to the
peripherial devices?
(really, there are more problems)
BTW, there is the Reflex language, aka "C with processes". Reflex is a 4GL for the
control domain with C-like syntax. But the known translators produce
C-code and Python-code.
If the amount of code to convert is a few thousand lines, recoding by hand is probably your best bet.
If you have lots of code to convert, then an automated tool might be very effective.
Using the DMS Software Reengineering Toolkit we've built translators to map mechanical motion diagrams into RLL (PLC) code. DMS also has full C parser/analyzers/front ends. The pieces are there to build a C to RLL code.
This isn't an easy task. It likely takes 6-12 man-months to configure DMS to something resembling what you want. If that's less than what it takes to do by hand, then its the right way to do it.
There are a few IEC development environments and target hardware that can use C blocks... I would also take a look at the reasons why it HAS to be an IEC-61131 complaint target. I have written extensively on compliance and why it doesn't mean squat.
SOFTplc corp can help I'm sure with user defined loadable modules... and they can be in C..
Schneider also supports C function blocks...
Labview too!! not sure why IEC is important that's all!! the compiler if existed would create bad code for sure:)
Your best bet is to split your C code into smaller parts which can be recoded as PLC functional blocks and use C to PASCAL convertor for each block which you will rewrite in structured text. Prepare to do a lot of manual work since automated conversion will probably disappoint you.
Also take a look at this page: http://www.control.com/thread/1026228786
Every time I've done this, I just parsed and converted it by hand from C directly to ST. I only ran into a few functions that required complete rewrites, although there was very little that dealt with pointers, which is something that ST generally chokes on, unfortunately.
Using the existing C code as blocks that are called by the PLC program would have the added advantage that the C blocks could run at the same periodicity that they did before, and their function is likely already well documented and tested. This would minimize any effect on changes from the existing control system. This is an architecture for controls with software PLCs that I have seen used before.

Is it possible to write code to write code?

I've heard that there are some things one cannot do as a computer programmer, but I don't know what they are. One thing that occurred to me recently was: wouldn't it be nice to have a class that could make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself. Is it possible for code to write code?
If you want to learn about the limits of computability, read about the halting problem
In computability theory, the halting
problem is a decision problem which
can be stated as follows: given a
description of a program and a finite
input, decide whether the program
finishes running or will run forever,
given that input.
Alan Turing proved in 1936 that a
general algorithm to solve the halting problem for all
possible program-input pairs cannot exist
Start by looking at quines, then at Macro-Assemblers and then lex & yacc, and flex & bison. Then consider self-modifying code.
Here's a quine (formatted, use the output as the new input):
#include<stdio.h>
main()
{
char *a = "main(){char *a = %c%s%c; int b = '%c'; printf(a,b,a,b,b);}";
int b = '"';
printf(a,b,a,b,b);
}
Now if you're just looking for things programmers can't do look for the opposite of np-complete.
Sure it is. That's how a lot of viruses work!
Get your head around this: computability theory.
Yes, that's what most Lisp macros do (for just one example).
Yes it certainly is, though maybe not in the context you are referring to check out this post on t4.
If you look at Functional Programming that has many opportunities to write code that generates further code, the way that a language like Lisp doesn't differentiate between code and data is a significant part of it's power.
Rails generates the various default model and controller classes from the database schema when it's creating a new application. It's quite standard to do this kind of thing with dynamic languages- I have a few bits of PHP around that generate php files, just because it was the simplest solution to the problem I was dealing with at the time.
So it is possible. As for the question you are asking, though- that is perhaps a little vague- what environment and language are you using? What do you expect the code to do and why does it need to be added to? A concrete example may bring more directly relevant responses.
Yes it is possible to create code generators.
Most of the time they take user input and produce valid code. But there are other possibilities.
Self modifying programes are also possible. But they were more common in the dos era.
Of course you can! In fact, if you use a dynamic language, the class can change itself (or another class) while the program is still running. It can even create new classes that didn't exist before. This is called metaprogramming, and it lets your code become very flexible.
You are confusing/conflating two meanings of the word "write". One meaning is the physical writing of bytes to a medium, and the other is designing software. Of course you can have the program do the former, if it was designed to do so.
The only way for a program to do something that the programmer did not explicitly intend it to do, is to behave like a living creature: mutate (incorporate in itself bits of environment), and replicate different mutants at different rates (to avoid complete extinction, if a mutation is terminal).
Sure it is. I wrote an effect for Paint.NET* that gives you an editor and allows you to write a graphical effect "on the fly". When you pause typing it compiles it to a dll, loads it and executes it. Now, in the editor, you only need to write the actual render function, everything else necessary to create a dll is written by the editor and sent to the C# compiler.
You can download it free here: http://www.boltbait.com/pdn/codelab/
In fact, there is even an option to see all the code that was written for you before it is sent to the compiler. The help file (linked above) talks all about it.
The source code is available to download from that page as well.
*Paint.NET is a free image editor that you can download here: http://getpaint.net
In relation to artificial intelligence, take a look at Evolutionary algorithms.
make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself
You can also generate code, build it into a library instead of an executable, and then dynamically load the library without even exiting the program that is currently running.
Dynamic languages usually don't work quite as you suggest, in that they don't have a completely separate compilation step. It isn't necessary for a program to modify its own source code, recompile, and start from scratch. Typically the new functionality is compiled and linked in on the fly.
Common Lisp is a very good language to practice this in, but there are others where you can created code and run it then and there. Typically, this will be through a function called "eval" or something similar. Perl has an "eval" function, and it's generally common for scripting languages to have the ability.
There are a lot of programs that write other programs, such as yacc or bison, but they don't have the same dynamic quality you seem to be looking for.
Take a look at Langtom's loop. This is the simplest example of self-reproducing "program".
There is a whole class of such things called "Code Generators". (Although, a compiler also fits the description as you set it). And those describe the two areas of these beasts.
Most code generates, take some form of user input (most take a Database schema) and product source code which is then compiled.
More advanced ones can output executable code. With .NET, there's a whole namespace (System.CodeDom) dedicated to the create of executable code. The these objects, you can take C# (or another language) code, compile it, and link it into your currently running program.
I do this in PHP.
To persist settings for a class, I keep a local variable called $data. $data is just a dictionary/hashtable/assoc-array (depending on where you come from).
When you load the class, it includes a php file which basically defines data. When I save the class, it writes the PHP out for each value of data. It's a slow write process (and there are currently some concurrency issues) but it's faster than light to read. So much faster (and lighter) than using a database.
Something like this wouldn't work for all languages. It works for me in PHP because PHP is very much on-the-fly.
It has always been possible to write code generators. With XML technology, the use of code generators can be an essential tool. Suppose you work for a company that has to deal with XML files from other companies. It is relatively straightforward to write a program that uses the XML parser to parse the new XML file and write another program that has all the callback functions set up to read XML files of that format. You would still have to edit the new program to make it specific to your needs, but the development time when a new XML file (new structure, new names) is cut down a lot by using this type of code generator. In my opinion, this is part of the strength of XML technology.
Lisp lisp lisp lisp :p
Joking, if you want code that generates code to run and you got time to loose learning it and breaking your mind with recursive stuff generating more code, try to learn lisp :)
(eval '(or true false))
wouldn't it be nice to have a class that could make a copy of the source of the program it runs, modify that program and add a method to the class that it is, and then run the copy of the program and terminate itself
There are almost no cases where that would solve a problem that cannot be solved "better" using non-self-modifying code..
That said, there are some very common (useful) cases of code writing other code.. The most obvious being any server-side web-application, which generates HTML/Javascript (well, HTML is markup, but it's identical in theory). Also any script that alters a terminals environment usually outputs a shell script that is eval'd by the parent shell. wxGlade generates code to that creates bare-bone wx-based GUIs.
See our DMS Software Reengineering Toolkit. This is general purpose machinery to read and modify programs, or generate programs by assembling fragments.
This is one of the fundamental questions of Artificial Intelligence. Personally I hope it is not possible - otherwise soon I'll be out of a job!!! :)
It is called meta-programming and is both a nice way of writing useful programs, and an interesting research topic. Jacques Pitrat's Artificial Beings: the conscience of a conscious machine book should interest you a lot. It is mostly related to meta-knowledge based computer programs.
Another related term is multi-staged programming (because there are several stages of programs, each generating the next one).

Resources