Convert Bash Script to C. Is that possible? - c

I found the following Bash -> C converter.
Is such a way possible to convert from bash to c?
Reason: Is C faster then BASH? I want to run something as a deamon instead of a cron job.

It is possible, the question is what are the objectives of doing so. They could be a subset of:
Speed interpreted scripts can be slower
Maintainability perhaps you have a time that has more experience with C
Flexibility the script is showing limitations on what can be achieved with reasonable effort
Integration perhaps you already have a code base that you're willing to tightly integrate with the scripts
Portability
There are also other reasons, like scalability, efficiency, and probably a lot more.
Based on the objectives of the "conversion", there are quite a few ways to achieve a C equivalent, varying the amount of code that will be "native". As an example we can consider two extremes.
On one extreme, we have a compiled C code that executes mostly as bash would, so every line of the original script would produce code equivalent to a fork/exec/wait system calls, where the changes would mostly be performing equivalents to wildcard expansion, retrieving of values from environment variables, handling synchronization of the forked processes, and also handling piping with the appropriate system call.
Notice that this "simple" conversion is already tons of work, that would probably be worse than just writting another shell interpreter. Also, it doesn't meet many of the objectives above, since portability wise, it is still probably dependent on the operating system's syscalls, and performance wise, the only gain is from initially parsing the command line.
On the other extreme, we have a complete rewrite in a more C fashion. This will replace all conditionals with C conditionals, ls, cd and rm commands into their respective system calls and possibly replacing string processing with appropriate libraries.
This might be better in achieving some of the objectives, but the cost would probably be even greater than the other way, also removing a lot of code reuse, since you'd have to implement function equivalents to simple commands.
As for a tool for automating this, I don' know of any, and if there are any they probably don't have widespread use because converting Bash to C or C to Bash isn't probably a good idea. If such need arises, it is probably a sympton of a design problem, and therefore a redesign is probably a better solution. Programming languages and Scripting Languages are different tools for different jobs, even though there are areas of intersection between what can be done with them. In general,
Don't script in C, and don't code in Bash
It is best to know how and when to use the tools you have, then to find a generic universal tool (aka. there are no such things as silver bullets).
I hope this helps a little =)

I'm sure someone has made a tool, just because they could, but I haven't seen one. If you need to run a bash script from C code, it's possible to just directly execute it via (for example) a system call:
system("if [ -f /var/log/mail ]; then echo \"you've got mail! (file)\"; fi");
Other than that, I'm not aware of an easy way to "automatically" do it. As humans we can look at the above and equate that to:
if( access( "/var/log/mail", F_OK ) != -1 )
printf("you've got mail! (file)");
As one of a dozen ways that could be achieved. So it's pretty easy to do that by hand, obviously it's going to take a lot more effort to make, what can be thought of as a bash->C compiler to do it automatically.
So is it possible? Sure!
Example? Sorry, no.

There's a program I use to obfuscate code when I need that.
For some of the programs I've used it on, it does improve the speed, on others it slows the script down, but that's not why i use it. The main utility for me is that the binary is not capable of being changed or read by casual users.
article: here http://www.linux-magazine.com/Online/Features/SHC-Shell-Compiler
developer's site here: http://www.datsi.fi.upm.es/~frosal/sources/
As mentioned on one of those pages, it falls someplace between a gadget and an actual tool.

Related

Bash script vs. writing C program to call other programs

I have 3-5 (large-ish) programs that I need merge so that they run sequentially. Speed is important, since this is for (near) real-time applications. (If there is a better way, let me know).
Would it make more sense to write a Bash script to call these other programs or to write a C program that will most likely use fork() and/or exec()? Are the trade-offs between speed and difficulty/time spent coding in favor of one over the other? Are there other methods I should look into?
I apologize if my terminology is off or if there is not enough information. Also, please correct me so that I do not repeat the same mistakes in the future.
This is the perfect use case for bash. The overhead will be less than a millisecond, and maintenance cost is way lower than for a C program.
I suggest writing the bash script and debugging it. If the performance proves adequate, you are done. It is reasonable to expect bash on a modern system (hardware and kernel) to perform well.
Otherwise use the bash script as the specification for writing a replacement C program.
The question posed is a classic question of "what is the correct tool for the job?". Obviously, bash being a shell is designed to launch applications, scripting the launch of multiple programs is as simple as listing the programs you need on separate lines along with the required arguments. But when the question is one of speed, it is one you can only answer by asking "How fast is fast enough?"
If you currently have separate programs you can string together and they can complete the job in an acceptable amount of time, why would you ever think about re-writing them for speed? The fact that you are asking the questions shows, you are either (a) having difficulty with the time it takes for the current routines to complete, of (b) you just want to see how much faster you can make it go. We all like hot-rods right?
Are there speed advantages to be had re-writing large shell apps that are number intensive in a compiled language like C -- you bet. Many times by hundreds of percent, but is it worth the time it takes to re-write. That's something only you can answer.

How much faster is C than R in practice?

I wrote a Gibbs sampler in R and decided to port it to C to see whether it would be faster. A lot of pages I have looked at claim that C will be up to 50 times faster, but every time I have used it, it's only about five or six times faster than R. My question is: is this to be expected, or are there tricks which I am not using which would make my C code significantly faster than this (like how using vectorization speeds up code in R)? I basically took the code and rewrote it in C, replacing matrix operations with for loops and making all the variables pointers.
Also, does anyone know of good resources for C from the point of view of an R programmer? There's an excellent book called The Art of R Programming by Matloff, but it seems to be written from the perspective of someone who already knows C.
Also, the screen tends to freeze when my C code is running in the standard R GUI for Windows. It doesn't crash; it unfreezes once the code has finished running, but it stops me from doing anything else in the GUI. Does anybody know how I could avoid this? I am calling the function using .C()
Many of the existing posts have explicit examples you can run, for example Darren Wilkinson has several posts on his blog analyzing this in different languages, and later even on different hardware (eg comparing his high-end laptop to his netbook and to a Raspberry Pi). Some of his posts are
the initial (then revised) post
another later post
and there are many more on his site -- these often compare C, Java, Python and more.
Now, I also turned this into a version using Rcpp -- see this blog post. We also used the same example in a comparison between Julia, Python and R/C++ at useR this summer so you should find plenty other examples and references. MCMC is widely used, and "easy pickings" for speedups.
Given these examples, allow me to add that I disagree with the two earlier comments your question received. The speed will not be the same, it is easy to do better in an example such as this, and your C/C++ skills will mostly determines how much better.
Finally, an often overlooked aspect is that the speed of the RNG matters a lot. Running down loops and adding things up is cheap -- doing "good" draws is not, and a lot of inter-system variation comes from that too.
About the GUI freezing, you might want to call R_CheckUserInterrupt and perhaps R_ProcessEvents every now and then.
I would say C, done properly, is much faster than R.
Some easy gains you could try:
Set the compiler to optimize for more speed.
Compiling with the -march flag.
Also if you're using VS, make sure you're compiling with release options, not debug.
Your observed performance difference will depend on a number of things: the type of operations that you are doing, how you write the C code, what type of compiler-level optimizations you use, your target CPU architecture, etc etc.
You can write basic, sloppy C and get something that works and runs with decent efficiency. You can also fine-tune your code for the unique characteristics of your target CPU - perhaps invoking specialized assembly instructions - and squeeze every last drop of performance that you can out of the code. You could even write code that runs significantly slower than the R version. C gives you a lot of flexibility. The limiting factor here is how much time that you want to put into writing and optimizing the C code.
The reverse is also true (duplicate the previous paragraph here, but swap "C" and "R").
I'm not trying to sound facetious, but there's really not a straightforward answer to your question. The only way to tell how much faster your C version would be is to write the code both ways and benchmark them.

Manually translating code from one language to another

I often write codes in MATLAB/Python to test whether my algorithm is feasible (& actually works). I then need to convert the entire code into C and sometimes, in FORTRAN90.
What would be a good way to manually convert a medium sized code from one language to another?
I have tried :
Converting the entire code from one into another and then testing it.
(Sometimes, there are errors and bugs which just won't go away and the finding the source of the error becomes a problem)
Go line by line and check for consistency of outputs every few lines.
(Too time consuming)
Use converters like f2c.
(In my experience, they are extremely horrible. I link to a lot of libraries which have different function calls for C and Fortran)
Also,:
I am fairly conversant with the programming languages I deal with so I don't need manuals or reference guides for my work (i.e. I know the syntax).
I am not asking this question specifically about MATLAB and C but rather as a translation paradigm.
Regarding the size, the codes are less than 100 lines long.
I dont want to call the code of one language to another. Please don't suggest that.
Different languages call for different paradigms. You definitely don't write and design code the same way in eg. Matlab, Python, C# or C++. Even object hierarchies will change a lot depending on the language.
That said, if your code consists in a few interconnected procedures, then you may go away with a direct line by line translation (every language allow you to write two or three interconnected functions while remaining idiomatic). But this is the case only for the simplest programs.
Prototyping in a high level language and then implementing the same idea in a robust and clean way in a "production" language is a very good practice, but involves two very different things :
Prototype in whatever language you want. Test, experiment, and convince yourself that the idea works. Pay attention to the big picture, don't focus on performance but on the high level ideas. Pay also attention to difficulties that you encounter when implementing, as you'll face them again in step 2.
Implement from scratch the idea in the production environment in language X. It will be quicker than if you did not do the prototyping stage, since most of the difficulties have been met in stage 1. Use idiomatic X, and focus on correctness. Pay attention to corner cases, general robustness, and once it works correctly, performance. You'll notice that roughly half of your code is made of new things which did not appear in 1. (eg. error checking, corner case handling, input/output, unit testing, etc).
You can see that line by line translation is obviously not a good idea, since you don't translate into the same program.
Also, when not prototyping, I find myself throwing away the first version and making another one that I like better, ie. I find myself prototyping ! Implementing the same thing twice is not a loss of time, it is normal development flow.
You may want to consider using a higher level domain specific language with multiple backends (e.g., Matlab, C, Fortran), producing clean and idiomatic code for each target language, probably with some optimisations. If your problem domain is narrow and every piece of code is more or less typical, it should be fairly trivial to design and implement such a DSL.
Break the source down into psuedo-code with input/process/output and then write your new code base to fit that spec.

Find functional changes between two revisions of a file (compile diff?)

I'm looking for a tool that checks whether two (C) source code files generate the same binary so that I can find actual functional changes between two files and ignore mere coding style changes.
It would be great if this worked even within a file for different changesets, so a file may have changed in coding style on some places, but also had one functional patch added.
It's very very hard to write a program to figure out the "functional" result of another program. Such a program sounds like it would be necessary for this. I would guess that computer programs themselves are right about the most compact and machine-readable way we have to even describe functionality, so it's kind of hard to write a program that analyses a program and generates a "better" description.
Somehow abstracting out and "understanding" that coding style differences don't affect functionality also sounds very, very hard. I find it hard when manually reading other people's code somehow, because the differences in style can be pretty large, even though the end result might be the same in "my style".
I would be surprised if a solution wouldn't also require a solution to the halting problem, which is proven impossible for the general case.
The only way is to compile both with the same compiler options and do a binary diff.
It's not only style changes you'd have to look out for; someone may have extracted code to a function that gets inlined in an optimised build. This may, or may not, depending on compiler options and version, give the same binary.
Mapping binary back to source to "high level functionality" - unlikely.
Comparing two source files with respect to "high level functionality" (ignoring coding style) - possible:
http://cscope.sourceforge.net/
Alternative suggestion:
Write a tool that "normalizes" your source files - by applying the same formatting to both sets of code.
This can easily be automated.
For example:
1) checkout both from version control,
2) apply "standard format",
3) compare
If all you're interested in is whether they both "generate the same binary", then the easiest solution is simply to generate both binaries, and compare.
Note, however, that there are things that would result in binaries that are bitwise different, even though they're functionally identical:
Change in external function names
Optimisations
Reordering non-dependent code snippets
etc.
There is a branch of computer science that deals with concurrency and parallel processes.
One of the applications is deciding whether two systems are behaviorally equivalent (in some bisimulation relation (weak or strong)).
Though it's computationally very difficult to decide whether two large systems are behaviorally equivalent. The usage is mainly for verification of small critical applications where we can't afford failure.

Reverse engineer "compiled" Perl vs. C?

Have a client that's claiming complied C is harder to reverse engineer than sudo "compiled" Perl byte-code, or the like. Anyone have a way to prove, or disprove this?
I don't know too much about perl, but I'll give some examples why reversing code compiled to assembly is so ugly.
The ugliest thing about reverse engineering c code is that the compilation removes all type information. This total lack of names and types is very the worst part IMO.
In a dynamically typed language the compiler needs to preserve much more information about that. In particular the names of fields/methods/... since these are usually strings for which it is impossible to find every use.
There is plenty of other ugly stuff. Such as whole program optimization using different registers to pass parameters every time. Functions being inlined so what was one a simple function appears in many places, often in slightly different form due to optimizations.
The same registers and bytes on the stack get reused by different content inside a function. Gets especially ugly with arrays on the stack. Since you have no way to know how big the array is and where it ends.
Then there are micro-optimizations which can get annoying. For example I once spend >15 minutes to reverse a simple function that once was similar to return x/1600. Because the compiler decided that divisions are slow and rewrote that division by a constant into several multiplications additions and bitwise-operations.
Perl is really easy to reverse engineer. The tool of choice is vi, vim, emacs or notepad.
That does raise the question about why they're worried about reverse engineering. It is more difficult to turn machine code back to something resembling the original source code than it is byte-code normally but for most nefarious activities that's irrelevant. If someone wants to copy your secrets or break your security they can do enough without turning it back into a perfect representation of your original source code.
Reverse engineering code for a virtual machine is usually easier. A virtual machine is typically designed to be an easy target for the language. That means it typically represents the constructs of that language reasonably easily and directly.
If, however, you're dealing with a VM that wasn't designed for that particular language (e.g., Perl compiled to the JVM) that would frequently put you back much closer to working with code generated for real hardware -- i.e., you have to do whatever's necessary to target a pre-defined architecture instead of designing the target to fit the source.
Ok, there has been suficient debate on this over the years; and mostly the results are never conclusive ... mainly because it doesn't matter.
For a motivated reverse engineer, both will be the same.
If you are using pseudo exe makers like perl2exe then that will be easier to "decompile" than compiled C, as perl2exe does not compile the perl at all, it's just a bit "hidden" (see http://www.net-security.org/vuln.php?id=2464 ; this is really old, but concept is probably still the same (I haven't researched so don't know for sure, but I hope you get my point) )
I would advise look at the language which is best for the job so maintenance and development of the actual product can be done sensibly and sustainably.
Remember you _can_not_ stop a motivated adversary, you need to make it more expensive to reverse than to write it themselves.
These 4 should make it difficult (but again not impossible)...
[1] Insert noise code (random places, random code) which does pointless maths and complex data structure interaction (if done properly this will be a great headache if the purpose is to reverse the code rather than the functionality).
[2] Chain a few (different) code obfuscators on the source code as part of build process.
[3] Apply a Software protection dongle which will prevent code execution if the h/w is not present, this will mean physical access to the dongle's data is required before rest of the reversing can take place : http://en.wikipedia.org/wiki/Software_protection_dongle
[4] There are always protectors (e.g. Themida http://www.oreans.com/themida.php) you can get which will be able to protect a .exe after it has been built (regardless of how it was compiled).
... That should give the reverser enough headache.
But remember that all this will also cost money, so you should always weigh up what is it that you are trying to achieve and then look at your options.
In short: Both methods are equally insecure. Unless you are using a non-compiling perl-to-exe maker in which case native compiled EXE wins.
I hope this helps.
C is harder to decompile than byte-compiled Perl code. Any Perl code that's been byte-compiled can be decompiled. Byte-compiled code is not machine code like in compiled C programs. Some others suggested using code obfuscation techniques. Those are just tricks to make code harder to read and won't effect the difficulty in decompiling the Perl source. The decompiled source may be harder to read but there are many Perl de-obfuscation tools available and even a Perl module:
http://metacpan.org/pod/B::Deobfuscate
Perl packing programs like Par, PerlAPP or Perl2exe won't offer source code protection either. At some point the source has to be extracted so Perl can execute the script. Even packers like PerlAPP and Perl2exe, which attempt some encryption techniques on the source, can be defeated with a debugger:
http://www.perlmonks.org/?displaytype=print;node_id=779752;replies=1
It'll stop someone from casually browsing your Perl code but even the packer has to unpack the script before it can be run. Anyone who's determined can get the source code.
Decompiling C is a different beast altogether. Once it's compiled it's now machine code. You either end up with Assembly code with most C decompilers or some of the commercial C decompilers will take the Assembly code and try to generate equivalent C code but, unless it's a really simple program, seldom are able to recreate the original code.

Resources