What is it called if two versions of a program performs the same job, but does it using different code? - theory

What is it called if two versions of a program performs the same job, but does it using different code?
Is it correct to say that the two versions are semantically equivalent, although the versions may consist of different code?
Say, I've some goal to accomplish and both program versions performs the job. Is there a term for expression this relation?

We call this extensional equivalence. Two programs are extensionally equivalent if they do the same thing, but are possibly implemented in a different manner. However, note that each program is always extensionally equivalent to itself as extensional equivalence is an equivalence relation.

Related

Why aren't binaries of different languages compatible with each other? How do you make them compatible?

A swift app, will convert its dynamic frameworks into binaries. And once something is a binary, then it's no longer Swift/Ruby/Python, etc. It's machine code.
Same thing happens for a Python binary. So why aren't the machine codes compatible with each other out of the box?
Is it just that a simple mapping is required to bridge one language to the other?
Like if I needed to use a binary created from the Swift language — into a Python based app, then do I need to expose the Swift Headers to Python for it to work? Or something else is required?
I assume you're talking about making calls in one language to a library compiled in a different language.
At the assembly language level, there are standards (ABI, for Application Binary Interface) that define how function parameters are passed in registers, how values are returned, the behavior of the stack, etc. ABIs are architecture and operating-system-dependent. Usually any function that is exported in a library will follow the ABI.
It is plain that ABIs basically expect a C language model for functions: a single return value, a well-defined data type for each function parameter as well as the return value, the possibility of using pointers, etc.
Problems start to arise once you move to a higher level language. C++ already introduces complications: whereas the name of a C function is the same in assembly (often a _ character is prepended), in C++ function names must encode data types due to the possibility of overloaded functions with the same name but different parameters. Thus, names must be mangled and demangled -- this is why a prototype for a C function must be declared as extern "C" in C++. Then there are issues of the classes (this pointer, vtables), namespaces and so on, which complicate matters further.
Then you have dynamically typed languages like Python. In truth, there is no such thing as dynamic typing at the assembly language levels: the instruction encodings in machine language (i.e. binary codes as they're read by the CPU when executed) implicitly determine whether you're using an integer or floating-point or SIMD instruction (and the width of operands), which also determines which of the different register banks are accessed. Although the language makes dynamic typing transparent to you, at the assembly code level, the interpreter/JIT/compiler must resolve them somehow, because ultimately the CPU must be told exactly what data type to operate on.
This is why you can't directly call a C function (or in general any library function) from Python -- unlike a pure Python function which can disregard the types of its parameters, library functions must know the exact types of each parameter and the return type. Thus, you must use something like ctypes for Python, explicitly specifying the types in question for each function that needs to be called -- in a way, this is similar to function prototypes usually found in C headers. It is possible to write functions in C that are directly callable from Python (and, in that case, essentially from Python alone), but you'll have to jump through a few hoops.
As for the particular language pairing you're interested in (Python/Swift), a cursory search came up with this thread in the Swift forums (this one, linked from there, may also be interesting. Reading the thread, there appears to be two feasible solutions at this time: first, use the #_cdecl attribute (which isn't officially supported) to make a C function, and then call it from Python using ctypes. But the second and apparently more promising one is to use the #objc attribute in Swift, and use PyObjC in Python. I assume this will allow using some of the higher-level features of Swift, at least those that intersect with what Objective-C offers.

How do I test C functions with internal ifdefs for functional equivalency?

I have a library of C functions that I optimized internally using SIMD intrinsics. These functions all look something like this:
void add_array(...) {
#if defined(USE_SIMD)
// SIMD code here ...
#else
// Scalar code here ...
#end
}
contained in individual files, so a add.c file here, for example.
Now, I would like to ensure that both variants of that function are functionally equivalent. I found that simply generating random (but valid) input for both variants and comparing the results suffices for my application. I think this is called Monkey Testing. The scalar code (or rather its output values) acts as a golden reference.
(Because of the SIMD intrinsics formal verification is not an option.)
However, I have not found a scalable and sustainable way to run these tests from a C testing framework. So far, I manually copied the vector code into an extra vector function add_array_vector() and then ran them both after another from the same main function test harness; comparing the "golden reference output" from the add_array() function with the values from the add_array_vector() variant. But this approach does not scale, since I have more than 100 of these functions that all use the #if #else approach internally. Since I have to run all code in a simulator (or on a bare-metal embedded device), I also can't interact with a file system. I need a single test binary that contains all tests and test data. It has to report its results via a printf (UART) call.
What I see as my only option is to compile the functions twice: Once without the USE_SIMD and once with the USE_SIMD defined. Then I would need to link these two function variants into the same main (my test harness). However, how do I ensure that both variants have different function names? Is there a way I can "name mangle" the USE_SIMD define into the function name? And how would I link them?
Maybe I am completely on the wrong track here and there is a far simpler way to solve this. I surely can't be the first person who came across this core issue: ensuring that two variants of the same C (or C++) function are functionally equivalent.
Any help is greatly appreciated. Thanks
EDIT: I can't afford to print the numeric results via printfs (or UART) as they are a serious bottleneck in this randomized bruteforce approach. They dramatically reduce (multiple order of magnitude) the number of iterations / tests I can run per second. Printing the final outcome or an error, if one occurs, is fine. Printing every numerical test result value for "external validation" is not sustainable.

Different Machine Codes for same piece of logic

Consider the following code
char love[4]={'l','o','v','e'};
will the machine code of love[1] and *(love+1) be same or different if different why?
If you're asking will the reference the same memory location, the answer is yes. So will *(1+love) and 1[love].
If you're asking if the compiler will generate the same machine language under the covers, that depends entirely on the compiler. The ISO C standard does not dictate that level of detail.
It's generally more concerned with effects rather than implementation details.
Given that all four possibilities mean the same thing, I'd be surprised if a compiler generated different machine code under the covers - I'd expect a decent compiler to generate the most efficient version for all cases. However, as mentioned above, it's by no means mandatory.

Assembly-level function fingerprint

I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:
two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
two functions compiled from different C source are likely to have different fingerprints,
(bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).
What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).
Assumptions
Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:
linux ELF binaries (without debugging information available, though),
not obfuscated in any way,
compiled by gcc,
on x86 linux (approach that can be implemented on other architectures would be nice).
Ideas
Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:
types of instructions contained in the function (i.e. floating point instructions, memory barriers)
memory accesses from the function (does it read/writes from/to heap? stack?)
library functions called (their names should be available in the ELF; also their order shouldn't usually change)
shape of the control flow graph (I guess this will be highly dependent on the compiler)
Existing work
I was able to find only tangentially related work:
Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm
Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?
FLIRT uses byte-level pattern matching, so it breaks down with any changes in the instruction encodings (e.g. different register allocation/reordered instructions).
For graph matching, see BinDiff. While it's closed source, Halvar has described some of the approaches on his blog. They even have open sourced some of the algos they do to generate fingerprints, in the form of BinCrowd plugin.
In my opinion, the easiest way to do something like this would be to decompose the functions assembly back into some higher level form where constructs (like for, while, function calls etc.) exist, then match the structure of these higher level constructs.
This would prevent instruction reordering, loop hoisting, loop unrolling and any other optimizations messing with the comparison, you can even (de)optimize this higher level structures to their maximum on both ends to ensure they are at the same point, so comparisons between unoptimized debug code and -O3 won't fail out due to missing temporaries/lack of register spills etc.
You can use something like boomerang as a basis for the decompilation (except you wouldn't spit out C code).
I suggest you approach this problem from the standpoint of the language the code was written in and what constraints that code puts on compiler optimization.
I'm not real familiar with the C standard, but C++ has the concept of "observable" behavior. The standard carefully defines this, and compilers are given great latitude in optimizing as long as the result gives the same observable behavior. My recommendation for trying to determine if two functions are the same would be to try to determine what their observable behavior is (what I/O they do and how the interact with other areas of memory and in what order).
If the problem set can be reduced to a small set of known C or C++ source code functions being compiled by n different compilers, each with m[n] different sets of compiler options, then a straightforward, if tedious, solution would be to compile the code with every combination of compiler and options and catalog the resulting instruction bytes, or more efficiently, their hash signature in a database.
The set of likely compiler options used is potentially large, but in actual practice, engineers typically use a pretty standard and small set of options, usually just minimally optimized for debugging and fully optimized for release. Researching many project configurations might reveal there are only two or three more in any engineering culture relating to prejudice or superstition of how compilers work—whether accurate or not.
I suspect this approach is closest to what you actually want: a way of investigating suspected misappropriated source code. All the suggested techniques of reconstructing the compiler's parse tree might bear fruit, but have great potential for overlooked symmetric solutions or ambiguous unsolvable cases.

Compile-time trigonometry in C

I currently have code that looks like
while (very_long_loop) {
...
y1 = getSomeValue();
...
x1 = y1*cos(PI/2);
x2 = y2*cos(SOME_CONSTANT);
...
outputValues(x1, x2, ...);
}
the obvious optimization would be to compute the cosines ahead-of-time. I could do this by filling an array with the values but I was wondering would it be possible to make the compiler compute these at compile-time?
Edit: I know that C doesn't have compile-time evaluation but I was hoping there would had been some weird and ugly way to do this with macros.
If you're lucky, you won't have to do anything: Modern compilers do constant propagation for functions in the same translation unit and intrinsic functions (which most likely will include the math functions).
Look at the assembly to check if that's the case for your compiler and increase the optimization levels if necessary.
Nope. A pre-computed lookup table would be the only way. In fact, Cosine (and Sine) might even be implemented that way in your libraries.
Profile first, Optimise Later.
No, unfortunately.
I would recommend writing a little program (or script) that generates a list of these values (which you can then #include into the correct place), that is run as part of your build process.
By the way: cos(pi/2) = 0!
You assume that computing cos is more expensive than an access. Perhaps this is not true on your architecture. Thus you should do some testing (profiling) - as always with optimization ideas.
Instead of precomputing these values, it is possible to use global variables to hold the values, which would be computed once on program startup.
No, C doesn't have the concept of compile time evaluation of functions and not even of symbolic constants if they are of type double. The only way to have them as immediate operand would be to precompute them and then to define them in macros. This is the way the C library does it for pi for example.
If you check the code and the compiler is not hoisting the constant values out of the loop, then do so yourself.
If the arguments to the trig functions are constant as in your sample code, then either pre-compute them yourself, or make them static variables so they are only computed once. If they vary between calls, but are constant within the loop then move them to outside the loop. If they vary between iterations of the loop, then a look-up table may be faster, but if that is acceptable accuracy then implementing your own trig functions which halt the calculation at a lower accuracy is also an option.
I am struck with awe by Christoph's answer above.
So nothing needs to be done in this case, where gcc has some knowledge about the math functions. But if you have a function (maybe implemented by you) which cannot be calculated by your C compiler or if your C compiler is not so clever (or you need to fill complicated data structures or some other reason) you can use some higher level language to act as macroprocessor. In the past, I have used eRuby for this purpose, but (ePerl should work very well too and is another obvious readily available and more or less comfortable choice.
You can specify make rules for transforming files with extension .eruby (or .eperl or whatever) to files with that extension stripped out so that, for example, if you write files module.c.eruby or module.h.eruby then make automatically knows how to generate module.c or module.h, respectively, and keeps them up-to-date. In your make rule you can easily add generation of comment that warns editing the file directly.
If you are using Windows or something similar, then I am out of my depths in explaining how to add support for running this transformation automatically for you by your favorite IDE. But I believe it should be possible, or you could just run make outside of your IDE whenever you need to change those .eruby (or whatever) files.
By the way, I have seen that with incredibly small lines of code I have seen eLua implemented to use Lua as a macro language. Of course any other scripting language with support for regular expressions and flexible layout rules should work as well (but Python is malsuited for this purpose due to strict white space rules).

Resources