I was going to implement this piece of code in my smart contract:
(defun absBug:integer (num:integer)
;; This property fails
#model [(property (>= result 0))]
(if (= (- (* (- num 6) (+ num 11)) (* 42 num)) (* (* 64 7) 52270780833))
(- 1)
(abs num)
)
I was wondering since I am implementing formal verification, will there be any latency or slow down once I deploy this contract onto any chain? Or is calculation done once and stored going forward?
(I know my code spits out the correct answer which I would have to adjust after the fact)
No, it does not affect the latency or any other performance on chain.
The purpose of formal verification is to prove that the contract is bug-free and deployable, hence it is ran before deployment and not on the chain.
FYI when you develop on pact-web, it runs formal verification by default. However, if you're locally developing the contract on your machine, you'll need to run (verify 'contract-name) to run the formal verification and this is when all the calculation takes place.
Related
If I try this in Racket:
(expt 2 1000)
I get a number many times bigger than all the atoms in the universe:
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
I can even get crazier with (expt 2 10000) which still only takes a second on my T450 laptop. So as I understand, this is only possible because of tail recursion. Is this correct? If so, is Racket's tail recursion pure functional programming, or are there hidden side-effects going on behind the scenes? Also, when I see Common Lisp's loop, is it based on tail recursion under the hood? In general, I guess I'm wondering how these feats of recursion/looping are possible.
Racket uses a C library to implement large integers (bignums).
The library is called GMP:
https://gmplib.org/manual/Integer-Exponentiation.html
Now the case of 2^n is pretty easy to implement in a binary represetation.
You only need a 1 followed by n zeros. That is, GMP can compute the number very fast.
Tail calling is a wonderful thing, but it's important to understand that it doesn't make it possible to compute things that wouldn't be computable otherwise. In general, any code that's written in (say) a functional language with tail-calling can be written in another language using a loop. The advantage of a language with tail-calling is that programmers don't need to rewrite their recursive calls to loops in order to allow their programs to run.
It looks like you're focusing here on the ability of Racket (and Scheme) to compute with very large numbers. This is because, by default, Racket and Scheme use "bignums" to represent integers. Packages with bignum functionality are available for many languages, including C, but they can make for extra work in languages without garbage collection, because their representations are not of a bounded size.
Also, when I see Common Lisp's loop, is it based on tail recursion under the hood?
This is an implementation detail, but most likely not. First, CL already allows TAGBODY blocks, which makes LOOP expressible in terms of CL constructs.
For example, if I macroexpand a simple LOOP:
(loop)
I obtain a rather uniform result across implementations.
;; SBCL
(BLOCK NIL (TAGBODY #:G869 (PROGN) (GO #:G869)))
;; CCL
(BLOCK NIL (TAGBODY #:G4 (PROGN) (GO #:G4)))
;; JSCL
(BLOCK NIL (TAGBODY #:G869 (PROGN) (GO #:G869)))
;; ECL
(BLOCK NIL (TAGBODY #:G109 (PROGN) (GO #:G109)))
;; ABCL
(BLOCK NIL (TAGBODY #:G44 (GO #:G44)))
Implementation are typically written in languages that have jump or loops, or that can emulate them easily. Moreover, a lot of CL implementations are compiled and can target assembly language that has jump primitives. So usually, there is no need to have an intermediate step that goes through tail-recursive functions.
That being said, implementing TAGBODY with tail-recursion seems doable.
For example JSCL cuts the expressions inside a tagbody into different methods, for each label, and those methods are called when using go: https://github.com/jscl-project/jscl/blob/db07c5ebfa2e254a0154666465d6f7591ce66e37/src/compiler/compiler.lisp#L982
Moreover, if I let the loop run for a while, no stack overflow happen. In that case however this is not due to tail-call elimination (which, AFAIK, is not implemented on all browsers). It looks like the code for tagbody always has an implicit while loop, and that go throws exceptions for the tagbody to catch.
In short: I need to check Haskell speed on simple operations, and currently have poor results, but I'm not sure if I doing compilation/optimization right.
UPD: The problem is answered, see comments - the trouble was in different integer type...
In details:
I work in a project where a number of services are doing bulk-processing on data, so that at least certain part of these services simply need to be fast. They are doing some heavy calculations and manipulations on data, not only extract-load. In other words it is a matter of how many instances and hours are going to be spent on say on each 1e15 records of data.
Currently we are considering adding a few more services to project and some of colleagues are curious to try writing them in different language from those already used. I'm less or more ok with it, but I insist we check the "core performance" of the proposed languages first. Surely speed testing is hard and controversial, so I propose we use very simple test, with simple operations and without complex data structures etc. We agreed for "poor recursive" fibonacci function:
fib x
| x < 2 = x
| otherwise = fib (x - 2) + fib (x - 1)
main = print (fib 43)
I wrote it in several languages for comparison. The C version looks like:
#include <stdio.h>
int fib(int x) {
return (x < 2) ? x : (fib(x - 2) + fib (x - 1));
}
int main(void) {
printf("%d\n", fib(43));
}
I compile the first with ghc -O2 test.hs and the latter with gcc -O2 test.c. GHC version is 8.4.3. Right now I see results differing about 12 times (2.2 seconds on C version, 3 seconds on Java version and 30 seconds on Haskell in the same machine).
I wonder if I did everything about compiler and compiler options right. For I thought as Haskell compiles to native-code (?) it should be comparable to C and similar languages. And I need hints on this.
P.S. Please, don't say that fibonacci function could be written without "exponential" recursion - I know - but as I said we need some test with a lot of simple calculations.
P.P.S. I don't mean if we can't make Haskell faster we won't use it. But probably we'll reserve it for some other service where time is spent mainly on input-output, so it won't matter. For current bunch of "speed-critical" services it's just the matter of whether the customer will pay $10000 or $120000 for these instances monthly.
Say I am doing a simple task-- a lot. For this post I will use the example of reducing mod a power of 2-- but this can be any task.
There are many approaches, and it is difficult to determine which one is better. For example, to reduce a 64-bit unsigned integer a modulo 2^b, we can either do:
a - (a >> b << b)
a << (64 - b) >> (64 - b)
a & ((1 << b) - 1)
a % (1 << b)
a & array[b], where array[i] would contain the value ((1 << i) - 1) for various i's. I do not wish to assume that array[i] will be in the L1 cache.
Perhaps there are others. For each of these methods, it is fairly straightforward to determine its cost after looking at the assembly code-- e.g. two shifts and a minus; or two shifts, a single-byte minus, and another minus operation partially optimized away by the compiler.
However, it is difficult to determine, which of these is actually faster on a given architecture. I have tried the following, but failed:
-- Read documentation on the architecture. When I can find it, it gives a clear answer as to how many cycles a shift, or an & take-- however, it is still difficult to tell, just how many cycles would a cached minus operation take, or how much of performance drawback I am incurring by pushing useful data out of the cache in order to load the array[b] data.
-- Run the same code many million times in a row. However, this results in the array remaining in the cache, and therefore giving a faster performance.
-- Run the same code many million times, and in between the runs run some code to put / read other data in the cache. However, there is a lot of variability in the running time, from me putting and accessing data in the cache, and the standard deviation is too big for the results to be reliable.
-- Thanks to a suggestion by #klutt, I pasted these methods into the real code, and the first three methods seem to have equal performance. What is probably happening is, all three methods can finish executing while the program is waiting for another value to be pulled up from the cache, later on in the program. However, in case the program changes later (e.g., less cache lookups), then one of these methods might become preferable to the others.
If I do not wish to revisit my reduction mod 2^* code every time the program changes, is there another, better method, to measure which of these is faster?
A lot of these different ways to implement a task (especially if it's a really simple task), will be optimized by the compiler to the exact same machine code instructions. Have you tried exploring http://godbolt.org/ and seeing what machine code instructions each of the different methods compiles to? This might give you a better idea of which ones are better.
I am trying to find the gcd of two numbers using two approach one is substraction
int gcd2(int a, int b)
{
if (a == 0)
return b;
else
printf("Iteration\n");
if(a>=b)
a=a-b;
else
b=b-a;
return gcd2(min(a,b),max(a,b));
}
and the other one is by modulo operation
int gcd1(int a, int b)
{
if (a == 0)
return b;
else
printf("Iteration\n");
return gcd1(b%a, a);
}
i know that the number of iteration in gcd2 is more then gcd1 but in gcd1 i am using modulo operation which is also costly so i wanted to know are these two approach same in terms of run time .
Knuth covers gcd extensively in Volume 2 of "The Art of Computer Programming" section 4.5.2
His version of binary gcd is more sophisticated and uses these facts:
a) if u and v are even, gcd(u, v)=2gcd(u/2, v/2)
b) if u is even and v is odd, gcd(u, v) = gcd(u/2, v)
c) As in Euclid's algorithm, gcd(u, v) = gcd(u-v, v)
d) if u and v are both odd, then u-v is even and |u-v| < max(u, v).
For his model computer MIX, binary gcd is about 20% faster than Euclidean gcd. Your mileage may vary.
For most practical purposes the version that uses the modulo operator should be faster, for two reasons:
(1) the subtraction-based version has to iterate more often and thus incurs more branch mis-predictions
(2) the modulo-based approach (and the binary version) are less vulnerable to the performance hitch mentioned by Henry, which occurs when one operand is much larger than the other
Also, the modulo-based and the shift-based version shift more work into specialised circuitry inside the processor, which gives them even more of an edge. The binary version is more complex to code; it can easily be made faster in languages that are close to the metal, like C/C++ or assembler, but not so easily in languages that are further away from the coal face. One reason is that C/C++ compilers can avoid some branches by employing conditional move instructions (e.g. CMOV) or at least allow equivalent bit trickery; compilers for other languages tend to lag behind C/C++ unless they share the same backend (as with the gnu compilers).
Things like python are a different story altogether, since the interpreter overhead dwarfs the instruction-level timing differences between additive ops, shift ops and mul/div ops, or the cost of branching (and branch mis-predictions).
That's a general view on things; it is always possible to construct pathological inputs that can prove any version to be inferior or superior.
In any case, Dijkstra's dictum regarding premature optimisation should be taken seriously. Barring proof to the contrary, the best code is the code that is so simple and clear that it is difficult to get wrong. Stick with the modulo version until you have proof that it's too slow. If and when you do have such proof, come back and we'll speed things up (since then there will be specifics known, specifics that can be leveraged).
Sometimes a loop where the CPU spends most of the time has some branch prediction miss (misprediction) very often (near .5 probability.) I've seen a few techniques on very isolated threads but never a list. The ones I know already fix situations where the condition can be turned to a bool and that 0/1 is used in some way to change. Are there other conditional branches that can be avoided?
e.g. (pseudocode)
loop () {
if (in[i] < C )
out[o++] = in[i++]
...
}
Can be rewritten, arguably losing some readability, with something like this:
loop() {
out[o] = in[i] // copy anyway, just don't increment
inc = in[i] < C // increment counters? (0 or 1)
o += inc
i += inc
}
Also I've seen techniques in the wild changing && to & in the conditional in certain contexts escaping my mind right now. I'm a rookie at this level of optimization but it sure feels like there's got to be more.
Using Matt Joiner's example:
if (b > a) b = a;
You could also do the following, without having to dig into assembly code:
bool if_else = b > a;
b = a * if_else + b * !if_else;
I believe the most common way to avoid branching is to leverage bit parallelism in reducing the total jumps present in your code. The longer the basic blocks, the less often the pipeline is flushed.
As someone else has mentioned, if you want to do more than unrolling loops, and providing branch hints, you're going to want to drop into assembly. Of course this should be done with utmost caution: your typical compiler can write better assembly in most cases than a human. Your best hope is to shave off rough edges, and make assumptions that the compiler cannot deduce.
Here's an example of the following C code:
if (b > a) b = a;
In assembly without any jumps, by using bit-manipulation (and extreme commenting):
sub eax, ebx ; = a - b
sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
and edx, eax ; = (b > a) ? a - b : 0
add ebx, edx ; b = (b > a) ? b + (a - b) : b + 0
Note that while conditional moves are immediately jumped on by assembly enthusiasts, that's only because they're easily understood and provide a higher level language concept in a convenient single instruction. They are not necessarily faster, not available on older processors, and by mapping your C code into corresponding conditional move instructions you're just doing the work of the compiler.
The generalization of the example you give is "replace conditional evaluation with math"; conditional-branch avoidance largely boils down to that.
What's going on with replacing && with & is that, since && is short-circuit, it constitutes conditional evaluation in and of itself. & gets you the same logical results if both sides are either 0 or 1, and isn't short-circuit. Same applies to || and | except you don't need to make sure the sides are constrained to 0 or 1 (again, for logic purposes only, i.e. you're using the result only Booleanly).
At this level things are very hardware-dependent and compiler-dependent. Is the compiler you're using smart enough to compile < without control flow? gcc on x86 is smart enough; lcc is not. On older or embedded instruction sets it may not be possible to compute < without control flow.
Beyond this Cassandra-like warning, it's hard to make any helpful general statements. So here are some general statements that may be unhelpful:
Modern branch-prediction hardware is terrifyingly good. If you could find a real program where bad branch prediction costs more than 1%-2% slowdown, I'd be very surprised.
Performance counters or other tools that tell you where to find branch mispredictions are indispensible.
If you actually need to improve such code, I'd look into trace scheduling and loop unrolling:
Loop unrolling replicates loop bodies and gives your optimizer more control flow to work with.
Trace scheduling identifies which paths are most likely to be taken, and among other tricks, it can tweak the branch directions so that the branch-prediction hardware works better on the most common paths. With unrolled loops, there are more and longer paths, so the trace scheduler has more to work with
I'd be leery of trying to code this myself in assembly. When the next chip comes out with new branch-prediction hardware, chances are excellent that all your hard work goes down the drain. Instead I'd look for a feedback-directed optimizing compiler.
An extension of the technique demonstrated in the original question applies when you have to do several nested tests to get an answer. You can build a small bitmask from the results of all the tests, and the "look up" the answer in a table.
if (a) {
if (b) {
result = q;
} else {
result = r;
}
} else {
if (b) {
result = s;
} else {
result = t;
}
}
If a and b are nearly random (e.g., from arbitrary data), and this is in a tight loop, then branch prediction failures can really slow this down. Can be written as:
// assuming a and b are bools and thus exactly 0 or 1 ...
static const table[] = { t, s, r, q };
unsigned index = (a << 1) | b;
result = table[index];
You can generalize this to several conditionals. I've seen it done for 4. If the nesting gets that deep, though, you want to make sure that testing all of them is really faster than doing just the minimal tests suggested by short-circuit evaluation.
GCC is already smart enough to replace conditionals with simpler instructions. For example newer Intel processors provide cmov (conditional move). If you can use it, SSE2 provides some instructions to compare 4 integers (or 8 shorts, or 16 chars) at a time.
Additionaly to compute minimum you can use (see these magic tricks):
min(x, y) = x+(((y-x)>>(WORDBITS-1))&(y-x))
However, pay attention to things like:
c[i][j] = min(c[i][j], c[i][k] + c[j][k]); // from Floyd-Warshal algorithm
even no jumps are implied is much slower than
int tmp = c[i][k] + c[j][k];
if (tmp < c[i][j])
c[i][j] = tmp;
My best guess is that in the first snippet you pollute the cache more often, while in the second you don't.
In my opinion if you're reaching down to this level of optimization, it's probably time to drop right into assembly language.
Essentially you're counting on the compiler generating a specific pattern of assembly to take advantage of this optimization in C anyway. It's difficult to guess exactly what code a compiler is going to generate, so you'd have to look at it anytime a small change is made - why not just do it in assembly and be done with it?
Most processors provide branch prediction that is better than 50%. In fact, if you get a 1% improvement in branch prediction then you can probably publish a paper. There are a mountain of papers on this topic if you are interested.
You're better off worrying about cache hits and misses.
This level of optimization is unlikely to make a worthwhile difference in all but the hottest of hotspots. Assuming it does (without proving it in a specific case) is a form of guessing, and the first rule of optimization is don't act on guesses.