How expressive can we be with arrays in Z3(Py)? An example - arrays

My first question is whether I can express the following formula in Z3Py:
Exists i::Integer s.t. (0<=i<|arr|) & (avg(arr)+t<arr[i])
This means: whether there is a position i::0<i<|arr| in the array whose value a[i] is greater than the average of the array avg(arr) plus a given threshold t.
I know this kind of expressions can be queried in Dafny and (since Dafny uses Z3 below) I guess this can be done in Z3Py.
My second question is: how expressive is the decidable fragment involving arrays in Z3?
I read this paper on how the full theory of arrays is not decidable (http://theory.stanford.edu/~arbrad/papers/arrays.pdf), but only a concrete fragment, the array property fragment.
Is there any interesting paper/tutorial on what can and cannot be done with arrays+quantifiers+functions in Z3?

You found the best paper to read regarding reasoning with Array's, so I doubt there's a better resource or a tutorial out there for you.
I think the sequence logic (not yet officially supported by SMTLib, but z3 supports it), is the right logic to use for reasoning about these sorts of problems, see: https://microsoft.github.io/z3guide/docs/theories/Sequences/
Having said that, most properties about arrays/sequences of "arbitrary size" require inductive proofs. This is because most interesting functions on them are essentially recursive (or iterative), and induction is the only way to prove properties for such programs. While SMT solvers improved significantly regarding support for recursive definitions and induction, they still don't perform anywhere near well compared to a traditional theorem prover. (This is, of course, to be expected.)
I'd recommend looking at the sequence logic, and playing around with recursive definitions. You might get some mileage out of that, though don't expect proofs for anything that require induction, especially if the inductive-hypothesis needs some clever invariant to be specified.
Note that if you know the length of your array concretely (i.e., 10, 15, or some other hopefully not too large a number), then it's best to allocate the elements symbolically yourself, and not use arrays/sequences at all. (And you can repeat your proof for lenghts 0, 1, 2, .. upto some fixed number.) But if you want proofs that work for arbitrary lengths, your best bet is to use sequences in z3, not arrays; with all the caveats I mentioned above.

Related

Computational Efficiency of Forward Mode Automatic vs Numeric vs Symbolic Differentiation

I am trying to solve a problem of finding the roots of a function using the Newton-Raphson (NR) method in the C language. The functions in which I would like to find the roots are mostly polynomial functions but may also contain trigonometric and logarithmic.
The NR method requires finding the differential of the function. There are 3 ways to implement differentiation:
Symbolic
Numerical
Automatic (with sub types being forward mode and reverse mode. For this particular question, I would like to focus on forward mode)
I have thousands of these functions all requiring finding roots in the quickest time possible.
From the little that I do know, Automatic differentiation is in general quicker than symbolic because it handles the problem of "expression swell" alot more efficiently.
My question therefore is, all other things being equal, which method of differentiation is more computationally efficient: Automatic Differentiation (and more specifically, forward mode) or Numeric differentiation?
If your functions are truly all polynomials, then symbolic derivatives are dirt simple. Letting the coefficients of the polynomial be stored in an array with entries p[k] = a_k, where index k corresponds to the coefficient of x^k, then the derivative is represented by the array with entries dp[k] = (k+1) p[k+1]. For multivariable polynomial, this extends straightforwardly to multidimensional arrays. If your polynomials are not in standard form, e.g. if they include terms like (x-a)^2 or ((x-a)^2-b)^3 or whatever, a little bit of work is needed to transform them into standard form, but this is something you probably should be doing anyways.
If the derivative is not available, you should consider using the secant or regula falsi methods. They have very decent convergence speed (φ-order instead of quadratic). An additional benefit of regula falsi, is that the iterations remains confined to the initial interval, which allows reliable root separation (which Newton does not).
Also note than in the case of numerical evaluation of the derivatives, you will require several computations of the functions, at best two of them. Then the actual convergence speed drops to √2, which is outperformed by the derivative-less methods.
Also note that the symbolic expression of the derivatives is often more costly to evaluate than the functions themselves. So one iteration of Newton costs at least two function evaluations, spoiling the benefit of the convergence rate.

Techniques for static code analysis in detecting integer overflows

I'm trying to find some effective techniques which I can base my integer-overflow detection tool on. I know there are many ready-made detection tools out there, but I'm trying to implement a simple one on my own, both for my personal interest in this area and also for my knowledge.
I know techniques like Pattern Matching and Type Inference, but I read that more complicated code analysis techniques are required to detect the int overflows. There's also the Taint Analysis which can "flag" un-trusted sources of data.
Is there some other technique, which I might not be aware of, which is capable of detecting integer overflows?
It may be worth to try with cppcheck static analysis tool, that claims to detect signed integer overflow as of version 1.67:
New checks:
- Detect shift by too many bits, signed integer overflow and dangerous sign conversion
Notice that it supports both C and C++ languages.
There is no overflow check for unsigned integers, as by Standard unsigned types never overflow.
Here is some basic example:
#include <stdio.h>
int main(void)
{
int a = 2147483647;
a = a + 1;
printf("%d\n", a);
return 0;
}
With such code it gets:
$ ./cppcheck --platform=unix64 simple.c
Checking simple.c...
[simple.c:6]: (error) Signed integer overflow for expression 'a+1'
However I wouldn't expect too much from it (at least with current version), as slighly different program:
int a = 2147483647;
a++;
passes without noticing overflow.
It seems you are looking for some sort of Value Range Analysis, and detect when that range would exceed the set bounds. This is something that on the face of it seems simple, but is actually hard. There will be lots of false positives, and that's even without counting bugs in the implementation.
To ignore the details for a moment, you associate a pair [lower bound, upper bound] with every variable, and do some math to figure out the new bounds for every operator. For example if the code adds two variables, in your analysis you add the upper bounds together to form the new upper bound, and you add the lower bounds together to get the new lower bound.
But of course it's not that simple. Firstly, what if there is non-straight-line code? if's are not too bad, you can just evaluate both sides and then take the union of the ranges after it (which can lose information! if two ranges have a gap in between, their union will span the gap). Loops require tricks, a naive implementation may run billions of iterations of analysis on a loop or never even terminate at all. Even if you use an abstract domain that has no infinite ascending chains, you can still get into trouble. The keywords to solve this are "widening operator" and (optionally, but probably a good idea) "narrowing operator".
It's even worse than that, because what's a variable? Your regular local variable of scalar type that never has its address taken isn't too bad. But what about arrays? Now you don't even know for sure which entry is being affected - the index itself may be a range! And then there's aliasing. That's far from a solved problem and causes many real world tools to make really pessimistic assumptions.
Also, function calls. You're going to call functions from some context, hopefully a known one (if not, then it's simple: you know nothing). That makes it hard, not only is there suddenly a lot more state to keep track of at the same time, there may be several places a function could be called from, including itself. The usual response to that is to re-evaluate that function when a range of one of its arguments has been expanded, once again this could take billions of steps if not done carefully. There also algorithms that analyze a function differently for different context, which can give more accurate results, but it's easy to spend a lot of time analyzing contexts that aren't different enough to matter.
Anyway if you've made it this far, you could read Accurate Static Branch Prediction by Value Range Propagation and related papers to get a good idea of how to actually do this.
And that's not all. Considering only the ranges of individual variables without caring about the relationships between (keyword: non-relational abstract domain) them does bad on really simple (for a human reader) things such as subtracting two variables that always close together in value, for which it will make a large range, with the assumption that they may be as far apart as their bounds allow. Even for something trivial such as
; assume x in [0 .. 10]
int y = x + 2;
int diff = y - x;
For a human reader, it's pretty obvious that diff = 2. In the analysis described so far, the conclusions would be that y in [2 .. 12] and diff in [-8, 12]. Now suppose the code continues with
int foo = diff + 2;
int bar = foo - diff;
Now we get foo in [-6, 14] and bar in [-18, 22] even though bar is obviously 2 again, the range doubled again. Now this was a simple example, and you could make up some ad-hoc hacks to detect it, but it's a more general problem. This effect tends to blow up the ranges of variables quickly and generate lots of unnecessary warnings. A partial solution is assigning ranges to differences between variables, then you get what's called a difference-bound matrix (unsurprisingly this is an example of a relational abstract domain). They can get big and slow for interprocedual analysis, or if you want to throw non-scalar variables at them too, and the algorithms start to get more complicated. And they only get you so far - if you throw a multiplication in the mix (that includes x + x and variants), things still go bad very fast.
So you can throw something else in the mix that can handle multiplication by a constant, see for example Abstract Domains of Affine Relations⋆ - this is very different from ranges, and won't by itself tell you much about the ranges of your variables, but you could use it to get more accurate ranges.
The story doesn't end there, but this answer is getting long. I hope this does not discourage you from researching this topic, it's a topic that lends itself well to starting out simple and adding more and more interesting things to your analysis tool.
Checking integer overflows in C:
When you add two 32-bit numbers and get a 33-bit result, the lower 32 bits are written to the destination, with the highest bit signaled out as a carry flag. Many languages including C don't provide a way to access this 'carry', so you can use limits i.e. <limits.h>, to check before you perform an arithmetic operation. Consider unsigned ints a and b :
if MAX - b < a, we know for sure that a + b would cause an overflow. An example is given in this C FAQ.
Watch out: As chux pointed out, this example is problematic with signed integers, because it won't handle MAX - b or MIN + b if b < 0. The example solution in the second link (below) covers all cases.
Multiplying numbers can cause an overflow, too. A solution is to double the length of the first number, then do the multiplication. Something like:
(typecast)a*b
Watch out: (typecast)(a*b) would be incorrect because it truncates first then typecasts.
A detailed technique for c can be found HERE. Using macros seems to be an easy and elegant solution.
I'd expect Frama-C to provide such a capability. Frama-C is focused on C source code, but I don't know if it is dialect-sensitive or specific. I believe it uses abstract interpretation to model values. I don't know if it specifically checks for overflows.
Our DMS Software Reengineering Toolkit has variety of langauge front ends, including most major dialects of C. It provides control and data flow analysis, and also abstract interpretation for computing ranges, as foundations on which you can build an answer. My Google Tech Talk on DMS at about 0:28:30 specifically talks about how one can use DMS's abstract interpretation on value ranges to detect overflow (of an index on a buffer). A variation on checking the upper bound on array sizes is simply to check for values not exceeding 2^N. However, off the shelf DMS does not provide any specific overflow analysis for C code. There's room for the OP to do interesting work :=}

how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm

I am seeking advice on how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm. I am using an MCMC approach to model the likelihood, given various covariates, that an individual will be assigned a particular rank in a social status hierarchy by a 3rd party (the judge): each judge (approx 80, across 4 villages) was asked to rank a group of individuals (approx 80, across 4 villages) based on their assessment of each individual's social status. Therefore, for each judge I have a vector of ranks corresponding to their judgement of each individual's position in the hierarchy.
To model this I assume that, when assigning ranks, judges are basing their decisions on the relative value of some latent measure of an individual's utility, u. Given this, it can then be assumed that a vector of ranks, r, produced by a given judge is a function of an unobserved vector, u, describing the utility of the individuals being ranked, where the individual with the kth highest value of u will be assigned the kth rank. I model u, using the covariates of interest, as a multivariate normally distributed variable and then determine the likelihood of the observed ranks, given the distribution of u generated by the model.
In addition to estimating the effect of, at most, 5 covariates, I also estimate hyperparameters describing variance between judges and items. Therefore, for every iteration of the chain I estimate a multivariate normal density approximately 8-10 times. As a result, 5000 iterations can take up to 14 hours. Obviously, I need to run it for much more than 5000 runs and so I need a means for dramatically speeding up the process. Given this, my questions are as follows:
(i) Am I right to assume that the best speed gains will be had by running some, if not all of my chain in C or C++?
(ii) assuming the answer to question 1 is yes, how do I go about this? For example, is there a way for me to retain all my R functions, but simply do the looping in C or C++: i.e. can I call my R functions from C and then do looping?
(iii) I guess what I really want to know is how best to approach the incorporation of C or C++ code into my program.
First make sure your slow R version is correct. Debugging R code might be easier than debugging C code. Done that? Great. You now have correct code you can compare against.
Next, find out what is taking the time. Use Rprof to run your code and see what is taking the time. I did this for some code I inherited once, and discovered it was spending 90% of the time in the t() function. This was because the programmer had a matrix, A, and was doing t(A) in a zillion places. I did one tA=t(A) at the start, and replaced every t(A) with tA. Massive speedup for no effort. Profile your code first.
Now, you've found your bottleneck. Is it code you can speed up in R? Is it a loop that you can vectorise? Do that. Check your results against your gold standard correct code. Always. Yes, I know its hard to compare algorithms that rely on random numbers, so set the seeds the same and try again.
Still not fast enough? Okay, now maybe you need to rewrite parts (the lowest level parts, generally, and those that were taking the most time in the profiling) in C or C++ or Fortran, or if you are really going for it, in GPU code.
Again, really check the code is giving the same answers as the correct R code. Really check it. If at this stage you find any bugs anywhere in the general method, fix them in what you thought was the correct R code and in your latest version, and rerun all your tests. Build lots of automatic tests. Run them often.
Read up about code refactoring. It's called refactoring because if you tell your boss you are rewriting your code, he or she will say 'why didn't you write it correctly first time?'. If you say you are refactoring your code, they'll say "hmmm... good". THIS ACTUALLY HAPPENS.
As others have said, Rcpp is made of win.
A complete example using R, C++ and Rcpp is provided by this blog post which was inspired by a this post on Darren Wilkinson's blog (and he has more follow-ups). The example is also included with recent releases of Rcpp in a directory RcppGibbs and should get you going.
I have a blog post which discusses exactly this topic which I suggest you take a look at:
http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/
(this post is more relevant than the post of mine that Dirk refers to).
I think the best method currently to integrate C or C++ is the Rcpp package of Dirk Eddelbuettel. You can find a lot of information at his website. There is also a talk at Google that is available through youtube that might be interesting.
Check out this project:
https://github.com/armstrtw/rcppbugs
Also, here is a link to the R/Fin 2012 talk:
https://github.com/downloads/armstrtw/rcppbugs/rcppbugs.pdf
I would suggest to benchmark each step of the MCMC sampler and identify the bottleneck. If you put each full conditional or M-H-step into a function, you can use the R compiler package which might give you 5%-10% speed gain. The next step is to use RCPP.
I think it would be really nice to have a general-purpose RCPP function which generates just one single draw using the M-H algorithm given a likelihood function.
However, with RCPP some things become difficult if you only know the R language: non-standard random distributions (especially truncated ones) and using arrays. You have to think more like a C programmer there.
Multivariate Normal is actually a big issue in R. Dmvnorm is very inefficient and slow. Dmnorm is faster, but it would give me NaNs quicker than dmvnorm in some models.
Neither does take an array of covariance matrices, so it is impossible to vectorize code in many instances. As long as you have a common covariance and means, however, you can vectorize, which is the R-ish strategy to speed up (and which is the oppositve of what you would do in C).

How fast is Data.Array?

The documentation of Data.Array reads:
Haskell provides indexable arrays, which may be thought of as
functions whose domains are isomorphic to contiguous subsets of the
integers. Functions restricted in this way can be implemented
efficiently; in particular, a programmer may reasonably expect rapid
access to the components.
I wonder how fast can (!) and (//) be. Can I expect O(1) complexity from these, as I would have from their imperative counterparts?
In general, yes, you should be able to expect O(1) from ! although I'm not sure if thats guaranteed by the standard.
You might want to see the vector package if you want faster arrays though (through use of stream fusion). It is also better designed.
Note that // is probably O(n) though because it has to traverse the list (just like an imperative program would). If you need a lot of mutation you can use MArray orMVector.

What is the best way of determining a loop invariant?

When using formal aspects to create some code is there a generic method of determining a loop invariant or will it be completely different depending on the problem?
It has already been pointed out that one same loop can have several invariants, and that Calculability is against you. It doesn't mean that you cannot try.
You are, in fact, looking for an inductive invariant: the word invariant may also be used for a property that is true at each iteration but for which is it not enough to know that it hold at one iteration to deduce that it holds at the next. If I is an inductive invariant, then any consequence of I is an invariant, but may not be an inductive invariant.
You are probably trying to get an inductive invariant to prove a certain property (post-condition) of the loop in some defined circumstances (pre-conditions).
There are two heuristics that work quite well:
start with what you have (pre-conditions), and weaken until you have an inductive invariant. In order to get an intuition how to weaken, apply one or several forward loop iterations and see what ceases to be true in the formula you have.
start with what you want (post-conditions) and strengthen until you have an inductive invariant. To get the intuition how to strengthen, apply one or several loop iterations backwards and see what needs to be added so that the post-condition can be deduced.
If you want the computer to help you in your practice, I can recommend the Jessie deductive verification plug-in for C programs of Frama-C. There are others, especially for Java and JML annotations, but I am less familiar with them. Trying out the invariants you think of is much faster than working out if they work on paper. I should point out that verifying that a property is an inductive invariant is also undecidable, but modern automatic provers do great on many simple examples. If you decide to go that route, get as many as you can from the list: Alt-ergo, Simplify, Z3.
With the optional (and slightly difficult to install) library Apron, Jessie can also infer some simple invariants automatically.
It's actually trivial to generate loop invariants. true is a good one for instance. It fulfills all three properties you want:
It holds before loop entry
It holds after each iteration
It holds after loop termination
But what you're after is probably the strongest loop invariant. Finding the strongest loop invariant however, is sometimes even an undecidable task. See article Inadequacy of Computable Loop Invariants.
I don't think it's easy to automate this. From wiki:
Because of the fundamental similarity of loops and recursive programs, proving partial correctness of loops with invariants is very similar to proving correctness of recursive programs via induction. In fact, the loop invariant is often the inductive property one has to prove of a recursive program that is equivalent to a given loop.
I've written about writing loop invariants in my blog, see Verifying Loops Part 2. The invariants needed to prove a loop correct typically comprise 2 parts:
A generalisation of the state that is intended when the loop terminates.
Extra bits needed to ensure that the loop body is well-formed (e.g. array indices in bounds).
(2) is straightforward. To derive (1), start with a predicate expressing the desired state after termination. Chances are it contains a 'forall' or 'exists' over some range of data. Now change the bounds of the 'forall' or 'exists' so that (a) they depend on variables modified by the loop (e.g. loop counters), and (b) so that the invariant is trivially true when the loop is first entered (usually by making the range of the 'forall' or 'exists' empty).
There are a number of heuristics for finding loop invariants. One good book about this is "Programming in the 1990s" by Ed Cohen. It's about how to find a good invariant by manipulating the postcondition by hand. Examples are: replace a constant by a variable, strengthen invariant, ...

Resources