Lean complains it can't see that a statement is decidable - decidable

I'm trying to define the following quantity partn:
variable pi : nat -> Prop
variable (Hdecp : ∀ p, decidable (pi p))
definition partn (n : nat) : nat := ∏ p ∈ (prime_factors n), (if pi p then p^(mult p n) else 1)
but get the error
error: failed to synthesize placeholder
pi : ℕ → Prop,
n p : ℕ
⊢ decidable (pi p)
How can I help Lean recognize that (pi p) is indeed decidable thanks to Hdecp?

edit: The elaborator can actually infer the instance completely on its own, as long it's available in the definition's context:
variable (Hdecp : ∀ p, decidable (pi p))
include Hdecp
definition partn (n : nat) : nat := ∏ p ∈ (prime_factors n), (if pi p then p^(mult p n) else 1)
original answer (still useful if the instance has more complex hypotheses):
If you want to avoid the explicit call to ite, you can locally introduce the decidable instance:
definition partn (n : nat) : nat := ∏ p ∈ (prime_factors n),
have decidable (pi p), from Hdecp p,
if pi p then p^(mult p n) else 1

I found a solution:
definition partn (n : nat) : nat := ∏ p ∈ (prime_factors n), (#ite (pi p) (Hdecp p) nat (p^(mult p n)) 1)
which allows me to explicitly use Hdecp in my if-the-else

Related

In Dafny, how to assert that, if all elements in a sequence are less than some value, this also holds for a permutation of this sequence?

This is my first time asking a question on here, so I hope I have adequately followed the guidelines for asking a proper question.
For some quick context: I am currently trying to implement and verify a recursive version of Quicksort in Dafny. At this point, it seems that all there is left to do is to prove one last lemma (i.e., the implementation completely verifies when I remove this lemma's body. If I am not mistaking, this should mean that the implementation completely verifies when assuming this lemma holds.).
Specifically, this lemma states that, if a sequence of values is currently properly partitioned around a pivot, then, if one permutes the (sub)sequences left and right of the pivot, the complete sequence is still a valid partition. Eventually, using this lemma, I essentially want to say that, if the subsequences left and right of the pivot get sorted, the complete sequence is still a valid partition; as a result, the complete sequence is sorted.
Now, I have tried to prove this lemma, but I get stuck on the part where I try to show that, if all values in a sequence are less than some value, then all values in a permutation of that sequence are also less than that value. Of course, I also need to show the equivalent property with "less than" replaced by "greater than or equal to", but I suppose that they are nearly identical, so knowing one would be sufficient.
The relevant part of the code is given below:
predicate Permutation(a: seq<int>, b: seq<int>)
requires 0 <= |a| == |b|
{
multiset(a) == multiset(b)
}
predicate Partitioned(a: seq<int>, lo: int, hi: int, pivotIndex: int)
requires 0 <= lo <= pivotIndex < hi <= |a|
{
(forall k :: lo <= k < pivotIndex ==> a[k] < a[pivotIndex])
&&
(forall k :: pivotIndex <= k < hi ==> a[k] >= a[pivotIndex])
}
lemma PermutationPreservesPartition(apre: seq<int>, apost: seq<int>, lo: int, hi: int, pivotIndex: int)
requires 0 <= lo <= pivotIndex < hi <= |apre| == |apost|
requires Partitioned(apre, lo, hi, pivotIndex)
requires Permutation(apre[lo..pivotIndex], apost[lo..pivotIndex])
requires Permutation(apre[pivotIndex + 1..hi], apost[pivotIndex + 1..hi])
requires apre[pivotIndex] == apost[pivotIndex]
ensures Partitioned(apost, lo, hi, pivotIndex)
{
}
I've tried several things, such as:
assert
Partitioned(apre, lo, hi, pivotIndex) && apre[pivotIndex] == apost[pivotIndex]
==>
(
(forall k :: lo <= k < pivotIndex ==> apre[k] < apost[pivotIndex])
&&
(forall k :: pivotIndex <= k < hi ==> apre[k] >= apost[pivotIndex])
);
assert
(forall k :: lo <= k < pivotIndex ==> apre[k] < apost[pivotIndex])
&&
(Permutation(apre[lo..pivotIndex], apost[lo..pivotIndex]))
==>
(forall k :: lo <= k < pivotIndex ==> apost[k] < apost[pivotIndex]);
However, here the second assertion already fails to verify.
After this first attempt, I figured that Dafny might not be able to verify this property between the sequences because the "Permutation" predicate uses the corresponding multisets instead of the sequences themselves. So, I tried to make the relation between the sequences more explicit by doing the following:
assert
Permutation(apre[lo..pivotIndex], apost[lo..pivotIndex])
==>
forall v :: v in multiset(apre[lo..pivotIndex]) <==> v in multiset(apost[lo..pivotIndex]);
assert
forall v :: v in multiset(apre[lo..pivotIndex]) <==> v in apre[lo..pivotIndex];
assert
forall v :: v in multiset(apost[lo..pivotIndex]) <==> v in apost[lo..pivotIndex];
assert
forall v :: v in apre[lo..pivotIndex] <==> v in apost[lo..pivotIndex];
assert
(
(forall v :: v in apre[lo..pivotIndex] <==> v in apost[lo..pivotIndex])
&&
(forall v :: v in apre[lo..pivotIndex] ==> v < apre[pivotIndex])
)
==>
(forall v :: v in apost[lo..pivotIndex] ==> v < apre[pivotIndex]);
assert
(
(forall v :: v in apost[lo..pivotIndex] ==> v < apre[pivotIndex])
&&
apre[pivotIndex] == apost[pivotIndex]
)
==>
(forall v :: v in apost[lo..pivotIndex] ==> v < apost[pivotIndex]);
This all verifies, which I thought was great, since there only seems one step left to connect this to the definition of "Partitioned", viz.:
assert
(forall v :: v in apost[lo..pivotIndex] ==> v < apost[pivotIndex])
==>
(forall k :: lo <= k < pivotIndex ==> apost[k] < apost[pivotIndex]);
Nevertheless, Dafny then fails to verify this assertion.
So, at this point, I am not sure how to convince Dafny that this lemma holds. I've tried looking at implementations of Quicksort in Dafny from other people, as well as any potentially relevant question I could find. However, this has, as of yet, been to no avail. I hope someone could help me out here.
My apologies for any potential ignorance regarding Dafny, I am just starting out with the language.
It is difficult to give a usable definition of "permutation". However, to prove the correctness of a sorting algorithm, you only need that the multiset of elements stays the same. For a sequence s, the expression multiset(s) gives you the multiset of elements of s. If you start with an array a, then a[..] gives you a sequence consisting of the elements of the array, so multiset(a[..]) gives you the multiset of elements in the array.
See https://github.com/dafny-lang/dafny/blob/master/Test/dafny3/GenericSort.dfy#L59 for an example.
Dafny's verifier cannot work out all properties of such multisets by itself. However, it generally does understand that the multiset of elements is unchanged when you swap two elements.

Type level programming to represent multidimensional arrays (Tensors)

I would like to have a type to represent multidimensional arrays (Tensors) in a type safe way. so I could write for example: zero :: Tensor (5,3,2) Integer
that would represent a multidimensional array that has 5 element , each of which has 3 elements each of which have 2 elements, where all elements are Integers
How would you define this type using type level programming?
Edit:
After the wonderful answer by Alec, Which implemented this using GADTs,
I wonder if you could take this a step further, and support multiple implementations of a class Tensor and of the operations on tensors and serialization of tensors
such that you could have for example:
GPU or CPU implementations using C
pure Haskell implementations
implementation that only prints the graph of computation and does not compute anything
implementation which caches results on disk
parallel or distributed computation
etc...
All type safe and easy to use.
My intention is to make a library in Haskell much like tensor-flow but type-safe and much more extensible, using automatic differentiation (ad library), and exact real arithmetic (exact-real library)
I think a functional language like Haskell is much more appropriate for these things (for all things in my opinion) than the python ecosystem which sprouted somehow.
Haskell is purely functional, much more sutible for computational programming than python
Haskell is much more efficient than python and can be compiled to binary
Haskell's laziness (arguably) removes the need to optimize the computation graph, and makes code much simpler that way
much more powerful abstractions in Haskell
Although i see the potential, i'm just not well versed enough (or smart enough) for this type-level programming, so i don't know how to implement such a thing in Haskell and get it to compile.
That's where I need your help.
Here is one way (here is a complete Gist). We stick to using Peano numbers instead of GHC's type level Nat just because induction works better on them.
{-# LANGUAGE GADTs, PolyKinds, DataKinds, TypeOperators, FlexibleInstances, FlexibleContexts #-}
import Data.Foldable
import Text.PrettyPrint.HughesPJClass
data Nat = Z | S Nat
-- Some type synonyms that simplify uses of 'Nat'
type N0 = Z
type N1 = S N0
type N2 = S N1
type N3 = S N2
type N4 = S N3
type N5 = S N4
type N6 = S N5
type N7 = S N6
type N8 = S N7
type N9 = S N8
-- Similar to lists, but indexed over their length
data Vector (dim :: Nat) a where
Nil :: Vector Z a
(:-) :: a -> Vector n a -> Vector (S n) a
infixr 5 :-
data Tensor (dim :: [Nat]) a where
Scalar :: a -> Tensor '[] a
Tensor :: Vector d (Tensor ds a) -> Tensor (d : ds) a
To display these types, we'll use the pretty package (which comes with GHC already).
instance (Foldable (Vector n), Pretty a) => Pretty (Vector n a) where
pPrint = braces . sep . punctuate (text ",") . map pPrint . toList
instance Pretty a => Pretty (Tensor '[] a) where
pPrint (Scalar x) = pPrint x
instance (Pretty (Tensor ds a), Pretty a, Foldable (Vector d)) => Pretty (Tensor (d : ds) a) where
pPrint (Tensor xs) = pPrint xs
Then here are instances of Foldable for our datatypes (nothing surprising here - I'm including this only because you need it for the Pretty instances to compile):
instance Foldable (Vector Z) where
foldMap f Nil = mempty
instance Foldable (Vector n) => Foldable (Vector (S n)) where
foldMap f (x :- xs) = f x `mappend` foldMap f xs
instance Foldable (Tensor '[]) where
foldMap f (Scalar x) = f x
instance (Foldable (Vector d), Foldable (Tensor ds)) => Foldable (Tensor (d : ds)) where
foldMap f (Tensor xs) = foldMap (foldMap f) xs
Finally, the part that answers your question: we can define Applicative (Vector n) and Applicative (Tensor ds) similar to how Applicative ZipList is defined (except pure doesn't return and empty list - it returns a list of the right length).
instance Applicative (Vector Z) where
pure _ = Nil
Nil <*> Nil = Nil
instance Applicative (Vector n) => Applicative (Vector (S n)) where
pure x = x :- pure x
(x :- xs) <*> (y :- ys) = x y :- (xs <*> ys)
instance Applicative (Tensor '[]) where
pure = Scalar
Scalar x <*> Scalar y = Scalar (x y)
instance (Applicative (Vector d), Applicative (Tensor ds)) => Applicative (Tensor (d : ds)) where
pure x = Tensor (pure (pure x))
Tensor xs <*> Tensor ys = Tensor ((<*>) <$> xs <*> ys)
Then, in GHCi, it is pretty trivial to make your zero function:
ghci> :set -XDataKinds
ghci> zero = pure 0
ghci> pPrint (zero :: Tensor [N5,N3,N2] Integer)
{{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}},
{{0, 0}, {0, 0}, {0, 0}}}

Vectorizable implementation of complementary error function erfcf()

The complementary error function, erfc, is a special functions closely related to the standard normal distribution. It is frequently used in statistics and the natural sciences (e.g. diffusion problems) where the "tails" of this distribution need to be considered, and use of the error function, erf, is therefore not suitable.
The complementary error function was made available in the ISO C99 standard math library as the functions erfcf, erfc, and erfcl; these were subsequently adopted into ISO C++ as well. Thus source code can readily be found in open-source implementations of that library, for example in glibc.
However, many existing implementations are scalar in nature, while modern processor hardware is SIMD-oriented (either explicitly, as in x86 CPU, or implicitly, as in GPUs). For performance reasons, a vectorizable implementation is therefore highly desirable. This means branches need to be avoided, except as part of select assignment. Likewise, extensive use of tables is not indicated, as parallelized lookup is often inefficient.
How would one go about constructing an efficient vectorizable implementation of the single-precision function erfcf()? The accuracy, as measured in ulp, should be roughly the same as glibc's scalar implementation, which has a maximum error of 3.12575 ulps (determined by exhaustive testing). The availability of fused multiply-add (FMA) can be assumed, as all major processor architectures (CPUs and GPUs) offer it at this time. While handling of floating-point status flags and errno can be ignored, denormals, infinities, and NaNs should be handled in accordance with the IEEE 754 bindings for ISO C.
After looking into various approaches, the one that seems most suitable is the algorithm proposed in the following paper:
M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of (1 + 2 x) exp(x2) erfc x in 0 ≤ x < ∞." Mathematics of Computation, Volume 36, No. 153, January 1981, pp. 249-253 (online copy)
The basic idea of the paper is to create an approximation to (1 + 2 x) exp(x2) erfc(x), from which we can compute erfcx(x) by simply dividing by (1 + 2 x), and erfc(x) by then multiplying with exp(-x2). The tightly bounded range of the function, with function values roughly in [1, 1.3], and its general "flatness" lend itself well to polynomial approximation. Numerical properties of this approach are further improved by narrowing the approximation interval: the original argument x is transformed by q = (x - K) / (x + K), where K is a suitably chosen constant, followed by computing p (q), where p is a polynomial.
Since erfc -x = 2 - erfc x, we only need to consider the interval [0, ∞] which is mapped to the interval [-1, 1] by this transformation. For IEEE-754 single-precision, erfcf() vanishes (becomes zero) for x > 10.0546875, so one needs to consider only x ∈ [0, 10.0546875). What is the "optimal' value of K for this range? I know of no mathematical analysis that would provide the answer, the paper suggests K = 3.75 based on experiments.
One can readily establish that for single-precision computation, a minimax polynomial approximation of degree 9 is sufficient for various values of K in that general vicinity. Systematically generating such approximations with the Remez algorithm, with K varying between 1.5 and 4 in steps of 1/16, lowest approximation error is observed for K = {2, 2.625, 3.3125}. Of these, K = 2 is the most advantageous choice, since it lends itself to very accurate computation of (x - K) / (x + K), as shown in this question.
The value K = 2 and the input domain for x would suggest that it is necessary to use variant 4 from my answer, however once can demonstrate experimentally that the less expensive variant 5 achieves the same accuracy here, which is likely due to the very shallow slope of the approximated function for q > -0.5, which causes any error in the argument q to be reduced by roughly a factor of ten.
Since computation of erfc() requires post-processing steps in addition to the initial approximation, it is clear that the accuracy of both of these computations must be high in order to achieve a sufficiently accurate final result. Error correcting techniques must be used.
One observes that the most significant coefficient in the polynomial approximation of (1 + 2 x) exp(x2) erfc(x) is of the form (1 + s), where s < 0.5. This means we can represent the leading coefficient more accurately by splitting off 1, and only using s in the polynomial. So instead of computing a polynomial p(q), then multiplying by the reciprocal r = 1 / (1 + 2 x), it is mathematically equivalent but numerically advantageous to compute the core approximation as p(q) + 1, and use p to compute fma (p, r, r).
The accuracy of the division can be enhanced by computing an initial quotient q from the reciprocal r, compute the residual e = p+1 - q * (1 + 2 x) with the help of an FMA, then use e to apply the correction q = q + (e * r), again using an FMA.
Exponentiation has error magnification properties, therefore computation of e-x2 must be performed carefully. The availability of FMA trivially allows the computation of -x2 as a double-float shigh:slow. ex is its own derivative, so one can compute eshigh:slow as eshigh + eshigh * slow. This computation can be combined with the multiplication of the previous intermediate result r to yield r = r * eshigh + r * eshigh * slow. By use of FMA, one ensures that the most significant term r * eshigh is computed as accurately as possible.
Combining the steps above with a few simple selections to handle exceptional cases and negative arguments, one arrives at the following C code:
float my_expf (float);
/* Compute complementary error function.
*
* Based on: M. M. Shepherd and J. G. Laframboise, "Chebyshev Approximation of
* (1+2x)exp(x^2)erfc x in 0 <= x < INF", Mathematics of Computation, Vol. 36,
* No. 153, January 1981, pp. 249-253.
*
* maximum error: 2.65184 ulps
*/
float my_erfcf (float x)
{
float a, d, e, p, q, r, s, t;
a = fabsf (x);
/* Compute q = (a-2)/(a+2) accurately. [0, 10.0546875] -> [-1, 0.66818] */
p = a + 2.0f;
r = 1.0f / p;
q = fmaf (-4.0f, r, 1.0f);
t = fmaf (q + 1.0f, -2.0f, a);
e = fmaf (-a, q, t);
q = fmaf (r, e, q);
/* Approximate (1+2*a)*exp(a*a)*erfc(a) as p(q)+1 for q in [-1, 0.66818] */
p = -0x1.a4a000p-12f; // -4.01139259e-4
p = fmaf (p, q, -0x1.42a260p-10f); // -1.23075210e-3
p = fmaf (p, q, 0x1.585714p-10f); // 1.31355342e-3
p = fmaf (p, q, 0x1.1adcc4p-07f); // 8.63227434e-3
p = fmaf (p, q, -0x1.081b82p-07f); // -8.05991981e-3
p = fmaf (p, q, -0x1.bc0b6ap-05f); // -5.42046614e-2
p = fmaf (p, q, 0x1.4ffc46p-03f); // 1.64055392e-1
p = fmaf (p, q, -0x1.540840p-03f); // -1.66031361e-1
p = fmaf (p, q, -0x1.7bf616p-04f); // -9.27639827e-2
p = fmaf (p, q, 0x1.1ba03ap-02f); // 2.76978403e-1
/* Divide (1+p) by (1+2*a) ==> exp(a*a)*erfc(a) */
d = fmaf (2.0f, a, 1.0f);
r = 1.0f / d;
q = fmaf (p, r, r); // q = (p+1)/(1+2*a)
e = fmaf (fmaf (q, -a, 0.5f), 2.0f, p - q); // residual: (p+1)-q*(1+2*a)
r = fmaf (e, r, q);
/* Multiply by exp(-a*a) ==> erfc(a) */
s = a * a;
e = my_expf (-s);
t = fmaf (-a, a, s);
r = fmaf (r, e, r * e * t);
/* Handle NaN, Inf arguments to erfc() */
if (!(a < INFINITY)) r = x + x;
/* Clamp result for large arguments */
if (a > 10.0546875f) r = 0.0f;
/* Handle negative arguments to erfc() */
if (x < 0.0f) r = 2.0f - r;
return r;
}
/* Compute exponential base e. Maximum ulp error = 0.86565 */
float my_expf (float a)
{
float c, f, r;
int i;
// exp(a) = exp(i + f); i = rint (a / log(2))
c = 0x1.800000p+23f; // 1.25829120e+7
r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi
f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
i = (int)r;
// approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
r = 0x1.694000p-10f; // 1.37805939e-3
r = fmaf (r, f, 0x1.125edcp-07f); // 8.37312452e-3
r = fmaf (r, f, 0x1.555b5ap-05f); // 4.16695364e-2
r = fmaf (r, f, 0x1.555450p-03f); // 1.66664720e-1
r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
// exp(a) = 2**i * exp(f);
r = ldexpf (r, i);
if (!(fabsf (a) < 104.0f)) {
r = a + a; // handle NaNs
if (a < 0.0f) r = 0.0f;
if (a > 0.0f) r = INFINITY;
}
return r;
}
I used my own implementation of expf() in the above code to isolate my work from differences in the expf() implementations on different compute platforms. But any implementation of expf() whose maximum error is close to 0.5 ulp should work well. As shown above, that is, when using my_expf(), my_erfcf() has a maximum error of 2.65184 ulps.
Provided a vectorizable expf() is available, the code above should vectorize without problem. I did a quick check with the Intel compiler 13.1.3.198. I put a call to my_erfcf() in a loop, added #include <mathimf.h>, replaced the call to my_expf() with a call to expf(), then compiled using these command line switches:
/Qstd=c99 /O3 /QxCORE-AVX2 /fp:precise /Qfma /Qimf-precision:high:expf /Qvec_report=2
The Intel compiler reported that the loop had been vectorized, which I double checked by inspection of the disassembled binary code.
Since my_erfcf() only uses reciprocals rather than full divisions, it is amenable to the use of fast reciprocal implementations, provided they deliver almost correctly-rounded results. For processors that provide a fast single-precision reciprocal approximation in hardware, this can easily be achieved by coupling this with a Halley iteration with cubic convergence. A (scalar) example of this approach for x86 processors is:
/* Compute 1.0f / a almost correctly rounded. Halley iteration with cubic convergence */
float fast_recipf (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}

C code to Haskell

So, i would like to convert a part of C code to Haskell. I wrote this part (it's a simplified example of what I want to do) in C, but being the newbie I am in Haskell, I can't really make it work.
float g(int n, float a, float p, float s)
{
int c;
while (n>0)
{
c = n % 2;
if (!c) s += p;
else s -= p;
p *= a;
n--;
}
return s;
}
Anyone got any ideas/solutions?
Lee's translation is already pretty good (well, he confused the odd and even cases(1)), but he fell into a couple of performance traps.
g n a p s =
if n > 0
then
let c = n `mod` 2
s' = (if c == 0 then (-) else (+)) s p
p' = p * a
in g (n-1) a p' s'
else s
He used mod instead of rem. The latter maps to machine division, the former performs additional checks to ensure a non-negative result. Thus mod is a bit slower than rem, and if either satisfies the needs - because they yield identical results in the case where both arguments are non-negative; or because the result is only compared to 0 (both conditions are satisfied here) - rem is preferable. Even better, and a bit more idiomatic is to use even (which uses rem for the reasons mentioned above). The difference is not huge, though.
No type signature. That means that the code is (type-class) polymorphic, and thus no strictness analysis is possible, nor any specialisations. If the code is used in the same module at a specific type, GHC can (and usually will, if optimisations are enabled) create a specialised version for that specific type that allows strictness analysis and some other optimisations (inlining of class methods like (+) etc.), in that case, one does not pay the polymorhism penalty. But if the use site is in a different module, that cannot happen. If (type-class) polymorphic code is desired, one should mark it INLINABLE or INLINE (for GHC < 7), so that its unfolding is exposed in the .hi file and the function can be specialised and optimised at the use site.
Since g is recursive, it cannot be inlined [meaning, GHC cannot inline it; in principle it is possible] at use sites, which often would enable more optimisations than a mere specialisation.
One technique that often allows better optimisation for recursive functions is the worker/wrapper transformation. One creates a wrapper that calls a recursive (local) worker, then the non-recursive wrapper can be inlined, and when the worker is called with known arguments, that can enable further optimisations like constant folding or, in the case of function arguments, inlining. In particular the latter often has an enormous impact, when combined with a static-argument-transformation (arguments that never change in the recursive calls are not passed as arguments to the recursive worker).
In this case, we only have one static argument of type Float, so a worker/wrapper transformation with a SAT typically makes no difference (as a rule of thumb, a SAT pays off when
the static argument is a function
several non-function arguments are static
so by this rule, we shouldn't expect any benefit from w/w + SAT, and in general, there is none). Here we have one special case where w/w + SAT can make a big difference, and that is when the factor a is 1. GHC has {-# RULES #-} that eliminate multiplication by 1 for various types, and with such a short loop body, a multiplication more or less per iteration makes a difference, the running time is reduced by about 40% after points 3 and 4 have been applied. (There are no RULES for multiplication by 0 or by -1 for floating point types because 0*x = 0 resp. (-1)*x = -x don't hold for NaNs.) For all other a, the w/w + SATed
{-# INLINABLE g #-}
g n a p s = worker n p s
where
worker n p s
| n <= 0 = s
| otherwise = let s' = if even n then s + p else s - p
in worker (n-1) a (p*a) s'
does not perform measurably different from the top-level recursive version with the same optimisations done.
Strictness. GHC's strictness analyser is good, but not perfect. It cannot see far enough through the algorithm to determine that the function is
strict in p if n >= 1 (assuming addition - (+) - is strict in both arguments)
also strict in a if n >= 2 (assuming strictness of (*) in both arguments)
and then produce a worker that is strict in both. Instead you get a worker that uses an unboxed Int# for n and an unboxed Float# for s (I'm using the type Int -> Float -> Float -> Float -> Float here, corresponding to the C), and boxed Floats for a and p. Thus in each iteration you get two unboxings and a re-boxing. That costs (relatively) a lot of time, since besides that it's just a bit of simple arithmetic and tests.
Help GHC along a bit, and make the worker (or g itself, if you don't do the worker/wrapper transform) strict in p (bang pattern for example). That is enough to allow GHC producing a worker using unboxed values throughout.
Using division to test parity (not applicable if the type is Int and the LLVM backend is used).
GHC's optimiser hasn't got down to the low-level bits very much yet, so the native code generator emits a division instruction for
x `rem` 2 == 0
and, when the rest of the loop body is as cheap as it is here, that costs a lot of time. LLVM's optimiser has already been taught to replace that with a bitmasking at type Int, so with ghc -O2 -fllvm you don't need to do that manually. With the native code generator, substituting that with
x .&. 1 == 0
(needs import Data.Bits of course) produces a significant speedup (on normal platforms where a bitwise and is much faster than a division).
The final result
{-# INLINABLE g #-}
g n a p s = worker n p s
where
worker k !ap acc
| k > 0 = worker (k-1) (ap*a) (if k .&. (1 :: Int) == 0 then acc + ap else acc - ap)
| otherwise = acc
performs not measurably different (for the tested values) from the result of gcc -O3 -msse2 loop.c, except for a = -1, where gcc replaces the multiplication with a negation (assuming all NaNs equivalent).
(1) He's not alone in that,
c = n % 2;
if (!c) s += p;
else s -= p;
seems to be really tricky, as far as I can see everybody(2) got that wrong.
(2) With one exception ;)
As a first step, let's simplify your code:
float g(int n, float a, float p, float s) {
if (n <= 0) return s;
float s2 = n % 2 == 0 ? s + p : s - p;
return g(n - 1, a, a*p, s2)
}
We have turned your original function into a recursive one that exhibits a certain structure. It's a sequence! We can turn this into Haskell conveniently:
gs :: Bool -> Float -> Float -> Float -> [Float]
gs nb a p s = s : gs (not nb) a (a*p) (if nb then s - p else s + p)
Finally we just need to index this list:
g :: Integer -> Float -> Float -> Float -> Float
g n a p s = gs (even n) a p s !! (n - 1)
The code is not tested, but it should work. If not, it's probably just an off-by-one error.
Here is how I would tackle this problem in Haskell. First, I observe that there are several loops merged into one here: we are
forming a geometric sequence (whose factor is a suitably negative version of p)
taking a prefix of the sequence
summing the result
So my solution follows this structure as well, with a tiny bit of s and p thrown in for good measure because that's what your code does. In a from-scratch version, I'd probably drop those two parameters entirely.
g n a p s = sum (s : take n (iterate (*(-a)) start)) where
start | odd n = -p
| otherwise = p
A fairly direct translation would be:
g n a p s =
if n > 0
then
let c = n `mod` 2
s' = (if c == 0 then (-) else (+)) s p
p' = p * a
in g (n-1) a p' s'
else s
Look at the signature of the g function (i.e., float g (int n, float a, float p, float s)) you know that your Haskell function will receive 4 elements and return a float, thus:
g :: Integer -> Float -> Float -> Float -> Float
let us now look into the loop, we see that n > 0 is the stop case, and n--; will be the decreasing step used on the recursive call. Therefore:
g :: Integer -> Float -> Float -> Float -> Float
g n a p s | n <= 0 = s
to n > 0, you have another conditional if (!(n % 2)) s += p; else s -= p; inside the loop. If n is odd than you will do s += p, p *= a and n--. In Haskell it will be:
g :: Integer -> Float -> Float -> Float -> Float
g n a p s | n <= 0 = s
| odd n = g (n-1) a (p*a) (s+p)
If n is even than you will do s-=p, p*=a; and n--. Thus:
g :: Integer -> Float -> Float -> Float -> Float
g n a p s | n <= 0 = s
| odd n = g (n-1) a (p*a) (s+p)
| otherwise = g (n-1) a (p*a) (s-p)
To expand on #Landei and #MathematicalOrchid 's comments below the question: The algorithm proposed to solve the problem at hand is always O(n). However, if you realize that what you're actually doing is computing a partial sum of the geometric series, you can use the well-known summation formula:
g n a p s = s + (-1)**n * p * ((-a)**n-1) / (-a-1)
This will be faster as the exponentiation can be done faster than O(n) by repeated squaring or other clever methods, which are likely automatically employed for integer powers by modern compilers.
You can encode loops almost-naturally with the Haskell Prelude function until :: (a -> Bool) -> (a -> a) -> a -> a:
g :: Int -> Float -> Float -> Float -> Float
g n a p s =
fst.snd $
until ((<= 0).fst)
(\(n,(!s,!p)) -> (n-1, (if even n then s+p else s-p, p*a)))
(n,(s,p))
The bang-patterns !s and !p mark strictly-calculated intermediate variables, to prevent excessive laziness which would otherwise harm efficiency.
until pred step start repeatedly applies the step function until pred called with the last generated value will hold, starting with initial value start. It can be represented by the pseudocode:
def until (pred, step, start): // well, actually,
while( true ): def until (pred, step, start):
if pred(start): return(start) if pred(start): return(start)
start := step(start) call until(pred, step, step(start))
The first pseudocode is equivalent to the second (which is how until is actually implemented) in the presence of tail call optimization, which is why in many functional languages where TCO is present loops are encoded via recursion.
So in Haskell, until is coded as
until p f x | p x = x
| otherwise = until p f (f x)
But it could have been coded differently, making explicit the interim results:
until p f x = last $ go x -- or, last (go x)
where go x | p x = [x]
| otherwise = x : go (f x)
Using the Haskell standard higher-order functions break and iterate this could be written as a stream-processing code,
until p f x = let (_,(r:_)) = break p (iterate f x) in r
-- or: span (not.p) ....
or just
until p f x = head $ dropWhile (not.p) $ iterate f x -- or, equivalently,
-- head . dropWhile (not.p) . iterate f $ x
If TCO weren't present in a given Haskell implementation, the last version would be the one to use.
Hopefully this makes clearer how the stream-processing code from Daniel Wagner's answer comes about,
g n a p s = s + (sum . take n . iterate (*(-a)) $ if odd n then (-p) else p)
because the predicate involved is about counting down from n, and
fst . snd . head . dropWhile ((> 0).fst) $
iterate (\(n,(!s,!p)) -> (n-1, (if even n then s+p else s-p, p*a)))
(n,(s,p))
===
fst . snd . head . dropWhile ((> 0).fst) $
iterate (\(n,(!s,!p)) -> (n-1, (s+p, p*(-a))))
(n,(s, if odd n then (-p) else p)) -- 0 is even
===
fst . (!! n) $
iterate (\(!s,!p) -> (s+p, p*(-a)))
(s, if odd n then (-p) else p)
===
foldl' (+) s . take n . iterate (*(-a)) $ if odd n then (-p) else p
In pure FP, the stream-processing paradigm makes all history of a computation available, as a stream (list) of values.

Using list elements and indices together

I've always found it awkward to have a function or expression that requires use of the values, as well as indices, of a list (or array, applies just the same) in Haskell.
I wrote validQueens below while experimenting with the N-queens problem here ...
validQueens x =
and [abs (x!!i - x!!j) /= j-i | i<-[0..length x - 2], j<-[i+1..length x - 1]]
I didn't care for the use of indexing, all the plus and minuses, etc. It feels sloppy. I came up with the following:
enumerate x = zip [0..length x - 1] x
validQueens' :: [Int] -> Bool
validQueens' x = and [abs (snd j - snd i) /= fst j - fst i | i<-l, j<-l, fst j > fst i]
where l = enumerate x
being inspired by Python's enumerate (not that borrowing imperative concepts is necessarily a great idea). Seems better in concept, but snd and fst all over the place kinda sucks. It's also, at least at first glance, costlier both in time and space. I'm not sure whether or not I like it any better.
So in short, I am not really satisfied with either
Iterating thru by index bounded by lengths, or even worse, off-by-ones and twos
Index-element tuples
Has anyone found a pattern they find more elegant than either of the above? If not, is there any compelling reason one of the above methods is superior?
Borrowing enumerate is fine and encouraged. However, it can be made a bit lazier by refusing to calculate the length of its argument:
enumerate = zip [0..]
(In fact, it's common to just use zip [0..] without naming it enumerate.) It's not clear to me why you think your second example should be costlier in either time or space. Remember: indexing is O(n), where n is the index. Your complaint about the unwieldiness of fst and snd is justified, and can be remedied with pattern-matching:
validQueens' xs = and [abs (y - x) /= j - i | (i, x) <- l, (j, y) <- l, i < j]
where l = zip [0..] xs
Now, you might be a bit concerned about the efficiency of this double loop, since the clause (j, y) <- l is going to be running down the entire spine of l, when really we just want it to start where we left off with (i, x) <- l. So, let's write a function that implements that idea:
pairs :: [a] -> [(a, a)]
pairs xs = [(x, y) | x:ys <- tails xs, y <- ys]
Having made this function, your function is not too hard to adapt. Pulling out the predicate into its own function, we can use all instead of and:
validSingleQueen ((i, x), (j, y)) = abs (y - x) /= j - i
validQueens' xs = all validSingleQueen (pairs (zip [0..] xs))
Or, if you prefer point-free notation:
validQueens' = all validSingleQueen . pairs . zip [0..]
Index-element tuples are quite a common thing to do in Haskell. Because zip stops when the first list stops, you can write them as
enumerate x = zip [0..] x
which is both more elegant and more efficient (as it doesn't compute length x up front). In fact I wouldn't even bother naming it, as zip [0..] is so short.
This is definitely more efficient than iterating by index for lists, because !! is linear in the second argument due to lists being linked lists.
Another way you can make your program more elegant is to use pattern-matching instead of fst and snd:
validQueens' :: [Int] -> Bool
validQueens' x = and [abs (j2 - i2) /= j1 - i1 | (i1, i2) <-l, (j1, j2) <-l, j1 > i1]
where l = zip [0..] x

Resources