Array in Haskell - arrays

How is an array created in haskell using the constructor array? I mean, does it create the first element and so on? In that case how does it read the associated list?
For example if we consider the following two programs:-
ar :: Int->(Array Int Int)
ar n = array (0,(n-1)) (((n-1),1):[(i,((ar n)!(i+1))) | i<-[0..(n-2)]])
ar :: Int->(Array Int Int)
ar n = array (0,(n-1)) ((0,1):[(i,((ar n)!(i-1))) | i<-[1..(n-1)]])
Will these two have different time complexity?

That depends on the implementation, but in a reasonable implementation, both have the same complexity (linear in the array size).
In GHC's array implementation, if we look at the code
array (l,u) ies
= let n = safeRangeSize (l,u)
in unsafeArray' (l,u) n
[(safeIndex (l,u) n i, e) | (i, e) <- ies]
{-# INLINE unsafeArray' #-}
unsafeArray' :: Ix i => (i,i) -> Int -> [(Int, e)] -> Array i e
unsafeArray' (l,u) n#(I# n#) ies = runST (ST $ \s1# ->
case newArray# n# arrEleBottom s1# of
(# s2#, marr# #) ->
foldr (fill marr#) (done l u n marr#) ies s2#)
{-# INLINE fill #-}
fill :: MutableArray# s e -> (Int, e) -> STRep s a -> STRep s a
-- NB: put the \s after the "=" so that 'fill'
-- inlines when applied to three args
fill marr# (I# i#, e) next
= \s1# -> case writeArray# marr# i# e s1# of
s2# -> next s2#
we can see that first a new chunk of memory is allocated for the array, that is then sequentially filled with arrEleBottom (which is an error call with message "undefined array element"), and then the elements supplied in the list are written to the respective indices in the order they appear in the list.
In general, since it is a boxed array, what is written to the array on construction is a thunk that specifies how to compute the value when it is needed (explicitly specified values, like the literal 1 in your examples, will result in a direct pointer to that value written to the array).
When the evaluation of such a thunk is forced, it may force also the evaluation of further thunks in the array - if the thunk refers to other array elements, like here. In the specific examples here, forcing any thunk results in forcing all thunks later resp. earlier in the array until the end with the entry that doesn't refer to another array element is reached. In the first example, if the first array element that is forced is the one at index 0, that builds a thunk of size proportional to the array length that is then reduced, so forcing the first array element then has complexity O(n), then all further elements are already evaluated, and forcing them is O(1). In the second example, the situation is symmetric, there forcing the last element first incurs the total evaluation cost. If the elements are demanded in a different order, the cost of evaluating all thunks is spread across the requests for different elements, but the total cost is the same. The cost of evaluating any not-yet-evaluated thunk is proportional to its distance from the next already evaluated thunk, and includes evaluating all thunks in between.
Since array access is constant time (except for cache effects, but those should not make a difference if you fill the array either forward or backward, they could make a big difference if the indices were in random order, but that still wouldn't affect time complexity), both have the same complexity.
Note however, that your using ar n to define the array elements carries the risk of multiple arrays being allocated (if compiled without optimisations, GHC does that - just tested: even with optimisations that can happen). To make sure that only one is constructed, make it
ar n = result
where
result = array (0,n-1) (... result!index ...)

Related

Comparing Arrays of Arrays

So I have two arrays, a and b of varying size containing child arrays of the same length and both are of the same type as are the child arrays (float for example).
I want find all the matches for the child arrays in b within the child arrays of array a.
Now I'm looking for a faster or better way to do this (perhaps CUDA or SIMD coding).
At the moment I have something like (F#):
let mutable result = 0.0
for a in arrayA do:
for b in arrayB do:
if a = b then
result <- result + (a |> Array.sum)
My array a contains around 5 Million elements and array b contains around 3000. Hence my performance related issue.
You use bruteforce algorithm to solve the problem. Suppose that A and B have sizes N and M respecively, each small array you are checking for equality is K elements long. Your algorithm takes O(N M K) time in worst case and O(N M + Z K) in best case, given that number of matches is Z (which may attain N M).
Notice that each of your small arrays is essentially a string. You have two sets of strings, and you want to detect all equal pairs between them.
This problem can be solved with hash table. Create a hash table with O(M) cells. In this table, store strings of array B without duplication. After you have added all the strings from B, iterate over strings from A and check if they are present in the hash table. This solution can be implemented as randomized one with O((M + N) K) time complexity on average, which is linear of the input data size.
Also, you can solve the problem in non-randomized way. Put all the strings into a single array X and sort them. During sorting, put strings from A after all equal strings from B. Note that you should remember which strings of X came from which array. You can either use some fast comparison sort, or use radix sort. In the latter case sorting is done in linear time, i.e. in O((M + N) K).
Now all the common strings are stored in X contiguously. You can iterate over X, maintaining the set of strings from B equal to the currently processed string. If you see a string different from the previous one, clear the set. If the string is from B, add it to the set. If it is from A, record that it is equal to the set of elements from B. This is a single pass over X with O(K) time per string, so it takes O((M + N) K) time in total.
If length K of your strings is not tiny, you can add vectorization to string operations. In case of hash table approach, most time would be spent on computing string hash. If you choose polynomial hash modulo 2^32, then it is easy to vectorize it with SSE2. Also, you need fast string comparison, which can be done with memcmp function, which can be easily vectorized too. For the sorting solution, you need only string comparisons. Also, you might want to implement a radix sort, which is not possible to vectorize, I'm afraid.
Efficient parallelization of both approaches is not very simple. For the first algorithm, you need a concurrent hash table. Actually, there are even some lock-free hash tables out there. For the second approach, you can parallelize the first step (quicksort is easy to parallelize, radix sort is not). The second step can be parallelized too if there are not too many equal strings: you can split the array X into almost equal pieces, breaking it only between two different strings.
You may save some time comparing large arrays by splitting them into smaller arrays and doing the equality check in parallel.
This chunk function is taken directly from F# Snippets
let chunk chunkSize (arr : _ array) =
query {
for idx in 0..(arr.Length - 1) do
groupBy (idx / chunkSize) into g
select (g |> Seq.map (fun idx -> arr.[idx]))
}
Then going something like this to compare arrays. I have chosen to split each array into 4 smaller chunks:
let fastArrayCompare a1 a2 = async {
let! a =
Seq.zip (chunk 4 a1) (chunk 4 a2)
|> Seq.map (fun (a1',a2') -> async {return a1' = a2'})
|> Async.Parallel
return Array.TrueForAll (a,(fun t -> t))}
Obviously you now adding some extra time with the array splitting but with lots of very large array comparisons you should make up this time and then some.

How to "invert" an array in linear time functionally rather than procedurally?

Say I have an array of integers A such that A[i] = j, and I want to "invert it"; that is, to create another array of integers B such that B[j] = i.
This is trivial to do procedurally in linear time in any language; here's a Python example:
def invert_procedurally(A):
B = [None] * (max(A) + 1)
for i, j in enumerate(A):
B[j] = i
return B
However, is there any way to do this functionally (as in functional programming, using map, reduce, or functions like those) in linear time?
The code might look something like this:
def invert_functionally(A):
# We can't modify variables in FP; we can only return a value
return map(???, A) # What goes here?
If this is not possible, what is the best (most efficient) alternative when doing functional programming?
In this context are arrays mutable or immutable? Generally I'd expect the mutable case to be about as straightforward as your Python implementation, perhaps aside from a few wrinkles with types. I'll assume you're more interested in the immutable scenario.
This operation inverts the indices and elements, so it's also important to know something about what constitutes valid array indices and impose those same constraints on the elements. Haskell has a class for index constraints called Ix. Any Ix type is ordered and has a range implementation to make an ordered list of indices ranging from one specified index to another. I think this Haskell implementation does what you want.
import Data.Array.IArray
invertArray :: (Ix x) => Array x x -> Array x x
invertArray arr = listArray (low,high) newElems
where oldElems = elems arr
newElems = indices arr
low = minimum oldElems
high = maximum oldElems
Under the hood listArray uses zipWith and range to associate indices in the specified range to the listed elements. That part ought to be linear time, and so is the one-time operation of extracting elements and indices from an array.
Whenever the sets of the input arrays indices and elements differ some elements will be undefined, which for better or worse blow up faster than Python's None. I believe you could overcome the undefined issue by implementing new Ix a instances over the Maybe monad, for instance.
Quick side-note: check out the invPerm example in the Haskell 98 Library Report. It does something similar to invertArray, but assumes up front that input array's elements are a permutation of its indices.
A solution needing mapand 3 operations:
toTuples views an the array as a list of tuples (i,e) where i is the index and e the element in the array at that index.
fromTuples creates and loads an array from a list of tuples.
swap which takes a tuple (a,b) and returns (b,a)
Hence the solution would be (in Haskellish notation):
invert = fromTuples . map swap . toTuples

Cuda Fortran 4D array

My code is being slowed down by a my 4D arrays access in global memory.
I am using PGI compiler 2010.
The 4D array I am accessing is read only from the device and the size is known at run time.
I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.
Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input.
My first idea is about pointers but not familiar with pointer fortran.
If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected
Following is a sample code of what I am trying to do
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(10,i,j,k)*regvalue1
& +myFourdArray(32,i,j,k)*regvalue2
& +myFourdArray(45,i,j,k)*regvalue3
end do
Best regards,
I believe the answer from #Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(i,j,k,10)*regvalue1
& +myFourdArray(i,j,k,32)*regvalue2
& +myFourdArray(i,j,k,45)*regvalue3
end do
Note that the only change is to myFourdArray, there is no need for a change in data ordering in the d_value array.
The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.
Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.
In Fortran this index is the first of a multiple-subscripted array.
In C, this index is the last of a multiple-subscripted array.
Your original code followed this convention for d_value by placing the X thread index (i) in the first array subscript position. But it broke this convention for myFourdArray by putting a constant in the first array subscript position. Thus your access to myFourdArray are noticeably slower.
When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i) first (for Fortran, or last for C).
You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(k,i,j)=d_value(k,i,j) + &
myFourdArray(k,i,j,10)*regvalue1 + &
myFourdArray(k,i,j,32)*regvalue2 + &
myFourdArray(k,i,j,45)*regvalue3
end do
If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.
In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.
Ah, ok it is a simple directive:
!$acc do vector
do k=...
enddo

Why the Haskell sequence function can't be lazy or why recursive monadic functions can't be lazy

With the question Listing all the contents of a directory by breadth-first order results in low efficiencyI learned that the low efficiency is due to a strange behavior of the recursive monad functions.
Try
sequence $ map return [1..]::[[Int]]
sequence $ map return [1..]::Maybe [Int]
and ghci will fall into an endless calculation.
If we rewrite the sequence function in a more readable form like follows:
sequence' [] = return []
sequence' (m:ms) = do {x<-m; xs<-sequence' ms; return (x:xs)}
and try:
sequence' $ map return [1..]::[[Int]]
sequence' $ map return [1..]::Maybe [Int]
we get the same situation, an endless loop.
Try a finite list
sequence' $ map return [1..]::Maybe [Int]
it will spring out the expected result Just [1,2,3,4..] after a long time waiting.
From what we tried,we can come to the conclusion that although the definition of sequence' seems to be lazy, it is strict and has to make out all the numbers before the result of sequence' can be printed.
Not only just sequence', if we define a function
iterateM:: Monad m => (a -> m a) -> a -> m [a]
iterateM f x = (f x) >>= iterateM0 f >>= return.(x:)
and try
iterateM (>>=(+1)) 0
then endless calculation occurs.
As we all know,the non-monadic iterate is defined just like the above iterateM, but why the iterate is lazy and iterateM is strict.
As we can see from above, both iterateM and sequence' are recursive monadic functions.Is there some thing strange with recursive monadic functions
The problem isn't the definition of sequence, it's the operation of the underlying monad. In particular, it's the strictness of the monad's >>= operation that determines the strictness of sequence.
For a sufficiently lazy monad, it's entirely possible to run sequence on an infinite list and consume the result incrementally. Consider:
Prelude> :m + Control.Monad.Identity
Prelude Control.Monad.Identity> runIdentity (sequence $ map return [1..] :: Identity [Int])
and the list will be printed (consumed) incrementally as desired.
It may be enlightening to try this with Control.Monad.State.Strict and Control.Monad.State.Lazy:
-- will print the list
Prelude Control.Monad.State.Lazy> evalState (sequence $ map return [1..] :: State () [Int]) ()
-- loops
Prelude Control.Monad.State.Strict> evalState (sequence $ map return [1..] :: State () [Int]) ()
In the IO monad, >>= is by definition strict, since this strictness is exactly the property necessary to enable reasoning about effect sequencing. I think #jberryman's answer is a good demonstration of what is meant by a "strict >>=". For IO and other monads with a strict >>=, each expression in the list must be evaluated before sequence can return. With an infinite list of expressions, this isn't possible.
You're not quite grokking the mechanics of bind:
(>>=) :: Monad m => m a -> (a -> m b) -> m b
Here's an implementation of sequence that only works on 3-length lists:
sequence3 (ma:mb:mc:[]) = ma >>= (\a-> mb >>= (\b-> mc >>= (\c-> return [a,b,c] )))
You see how we have to "run" each "monadic action" in the list before we can return the outer constructor (i.e. the outermost cons, or (:))? Try implementing it differently if you don't believe.
This is one reason monads are useful for IO: there is an implicit sequencing of effects when you bind two actions.
You also have to be careful about using the terms "lazy" and "strict". It's true with sequence that you must traverse the whole list before the final result can be wrapped, but the following works perfectly well:
Prelude Control.Monad> sequence3 [Just undefined, Just undefined, Nothing]
Nothing
Monadic sequence cannot in general work lazily on infinite lists. Consider its signature:
sequence :: Monad m => [m a] -> m [a]
It combines all monadic effects in its argument into a single effect. If you apply it to an infinite list, you'd need to combine an infinite number of effect into one. For some monads, it is possible, for some monads, it is not.
As an example, consider sequence specialized to Maybe, as you did in your example:
sequence :: [Maybe a] -> Maybe [a]
The result is Just ... iff all elements in the array are Just .... If any of the elements is Nothing then the result is Nothing. This means that unless you examine all elements of the input, you cannot tell if the result is Nothing or Just ....
The same applies for sequence specialized to []: sequence :: [[a]] -> [[a]]. If any of the elements of the argument is an empty list, the whole result is an empty list, like in sequence [[1],[2,3],[],[4]]. So in order to evaluate sequence on a list of lists, you have to examine all the elements to see what the result will look like.
On the other hand, sequence specialized to the Reader monad can process its argument lazily, because there is no real "effect" on Reader's monadic computation. If you define
inf :: Reader Int [Int]
inf = sequence $ map return [1..]
or perhaps
inf = sequence $ map (\x -> reader (* x)) [1..]
it will work lazily, as you can see by calling take 10 (runReader inf 3).

Dynamic programming with Data.Vector

am using Data.Vector and am currently in need of computing the contents of a vector for use in computing a cryptographic hash(Sha1). I created the following code.
dynamic :: a -> Int -> (Int -> Vector a -> a) -> Vector a
dynamic e n f =
let
start = Data.Vector.replicate n e
in step start 0
where
step vector i = if i==n then vector
else step (vector // [(i,f i vector)]) (i+1)
I created this so that the function f filling out the vector has access to the partial
results along the way. Surely something like this must already exist in Data.Vector, no?
The problem statement is the following: You are to solve a dynamic programming problem where the finished result is an array. You know the size of the array size and you have a recursive function for filling it out.
You probably already saw the function generate, which takes a size n and a function f of type Int -> a and then produces a Vector a of size n. What you probably weren't aware of is that when using this function you actually do have access to the partial results.
What I mean to say is that inside the function you pass to generate you can refer to the vector you're defining and due to Haskell's laziness it will work fine (unless you make it so that the different items of the vector depend on each other in a circular fashion, of course).
Example:
import Data.Vector
tenFibs = generate 10 fib
where fib 0 = 0
fib 1 = 1
fib n = tenFibs ! (n-1) + tenFibs ! (n-2)
tenFibs is now a vector containing the first 10 Fibonacci numbers.
Maybe you could use one of Data.Vector's scan functions?
http://hackage.haskell.org/packages/archive/vector/0.6.0.2/doc/html/Data-Vector.html#32

Resources