Haskell real-time update and lookup performance - arrays

I am writing a game-playing ai (aichallenge.org - Ants), which requires a lot of updating of, and referring to data-structures. I have tried both Arrays and Maps, but the basic problem seems to be that every update creates a new value, which makes it slow. The game boots you out if you take more than one second to make your move, so the application counts as "hard-real-time". Is it possible to have the performance of mutable data-structures in Haskell, or should I learn Python, or rewrite my code in OCaml?
I have completely rewritten the Ants "starter-pack". Changed from Arrays to Maps because my tests showed that Maps update much faster.
I ran the Maps version with profiling on, which showed that about 20% of the time is being taken by Map updates alone.
Here is a simple demonstration of how slow Array updates are.
slow_array =
let arr = listArray (0,9999) (repeat 0)
upd i ar = ar // [(i,i)]
in foldr upd arr [0..9999]
Now evaluating slow_array!9999 takes almost 10 seconds! Although it would be faster to apply all the updates at once, the example models the real problem where the array must be updated each turn, and preferably each time you choose a move when planning your next turn.
Thanks to nponeccop and Tener for the reference to the vector modules. The following code is equivalent to my original example, but runs in 0.06 seconds instead of 10.
import qualified Data.Vector.Unboxed.Mutable as V
fast_vector :: IO (V.IOVector Int)
fast_vector = do
vec <- V.new 10000
V.set vec 0
mapM_ (\i -> V.write vec i i) [0..9999]
return vec
fv_read :: IO Int
fv_read = do
v <- fast_vector
V.read v 9999
Now, to incorporate this into my Ants code...

First of all, think if you can improve your algorithm. Also note that the default Ants.hs is not optimal and you need to roll your own.
Second, you should use a profiler to find where the performance problem is instead of relying on hand-waving. Haskell code is usually much faster than Python (10-30 times faster, you can look at Language Shootout for example comparison) even with functional data structures, so probably you do something wrong.
Haskell supports mutable data pretty well. See ST (state thread) and libraries for mutable arrays for the ST. Also take a look at vectors package. Finally, you can use data-parallel haskell, haskell-mpi or other ways of parallelization to load all available CPU cores, or even distribute work over several computers.
Are you using compiled code (e.g. cabal build or ghc --make) or use runhaskell or ghci? The latter ones are bytecode interpreters and create much slower code than the native code compiler. See Cabal reference - it is the preferred way to build applications.
Also make sure you have optimization turned on (-O2 and other flags). Note that -O vs -O2 can make a difference, and try different backends including the new LLVM backend (-fllvm).

Updating arrays one element at a time is incredibily inefficient because each update involves making a copy of the whole array. Other data structures such as Map are implemented as trees and thus allow logarithmic time updates. However, in general updating functional data structures one element at a time is often sub-optimal, so you should try to take a step back and think about how you can implement something as a transformation of the whole structure at once instead of a single element at a time.
For example, your slow_array example can be written much more efficiently by doing all the updates in one step, which only requires the array to be copied once.
faster_array =
let arr = listArray (0,9999) (repeat 0)
in arr // [(i,i) | i <- [0..9999]]
If you cannot think of an alternative to the imperative one-element-at-a-time algorithm, mutable data structures have been mentioned as another option.

You are basically asking for mutable data structure. Apart from standard libraries I would recommend you lookup this:
vector: http://hackage.haskell.org/package/vector
That said, I'm not so sure that you need them. There are neat algorithms for persitent data structures as well. A fast replacement for Data.Map is hash table from this package:
unordered-containers: http://hackage.haskell.org/package/unordered-containers

Related

Raku : is there a SUPER fast way to turn an array into a string without the spaces separating the elements?

I need to convert thousands of binary byte strings, each about a megabyte long, into ASC strings. This is what I have been doing, and seems too slow:
sub fileToCorrectUTF8Str ($fileName) { # binary file
my $finalString = "";
my $fileBuf = slurp($fileName, :bin);
for #$fileBuf { $finalString = $finalString ~ $_.chr; };
return $finalString;
}
~#b turns #b into string with all elements separated by space, but this is not what I want. If #b = < a b c d >; the ~#b is "a b c d"; but I just want "abcd", and I want to do this REALLY fast.
So, what is the best way? I can't really use hyper for parallelism because the final string is constructed sequentially. Or can I?
TL;DR On an old rakudo, .decode is about 100X times as fast.
In longer form to match your code:
sub fileToCorrectUTF8Str ($fileName) { # binary file
slurp($fileName, :bin).decode
}
Performance notes
First, here's what I wrote for testing:
# Create million and 1 bytes long file:
spurt 'foo', "1234\n6789\n" x 1e5 ~ 'Z', :bin;
# (`say` the last character to check work is done)
say .decode.substr(1e6) with slurp 'foo', :bin;
# fileToCorrectUTF8Str 'foo' );
say now - INIT now;
On TIO.run's 2018.12 rakudo, the above .decode weighs in at about .05 seconds per million byte file instead of about 5 seconds for your solution.
You could/should of course test on your system and/or using later versions of rakudo. I would expect the difference to remain in the same order, but for the absolute times to improve markedly as the years roll by.[1]
Why is it 100X as fast?
Well, first, # on a Buf / Blob explicitly forces raku to view the erstwhile single item (a buffer) as a plural thing (a list of elements aka multiple items). That means high level iteration which, for a million element buffer, is immediately a million high level iterations/operations instead of just one high level operation.
Second, using .decode not only avoids iteration but only incurs relatively slow method call overhead once per file whereas when iterating there are potentially a million .chr calls per file. Method calls are (at least semantically) late-bound which is in principle relatively costly compared to, for example, calling a sub instead of a method (subs are generally early bound).
That all said:
Remember Caveat Empty[1]. For example, rakudo's standard classes generate method caches, and it's plausible the compiler just in-lines the method anyway, so it's possible there is negligible overhead for the method call aspect.
See also the doc's Performance page, especially Use existing high performance code.
Is the Buf.Str error message LTA?
Update See Liz++'s comment.
If you try to use .Str on a Buf or Blob (or equivalent, such as using the ~ prefix on it) you'll get an exception. Currently the message is:
Cannot use a Buf as a string, but you called the Str method on it
The doc for .Str on a Buf/Blob currently says:
In order to convert to a Str you need to use .decode.
It's arguably LTA that the error message doesn't suggest the same thing.
Then again, before deciding what to do about this, if anything, we need to consider what, and how, folk could learn from anything that goes wrong, including signals about it, such as error messages, and also what and how they do in fact currently learn, and bias our reactions toward building the right culture and infrastructure.
In particular, if folk can easily connect between an error message they see, and online discussion that elaborates on it, that needs to be taken into account and perhaps encouraged and/or made easier.
For example, there's now this SO covering this issue with the error message in it, so a google is likely to get someone here. Leaning on that might well be a more appropriate path forward than changing the error message. Or it might not. The change would be easy...
Please consider commenting below and/or searching existing rakudo issues to see if improvement of the Buf.Str error message is being considered and/or whether you wish to open an issue to propose it be altered. Every rock moved is at least great exercise, and, as our collective effort becomes increasingly wise, improves (our view of) the mountain.
Footnotes
[1] As the well known Latin saying Caveat Empty goes, both absolute and relative performance of any particular raku feature, and more generally any particular code, is always subject to variation due to factors including one's system's capabilities, its load during the time it's running the code, and any optimization done by the compiler. Thus, for example, if your system is "empty", then your code may run faster. Or, as another example, if you wait a year or three for the compiler to get faster, advances in rakudo's performance continue to look promising.

MatLab Bottleneck

I am working with big arrays (~6x40million) and my code is showing great bottlenecks. I am experienced programming in MatLab, but don't know much about the inner processes (like memory and such...).
My code looks as follows(Just the essentials, of course all variables are initialized, specially the arrays in loops, I just don't want to bomb you all with code ):
First I read the file,
disp('Point cloud import and subsampling')
tic
fid=fopen(strcat(Name,'.dat'));
C=textscan(fid, '%d%d%f%f%f%d'); %<= Big!
fclose(fid);
then create arrays out of the contents,
y=C{1}(1:Subsampling:end)/Subsampling;
x=C{2}(1:Subsampling:end)/Subsampling;
%... and so on for the other rows
clear C %No one wants 400+ millon doubles just lying around.
And clear the cell array (1), and create some images and arrays with the new values
for i=1:length(x)
PCImage(y(i)+SubSize(1)-maxy+1,x(i)+1-minx)=Reflectanse(i);
PixelCoordinates(y(i)+SubSize(1)-maxy+1,x(i)+1-minx,:)=Coordinates(i,:);
end
toc
Everything runs more or less smoothly until here, but then I manipulate some arrays
disp('Overlap alignment')
tic
PCImage=PCImage(:,[1:maxx/2-Overlap,maxx/2:end-Overlap]); %-30 overlap?
PixelCoordinates=PixelCoordinates(:,[1:maxx/2-Overlap,maxx/2:end-Overlap],:);
Sphere=Sphere(:,[1:maxx/2-Overlap,maxx/2:end-Overlap],:);
toc
and this is a big bottleneck, but it gets worst at the next step
disp('Planar view and point cloud matching')
tic
CompImage=zeros(max(SubSize(1),PCSize(1)),max(SubSize(2),PCSize(2)),3);
CompImage(1:SubSize(1),1:SubSize(2),2)=Subimage; %ExportImage Cyan
CompImage(1:SubSize(1),1:SubSize(2),3)=Subimage;
CompImage(1:PCSize(1),1:PCSize(2),1)=PCImage; %PointCloudImage Red
toc
Output
Point cloud import and subsampling
Elapsed time is 181.157182 seconds.
Overlap alignment
Elapsed time is 408.750932 seconds.
Planar view and point cloud matching
Elapsed time is 719.383807 seconds.
My questions are: will clearing unused objects like C in 1 have any effect? (it doesn't seem like that)
Am I overseeing any other important mechanisms or rules of thumb, or is the whole thing just too much and supposed to happen like this?
When subsref is used, matlab makes a copy of the sub referenced elements. This may be costly for large arrays. Often it will be faster to catenate vectors like
res = [a,b,c];
This is not possible with the current code as written above, but if the code could be modified to make this work, it may save some time.
EDIT
For multi-dimensional arrays you need to use cat
CompImage = cat(dim,Subimage,Subimage,PCImage);
where dim is 3 for this example.

What is the most efficient way to read a CSV file into an Accelerate (or Repa) Array?

I am interested in playing around with the Accelerate library, and I would like to perform some operations on data stored inside of a CSV file. I've read this excellent introduction to Accelerate, but I'm not sure how I can go about reading CSVs into Accelerate efficiently. I've thought about this, and the only thing I can think of is to parse the entire CSV file into one long list, and then feed the entire list into Accelerate.
My data sets will be quite large, and it doesn't seem efficient to read a 1 gb+ file into memory only to copy somewhere else. I noticed there was a CSV Enumerator package on Hackage, but I'm not sure how to use it with Accelerate's generate function. Another constraint is that it seems the dimensions of the Array, or at least number of elements, must be known before generating an array using Accelerate.
Has anyone dealt with this kind of problem before?
Thanks!
I am not sure if this is 100% applicable to accelerate or repa, but here is one way I've handled this for Vector in the past:
-- | A hopefully-efficient sink that incrementally grows a vector from the input stream
sinkVector :: (PrimMonad m, GV.Vector v a) => Int -> ConduitM a o m (Int, v a)
sinkVector by = do
v <- lift $ GMV.new by
go 0 v
where
-- i is the index of the next element to be written by go
-- also exactly the number of elements in v so far
go i v = do
res <- await
case res of
Nothing -> do
v' <- lift $ GV.freeze $ GMV.slice 0 i v
return $! (i, v')
Just x -> do
v' <- case GMV.length v == i of
True -> lift $ GMV.grow v by
False -> return v
lift $ GMV.write v' i x
go (i+1) v'
It basically allocates by empty slots and proceeds to fill them. Once it hits the ceiling, it grows the underlying vector once again. I haven't benchmarked anything, but it appears to perform OK in practice. I am curious to see if there will be other more efficient answers here.
Hope this helps in some way. I do see there's a fromVector function in repa and perhaps that's your golden ticket in combination with this method.
I haven't tried reading CSV files into repa but I recommend using cassava (http://hackage.haskell.org/package/cassava). Iirc I had a 1.5G file which I used to create my stats. With cassava, my program ran in a surprisingly small amount of memory. Here's an extended example of usage:
http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-analysis-in-haskell/
In the case of repa, if you add rows incrementally to an array (which it sounds like you want to do) then one would hope the space usage would also grow incrementally. It certainly is worth an experiment. And possibly also contacting the repa folks. Please report back on your results :-)

How to go about creating a prolog program that can work backwards to determine steps needed to reach a goal

I'm not sure what exactly I'm trying to ask. I want to be able to make some code that can easily take an initial and final state and some rules, and determine paths/choices to get there.
So think, for example, in a game like Starcraft. To build a factory I need to have a barracks and a command center already built. So if I have nothing and I want a factory I might say ->Command Center->Barracks->Factory. Each thing takes time and resources, and that should be noted and considered in the path. If I want my factory at 5 minutes there are less options then if I want it at 10.
Also, the engine should be able to calculate available resources and utilize them effectively. Those three buildings might cost 600 total minerals but the engine should plan the Command Center when it would have 200 (or w/e it costs).
This would ultimately have requirements similar to 10 marines # 5 minutes, infantry weapons upgrade at 6:30, 30 marines at 10 minutes, Factory # 11, etc...
So, how do I go about doing something like this? My first thought was to use some procedural language and make all the decisions from the ground up. I could simulate the system and branching and making different choices. Ultimately, some choices are going quickly make it impossible to reach goals later (If I build 20 Supply Depots I'm prob not going to make that factory on time.)
So then I thought weren't functional languages designed for this? I tried to write some prolog but I've been having trouble with stuff like time and distance calculations. And I'm not sure the best way to return the "plan".
I was thinking I could write:
depends_on(factory, barracks)
depends_on(barracks, command_center)
builds_from(marine, barracks)
build_time(command_center, 60)
build_time(barracks, 45)
build_time(factory, 30)
minerals(command_center, 400)
...
build(X) :-
depends_on(X, Y),
build_time(X, T),
minerals(X, M),
...
Here's where I get confused. I'm not sure how to construct this function and a query to get anything even close to what I want. I would have to somehow account for rate at which minerals are gathered during the time spent building and other possible paths with extra gold. If I only want 1 marine in 10 minutes I would want the engine to generate lots of plans because there are lots of ways to end with 1 marine at 10 minutes (maybe cut it off after so many, not sure how you do that in prolog).
I'm looking for advice on how to continue down this path or advice about other options. I haven't been able to find anything more useful than towers of hanoi and ancestry examples for AI so even some good articles explaining how to use prolog to DO REAL THINGS would be amazing. And if I somehow can get these rules set up in a useful way how to I get the "plans" prolog came up with (ways to solve the query) other than writing to stdout like all the towers of hanoi examples do? Or is that the preferred way?
My other question is, my main code is in ruby (and potentially other languages) and the options to communicate with prolog are calling my prolog program from within ruby, accessing a virtual file system from within prolog, or some kind of database structure (unlikely). I'm using SWI-Prolog atm, would I be better off doing this procedurally in Ruby or would constructing this in a functional language like prolog or haskall be worth the extra effort integrating?
I'm sorry if this is unclear, I appreciate any attempt to help, and I'll re-word things that are unclear.
Your question is typical and very common for users of procedural languages who first try Prolog. It is very easy to solve: You need to think in terms of relations between successive states of your world. A state of your world consists for example of the time elapsed, the minerals available, the things you already built etc. Such a state can be easily represented with a Prolog term, and could look for example like time_minerals_buildings(10, 10000, [barracks,factory])). Given such a state, you need to describe what the state's possible successor states look like. For example:
state_successor(State0, State) :-
State0 = time_minerals_buildings(Time0, Minerals0, Buildings0),
Time is Time0 + 1,
can_build_new_building(Buildings0, Building),
building_minerals(Building, MB),
Minerals is Minerals0 - MB,
Minerals >= 0,
State = time_minerals_buildings(Time, Minerals, Building).
I am using the explicit naming convention (State0 -> State) to make clear that we are talking about successive states. You can of course also pull the unifications into the clause head. The example code is purely hypothetical and could look rather different in your final application. In this case, I am describing that the new state's elapsed time is the old state's time + 1, that the new amount of minerals decreases by the amount required to build Building, and that I have a predicate can_build_new_building(Bs, B), which is true when a new building B can be built assuming that the buildings given in Bs are already built. I assume it is a non-deterministic predicate in general, and will yield all possible answers (= new buildings that can be built) on backtracking, and I leave it as an exercise for you to define such a predicate.
Given such a predicate state_successor/2, which relates a state of the world to its direct possible successors, you can easily define a path of states that lead to a desired final state. In its simplest form, it will look similar to the following DCG that describes a list of successive states:
states(State0) -->
( { final_state(State0) } -> []
; [State0],
{ state_successor(State0, State1) },
states(State1)
).
You can then use for example iterative deepening to search for solutions:
?- initial_state(S0), length(Path, _), phrase(states(S0), Path).
Also, you can keep track of states you already considered and avoid re-exploring them etc.
The reason you get confused with the example code you posted is essentially that build/1 does not have enough arguments to describe what you want. You need at least two arguments: One is the current state of the world, and the other is a possible successor to this given state. Given such a relation, everything else you need can be described easily. I hope this answers your question.
Caveat: my Prolog is rusty and shallow, so this may be off base
Perhaps a 'difference engine' approach would be appropriate:
given a goal like 'build factory',
backwards-chaining relations would check for has-barracks and tell you first to build-barracks,
which would check for has-command-center and tell you to build-command-center,
and so on,
accumulating a plan (and costs) along the way
If this is practical, it may be more flexible than a state-based approach... or it may be the same thing wearing a different t-shirt!

How to roll a fast BVH representation in Haskell

I'm playing with a Haskell Raytracer and currently use a BVH implementation which stresses a naive binary tree to store the hierarchy,
data TreeBvh
= Node Dimension TreeBvh TreeBvh AABB
| Leaf AnyPrim AABB
where Dimension is either X, Y or Z (used for faster traversal) and AABB is my type for an axis-aligned bounding box. This is working reasonably well, but I'd really like to get this as fast as I possibly can. So my next step (when using C/C++) would be to use this tree to construct a flattened representation where the nodes are stored in an array, the "left" child immediately follows it's parent node and the index of right child of the parent is stored with the parent, so I have something like this:
data LinearNode
= LinearNode Dimension Int AABB
| LinearLeaf AnyPrim AABB
data LinearBvh
= MkLinearBvh (Array Int LinearNode)
I didn't really try out this one yet, but I fear the performance would still be sub-par because I can't store LinearNode instances in an UArray, neither could I store the Int indexing the right child together with the Float values which make up the AABB in a single UArray (correct me if I got this wrong). And using two Arrays would mean bad cache coherency. So I'm basically looking for a way to efficiently store my tree so I can expect good performance for traversal. It sould be
compact
have good locality properties
work with recent GHC compilers
should go through as little indirections as possible (going though a "thunk" can't help performance, so "unboxed" types would help I think)
If I understood you correctly you want unboxed arrays of user-defined types? if so check-out the vector package which also supports loop fusion. It's worth checking out slides for High-Performance Haskell
I should really point out that Haskell is not very good at giving the programmer a means of choosing data layout in memory.
You might be interested in storing the tree in a flat array in cache-oblivious way ("Van Emde Boas tree"). It should work, but who knows. :)
(shameless plug: I've made a similar effort some time ago; I've used some advanced type system features of the ATS programming language to make the raytracer both safer and faster; see the code here: http://code.google.com/p/ats-miscellanea/ -- I didn't go very far yet, unfortunately)
What you're proposing was discovered years ago, it's called a bounding interval hierarchy (BIH).

Resources