Repa creation from ByteString - arrays

Initially I have a ByteString, which i then unpack and convert into Int16s, this part of the process takes relatively little time. I then go to convert the list of Int16s into a Repa array with the following line,
Repa.fromListUnboxed (Z :. bytesOfDataPerImage `div` 2) listOfInts
According to the profiler this line is taking ~40% of CPU time, which could just be indicative that the computations I am performing don't warrant the use of Repa. Is there a more efficient route to take when going from ByteString to Repa array?
I have tried the Repa fromByteString function, though the transformation of
Array B DIM1 Word8 -> Array U DIM1 Int16
was pretty slow. I was performing this by first reshaping the array into a 2d array of Word8s, then folding into Int16s. Perhaps the Byte array was the right approach and my conversion method is just wrong.
convertImageData :: Array B DIM1 Word8 -> Array U DIM1 Int16
convertImageData !arr = Repa.foldS convertWords 0 (Repa.map fromIntegral (splitArray arr))
splitArray :: Array B DIM1 Word8 -> Array U DIM2 Word8
splitArray !arr = computeUnboxedS $ reshape (Z :. ((size $ extent arr) `div` 2) :. 2) arr
convertWords :: Int16 -> Int16 -> Int16
convertWords !word1 !word2 = (word1 `shiftL` 8) .|. word2
For some context this program is being benchmarked against the same program written in C/C++.

Your initial approach of converting into a list and later calling Repa.fromListUnboxed is certainly very slow, since all you are doing is forcing elements of a list and than loading it sequentially into the unboxed array. That is why conversion into a list takes very little time, since all it does is it creates a bunch of thunks, but the actual computation happens when you load it into the array.
Your second approach is definitely way better, but there are still unnecessary steps, eg. there is no need to reshape the array, you can just pass the new size to the fromByteString function`. So here is a slightly improved version:
bytesToRepaOriginal :: Bytes.ByteString -> Repa.Array Repa.U Repa.DIM1 Int16
bytesToRepaOriginal bs =
Repa.foldS
convertWords
0
(Repa.map fromIntegral $
Repa.fromByteString (Repa.Z Repa.:. (Bytes.length bs `div` 2) Repa.:. 2) bs)
fromByteString function and B representation in Repa isn't particularly fast for some reason, so there is a faster way to do it, namely to construct an array by directly indexing the ByteString:
bytesToRepaP :: Monad m => Bytes.ByteString -> m (Repa.Array Repa.U Repa.DIM1 Int16)
bytesToRepaP bs =
Repa.computeUnboxedP $
Repa.fromFunction
(Repa.Z Repa.:. (Bytes.length bs `div` 2))
(\(Repa.Z Repa.:. i) ->
let i' = i * 2
f = Bytes.unsafeIndex bs
in (fromIntegral (f i') `shiftL` 8) .|. fromIntegral (f (i' + 1)))
Switching to sequential computation with Repa.computeUnboxedS will give you a factor of x2 slow down, but since we are trying optimize it, we need go all the way with parallel computation.
Not to steal all the thunder from a very nice Repa library, I'd like to also show how all that would work with new massiv library:
import Data.Massiv.Array as Massiv
bytesToMassiv :: Bytes.ByteString -> Massiv.Array Massiv.U Massiv.Ix1 Int16
bytesToMassiv bs =
Massiv.makeArrayR U Par (Bytes.length bs `div` 2)
(\i ->
let i' = i * 2
f = Bytes.unsafeIndex bs
in (fromIntegral (f i') `shiftL` 8) .|. fromIntegral (f (i' + 1)))
Just to present some concrete numbers showing the optimizations in action here is a stripped down criterion benchmarks:
benchmarking fromByteString/Massiv Parallel
time 1.114 ms (1.079 ms .. 1.156 ms)
benchmarking fromByteString/Repa Parallel
time 1.954 ms (1.871 ms .. 2.040 ms)
benchmarking fromByteString/Repa Original
time 15.80 ms (15.67 ms .. 15.92 ms)

Related

Knights tour in haskell getting a loop

I'm in the process of coding the knight's tour function, and I'm as far as this where I'm getting an infinte loop in my ghci:
type Field = (Int, Int)
nextPositions:: Int -> Field -> [Field]
nextPositions n (x,y) = filter onBoard
[(x+2,y-1),(x+2,y+1),(x-2,y-1),(x-2,y+1),(x+1,y-2),(x+1,y+2),(x-1,y-2),(x-1,y+2)]
where onBoard (x,y) = x `elem` [1..n] && y `elem` [1..n]
type Path = [Field]
knightTour :: Int -> Field -> [Path]
knightTour n start = [posi:path | (posi,path) <- tour (n*n)]
where tour 1 = [(start, [])]
tour k = [(posi', posi:path) | (posi, path) <- tour (k-1), posi' <- (filter (`notElem` path) (nextPositions n posi))]
F.e. knightTour 10 (4,4) does not give an output!
Any advise?
I think one of the main problems is checking if you have visited a square. This takes too much time. You should look for a data structure that makes that more efficient.
For small boards, for example up to 8×8, you can make use of a 64-bit integer for that. A 64-bit can be seen as 64 booleans that each can represent whether the knight already has visited that place.
we thus can implement this with:
{-# LANGUAGE BangPatterns #-}
import Data.Bits(testBit, setBit)
import Data.Word(Word64)
testPosition :: Int -> Word64 -> (Int, Int) -> Bool
testPosition !n !w (!r, !c) = testBit w (n*r + c)
setPosition :: Int -> (Int, Int) -> Word64 -> Word64
setPosition !n (!r, !c) !w = setBit w (n*r + c)
nextPositions :: Int -> Word64 -> (Int, Int) -> [(Int, Int)]
nextPositions !n !w (!x, !y) = [ c
| c#(x', y') <- [(x-1,y-2), (x-1,y+2), (x+1,y-2), (x+1,y+2), (x-2,y-1), (x-2,y+1), (x+2,y-1), (x+2,y+1)]
, x' >= 0
, y' >= 0
, x' < n
, y' < n
, not (testPosition n w c)
]
knightTour :: Int -> (Int, Int) -> [[(Int, Int)]]
knightTour n p0 = go (n*n-1) (setPosition n p0 0) p0
where go 0 _ _ = [[]]
go !k !w !ps = [
(ps':rs)
| ps' <- nextPositions n w ps
, rs <- go (k-1) (setPosition n ps' w) ps'
]
main = print (knightTour 6 (1,1))
If I compile this with the -O2 flag and run this locally for a 5×5 board where the knight starts at (1,1), all the solutions are generated in 0.32 seconds. For a 6×6 board, it takes 2.91 seconds to print the first solution, but it takes forever to find all solutions that start at (1,1). For an 8×8 board, the first solution was found in 185.76 seconds:
[(0,3),(1,5),(0,7),(2,6),(1,4),(0,2),(1,0),(2,2),(3,0),(4,2),(3,4),(4,6),(5,4),(6,2),(5,0),(3,1),(2,3),(3,5),(2,7),(0,6),(2,5),(1,3),(0,1),(2,0),(3,2),(2,4),(0,5),(1,7),(3,6),(4,4),(5,6),(7,7),(6,5),(7,3),(6,1),(4,0),(5,2),(7,1),(6,3),(7,5),(6,7),(5,5),(4,7),(6,6),(7,4),(5,3),(7,2),(6,0),(4,1),(3,3),(2,1),(0,0),(1,2),(0,4),(1,6),(3,7),(4,5),(5,7),(7,6),(6,4),(4,3),(5,1),(7,0)]
It is however not a good idea to solve this with a brute force approach. If we assume an average branching factor of ~6 moves, then for a 6×6 board, we have already 1.031×1028 possible sequences we have to examine for a 6×6 board.
It is better to work with a divide and conquer approach. It is easy to split a board like 8×8 into four 4×4 boards. Then you determine places where you can hop from one board to another, and then you solve the subproblems for a 4×4 board. For small boards, you can easily store the solutions to go from any square to any other square on a 4×4 board, and then reuse these for all quadrants, so you save computational effort, by not calculating this a second time, especially since you do not need to store symmetrical queries multiple times. If you know how to go from (1,0) to (2,3) on a 4×4 board, you can easily use this to go from (3,0) to (2,3) on the same board, just by mirroring this.

What would be an idiomatic F# way to scale a list of (n-tuples or list) with another list, also arrays?

Given:
let weights = [0.5;0.4;0.3]
let X = [[2;3;4];[7;3;2];[5;3;6]]
what I want is: wX = [(0.5)*[2;3;4];(0.4)*[7;3;2];(0.3)*[5;3;6]]
would like to know an elegant way to do this with lists as well as with arrays. Additional optimization information is welcome
You write about a list of lists, but your code shows a list of tuples. Taking the liberty to adjust for that, a solution would be
let weights = [0.5;0.4;0.3]
let X = [[2;3;4];[7;3;2];[5;3;6]]
X
|> List.map2 (fun w x ->
x
|> List.map (fun xi ->
(float xi) * w
)
) weights
Depending on how comfortable you are with the syntax, you may prefer a oneliner like
List.map2 (fun w x -> List.map (float >> (*) w) x) weights X
The same library functions exist for sequences (Seq.map2, Seq.map) and arrays (in the Array module).
This is much more than an answer to the specific question but after a chat in the comments and learning that the question was specifically a part of a neural network in F# I am posting this which covers the question and implements the feedforward part of a neural network. It makes use of MathNet Numerics
This code is an F# translation of part of the Python code from Neural Networks and Deep Learning.
Python
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
F#
module NeuralNetwork1 =
//# Third-party libraries
open MathNet.Numerics.Distributions // Normal.Sample
open MathNet.Numerics.LinearAlgebra // Matrix
type Network(sizes : int array) =
let mutable (_biases : Matrix<double> list) = []
let mutable (_weights : Matrix<double> list) = []
member __.Biases
with get() = _biases
and set value =
_biases <- value
member __.Weights
with get() = _weights
and set value =
_weights <- value
member __.Backprop (x : Matrix<double>) (y : Matrix<double>) =
// Note: There is a separate member for feedforward. This one is only used within Backprop
// Note: In the text layers are numbered from 1 to n with 1 being the input and n being the output
// In the code layers are numbered from 0 to n-1 with 0 being the input and n-1 being the output
// Layers
// 1 2 3 Text
// 0 1 2 Code
// 784 -> 30 -> 10
let feedforward () : (Matrix<double> list * Matrix<double> list) =
let (bw : (Matrix<double> * Matrix<double>) list) = List.zip __.Biases __.Weights
let rec feedfowardInner layer activation zs activations =
match layer with
| x when x < (__.NumLayers - 1) ->
let (bias, weight) = bw.[layer]
let z = weight * activation + bias
let activation = __.Sigmoid z
feedfowardInner (layer + 1) activation (z :: zs) (activation :: activations)
| _ ->
// Normally with recursive functions that build list for returning
// the final list(s) would be reversed before returning.
// However since the returned list will be accessed in reverse order
// for the backpropagation step, we leave them in the reverse order.
(zs, activations)
feedfowardInner 0 x [] [x]
In weight * activation * is an overloaded operator operating on Matrix<double>
Related back to your example data and using MathNet Numerics Arithmetics
let weights = [0.5;0.4;0.3]
let X = [[2;3;4];[7;3;2];[5;3;6]]
first the values for X need to be converted to float
let x1 = [[2.0;3.0;4.0];[7.0;3.0;2.0];[5.0;3;0;6;0]]
Now notice that x1 is a matrix and weights is a vector
so we can just multiply
let wx1 = weights * x1
Since the way I validated the code was a bit more than most I will explain it so that you don't have doubts to its validity.
When working with Neural Networks and in particular mini-batches, the starting numbers for the weights and biases are random and the generation of the mini-batches is also done randomly.
I know the original Python code was valid and I was able to run it successfully and get the same results as indicated in the book, meaning that the initial successes were within a couple of percent of the book and the graphs of the success were the same. I did this for several runs and several configurations of the neural network as discussed in the book. Then I ran the F# code and achieved the same graphs.
I also copied the starting random number sets from the Python code into the F# code so that while the data generated was random, both the Python and F# code used the same starting numbers, of which there are thousands. I then single stepped both the Python and F# code to verify that each individual function was returning a comparable float value, e.g. I put a break point on each line and made sure I checked each one. This actually took a few days because I had to write export and import code and massage the data from Python to F#.
See: How to determine type of nested data structures in Python?
I also tried a variation where I replaced the F# list with Linked list, but found no increase in speed, e.g. LinkedList<Matrix<double>>. Was an interesting exercise.
If I understand correctly,
let wX = weights |> List.map (fun w ->
X |> List.map (fun (a, b, c) ->
w * float a,
w * float b,
w * float c))
This is an alternate way to achieve this using Math.Net: https://numerics.mathdotnet.com/Matrix.html#Arithmetics

F# memoization efficiency - near 1 million elements

I'm working on an f# solution to this problem where I need to find the generator element above 1,000,000 with the longest generated sequence
I use a tail-recursive function that memoizes the previous results to speed up the calculation. This is my current implementation.
let memoize f =
let cache = new Dictionary<_,_>(1000000)
(fun x ->
match cache.TryGetValue x with
| true, v ->
v
| _ -> let v = f x
cache.Add(x, v)
v)
let rec memSequence =
memoize (fun generator s ->
if generator = 1 then s + 1
else
let state = s+1
if even generator then memSequence(generator/2) state
else memSequence(3*generator + 1) state )
let problem14 =
Array.init 999999 (fun idx -> (idx+1, (memSequence (idx+1) 0))) |> Array.maxBy snd |> fst
It seems to work well until want to calculate the lengths of the sequences generated by the first 100,000 numbers but it slows down significantly over that. In fact, for 120,000 it doesn't seem to terminate. I had a feeling that it might be due to the Dictionary I use, but I read that this shouldn't be the case. Could you point out why this may be potentially inefficient?
You're on the right track, but there's one thing very wrong in how you implement your memoization.
Your memoize function takes a function of one argument and returns a memoized version of it. When you use it in memSequence however, you give it a curried, two argument function. What then happens is that the memoize takes the function and saves down the result of partially applying it for the first argument only, i.e. it stores the closure resulting from applying the function to generator, and than proceeds to call those closures on s.
This means that your memoization effectively doesn't do anything - add some print statements in your memoize function and you'll see that you're still doing full recursion.
I think the underlying question may have been How to combine a memoizing function with a potentially costly calculating function that takes more than one argument?.
In this case, that second argument isn't needed. There's nothing inherently wrong with memoizing 2168612 elements (the size of the dictionary after the calculation).
Beware of overflow, since at 113383 the sequence surpasses System.Int32.MaxValue. A solution might thus look like this:
let memoRec f =
let d = new System.Collections.Generic.Dictionary<_,_>()
let rec g x =
match d.TryGetValue x with
| true, res -> res
| _ -> let res = f g x in d.Add(x, res); res
g
let collatzLong =
memoRec (fun f n ->
if n <= 1L then 0
else 1 + f (if n % 2L = 0L then n / 2L else n * 3L + 1L) )
{0L .. 999999L}
|> Seq.map (fun i -> i, collatzLong i)
|> Seq.maxBy snd
|> fst

2d Array Sort in Haskell

I'm trying to teach myself Haskell (coming from OOP languages). Having a hard time grasping the immutable variables stuff. I'm trying to sort a 2d array in row major.
In java, for example (pseudo):
int array[3][3] = **initialize array here
for(i = 0; i<3; i++)
for(j = 0; j<3; j++)
if(array[i][j] < current_low)
current_low = array[i][j]
How can I implement this same sort of thing in Haskell? If I create a temp array to add the low values to after each iteration, I won't be able to add to it because it is immutable, correct? Also, Haskell doesn't have loops, right?
Here's some useful stuff I know in Haskell:
main = do
let a = [[10,4],[6,10],[5,2]] --assign random numbers
print (a !! 0 !! 1) --will print a[0][1] in java notation
--How can we loop through the values?
First, your Java code does not sort anything. It just finds the smallest element. And, well, there's a kind of obvious Haskell solution... guess what, the function is called minimum! Let's see what it does:
GHCi> :t minimum
minimum :: Ord a => [a] -> a
ok, so it takes a list of values that can be compared (hence Ord) and outputs a single value, namely the smallest. How do we apply this to a "2D list" (nested list)? Well, basically we need the minimum amongst all minima of the sub-lists. So we first replace the list of list with the list of minima
allMinima = map minimum a
...and then use minimum allMinima.
Written compactly:
main :: IO ()
main = do
let a = [[10,4],[6,10],[5,2]] -- don't forget the indentation
print (minimum $ map minimum a)
That's all!
Indeed "looping through values" is a very un-functional concept. We generally don't want to talk about single steps that need to be taken, rather think about properties of the result we want, and let the compiler figure out how to do it. So if we weren't allowed to use the pre-defined minimum, here's how to think about it:
If we have a list and look at a single value... under what circumstances is it the correct result? Well, if it's smaller than all other values. And what is the smallest of the other values? Exactly, the minimum amongst them.
minimum' :: Ord a => [a] -> a
minimum' (x:xs)
| x < minimum' xs = x
If it's not smaller, then we just use the minimum of the other values
minimum' (x:xs)
| x < minxs = x
| otherwise = minxs
where minxs = minimum' xs
One more thing: if we recurse through the list this way, there will at some point be no first element left to compare with something. To prevent that, we first need the special case of a single-element list:
minimum' :: Ord a => [a] -> a
minimum' [x] = x -- obviously smallest, since there's no other element.
minimum' (x:xs)
| x < minxs = x
| otherwise = minxs
where minxs = minimum' xs
Alright, well, I'll take a stab. Zach, this answer is intended to get you thinking in recursions and folds. Recursions, folds, and maps are the fundamental ways that loops are replaced in functional style. Just try to believe that in reality, the question of nested looping rarely arises naturally in functional programming. When you actually need to do it, you'll often enter a special section of code, called a monad, in which you can do destructive writes in an imperative style. Here's an example. But, since you asked for help with breaking out of loop thinking, I'm going to focus on that part of the answer instead. #leftaroundabout's answer is also very good and you fill in his definition of minimum here.
flatten :: [[a]] -> [a]
flatten [] = []
flatten xs = foldr (++) [] xs
squarize :: Int -> [a] -> [[a]]
squarize _ [] = []
squarize len xs = (take len xs) : (squarize len $ drop len xs)
crappySort :: Ord a => [a] -> [a]
crappySort [] = []
crappySort xs =
let smallest = minimum xs
rest = filter (smallest /=) xs
count = (length xs) - (length rest)
in
replicate count smallest ++ crappySort rest
sortByThrees xs = squarize 3 $ crappySort $ flatten xs

What is the best way to store and sort rectangular data in Haskell?

I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...
I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.
I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.
Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.
I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector
Suspicious. Retaining ByteStrings should not be collected. Maybe there is excessive data copying somewhere in your code?
ByteString is a header (8 bytes) + ForeignPtr Word8 ref (8 bytes) + Int offset (4 bytes) + Int length (4 bytes)
ForeignPtr is a header (8 bytes) + Addr# (8 bytes) + PlainPtr ref (8 bytes)
PlainPtr is a header (8 bytes) + MutableByteArray# ref (8 bytes)
(Revised according to https://stackoverflow.com/a/3256825/648955)
All in all, ByteString overhead is at least 64 bytes (correct me, of some fields are shared).
Write your own data management: big flat Word8 array and adhoc offset wrapper
newtype ByteId = ByteId { offset :: Word64 }
with Ord instance.
Overhead would be 8 bytes per identifier. Store offsets in unboxed Vector. Sort with something like this: http://hackage.haskell.org/packages/archive/vector-algorithms/0.5.4.2/doc/html/Data-Vector-Algorithms-Intro.html#v:sort
The Array family of types has built-in support for multi-dimensional arrays. The indices can be anything with an Ix instance, in particular for your case (Int, Int). It also does not support sorting, unfortunately.
But for your use case, do you really need sorting? If you have a map from identifiers to Int you can just increase the count as you go, and later select all keys with value 1. You can check out the bytestring-trie package, although for your use case it suggests to use hashmap.
Another algorithm would be to carry two sets (e.g. HashSet), one with identifiers seen exactly once, and one with identifiers seen more than once, and you update these sets while you go through the list.
Also, how do you read your file: If you read it as one large ByteString and carefully construct the small ByteString objects from it, they will actually be just pointers into the big chunk of memory with the large file, possibly eliminating your GC problems. But to assess that we’d need to see your code.
There are a couple of wrappers around mmap available that can give you either Ptrs to data in your file or a big ByteString. A ByteString is really just a pointer,offset,length tuple; splitting that big ByteString into a bunch of small ones is really just making a bunch of new tuples that point to subsets of the big one. Since you say each record is at a fixed offset in the file, you should be able to create a bunch of new ones without actually accessing any of the file at all via ByteString's take.
I don't have any good suggestions for the sorting part of the problem, but avoiding the copying of the file data in the first place ought to be a good start.
A trie should work. This code takes 45 mins on a file of 18 million lines, 6 million unique keys, on a dual-core laptop with 4 gig RAM:
--invoked: test.exe +RTS -K3.9G -c -h
import qualified Data.ByteString.Char8 as B
import qualified Data.Trie as T
file = "data.txt"
main = ret >>= print
ret = do -- retreive the data
ls <- B.readFile file >>= return.B.lines
trie <- return $ tupleUp ls
return $ T.size trie
tupleUp:: [B.ByteString] -> T.Trie Int
tupleUp l = foldl f T.empty l
f acc str = case T.lookup str acc
of Nothing -> T.insert str 1 acc
Just n -> T.adjust (+1) str acc
Here's the code used to generate the data file (6MM keys, then 3 copies into 1 file to get the 18MM keys:
import qualified Data.ByteString.Char8 as BS
import System.Random
import Data.List.Split
file = "data.txt"
numLines = 6e6 --17e6
chunkSize = 36
charSet = ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9']
-- generate the file
gen = do
randgen <- getStdGen
dat <- return $ t randgen
writeFile file (unlines dat)
t gen = take (ceiling numLines) $ charChunks
where
charChunks = chunksOf chunkSize chars
chars = map (charSet!!) rands
rands = randomRs (0,(length charSet) -1) gen
main = gen
So, how fast can we be? Let's do some tests with a file generated by #ja.'s code:
cat data.txt > /dev/null
--> 0.17 seconds
The same in Haskell?
import qualified Data.ByteString as B
f = id
main = B.readFile "data.txt" >>= return . f >>= B.putStr
Timing?
time ./Test > /dev/null
--> 0.32 seconds
Takes twice as long but I suppose it's not too bad. Using a strict bytestring because
we want to chunk it up in a second.
Next, can we use Vector or is it too slow? Let's build a Vector of chunks and put them back together again. I use the blaze-builder package for optimized output.
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr
How long does it take?
time ./Test2 > /dev/null
--> 1.06 seconds
Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a Vector with no serious performance hits.
What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs ->
go v bs lastIdx (chunkSize - 1)
where lastIdx = V.length v - 1
emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
emptyBuckets = V.thaw $ V.generate 256 (const [])
go :: V.Vector B.ByteString ->
V.MVector (PrimState (ST s)) [B.ByteString] ->
Int -> Int -> ST s (V.Vector B.ByteString)
go v _ _ (-1) = return v
go _ buckets (-1) testIdx = do
v' <- unbucket buckets
bs <- emptyBuckets
go v' bs lastIdx (testIdx - 1)
go v buckets idx testIdx = do
let testChunk = v V.! idx
testByte = fromIntegral $ testChunk `B.index` testIdx
b <- MV.read buckets testByte
MV.write buckets testByte (testChunk:b)
go v buckets (idx-1) testIdx
unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] ->
ST s (V.Vector B.ByteString)
unbucket v = do
v' <- V.freeze v
return . V.fromList . concat . V.toList $ v'
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
bucketSort (recordLen) .
mkVector
Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.
Or you can just use the vector-algorithms package. Testing with a heap sort:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
sortIt v = runST $ do
mv <- V.thaw v
sort mv
V.freeze mv
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
sortIt .
mkVector
This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.
Good enough?

Resources