Slow array access in haskell?

Slow array access in haskell? - arrays

I'm doing this Car Game problem on Kattis: https://open.kattis.com/problems/cargame
There's a five-second time limit, but on the last instance, my code requires longer to run. I'm fairly sure I'm doing the right thing (from a big-O standpoint) so now I need to optimize it somehow.
I downloaded the test data from:
http://challenge.csc.kth.se/2013/challenge-2013.tar.bz2
From profiling, it seems like most of the running time is spent in containsSub which is nothing more than an array access together with a tail-recursive call. Furthermore, it's only called about 100M times, so why does it take 6.5 seconds to run (6.5 s on my laptop. I've found Kattis is generally about twice as slow, so probably more like 13 seconds). On the statistics page, some of the C++ solutions run in under a second. Even some python solutions just barely make it under the 5-second bar.
module Main where
import Control.Monad
import Data.Array (Array, (!), (//))
import qualified Data.Array as Array
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import Data.Char
import Data.List
import Data.Maybe
main::IO()
main = do
[n, m] <- readIntsLn
dictWords <- replicateM n BS.getLine
let suffixChains = map (\w -> (w, buildChain w)) dictWords
replicateM_ m $ findChain suffixChains
noWordMsg :: ByteString
noWordMsg = BS.pack "No valid word"
findChain :: [(ByteString, WordChain)] -> IO ()
findChain suffixChains = do
chrs <- liftM (BS.map toLower) BS.getLine
BS.putStrLn
(
case find (containsSub chrs . snd) suffixChains of
Nothing -> noWordMsg
Just (w, _) -> w
)
readAsInt :: BS.ByteString -> Int
readAsInt = fst . fromJust . BS.readInt
readIntsLn :: IO [Int]
readIntsLn = liftM (map readAsInt . BS.words) BS.getLine
data WordChain = None | Rest (Array Char WordChain)
emptyChars :: WordChain
emptyChars = Rest . Array.listArray ('a', 'z') $ repeat None
buildChain :: ByteString -> WordChain
buildChain s =
case BS.uncons s of
Nothing -> emptyChars
Just (hd, tl) ->
let wc#(Rest m) = buildChain tl in
Rest $ m // [(hd, wc)]
containsSub :: ByteString -> WordChain -> Bool
containsSub _ None = False
containsSub s (Rest m) =
case BS.uncons s of
Nothing -> True
Just (hd, tl) -> containsSub tl (m ! hd)
EDIT: TAKE 2:
I tried building a lazy trie to avoid searching things I'd already searched. So for instance, if I've already encountered a triplet beginning with 'a', then in the future I can skip anything which doesn't contain an 'a'. If I've already searched a triplet beginning 'ab', I can skip anything which doesn't contain 'ab'. And if I've already searched the exact triplet 'abc', I can just return the same result from last time. In theory, this should contribute a significant speedup. In practice the running time is identical.
Furthermore, without the seq's, profiling takes forever and gives bogus results (I couldn't guess why).
With the seqs, profiling says the bulk of the time is spent in forLetter (which is where the array accesses have been moved to so again it looks like array access is the slow part)
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Monad
import Data.Array (Array, (!), (//))
import qualified Data.Array as Array
import qualified Data.Array.Base as Base
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import Data.Char
import Data.Functor
import Data.Maybe
main::IO()
main = do
[n, m] <- readIntsLn
dictWords <- replicateM n BS.getLine
let suffixChainsL = map (\w -> (w, buildChain w)) dictWords
let suffixChains = foldr seq suffixChainsL suffixChainsL
suffixChains `seq` doProbs m suffixChains
noWordMsg :: ByteString
noWordMsg = BS.pack "No valid word"
doProbs :: Int -> [(ByteString, WordChain)] -> IO ()
doProbs m chains = replicateM_ m doProb
where
cf = findChain chains
doProb =
do
chrs <- liftM (map toLower) getLine
BS.putStrLn . fromMaybe noWordMsg $ cf chrs
findChain :: [(ByteString, WordChain)] -> String -> Maybe ByteString
findChain [] = const Nothing
findChain suffixChains#(shd : _) = doFind
where
letterMap :: Array Char (String -> Maybe ByteString)
letterMap =
Array.listArray ('a','z')
[findChain (mapMaybe (forLetter hd) suffixChains) | hd <- [0..25]]
endRes = Just $ fst shd
doFind :: String -> Maybe ByteString
doFind [] = endRes
doFind (hd : tl) = (letterMap ! hd) tl
forLetter :: Int -> (ByteString, WordChain) -> Maybe (ByteString, WordChain)
forLetter c (s, WC wc) = (s,) <$> wc `Base.unsafeAt` c
readAsInt :: BS.ByteString -> Int
readAsInt = fst . fromJust . BS.readInt
readIntsLn :: IO [Int]
readIntsLn = liftM (map readAsInt . BS.words) BS.getLine
newtype WordChain = WC (Array Char (Maybe WordChain))
emptyChars :: WordChain
emptyChars = WC . Array.listArray ('a', 'z') $ repeat Nothing
buildChain :: ByteString -> WordChain
buildChain = BS.foldr helper emptyChars
where
helper :: Char -> WordChain -> WordChain
helper hd wc#(WC m) = m `seq` WC (m // [(hd, Just wc)])

The uncons call in containsSub creates a new ByteString. Try speeding it up by keeping track of the offset into the string with an index, e.g.:
containsSub' :: ByteString -> WordChain -> Bool
containsSub' str wc = go 0 wc
where len = BS.length str
go _ None = False
go i (Rest m) | i >= len = True
| otherwise = go (i+1) (m ! BS.index str i)

After much discussion on the #haskell and #ghc IRC channels, I found that the problem was related to this ghc bug: https://ghc.haskell.org/trac/ghc/ticket/1168
The solution was simply to change the definition of doProbs
doProbs m chains = cf `seq` replicateM_ m doProb
...
Or just to compile with -fno-state-hack
ghc's state hack optimization was causing it to unnecessarily recompute cf (and the associated letterMap) on every call.
So it has nothing to do with array accesses.

Related

Combining MonadRandom with ST computations in a stack

I'm trying to write the Fisher-Yates shuffle using mutable arrays. So far, I have the following code:
module Main where
import Control.Monad.Random
import Control.Monad.Primitive
import Control.Monad.ST
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as MV
fisherYates :: (MonadRandom m, PrimMonad m) => MV.MVector (PrimState m) a -> m ()
fisherYates v = forM_ [0 .. l - 1] (\i -> do j <- getRandomR (i, l)
MV.swap v i j)
where l = MV.length v - 1
shuffle :: MonadRandom m => V.Vector a -> m (V.Vector a)
shuffle v = _ -- don't know how to write this
main :: IO ()
main = print . evalRand (shuffle . V.generate 10 $ id) $ mkStdGen 42
However, I am totally unsure how to define shuffle, which is meant to be a 'high-level wrapper' around the mutable vector operations. It seems (at least from my understanding), that I first have to 'run' the random 'part' of the stack, save the state, run the ST 'part' to get out an immutable vector, and then rewrap the whole thing. Additionally, I know I have to make use of thaw somewhere, but my attempts are coming up short. Could someone please tell me what I'm missing, and how I can do what I want?

I have two suggestions for you:
Select the right monad nesting.
Separate out your monad implementation from the logic of the algorithm.
You are trying to run the random monad last and use the ST internally, thus you need the ST to be a sort of monad transformer. Decide what your monad stack looks like - which monad is the transformer and which is the inner monad? The easy thing to do is make the ST monad the inner monad and the random monad the transformer (easy just because you have all the needed packages already).
Now make a small set of helper functions. It won't really pay off here -
the payoff is large for complex projects. Here's the monad stack and helpers I came up with:
{-# LANGUAGE RankNTypes #-}
module Main where
import System.Random (StdGen)
import Control.Monad.Random
import Control.Monad.Primitive
import Control.Monad.ST
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as MV
type KozM s a = RandT StdGen (ST s) a
Notice the transformer is RandT and the inner monad of ST s.
rnd :: (Int,Int) -> KozM s Int
rnd = getRandomR
swp :: MV.MVector s a -> Int -> Int -> KozM s ()
swp v i j = lift (MV.swap v i j)
freeze :: MV.MVector s a -> KozM s (V.Vector a)
thaw :: V.Vector a -> KozM s (MV.MVector s a)
freeze = lift . V.freeze
thaw = lift . V.thaw
All the operations you need to mutate the vector. Now we just need to run this monad so we can somehow escape to another useful context. I noticed you previously hard-coded an RNG (42) - I use a random one but whichever...
run :: (forall s. KozM s a) -> IO a -- Can be just `a` if you hard-code
-- an RNG as done in the question
run m = do g <- newStdGen
pure (runST (evalRandT m g))
Finally we can use this monad to implement f-y:
fisherYates :: MV.MVector s a -> KozM s ()
fisherYates v = forM_ [0 .. l - 1] (\i -> do j <- rnd (i, l)
swp v i j)
where l = MV.length v - 1
At this point you might not be feeling like you learned anything, hopefully the run function was helpful but I can see how you might feel this is getting verbose. The important thing to take note of here is how clean the rest of your code can be if you handle the plumbing of the monad above so you don't have lift and module qualifiers polluting the logic of the possibly complex thing you actually set-out to solve.
That said, here's the underwhelming shuffle:
shuffle :: V.Vector a -> KozM s (V.Vector a)
shuffle v = do
vm <- thaw v
fisherYates vm
freeze vm
The type is important. You had previously called evalRand on shuffle which implied it would be some sort of MonadRandom m and simultaneously have to call runST - a conflation of the monad logic and the algorithm logic that just hurts the brain.
The main is uninteresting:
main :: IO ()
main = print =<< (run (shuffle (V.generate 10 id)) :: IO (V.Vector Int))
EDIT: yes you can do this while keeping the methods more general. At some point you need to specify which monad you run or you can't have a main that will execute it, but all the logic can be general using MonadRandom and PrimMonad.
{-# LANGUAGE RankNTypes #-}
module Main where
import System.Random (StdGen)
import Control.Monad.Random
import Control.Monad.Primitive
import Control.Monad.ST
import qualified Data.Vector as V
import qualified Data.Vector.Mutable as MV
type KozM s a = RandT StdGen (ST s) a
rnd :: MonadRandom m => (Int, Int) -> m Int
rnd = getRandomR
swp :: PrimMonad m => MV.MVector (PrimState m) a -> Int -> Int -> m ()
swp v i j = MV.swap v i j
-- freeze :: MV.MVector s a -> KozM s (V.Vector a)
-- thaw :: V.Vector a -> KozM s (MV.MVector s a)
freeze :: PrimMonad m => MV.MVector (PrimState m) a -> m (V.Vector a)
thaw :: PrimMonad m => V.Vector a -> m (MV.MVector (PrimState m) a)
freeze = V.freeze
thaw = V.thaw
-- Some monad libraries, like monadlib, have a generalized `run` class method.
-- This doesn't exist, to the best of my knowledge, for the intersection of ST
-- and mtl.
run :: (forall s. KozM s a) -> IO a -- Can be just `a` if you hard-code
-- an RNG as done in the question
run m = do g <- newStdGen
pure (runST (evalRandT m g))
-- fisherYates :: MV.MVector s a -> KozM s ()
fisherYates :: (MonadRandom m, PrimMonad m) => MV.MVector (PrimState m) a -> m ()
fisherYates v = forM_ [0 .. l - 1] (\i -> do j <- rnd (i, l)
swp v i j)
where l = MV.length v - 1
shuffle :: (MonadRandom m, PrimMonad m) => V.Vector a -> m (V.Vector a)
shuffle v = do
vm <- thaw v
fisherYates vm
freeze vm
main :: IO ()
main = print =<< (run (shuffle (V.generate 10 id)) :: IO (V.Vector Int))

Haskell: Read a binary file backwards

I'm looking to find the last 32bit word in an uInt32 binary dump matching a particular pattern using Haskell. I am able to complete the task using last, however the code must trawl through the entire file so it is rather inefficient.
Is there a simple way to make readfile operate through the file in reverse? I believe this would solve the issue with the smallest change to the current code.
Here is my current code, for reference. I only began with Haskell this weekend so I am sure it is quite ugly. It looks for the last 32 bit word beginning with 0b10 at the MSB.
import System.Environment(getArgs)
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Internal as BL
import qualified Data.ByteString as BS
import Data.Binary.Get
import Data.Word
import Data.Bits
import Text.Printf(printf)
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
content <- BL.readFile file
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- Data
type Packet = Word32
-- filter where first 2 bits are 10
filterTimes :: [Packet] -> [Packet]
filterTimes = filter ((== 0x2) . tag)
-- get the first 2 bits
tag :: Packet -> Packet
tag rp =
let tagSize = 2
in shiftR rp (finiteBitSize rp - tagSize)
-- remove the tag bits
getValue :: Packet -> Packet
getValue =
let tagSize = 2
mask = complement $ rotateR (2^tagSize - 1) tagSize
in (.&.) mask
-- Input
-- Based on https://hackage.haskell.org/package/binary/docs/Data-Binary-Get.html
getPacket :: Get Packet
getPacket = do
packet <- getWord32le
return $! packet
getPackets :: BL.ByteString -> [Packet]
getPackets input0 = go decoder input0
where
decoder = runGetIncremental getPacket
go :: Decoder Packet -> BL.ByteString -> [Packet]
go (Done leftover _consumed packet) input =
packet : go decoder (BL.chunk leftover input)
go (Partial k) input =
go (k . takeHeadChunk $ input) (dropHeadChunk input)
go (Fail _leftover _consumed msg) _input =
[]
takeHeadChunk :: BL.ByteString -> Maybe BS.ByteString
takeHeadChunk lbs =
case lbs of
(BL.Chunk bs _) -> Just bs
_ -> Nothing
dropHeadChunk :: BL.ByteString -> BL.ByteString
dropHeadChunk lbs =
case lbs of
(BL.Chunk _ lbs') -> lbs'
_ -> BL.Empty

Some comments on your code:
You are using last which could throw an exception. You should use lastMay fromthe safe package which returns a Maybe.
Since you are just treating the file as a vector of Word32s, I don't think it's worth using Data.Binary.Get and the associated overhead and complexity that it entails. Just treat the file as a (perhaps lazy) ByteString and access every 4th byte or break it up into 4-byte substrings.
You can have a look at code which uses ByteStrings here. It implements the following approaches to the problem:
Read in the entire file as a lazy ByteString and produce a (lazy) list of 4-byte substrings. Return the last substring which satisifies the criteria.
intoWords :: BL.ByteString -> [ BL.ByteString ]
intoWords bs
| BL.null a = []
| otherwise = a : intoWords b
where (a,b) = BL.splitAt 4 bs
-- find by breaking the file into 4-byte words
find_C0_v1 :: FilePath -> IO (Maybe BL.ByteString)
find_C0_v1 path = do
contents <- BL.readFile path
return $ lastMay . filter (\bs -> BL.index bs 0 == 0xC0) . intoWords $ contents
Read in the entire file as a lazy ByteString and access every 4-th byte looking for a 0xC0. Return the last occurrence.
-- find by looking at every 4th byte
find_C0_v2 :: FilePath -> IO (Maybe BL.ByteString)
find_C0_v2 path = do
contents <- BL.readFile path
size <- fmap fromIntegral $ withFile path ReadMode hFileSize
let wordAt i = BL.take 4 . BL.drop i $ contents
return $ fmap wordAt $ lastMay $ filter (\i -> BL.index contents i == 0xC0) [0,4..size-1]
Read the file in backwards in chunks of 64K. Within each chunk (which is a strict ByteString) access every 4th byte looking for a 0xC0 starting from the end of the chunk. Return the first occurrence.
-- read a file backwords until a predicate returns a Just value
loopBlocks :: Int -> Handle -> Integer -> (BS.ByteString -> Integer -> Maybe a) -> IO (Maybe a)
loopBlocks blksize h top pred
| top <= 0 = return Nothing
| otherwise = do
let offset = top - fromIntegral blksize
hSeek h AbsoluteSeek offset
blk <- BS.hGet h blksize
case pred blk offset of
Nothing -> loopBlocks blksize h offset pred
x -> return x
-- find by reading backwords lookint at every 4th byte
find_C0_v3 :: FilePath -> IO (Maybe Integer)
find_C0_v3 path = do
withFile path ReadMode $ \h -> do
size <- hFileSize h
let top = size - (mod size 4)
blksize = 64*1024 :: Int
loopBlocks blksize h top $ \blk offset ->
fmap ( (+offset) . fromIntegral ) $ headMay $ filter (\i -> BS.index blk i == 0xC0) [blksize-4,blksize-8..0]
The third method is the fastest even if it has to read in the entire file. The first method actually works pretty well. I wouldn't recommend the second at all - its performance degrades precipitously as the file size grows.

For any others who may be interested, I have adapted #ErikR's answer. This solution follows his proposed solution 3, but makes use of my existing code, by stepping through blocks in reverse lazily.
This requires a few extra imports:
import System.IO
import Safe
import Data.Maybe
main becomes:
main = do
args <- getArgs
let file = args!!0
putStrLn $ "Find last 0xCXXXXXXX in " ++ file
-- forward
withFile file ReadMode $ \h -> do
content <- BL.hGetContents h
let packets = getPackets content
putStrLn . show . getValue . last . filterTimes $ packets
-- reverse
withFile file ReadMode $ \h -> do
size <- hFileSize h
let blksize = 64*1024 :: Int
chunks <- makeReverseChunks blksize h (fromIntegral size)
putStrLn . show . getValue . (fromMaybe 0) . headMay . catMaybes . (map $ lastMay . filterTimes . getPackets) $ chunks
With an added helper function:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [BL.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- BL.hGet h blksize
rest <- makeReverseChunks blksize h offset
return $ blk : rest

Here is a variation of the function makeReverseChunks. It was currently quite strict. Moreover the use of lazy bytestrings is not helping if one keep the blksize low enough. In order to achieve lazy reading one must use unsafeInterleaveIO. Here is a solution using strict bytestrings and lazy IO:
-- create list of data chunks, backwards in order through the file
makeReverseChunks :: Int -> Handle -> Int -> IO [SBS.ByteString]
makeReverseChunks blksize h top
| top == 0 = return []
| top < 0 = error "negative file index"
| otherwise = do
let offset = max (top - fromIntegral blksize) 0
hSeek h AbsoluteSeek (fromIntegral offset)
blk <- SBS.hGet h blksize
rest <- unsafeInterleaveIO $ makeReverseChunks blksize h offset
return $ blk : rest

How to populate a 2D Array in Haskell

I am currently working on populating a grid in Haskell defined as type Grid = UArray (Int, Int) Bool where the two Ints are the x and y coordinates. I am trying to turn this into a 10 by 10 data structure where the data types are Boolean. I can get the values as such the listArray is part of the Data.Array.Unboxed package
import System.IO
import Control.Monad
import Data.Array.Unboxed
import Data.List (unfoldr)
type Board = UArray (Int, Int) Bool
progress :: Int -> Int -> Board -> Board
progress width height previous =
listArray b (map f (range b))
where b#((y1,x1),(y2,x2)) = bounds previous
-- some basic math functions performed here
total :: [Bool] -> Int
total = length . filter id
board :: [String] -> (Int, Int, Board)
board l = (width, height, a)
where (width, height) = (length $ head l, length l)
a = listArray ((1, 1), (height, width)) $ concatMap f l
f = map g
g 'f' = False
g _ = True
printGrid :: Int -> Board -> IO ()
printGrid width = mapM_ f . split width . elems
where f = putStrLn . map g
g False = 'f'
g _ = 't'
split :: Int -> [a] -> [[a]]
split n = takeWhile (not . null) . unfoldr (Just . splitAt n)
-- used only as a test trying to create same structure programatically
testBoard = board
["ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff",
"ffffffffff"]
printProgress :: Int -> (Int, Int, Board) -> IO ()
printProgress n (width, height, g) = mapM_ f $ take n $ iterate (progress width height) g
where f g = do
putStrLn "------------------------------"
printGrid width g
main :: IO ()
main = do
printProgress times testBoard

You can make a UArray using the array function or the listArray function. The first expects a list of indices and elements while the second expects a tuple indicating the index range and a list of elements that are paired with an index automatically (lexicographic order).
Data.Array.Unboxed> listArray ((0,0), (9,9)) (replicate 100 False) :: UArray (Int,Int) Bool
array ((0,0),(9,9)) [((0,0),False),((0,1),False),((0,2),False),((0,3),False),((0,4),False),((0,5),False),((0,6),False),((0,7),False),((0,8),False),((0,9),False),((1,0),False),((1,1),False),((1,2),False),((1,3),False),((1,4),False),((1,5),False),((1,6),False),((1,7),False),((1,8),False),((1,9),False),((2,0),False),((2,1),False),((2,2),False),((2,3),False),((2,4),False),((2,5),False),((2,6),False),((2,7),False),((2,8),False),((2,9),False),((3,0),False),((3,1),False),((3,2),False),((3,3),False),((3,4),False),((3,5),False),((3,6),False),((3,7),False),((3,8),False),((3,9),False),((4,0),False),((4,1),False),((4,2),False),((4,3),False),((4,4),False),((4,5),False),((4,6),False),((4,7),False),((4,8),False),((4,9),False),((5,0),False),((5,1),False),((5,2),False),((5,3),False),((5,4),False),((5,5),False),((5,6),False),((5,7),False),((5,8),False),((5,9),False),((6,0),False),((6,1),False),((6,2),False),((6,3),False),((6,4),False),((6,5),False),((6,6),False),((6,7),False),((6,8),False),((6,9),False),((7,0),False),((7,1),False),((7,2),False),((7,3),False),((7,4),False),((7,5),False),((7,6),False),((7,7),False),((7,8),False),((7,9),False),((8,0),False),((8,1),False),((8,2),False),((8,3),False),((8,4),False),((8,5),False),((8,6),False),((8,7),False),((8,8),False),((8,9),False),((9,0),False),((9,1),False),((9,2),False),((9,3),False),((9,4),False),((9,5),False),((9,6),False),((9,7),False),((9,8),False),((9,9),False)]
Going one step further, you can parse your input [String] into a board by simply concatenating and mapping a test for == 't':
stringsToGrid :: [String] -> Grid
stringsToGrid = listArray ((0,0), (9,9)) . map (== 't') . concat

What is the best way to store and sort rectangular data in Haskell?

I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...
I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.
I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.
Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.

I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector
Suspicious. Retaining ByteStrings should not be collected. Maybe there is excessive data copying somewhere in your code?
ByteString is a header (8 bytes) + ForeignPtr Word8 ref (8 bytes) + Int offset (4 bytes) + Int length (4 bytes)
ForeignPtr is a header (8 bytes) + Addr# (8 bytes) + PlainPtr ref (8 bytes)
PlainPtr is a header (8 bytes) + MutableByteArray# ref (8 bytes)
(Revised according to https://stackoverflow.com/a/3256825/648955)
All in all, ByteString overhead is at least 64 bytes (correct me, of some fields are shared).
Write your own data management: big flat Word8 array and adhoc offset wrapper
newtype ByteId = ByteId { offset :: Word64 }
with Ord instance.
Overhead would be 8 bytes per identifier. Store offsets in unboxed Vector. Sort with something like this: http://hackage.haskell.org/packages/archive/vector-algorithms/0.5.4.2/doc/html/Data-Vector-Algorithms-Intro.html#v:sort

The Array family of types has built-in support for multi-dimensional arrays. The indices can be anything with an Ix instance, in particular for your case (Int, Int). It also does not support sorting, unfortunately.
But for your use case, do you really need sorting? If you have a map from identifiers to Int you can just increase the count as you go, and later select all keys with value 1. You can check out the bytestring-trie package, although for your use case it suggests to use hashmap.
Another algorithm would be to carry two sets (e.g. HashSet), one with identifiers seen exactly once, and one with identifiers seen more than once, and you update these sets while you go through the list.
Also, how do you read your file: If you read it as one large ByteString and carefully construct the small ByteString objects from it, they will actually be just pointers into the big chunk of memory with the large file, possibly eliminating your GC problems. But to assess that we’d need to see your code.

There are a couple of wrappers around mmap available that can give you either Ptrs to data in your file or a big ByteString. A ByteString is really just a pointer,offset,length tuple; splitting that big ByteString into a bunch of small ones is really just making a bunch of new tuples that point to subsets of the big one. Since you say each record is at a fixed offset in the file, you should be able to create a bunch of new ones without actually accessing any of the file at all via ByteString's take.
I don't have any good suggestions for the sorting part of the problem, but avoiding the copying of the file data in the first place ought to be a good start.

A trie should work. This code takes 45 mins on a file of 18 million lines, 6 million unique keys, on a dual-core laptop with 4 gig RAM:
--invoked: test.exe +RTS -K3.9G -c -h
import qualified Data.ByteString.Char8 as B
import qualified Data.Trie as T
file = "data.txt"
main = ret >>= print
ret = do -- retreive the data
ls <- B.readFile file >>= return.B.lines
trie <- return $ tupleUp ls
return $ T.size trie
tupleUp:: [B.ByteString] -> T.Trie Int
tupleUp l = foldl f T.empty l
f acc str = case T.lookup str acc
of Nothing -> T.insert str 1 acc
Just n -> T.adjust (+1) str acc
Here's the code used to generate the data file (6MM keys, then 3 copies into 1 file to get the 18MM keys:
import qualified Data.ByteString.Char8 as BS
import System.Random
import Data.List.Split
file = "data.txt"
numLines = 6e6 --17e6
chunkSize = 36
charSet = ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9']
-- generate the file
gen = do
randgen <- getStdGen
dat <- return $ t randgen
writeFile file (unlines dat)
t gen = take (ceiling numLines) $ charChunks
where
charChunks = chunksOf chunkSize chars
chars = map (charSet!!) rands
rands = randomRs (0,(length charSet) -1) gen
main = gen

So, how fast can we be? Let's do some tests with a file generated by #ja.'s code:
cat data.txt > /dev/null
--> 0.17 seconds
The same in Haskell?
import qualified Data.ByteString as B
f = id
main = B.readFile "data.txt" >>= return . f >>= B.putStr
Timing?
time ./Test > /dev/null
--> 0.32 seconds
Takes twice as long but I suppose it's not too bad. Using a strict bytestring because
we want to chunk it up in a second.
Next, can we use Vector or is it too slow? Let's build a Vector of chunks and put them back together again. I use the blaze-builder package for optimized output.
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr
How long does it take?
time ./Test2 > /dev/null
--> 1.06 seconds
Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a Vector with no serious performance hits.
What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs ->
go v bs lastIdx (chunkSize - 1)
where lastIdx = V.length v - 1
emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
emptyBuckets = V.thaw $ V.generate 256 (const [])
go :: V.Vector B.ByteString ->
V.MVector (PrimState (ST s)) [B.ByteString] ->
Int -> Int -> ST s (V.Vector B.ByteString)
go v _ _ (-1) = return v
go _ buckets (-1) testIdx = do
v' <- unbucket buckets
bs <- emptyBuckets
go v' bs lastIdx (testIdx - 1)
go v buckets idx testIdx = do
let testChunk = v V.! idx
testByte = fromIntegral $ testChunk `B.index` testIdx
b <- MV.read buckets testByte
MV.write buckets testByte (testChunk:b)
go v buckets (idx-1) testIdx
unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] ->
ST s (V.Vector B.ByteString)
unbucket v = do
v' <- V.freeze v
return . V.fromList . concat . V.toList $ v'
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
bucketSort (recordLen) .
mkVector
Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.
Or you can just use the vector-algorithms package. Testing with a heap sort:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
sortIt v = runST $ do
mv <- V.thaw v
sort mv
V.freeze mv
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
sortIt .
mkVector
This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.
Good enough?

Is there a function in haskell working like a mixture of accumArray and foldr?

let me call the function accumrArray.
accumrArray ::
(e' -> e -> e) An accumulating function
-> e A default element
-> (i, i) The bounds of the array
-> [(i, e')] List of associations
-> a i e The array
accumrArray (:) [] (1,2) [(1,1),(2,2),(2,3)] === array [(1,[1]), (2,[2,3])]
head $ (accumrArray (:) [] (1,1) [(1,x)|x<-[4..]]) ! 1 === 4

How strange... I wrote this function a few days ago for someone else. The function first appeared in LML (I believe), but never made it into the Haskell array library.
Here you go:
{-# LANGUAGE ScopedTypeVariables #-}
import Data.Array
import System.IO.Unsafe
import Data.IORef
import Data.Array.MArray
import Data.Array.Base
import Control.Monad
import Data.Array.IO
accumArrayR :: forall a e i. Ix i => (a -> e -> e) -> e -> (i,i) -> [(i,a)] -> Array i e
accumArrayR f e bounds#(l,u) assocs = unsafePerformIO $ do
ref <- newIORef assocs
arr <- newArray_ bounds
let _ = arr :: IOArray i e
let n = safeRangeSize (l,u)
let elem x = unsafePerformIO $ do
ass <- readIORef ref
let loop [] = writeIORef ref [] >> return e
loop ((y,a):rest) = do
let ix = safeIndex bounds n y
let r = f a (elem x)
unsafeWrite arr ix r
if (ix == x)
then writeIORef ref rest >> return r
else loop rest
loop ass
forM_ [0..n] $ \ix -> unsafeWrite arr ix (elem ix)
unsafeFreeze arr
A challenge for the reader: use accumArrayR to implement linear-time depth-first-search of a graph.
Edit I should mention that the function isn't thread-safe as written. Turning the IORef into an MVar would fix it, but there might be better ways.

Not the most efficient, but...
accumrArray f x b l = accumArray (flip f) x b (reverse l)

I would argue that
accumrArray f x b l = accumArray (flip f) x b (reverse l)
is indeed the best solution (credits to sclv's answer).
Its supposed "inefficiency" comes from fact that foldr applies the function f from right to left.
However, since accumArray is strict, l can never be infinite, otherwise the program would be incorrect. It would never terminate.
Therefore, foldl (flip f) is just as good as a foldr.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight