I'm in the process of coding the knight's tour function, and I'm as far as this where I'm getting an infinte loop in my ghci:
type Field = (Int, Int)
nextPositions:: Int -> Field -> [Field]
nextPositions n (x,y) = filter onBoard
[(x+2,y-1),(x+2,y+1),(x-2,y-1),(x-2,y+1),(x+1,y-2),(x+1,y+2),(x-1,y-2),(x-1,y+2)]
where onBoard (x,y) = x `elem` [1..n] && y `elem` [1..n]
type Path = [Field]
knightTour :: Int -> Field -> [Path]
knightTour n start = [posi:path | (posi,path) <- tour (n*n)]
where tour 1 = [(start, [])]
tour k = [(posi', posi:path) | (posi, path) <- tour (k-1), posi' <- (filter (`notElem` path) (nextPositions n posi))]
F.e. knightTour 10 (4,4) does not give an output!
Any advise?
I think one of the main problems is checking if you have visited a square. This takes too much time. You should look for a data structure that makes that more efficient.
For small boards, for example up to 8×8, you can make use of a 64-bit integer for that. A 64-bit can be seen as 64 booleans that each can represent whether the knight already has visited that place.
we thus can implement this with:
{-# LANGUAGE BangPatterns #-}
import Data.Bits(testBit, setBit)
import Data.Word(Word64)
testPosition :: Int -> Word64 -> (Int, Int) -> Bool
testPosition !n !w (!r, !c) = testBit w (n*r + c)
setPosition :: Int -> (Int, Int) -> Word64 -> Word64
setPosition !n (!r, !c) !w = setBit w (n*r + c)
nextPositions :: Int -> Word64 -> (Int, Int) -> [(Int, Int)]
nextPositions !n !w (!x, !y) = [ c
| c#(x', y') <- [(x-1,y-2), (x-1,y+2), (x+1,y-2), (x+1,y+2), (x-2,y-1), (x-2,y+1), (x+2,y-1), (x+2,y+1)]
, x' >= 0
, y' >= 0
, x' < n
, y' < n
, not (testPosition n w c)
]
knightTour :: Int -> (Int, Int) -> [[(Int, Int)]]
knightTour n p0 = go (n*n-1) (setPosition n p0 0) p0
where go 0 _ _ = [[]]
go !k !w !ps = [
(ps':rs)
| ps' <- nextPositions n w ps
, rs <- go (k-1) (setPosition n ps' w) ps'
]
main = print (knightTour 6 (1,1))
If I compile this with the -O2 flag and run this locally for a 5×5 board where the knight starts at (1,1), all the solutions are generated in 0.32 seconds. For a 6×6 board, it takes 2.91 seconds to print the first solution, but it takes forever to find all solutions that start at (1,1). For an 8×8 board, the first solution was found in 185.76 seconds:
[(0,3),(1,5),(0,7),(2,6),(1,4),(0,2),(1,0),(2,2),(3,0),(4,2),(3,4),(4,6),(5,4),(6,2),(5,0),(3,1),(2,3),(3,5),(2,7),(0,6),(2,5),(1,3),(0,1),(2,0),(3,2),(2,4),(0,5),(1,7),(3,6),(4,4),(5,6),(7,7),(6,5),(7,3),(6,1),(4,0),(5,2),(7,1),(6,3),(7,5),(6,7),(5,5),(4,7),(6,6),(7,4),(5,3),(7,2),(6,0),(4,1),(3,3),(2,1),(0,0),(1,2),(0,4),(1,6),(3,7),(4,5),(5,7),(7,6),(6,4),(4,3),(5,1),(7,0)]
It is however not a good idea to solve this with a brute force approach. If we assume an average branching factor of ~6 moves, then for a 6×6 board, we have already 1.031×1028 possible sequences we have to examine for a 6×6 board.
It is better to work with a divide and conquer approach. It is easy to split a board like 8×8 into four 4×4 boards. Then you determine places where you can hop from one board to another, and then you solve the subproblems for a 4×4 board. For small boards, you can easily store the solutions to go from any square to any other square on a 4×4 board, and then reuse these for all quadrants, so you save computational effort, by not calculating this a second time, especially since you do not need to store symmetrical queries multiple times. If you know how to go from (1,0) to (2,3) on a 4×4 board, you can easily use this to go from (3,0) to (2,3) on the same board, just by mirroring this.
I'm doing this Car Game problem on Kattis: https://open.kattis.com/problems/cargame
There's a five-second time limit, but on the last instance, my code requires longer to run. I'm fairly sure I'm doing the right thing (from a big-O standpoint) so now I need to optimize it somehow.
I downloaded the test data from:
http://challenge.csc.kth.se/2013/challenge-2013.tar.bz2
From profiling, it seems like most of the running time is spent in containsSub which is nothing more than an array access together with a tail-recursive call. Furthermore, it's only called about 100M times, so why does it take 6.5 seconds to run (6.5 s on my laptop. I've found Kattis is generally about twice as slow, so probably more like 13 seconds). On the statistics page, some of the C++ solutions run in under a second. Even some python solutions just barely make it under the 5-second bar.
module Main where
import Control.Monad
import Data.Array (Array, (!), (//))
import qualified Data.Array as Array
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import Data.Char
import Data.List
import Data.Maybe
main::IO()
main = do
[n, m] <- readIntsLn
dictWords <- replicateM n BS.getLine
let suffixChains = map (\w -> (w, buildChain w)) dictWords
replicateM_ m $ findChain suffixChains
noWordMsg :: ByteString
noWordMsg = BS.pack "No valid word"
findChain :: [(ByteString, WordChain)] -> IO ()
findChain suffixChains = do
chrs <- liftM (BS.map toLower) BS.getLine
BS.putStrLn
(
case find (containsSub chrs . snd) suffixChains of
Nothing -> noWordMsg
Just (w, _) -> w
)
readAsInt :: BS.ByteString -> Int
readAsInt = fst . fromJust . BS.readInt
readIntsLn :: IO [Int]
readIntsLn = liftM (map readAsInt . BS.words) BS.getLine
data WordChain = None | Rest (Array Char WordChain)
emptyChars :: WordChain
emptyChars = Rest . Array.listArray ('a', 'z') $ repeat None
buildChain :: ByteString -> WordChain
buildChain s =
case BS.uncons s of
Nothing -> emptyChars
Just (hd, tl) ->
let wc#(Rest m) = buildChain tl in
Rest $ m // [(hd, wc)]
containsSub :: ByteString -> WordChain -> Bool
containsSub _ None = False
containsSub s (Rest m) =
case BS.uncons s of
Nothing -> True
Just (hd, tl) -> containsSub tl (m ! hd)
EDIT: TAKE 2:
I tried building a lazy trie to avoid searching things I'd already searched. So for instance, if I've already encountered a triplet beginning with 'a', then in the future I can skip anything which doesn't contain an 'a'. If I've already searched a triplet beginning 'ab', I can skip anything which doesn't contain 'ab'. And if I've already searched the exact triplet 'abc', I can just return the same result from last time. In theory, this should contribute a significant speedup. In practice the running time is identical.
Furthermore, without the seq's, profiling takes forever and gives bogus results (I couldn't guess why).
With the seqs, profiling says the bulk of the time is spent in forLetter (which is where the array accesses have been moved to so again it looks like array access is the slow part)
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Monad
import Data.Array (Array, (!), (//))
import qualified Data.Array as Array
import qualified Data.Array.Base as Base
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import Data.Char
import Data.Functor
import Data.Maybe
main::IO()
main = do
[n, m] <- readIntsLn
dictWords <- replicateM n BS.getLine
let suffixChainsL = map (\w -> (w, buildChain w)) dictWords
let suffixChains = foldr seq suffixChainsL suffixChainsL
suffixChains `seq` doProbs m suffixChains
noWordMsg :: ByteString
noWordMsg = BS.pack "No valid word"
doProbs :: Int -> [(ByteString, WordChain)] -> IO ()
doProbs m chains = replicateM_ m doProb
where
cf = findChain chains
doProb =
do
chrs <- liftM (map toLower) getLine
BS.putStrLn . fromMaybe noWordMsg $ cf chrs
findChain :: [(ByteString, WordChain)] -> String -> Maybe ByteString
findChain [] = const Nothing
findChain suffixChains#(shd : _) = doFind
where
letterMap :: Array Char (String -> Maybe ByteString)
letterMap =
Array.listArray ('a','z')
[findChain (mapMaybe (forLetter hd) suffixChains) | hd <- [0..25]]
endRes = Just $ fst shd
doFind :: String -> Maybe ByteString
doFind [] = endRes
doFind (hd : tl) = (letterMap ! hd) tl
forLetter :: Int -> (ByteString, WordChain) -> Maybe (ByteString, WordChain)
forLetter c (s, WC wc) = (s,) <$> wc `Base.unsafeAt` c
readAsInt :: BS.ByteString -> Int
readAsInt = fst . fromJust . BS.readInt
readIntsLn :: IO [Int]
readIntsLn = liftM (map readAsInt . BS.words) BS.getLine
newtype WordChain = WC (Array Char (Maybe WordChain))
emptyChars :: WordChain
emptyChars = WC . Array.listArray ('a', 'z') $ repeat Nothing
buildChain :: ByteString -> WordChain
buildChain = BS.foldr helper emptyChars
where
helper :: Char -> WordChain -> WordChain
helper hd wc#(WC m) = m `seq` WC (m // [(hd, Just wc)])
The uncons call in containsSub creates a new ByteString. Try speeding it up by keeping track of the offset into the string with an index, e.g.:
containsSub' :: ByteString -> WordChain -> Bool
containsSub' str wc = go 0 wc
where len = BS.length str
go _ None = False
go i (Rest m) | i >= len = True
| otherwise = go (i+1) (m ! BS.index str i)
After much discussion on the #haskell and #ghc IRC channels, I found that the problem was related to this ghc bug: https://ghc.haskell.org/trac/ghc/ticket/1168
The solution was simply to change the definition of doProbs
doProbs m chains = cf `seq` replicateM_ m doProb
...
Or just to compile with -fno-state-hack
ghc's state hack optimization was causing it to unnecessarily recompute cf (and the associated letterMap) on every call.
So it has nothing to do with array accesses.
I tried to learn how the STArray works, but I couldn't. (Doc is poor, or at least the one I found).
Any way, I have the next algorithm, but it uses a lot of !!, which is slow. How can I convert it to use the STArray monad?
-- The Algorithm prints the primes present in [1 .. n]
main :: IO ()
main = print $ primesUpTo 100
type Nat = Int
primesUpTo :: Nat -> [Nat]
primesUpTo n = primesUpToAux n 2 [1]
primesUpToAux :: Nat -> Nat -> [Nat] -> [Nat]
primesUpToAux n current primes =
if current > n
then primes
else primesUpToAux n (current + 1) newAcum
where newAcum = case isPrime current primes of
True -> primes++[current]
False -> primes
isPrime :: Nat -> [Nat] -> Bool
isPrime 1 _ = True
isPrime 2 _ = True
isPrime x neededPrimes = isPrimeAux x neededPrimes 1
isPrimeAux x neededPrimes currentPrimeIndex =
if sqrtOfX < currentPrime
then True
else if mod x currentPrime == 0
then False
else isPrimeAux x neededPrimes (currentPrimeIndex + 1)
where
sqrtOfX = sqrtNat x
currentPrime = neededPrimes !! currentPrimeIndex
sqrtNat :: Nat -> Nat
sqrtNat = floor . sqrt . fromIntegral
Edit
Oops, the !! wasn't the problem; in the next version of the algorithm (below) I've removed the use of !!; also, I fixed 1 being a prime, which is not, as pointed by #pedrorodrigues
main :: IO ()
main = print $ primesUpTo 20000
type Nat = Int
primesUpTo :: Nat -> [Nat]
primesUpTo n = primesUpToAux n 1 []
primesUpToAux :: Nat -> Nat -> [Nat] -> [Nat]
primesUpToAux n current primesAcum =
if current > n
then primesAcum
else primesUpToAux n (current + 1) newPrimesAcum
where newPrimesAcum = case isPrime current primesAcum of
True -> primesAcum++[current]
False -> primesAcum
isPrime :: Nat -> [Nat] -> Bool
isPrime 1 _ = False
isPrime 2 _ = True
isPrime x neededPrimes =
if sqrtOfX < currentPrime
then True
else if mod x currentPrime == 0
then False
else isPrime x restOfPrimes
where
sqrtOfX = sqrtNat x
currentPrime:restOfPrimes = neededPrimes
sqrtNat :: Nat -> Nat
sqrtNat = floor . sqrt . fromIntegral
Now this question is about 2 questions really:
1.- How to transform this algorithm to use arrays instead of lists? (Is for the sake of learning how to handle state and arrays in Haskell)
Which somebody already answered in the comments, but pointing to a not very good explained example.
2.- How to eliminate the concatenation of lists every time a new prime is found?
True -> primesAcum++[current]
Here's a more or less direct translation of your code into working with an unboxed array of integers:
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
import Data.Array.Unboxed
import Control.Arrow
main :: IO ()
main = print . (length &&& last) . primesUpTo $ 1299709
primesUpTo :: Int -> [Int]
primesUpTo = takeWhile (> 0) . elems . primesUpToUA
primesUpToUA :: Int -> UArray Int Int
primesUpToUA n = runSTUArray $ do
let k = floor( 1.25506 * fromIntegral n / log (fromIntegral n)) -- WP formula
ar <- newArray (0,k+1) 0 -- all zeroes initially, two extra spaces
let
loop c i | c > n = return ar -- upper limit is reached
| otherwise = do -- `i` primes currently in the array
b <- isPrime 0 i c -- is `c` a prime?
if b then do { writeArray ar i c ; loop (c+1) (i+1) }
else loop (c+1) i
isPrime j i c | j >= i = return True -- `i` primes
| otherwise = do -- currently in the array
p <- readArray ar j
if p*p > c then return True
else if rem c p == 0 then return False
else isPrime (j+1) i c
loop 2 0
This is more or less self-explanatory, when you read it slowly, one statement at a time.
Using arrays, there's no problems with list concatenation, as there are no lists. We use the array of primes as we're adding new items to it.
Of course you can re-write your list-based code to behave better; the simplest re-write is
ps :: [Int]
ps = 2 : [i | i <- [3..],
and [rem i p > 0 | p <- takeWhile ((<=i).(^2)) ps]]
primesTo n = takeWhile (<= n) ps
The key is to switch from recursive thinking to corecursive thinking - not how to add at the end, explicitly, but to define how a list is to be produced — and let the lazy semantics take care of the rest.
I have a handful of ASCII files containing around 17 million lines in total, and within each/most lines is a fixed 36-byte identifier. So my data is rectangular: I have a lot of rows of fixed width. Using Haskell, I want to read all the lines in, use a regex to extract the identifier (I'm fine up to there), then sort them and count the number of unique identifiers (very close to grep | sort | uniq). (I'm already parallelising by reading from each file in parallel.) Sounds like a simple problem , but...
I'm finding it hard to get decent performance out of this problem, even before the sorting stage. Here's as far as I've got. String is overkill for 36-bytes of ASCII, so I figured on using ByteString. But a (linked) list of size 17 million seems like a bad idea, so I tried IOVector ByteString. But this seems to be quite slow. I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector: the GC is taking at least 3 times as long as the code (according to +RTS -s) and I think it only gets worse as the program keeps running.
I was thinking that I should maybe use Repa or some sort of single giant ByteString/IOVector Char8/whatever (since I know the exact width of each row is 36) to store the data in one massive rectangular array for each thread, which should alleviate the GC problem. However, I do still need to sort the rows afterwards, and Repa seems to have no support for sorting, and I don't want to be writing sort algorithms myself. So I don't know how to have a giant rectangular array and yet still sort it.
Suggestions for libraries to use, GC flags to set, or anything else? The machine has 24 cores and 24GB of RAM, so I'm not constrained on hardware. I want to remain in Haskell because I have lots of related code (that is also parsing the same files and producing summary statistics) that is working fine, and I don't want to rewrite it.
I believe the garbage collection is suffering as I retain millions of small ByteStrings in the vector
Suspicious. Retaining ByteStrings should not be collected. Maybe there is excessive data copying somewhere in your code?
ByteString is a header (8 bytes) + ForeignPtr Word8 ref (8 bytes) + Int offset (4 bytes) + Int length (4 bytes)
ForeignPtr is a header (8 bytes) + Addr# (8 bytes) + PlainPtr ref (8 bytes)
PlainPtr is a header (8 bytes) + MutableByteArray# ref (8 bytes)
(Revised according to https://stackoverflow.com/a/3256825/648955)
All in all, ByteString overhead is at least 64 bytes (correct me, of some fields are shared).
Write your own data management: big flat Word8 array and adhoc offset wrapper
newtype ByteId = ByteId { offset :: Word64 }
with Ord instance.
Overhead would be 8 bytes per identifier. Store offsets in unboxed Vector. Sort with something like this: http://hackage.haskell.org/packages/archive/vector-algorithms/0.5.4.2/doc/html/Data-Vector-Algorithms-Intro.html#v:sort
The Array family of types has built-in support for multi-dimensional arrays. The indices can be anything with an Ix instance, in particular for your case (Int, Int). It also does not support sorting, unfortunately.
But for your use case, do you really need sorting? If you have a map from identifiers to Int you can just increase the count as you go, and later select all keys with value 1. You can check out the bytestring-trie package, although for your use case it suggests to use hashmap.
Another algorithm would be to carry two sets (e.g. HashSet), one with identifiers seen exactly once, and one with identifiers seen more than once, and you update these sets while you go through the list.
Also, how do you read your file: If you read it as one large ByteString and carefully construct the small ByteString objects from it, they will actually be just pointers into the big chunk of memory with the large file, possibly eliminating your GC problems. But to assess that we’d need to see your code.
There are a couple of wrappers around mmap available that can give you either Ptrs to data in your file or a big ByteString. A ByteString is really just a pointer,offset,length tuple; splitting that big ByteString into a bunch of small ones is really just making a bunch of new tuples that point to subsets of the big one. Since you say each record is at a fixed offset in the file, you should be able to create a bunch of new ones without actually accessing any of the file at all via ByteString's take.
I don't have any good suggestions for the sorting part of the problem, but avoiding the copying of the file data in the first place ought to be a good start.
A trie should work. This code takes 45 mins on a file of 18 million lines, 6 million unique keys, on a dual-core laptop with 4 gig RAM:
--invoked: test.exe +RTS -K3.9G -c -h
import qualified Data.ByteString.Char8 as B
import qualified Data.Trie as T
file = "data.txt"
main = ret >>= print
ret = do -- retreive the data
ls <- B.readFile file >>= return.B.lines
trie <- return $ tupleUp ls
return $ T.size trie
tupleUp:: [B.ByteString] -> T.Trie Int
tupleUp l = foldl f T.empty l
f acc str = case T.lookup str acc
of Nothing -> T.insert str 1 acc
Just n -> T.adjust (+1) str acc
Here's the code used to generate the data file (6MM keys, then 3 copies into 1 file to get the 18MM keys:
import qualified Data.ByteString.Char8 as BS
import System.Random
import Data.List.Split
file = "data.txt"
numLines = 6e6 --17e6
chunkSize = 36
charSet = ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9']
-- generate the file
gen = do
randgen <- getStdGen
dat <- return $ t randgen
writeFile file (unlines dat)
t gen = take (ceiling numLines) $ charChunks
where
charChunks = chunksOf chunkSize chars
chars = map (charSet!!) rands
rands = randomRs (0,(length charSet) -1) gen
main = gen
So, how fast can we be? Let's do some tests with a file generated by #ja.'s code:
cat data.txt > /dev/null
--> 0.17 seconds
The same in Haskell?
import qualified Data.ByteString as B
f = id
main = B.readFile "data.txt" >>= return . f >>= B.putStr
Timing?
time ./Test > /dev/null
--> 0.32 seconds
Takes twice as long but I suppose it's not too bad. Using a strict bytestring because
we want to chunk it up in a second.
Next, can we use Vector or is it too slow? Let's build a Vector of chunks and put them back together again. I use the blaze-builder package for optimized output.
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr
How long does it take?
time ./Test2 > /dev/null
--> 1.06 seconds
Not bad at all! Even though you are using a regex to read the lines instead of my fixed chunk positions, we still can conclude that you can put your chunks in a Vector with no serious performance hits.
What's left? Sorting. Theoretically a bucket sort should be an ideal algorithm for this kind of problem. I tried implementing one myself:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs ->
go v bs lastIdx (chunkSize - 1)
where lastIdx = V.length v - 1
emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
emptyBuckets = V.thaw $ V.generate 256 (const [])
go :: V.Vector B.ByteString ->
V.MVector (PrimState (ST s)) [B.ByteString] ->
Int -> Int -> ST s (V.Vector B.ByteString)
go v _ _ (-1) = return v
go _ buckets (-1) testIdx = do
v' <- unbucket buckets
bs <- emptyBuckets
go v' bs lastIdx (testIdx - 1)
go v buckets idx testIdx = do
let testChunk = v V.! idx
testByte = fromIntegral $ testChunk `B.index` testIdx
b <- MV.read buckets testByte
MV.write buckets testByte (testChunk:b)
go v buckets (idx-1) testIdx
unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] ->
ST s (V.Vector B.ByteString)
unbucket v = do
v' <- V.freeze v
return . V.fromList . concat . V.toList $ v'
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
bucketSort (recordLen) .
mkVector
Testing it gave a time of about 1:50 minutes, which is probably acceptable. We are talking of an O(c*n) algorithm for n in the range of some millions and a constant c of 36*something. But I'm sure you can optimize it further.
Or you can just use the vector-algorithms package. Testing with a heap sort:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
sortIt v = runST $ do
mv <- V.thaw v
sort mv
V.freeze mv
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
sortIt .
mkVector
This does the job in about 1:20 minutes on my machine, so right now it's faster than my bucket sort. Both of the final solutions consume something in the range of 1-1.2 GB of RAM.
Good enough?
I'm using Gviz library from bioconductor. I input a tab delimited file containing CNV position that I need to plot on my chromosome ideogram.
My input file is defined by dat and has 4 columns
[1] chromosome
[2] start
[3] end
[4] width (could '+' or '-' depending on the orientation of the Copy Number)
So I did that :
library(IRanges)
libraray(Gviz)
gen <- "mm9"
chr <- "chr1"
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
gtrack <- GenomeAxisTrack()
dat <- read.delim("C:/R/1ips_chr1.txt", header = FALSE, sep ="\t")
s <- dat[2]
e <- dat[3]
l <- dat[4]
It shows an error message when I call the file dat :
atrack1 <- AnnotationTrack( start = s, width = l , chromosome = chr, genome = gen, name = "Sample1")
Error : function (classes, fdef, mtable) : unable to find an inherited method for function ".buildRange", for signature "NULL", "data.frame", "NULL", "data.frame"
Obviously the way I call a the inputed file (in dat) doesn't satisfy R .. Someone help me please :)
From the reference manual for the Gviz package (with which I am not familiar), the arguments start and width in the AnnotationTrack function need to be integer vectors. When you subset dat using the single square bracket [, the resulting object is a data.frame (see ?`[.data.frame` for more on this). Try instead
s <- dat[[2]]
e <- dat[[3]]
l <- dat[[4]]
to obtain integer vectors.