How to make my Haskell code use Laziness and Garbage collector - file

I wrote a Haskell code which has to solve the following problem : we have n files : f1, f2, f3 .... fn and I cut those files such a way that each slice has 100 lines
f1_1, f1_2, f1_3 .... f1_m
f2_1, f2_2, .... f2_n
...
fn_1, fn_2, .... fn_k
finally I construct a special data type (Dags) using slices in the following way
f1_1, f2_1, f3_1, .... fn_1 => Dag1
f1_2, f2_2, f3_2, ..... fn_2 => Dag2
....
f1_k, f2_k, f3_k, ..... fn_k => Dagk
the code that I wrote start by cutting all the files, then it couple the i-th elements of the results list and construct Dag using the final result list
it looks like this
-- # take a filename and cut the file in slices of 100 lines
sliceFile :: FilePath -> [[String]]
-- # take a list of lists and group the i-th elements into list
coupleIthElement :: [[String]] -> [[String]]
-- # take a list of lines and create a DAG
makeDags :: [String] -> Dag
-- # final code look like this
makeDag_ :: [FilePath] -> [Dag]
makeDags files = map makeDags $ coupleIthElement (concat (map sliceFile files))
The problem is that this code is non-efficient because :
it needs storing all the files in memory in list form
the garbage collector is not working efficiently since all fonctions need the results list of the previous fonction
How could I re-write my program to take advantage of garbage collector work and Laziness of Haskell ?
if not possible or easier, what can i do to be more efficient even a bit ?
thanks for reply
edit
coupleIthElement ["abc", "123", "xyz"] must return ["a1x","b2y","c3z"]
of cause the 100 lines are arbitrary selected using a particular criteria upon some element of the lines but i discard this aspect to make the problem more easier to understand,
another edition
data Dag = Dag ([(Int, String)], [((Int, Int), Int)]) deriving Show
test_dag = Dag ([(1, "a"),(2, "b"),(3, "c")],[((1,2),1),((1,3),1)])
test_dag2 = Dag ([],[])
the first list is each vertice define by the number and the label, the second list is the edges ((1,2),3) means edge between vertice 1 and 2 with the cost 3

A few points:
1) Have you considered using fgl? It's probably more efficient than your own Dag implementation. If you really need to use Dag, you could construct your graphs with fgl then convert them to Dag when they're complete.
2) It seems like you don't actually use the slices when constructing your graphs, rather they control how many graphs you have. If so, how about something like this:
dagFromHandles :: [Handle] -> IO Dag
dagFromHandles = fmap makeDags . mapM hGetLine
allDags :: [FilePath] -> IO [Dag]
allDags listOfFiles = do
handles <- mapM (flip openFile ReadMode) listOfFiles
replicateM 100 (dagFromHandles handles)
This assumes that each file has at least 100 lines, and any extra lines will be ignored. Even better would be if you had a function that would consume a Dag, then you could do
useDag :: Dag -> IO ()
runDags :: [FilePath] -> IO ()
runDags listOfFiles = do
handles <- mapM (flip openFile ReadMode) listOfFiles
replicateM_ 100 (dagFromHandles handles >>= useDag)
This should make more efficient use of garbage collection.
Of course this assumes that I understand the problem properly, and I'm not certain that I do. Note that concat (map sliceFile) should be a no-op (sliceFile would need to be in IO as you've defined the type, but ignoring that for now), so I don't see why you're bothering with it at all.

If it's not needed to process your file in slices, avoid this. Haskell does this automatically! In Haskell, you think of IO as a stream. Data is read from input, as soon as it's needed and discarded, as soon as it's unused. So for instance, this is an easy file-copying programm:
main = interact id
interact has the signature interact :: (String -> String) -> IO (), and feeds the input into a function which handles it and produces some output, which is written to stdout. This program is more efficient then most C-implementations, as the runtime automatically buffers the input and output.
If you want to understand laziness, you have to forget all the wisdom you learned as a imperative programmer and have to think about a program as a description to modify data, not as a set of instructions - data is only processed when needed!
The key point, why your data may be handled the wrong way is the multiple traversion of the list. Your function makeDags traverses the transposed the slices list one by one, so the elements of the original list may not be discarded. What you should try, is to write your function in a way like this:
sliceFile :: FilePath -> [[String]]
sliceFile fp = do
f <- readFile fp
let l = lines fp
slice [] = []
slice x = ll : slice ls where (ll,ls) = splitAt 100 x
return slice l
sliceFirstRow :: [[String]] -> ([String],[[String]])
sliceFirstRow list = unzip $ map (\(x:xs) -> (x,xs)) list
makeDags :: [[String]] -> [Dag]
makeDags [[]] = []
makeDags list = makeDag firstRow : makeDags restOfList where
(firstRow,restOfList) = sliceFirstRow list
This function may be a solution, since the first row is no longer referenced, when it's done. But in the most places, this is a result of laziness, so you could probably try to use seq to force building the Dags and allowing the IO data to be garbage-collected. (If you don't force building the dags, the data can't be garbage collected).
But anyway, I could provide a more helpfull answer, if you give some informations about what these dags are.

Related

How can I repeatedly read in shuffled lines of a large data file in Haskell?

I have a data file of 60k lines, where each line has ~1k comma separated Ints (that I want to immediately turn into Doubles).
I want to iterate over a sequence of random "batches" of 32 lines, where a batch is a random subset of all of the lines, and none of the batches share lines in common. Since there are 60k lines and 32 lines per batch, there should be 1875 batches.
I'm open to changing things if necessary, but I'd like them to be in the form of a list (of batches) that's lazily evaluated. The code that needs this is a foldM, where I'm using it like:
resulting_struct <- foldM fold_fn my_struct batch_list
so that it repeatedly calls fold_fn on the result of the current accumulator my_struct and the next element of batch_list.
I'm very confused. It was easy when I didn't need to shuffle them; I simply read them in and chunked them, and they were evaluated lazily, so I had no problems. Now I'm completely stuck and feel like I must be missing something simple.
I've tried the following:
Reading the file into a list of lines and naively shuffling the input. This doesn't work, as readFile is lazily evaluated, but it needs to read the whole file into memory to shuffle it randomly, and it quickly eats up all my ~8 GB RAM.
Getting the length of the file, and then creating a list of batches of shuffled indices from 0 to 60k that correspond to the line numbers that will be selected to form the batches. Then, when I want to actually get the data batches, I do:
ind_batches <- get_shuffled_ind_batches_from_file fname
batch_list <- mapM (get_data_batch_from_ind_batch fname) ind_batches
where:
get_shuffled_ind_batches_from_file :: String -> IO [[Int]]
get_shuffled_ind_batches_from_file fname = do
contents <- get_contents_from_file fname -- uses readFile, returns [[Double]]
let n_samps = length contents
ind = [0..(n_samps-1)]
shuffled_indices <- shuffle_list ind
let shuffled_ind_chunks = take 1800 $ chunksOf 32 shuffled_indices
return shuffled_ind_chunks
get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = do
contents <- get_contents_from_file fname
let data_batch = get_elems_at_indices contents ind_chunk
return data_batch
shuffle_list :: [a] -> IO [a]
shuffle_list xs = do
ar <- newArray n xs
forM [1..n] $ \i -> do
j <- randomRIO (i,n)
vi <- readArray ar i
vj <- readArray ar j
writeArray ar j vi
return vj
where
n = length xs
newArray :: Int -> [a] -> IO (IOArray Int a)
newArray n xs = newListArray (1,n) xs
get_elems_at_indices :: [a] -> [Int] -> [a]
get_elems_at_indices my_list ind_list = (map . (!!)) my_list ind_list
however, it seems like mapM evaluates immediately, which then tries to read in the file contents repeatedly (I think, the RAM blows up anyway).
A bit more searching told me that I could try using unsafeInterleaveIO to make it so it lazily evaluates an action, so I tried sticking it in like so:
get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = unsafeInterleaveIO $ do
contents <- get_contents_from_file fname
let data_batch = get_elems_at_indices contents ind_chunk
return data_batch
but no luck, same problem as above.
I feel like I've been banging my head against the wall here and must be missing something very simple. Someone suggested using streams or conduits instead, but when I looked at the documentation for them, it wasn't really clear to me how I could use them to solve this problem.
How can I read in a large data file and also shuffle it, without using up all my memory?
hGetContents will return the contents of the file lazily, but if you do much of anything with the result you will realize the whole file at once. I suggest reading the file once, and scanning over it for newlines, so that you can build an index of which chunk starts at which byte offset. That index will be quite small, so you can shuffle it easily. Then you can iterate through the index, each time opening the file and reading only a defined sub-range of it, and parsing only that one chunk.

Optimize a file writing operation in OCaml?

basically in my project, I am trying to write a list of strings into file like this:
val mutable rodata_list : (string*string) list = []
.....
let zip1 ll =
List.map (fun (h,e) -> h^e) ll in
let oc = open_out_gen [Open_append; Open_creat] 0o666 "final_data.s" in
List.iter (fun l -> Printf.fprintf oc "%s\n" l) (zip1 rodata_list);
Here is my problem, usually the rodata_list can reach as long as 800,000 size, and the above code on our server (64-bit, 32 core Intel(R) Xeon(R) CPU E5-2690 0 # 2.90GHz) takes about 3.5 seconds.. The OCaml version I use is 4.01.0.
This is not acceptable, especially as I have 4 piece of code like this to write into a file. Totally they could take me over 15 seconds..
I tried this:
Printf.fprintf oc "%s\n" (String.concat "\n" (zip1 rodata_list));
But no obvious improvement..
So I am wondering that, how to optimize this part? I appreciate any solutions. Thank you!
Don't use ^ to concatenate a bunch of strings in performance critical code, as it will lead to quadratic complexity;
Try not to rely on *printf functions, when performance matters (although in OCaml 4.02 it is pretty fast);
Don't apply several iterations on a list in a row, since OCaml doesn't have a deforesting. Try to do as much operations in an iteration as possible;
If you're using lists of 1 million elements, then you're actually doing something wrong. Try to use different data structure;
So, given the advices above we have the following:
List.iter (fun (x,y) ->
output_string oc x;
output_string oc y;
output_char oc '\n') rodata_list
Also, any optimizations should start from profiling, to get the profile you need to compile it with profiling info, for example like this:
ocamlbuild myprogram.p.native
Then you can run program to collect the profile, that can be read with gprof. My guess, that you will spend all the time not in the actual IO, or even concatenation, but in garbage collection, since your zip, will create millions of string.
How fast it should be
So to proof, that you're actually trying to optimize wrong part of your code, I've wrote this small program:
let rec init_rev acc = function
| 0 -> acc
| n -> init_rev (("hello", "world") :: acc) (n-1)
let () = List.iter (fun (x,y) ->
print_string x;
print_endline y) (init_rev [] 1000_000)
It creates a list of one million elements and outputs it:
$ ocamlbuild main.native
$ time ./main.native > data.txt
real 0m0.998s
user 0m0.211s
sys 0m0.783s
This is on macbook laptop. Moreover we spend most of the time in the system, with only 200ms in OCaml. And a simple loop for 1000_000 iterations without creating a list, takes only 11ms.
So, profile.

Reading file into types - Haskell

Right now I have two types:
type Rating = (String, Int)
type Film = (String, String, Int, [Rating])
I have a file that has this data in it:
"Blade Runner"
"Ridley Scott"
1982
("Amy",5), ("Bill",8), ("Ian",7), ("Kevin",9), ("Emma",4), ("Sam",7), ("Megan",4)
"The Fly"
"David Cronenberg"
1986
("Megan",4), ("Fred",7), ("Chris",5), ("Ian",0), ("Amy",6)
How can I look through then file storing all of the entries into something like FilmDatabase = [Film] ?
Haskell provides a unique way of sketching out your approach. Begin with what you know
module Main where
type Rating = (String, Int)
type Film = (String, String, Int, [Rating])
main :: IO ()
main = do
films <- readFilms "ratings.dat"
print films
Attempting to load this program into ghci will produce
films.hs:8:12: Not in scope: `readFilms'
It needs to know what readFilms is, so add just enough code to keep moving.
readFilms = undefined
It is a function that should do something related to Film data. Reload this code (with the :reload command or :r for short) to get
films.hs:9:3:
Ambiguous type variable `a0' in the constraint:
(Show a0) arising from the use of `print'
...
The type of print is
Prelude> :t print
print :: Show a => a -> IO ()
In other words, print takes a single argument that, informally, knows how to show itself (that is, convert its contents to a string) and creates an I/O action that when executed outputs that string. It’s more-or-less how you expect print to work:
Prelude> print 3
3
Prelude> print "hi"
"hi"
We know that we want to print the Film data from the file, but, although good, ghc can’t read our minds. But after adding a type hint
readFilms :: FilePath -> Film
readFilms = undefined
we get a new error.
films.hs:8:12:
Couldn't match expected type `IO t0'
with actual type `(String, String, Int, [Rating])'
Expected type: IO t0
Actual type: Film
In the return type of a call of `readFilms'
In a stmt of a 'do' expression: films <- readFilms "ratings.dat"
The error tells you that the compiler is confused about your story. You said readFilms should give it back a Film, but the way you called it in main, the computer should have to first perform some I/O and then give back Film data.
In Haskell, this is the difference between a pure string, say "JamieB", and a side effect, say reading your input from the keyboard after prompting you to input your Stack Overflow username.
So now we know we can sketch readFilms as
readFilms :: FilePath -> IO Film
readFilms = undefined
and the code compiles! (But we can’t yet run it.)
To dig down another layer, pretend that the name of a single movie is the only data in ratings.dat and put placeholders everywhere else to keep the typechecker happy.
readFilms :: FilePath -> IO Film
readFilms path = do
alldata <- readFile path
return (alldata, "", 0, [])
This version compiles, and you can even run it by entering main at the ghci prompt.
In dave4420’s answer are great hints about other functions to use. Think of the method above as putting together a jigsaw puzzle where the individual pieces are functions. For your program to be correct, all the types must fit together. You can make progress toward your final working program by taking little babysteps as above, and the typechecker will let you know if you have a mistake in your sketch.
Things to figure out:
How do you convert the whole blob of input to individual lines?
How do you figure out whether the line your program is examining is a title, a director, and so on?
How do you convert the year in your file (a String) to an Int to cooperate with your definition of Film?
How do you skip blank or empty lines?
How do you make readFilms accumulate and return a list of Film data?
Is this homework?
You might find these functions useful:
readFile :: FilePath -> IO String
lines :: String -> [String]
break :: (a -> Bool) -> [a] -> ([a], [a])
dropWhile :: (a -> Bool) -> [a] -> [a]
null :: [a] -> Bool
read :: Read a => String -> a
Remember that String is the same as [Char].
Some clues:
dropWhile null will get rid of empty lines from the start of a list
break null will split a list into the leading run of non-empty lines, and the rest of the list
Haskell has a great way of using the types to find the right function. For instance: In Gregs answer, he wants you to figure out (among other things) how to convert the year of the film from a String to an Int. Well, you need a function. What should be the type of that function? It takes a String and returns an Int, so the type should be String -> Int. Once you have that, go to Hoogle and enter that type. This will give you a list of functions with similar types. The function you need actually has a slightly different type - Read a => String -> a - so it is a bit down the list, but guessing a type and then scanning the resulting list is often a very useful strategy.

Growing arrays in Haskell

I have the following (imperative) algorithm that I want to implement in Haskell:
Given a sequence of pairs [(e0,s0), (e1,s1), (e2,s2),...,(en,sn)], where both "e" and "s" parts are natural numbers not necessarily different, at each time step one element of this sequence is randomly selected, let's say (ei,si), and based in the values of (ei,si), a new element is built and added to the sequence.
How can I implement this efficiently in Haskell? The need for random access would make it bad for lists, while the need for appending one element at a time would make it bad for arrays, as far as I know.
Thanks in advance.
I suggest using either Data.Set or Data.Sequence, depending on what you're needing it for. The latter in particular provides you with logarithmic index lookup (as opposed to linear for lists) and O(1) appending on either end.
"while the need for appending one element at a time would make it bad for arrays" Algorithmically, it seems like you want a dynamic array (aka vector, array list, etc.), which has amortized O(1) time to append an element. I don't know of a Haskell implementation of it off-hand, and it is not a very "functional" data structure, but it is definitely possible to implement it in Haskell in some kind of state monad.
If you know approx how much total elements you will need then you can create an array of such size which is "sparse" at first and then as need you can put elements in it.
Something like below can be used to represent this new array:
data MyArray = MyArray (Array Int Int) Int
(where the last Int represent how many elements are used in the array)
If you really need stop-and-start resizing, you could think about using the simple-rope package along with a StringLike instance for something like Vector. In particular, this might accommodate scenarios where you start out with a large array and are interested in relatively small additions.
That said, adding individual elements into the chunks of the rope may still induce a lot of copying. You will need to try out your specific case, but you should be prepared to use a mutable vector as you may not need pure intermediate results.
If you can build your array in one shot and just need the indexing behavior you describe, something like the following may suffice,
import Data.Array.IArray
test :: Array Int (Int,Int)
test = accumArray (flip const) (0,0) (0,20) [(i, f i) | i <- [0..19]]
where f 0 = (1,0)
f i = let (e,s) = test ! (i `div` 2) in (e*2,s+1)
Taking a note from ivanm, I think Sets are the way to go for this.
import Data.Set as Set
import System.Random (RandomGen, getStdGen)
startSet :: Set (Int, Int)
startSet = Set.fromList [(1,2), (3,4)] -- etc. Whatever the initial set is
-- grow the set by randomly producing "n" elements.
growSet :: (RandomGen g) => g -> Set (Int, Int) -> Int -> (Set (Int, Int), g)
growSet g s n | n <= 0 = (s, g)
| otherwise = growSet g'' s' (n-1)
where s' = Set.insert (x,y) s
((x,_), g') = randElem s g
((_,y), g'') = randElem s g'
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem = undefined
main = do
g <- getStdGen
let (grownSet,_) = growSet g startSet 2
print $ grownSet -- or whatever you want to do with it
This assumes that randElem is an efficient, definable method for selecting a random element from a Set. (I asked this SO question regarding efficient implementations of such a method). One thing I realized upon writing up this implementation is that it may not suit your needs, since Sets cannot contain duplicate elements, and my algorithm has no way to give extra weight to pairings that appear multiple times in the list.

How to split a 110Mo file with Haskell

I have a file which look like this index : label, index's value contain keys in the range of 0... 100000000 and label can be any String value, I want split this file which has 110 Mo in many slices of 100 lines each an make some computation upon each slice. How can I do this?
123 : "acgbdv"
127 : "ytehdh"
129 : "yhdhgdt"
...
9898657 : "bdggdggd"
If you're using String IO, you can do the following:
import System.IO
import Control.Monad
-- | Process 100 lines
process100 :: [String] -> MyData
-- whatever this function does
loop :: [String] -> [MyData]
loop lns = go [] lns
where
go acc [] = reverse acc
go acc lns = let (this, next) = splitAt 100 lns in go (process100 this:acc) next
processFile :: FilePath -> IO [MyData]
processFile f = withFile f ReadMode (fmap (loop . lines) . hGetContents)
Note that this function will silently process the last chunk even if it isn't exactly 100 lines.
Packages like bytestring and text generally provide functions like lines and hGetContents so you should be able to easily adapt this function to any of them.
It's important to know what you're doing with the results of processing each slice, because you don't want to hold on to that data for longer than necessary. Ideally, after each slice is calculated the data would be entirely consumed and could be gc'd. Generally either the separate results get combined into a single data structure (a "fold"), or each one is dealt with separately (maybe outputting a line to a file or something similar). If it's a fold, you should change "loop" to look like this:
loopFold :: [String] -> MyData -- assuming there is a Monoid instance for MyData
loopFold lns = go mzero lns
where
go !acc [] = acc
go !acc lns = let (this, next) = splitAt 100 lns in go (process100 this `mappend` acc) next
The loopFold function uses bang patterns (enabled with "LANGUAGE BangPatterns" pragma) to force evaluation of the "MyData". Depending on what MyData is, you may need to use deepseq to make sure it's fully evaluated.
If instead you're writing each line to output, leave loop as it is and change processFile:
processFileMapping :: FilePath -> IO ()
processFileMapping f = withFile f ReadMode pf
where
pf = mapM_ (putStrLn . show) <=< fmap (loop . lines) . hGetContents
If you're interested in enumerator/iteratee style processing, this is a pretty simple problem. I can't give a good example without knowing what sort of work process100 is doing, but it would involve enumLines and take.
Is it necessary to process exactly 100 lines at a time, or do you just want to process in chunks for efficiency? If it's the latter, don't worry about it. You'd most likely be better off processing one line at a time, using either an actual fold function or a function similar to processFileMapping.

Resources