How to find if a file is binary - file

I am trying to read text of all files in a folder with following code:
readALine :: FilePath -> IO ()
readALine fname = do
putStr . show $ "Filename: " ++ fname ++ "; "
fs <- getFileSize fname
if fs > 0 then do
hand <- openFile fname ReadMode
fline <- hGetLine hand
hClose hand
print $ "First line: " <> fline
else return ()
However, some of these files are binary. How can I find if a given file is binary? I could not find any such function in https://hoogle.haskell.org/?hoogle=binary%20file
Thanks for your help.
Edit: By binary I mean the file has unprintable characters. I am not sure of proper term for these files.
I installed UTF8-string and modified the code:
readALine :: FilePath -> IO ()
readALine fname = do
putStr . show $ "Filename: " ++ fname ++ "; "
fs <- getFileSize fname
if fs > 0 then do
hand <- openFile fname ReadMode
fline <- hGetLine hand
hClose hand
if isUTF8Encoded (unpack fline) then do
print $ "Not binary file."
print $ "First line: " <> fline
else return ()
else return ()
Now it works but on encountering a 'binary' executable file (called esync.x), there is error at hGetLine hand expression:
"Filename: ./esync.x; "firstline2.hs: ./esync.x: hGetLine: invalid argument (invalid byte sequence)
How can I check about characters from file handle itself?

The definition of binary is quite vague, but assuming you mean content which is not valid UTF-8 text.
You should use toString in Data.ByteString.UTF8 which replaces non-UTF-8 characters with a replacement character but doesn't fail with an error.
Converting your example to use UTF-8 ByteStrings:
import Data.Monoid
import System.IO
import System.Directory
import qualified Data.ByteString as B
import qualified Data.ByteString.UTF8 as B
readALine :: FilePath -> IO ()
readALine fname = do
putStr . show $ "Filename: " ++ fname ++ "; "
fs <- getFileSize fname
if fs > 0 then do
hand <- openFile fname ReadMode
fline <- B.hGetLine hand
hClose hand
print $ "First line: " <> B.toString fline
else return ()
This code doesn't fail on binary but is not really detecting binary content. If you want to detect binary, look for B.replacement_char in your data. To detect non-printable characters, you may look for code points smaller than 32 (space character) as well.

Related

The Haskell way to do IO Loops (without explicit recursion)?

I want to read a list of strings seperated by newlines from STDIN, until a new line is witnessed and I want an action of the type IO [String]. Here is how I would do it with recursion:
myReadList :: IO String
myReadList = go []
where
go :: [String] -> IO [String]
go l = do {
inp <- getLine;
if (inp == "") then
return l;
else go (inp:l);
}
However, this method of using go obscures readability and is a pattern so common that one would ideally want to abstract this out.
So, this was my attempt:
whileM :: (Monad m) => (a -> Bool) -> [m a] -> m [a]
whileM p [] = return []
whileM p (x:xs) = do
s <- x
if p s
then do
l <- whileM p xs
return (s:l)
else
return []
myReadList :: IO [String]
myReadList = whileM (/= "") (repeat getLine)
I am guessing there is some default implementation of this whileM or something similar already. However I cannot find it.
Could someone point out what is the most natural and elegant way to deal with this problem?
unfoldWhileM is same as your whileM except that it takes an action (not a list) as second argument.
myReadList = unfoldWhileM (/= "") getLine
Yes for abstracting out the explicit recursion as mentioned in the previous answer there is the Control.Monad.Loop library which is useful. For those who are interested here is a nice tutorial on Monad Loops.
However there is another way. Previously, struggling with this job and knowing that Haskell is by default Lazy i first tried;
(sequence . repeat $ getLine) >>= return . takeWhile (/="q")
I expected the above to collect entered lines into an IO [String] type. Nah... It runs indefinitely and IO actişons don't look lazy at all. At this point System IO Lazy might come handy too. It's a 2 function only simple library.
run :: T a -> IO a
interleave :: IO a -> T a
So run takes an Lazy IO action and turns it into an IO action and interleave does the opposite. Accordingly if we rephrase the above function as;
import qualified System.IO.Lazy as LIO
gls = LIO.run (sequence . repeat $ LIO.interleave getLine) >>= return . takeWhile (/="q")
Prelude> gls >>= return . sum . fmap (read :: String -> Int)
1
2
3
4
q
10
A solution using the effectful streams of the streaming package:
import Streaming
import qualified Streaming.Prelude as S
main :: IO ()
main = do
result <- S.toList_ . S.takeWhile (/="") . S.repeatM $ getLine
print result
A solution that shows prompts, keeping them separated from the reading actions:
main :: IO ()
main = do
result <- S.toList_
$ S.zipWith (\_ s -> s)
(S.repeatM $ putStrLn "Write something: ")
(S.takeWhile (/="") . S.repeatM $ getLine)
print result

Reading file with "US-ASCII" encoding in Haskell: hGetContents: invalid argument (invalid byte sequence)

I'm using Haskell for programming a parser, but this error is a wall I can't pass. Here is my code:
main = do
arguments <- getArgs
let fileName = head arguments
fileContents <- readFile fileName
converter <- open "UTF-8" Nothing
let titleLength = length fileName
titleWithoutExtension = take (titleLength - 4) fileName
allNonEmptyLines = unlines $ tail $ filter (/= "") $ lines fileContents
When I try to read a file with "US-ASCII" encoding I get the famous error hGetContents: invalid argument (invalid byte sequence). I've tried to change the "UTF-8" in my code by "US-ASCII", but the error persist. Is there a way for reading this files, or any kind of file handling encoding problems?
You should hSetEncoding to configure the file handle for a specific text encoding, e.g.:
import System.Environment
import System.IO
main = do
(path : _) <- getArgs
h <- openFile path ReadMode
hSetEncoding h latin1
contents <- hGetContents h
-- no need to close h
putStrLn $ show $ length contents
If your file contains non-ASCII characters and it's not UTF8 encoded, then latin1 is a good bet although it's not the only possibility.

Haskell -- main.hs:121:19: parse error on input `<-'

import System.Environment
import Control.Monad
getLines = liftM lines . readFile
main = do
argv <- getArgs
name <- getProgName
if not (null argv)
then let file = head argv
list <- getLines file
mapM_ putStrLn list
else hPutStr stderr $ "usage: " ++ name ++ " number\n"
I'm not sure what I'm doing wrong and why I'm getting this error.
A let block should be followed either by more 'variable' assignments, or should be ended. In that case, you want to align the next actions under the let. All of this should be in a do statement.
So.... You want to have a do right after your then, and you want to align the list <- ... and mapM_ ... with the let command:
main = do
argv <- getArgs
name <- getProgName
if not (null argv)
then do
let file = head argv
list <- getLines file
mapM_ putStrLn list
else hPutStr stderr $ "usage: " ++ name ++ " number\n"

Convert all the elements in a file into a array in haskell

I have a file which contains a set of 200,000+ words and I want the program to read the data and store it in array and form a new array with all the 200,000+ words.
I wrote the code as
import System.IO
main = do
handle <- openFile "words.txt" ReadMode
contents <- hGetContents handle
con <- lines contents
putStrLn ( show con)
hClose handle
But it is giving error as type error at line 5
And the text file is a of the form
ABRIDGMENT
ABRIDGMENTS
ABRIM
ABRIN
ABRINS
ABRIS
and so on
what are the amendments in the code that it can can form a array of words
I solved it in python (HTH)
def readFile():
allWords = []
for word in open ("words.txt"):
allWords.append(word.strip())
return allWords
Maybe
readFile "words.txt" >>= return . words
with type
:: IO [String]
or you can write
getWordsFromFile :: String -> IO [String]
getWordsFromFile file = readFile file >>= return . words
and use as
main = do
wordList <- getWordsFromFile "words.txt"
putStrLn $ "File contains " ++ show (length wordList) ++ " words."
Very constructive comments from #sanityinc and #Sarah (thanks!):
#sanityinc: "Other options: fmap words $ readFile file or words <$> readFile file if you've imported <$> from Control.Applicative"
#Sarah: "To elaborate a bit, whenever you see foo >>= return . bar you can (and should) replace it with fmap bar foo because you're not actually using the extra powers that come with Monad and in most cases restricting yourself to a needlessly complex type is not beneficial. This will be even more true in the future where Applicative is a superclass of Monad"

Ensuring files are closed promptly

I am writing a daemon that reads something from a small file, modifies it, and writes it back to the same file. I need to make sure that each file is closed promptly after reading before I try to write to it. I also need to make sure each file is closed promptly after writing, because I might occasionally read from it again right away.
I have looked into using binary-strict instead of binary, but it seems that only provides a strict Get, not a strict Put. Same issue with System.IO.Strict. And from reading the binary-strict documentation, I'm not sure it really solves my problem of ensuring that files are promptly closed. What's the best way to handle this? DeepSeq?
Here's a highly simplified example that will give you an idea of the structure of my application. This example terminates with
*** Exception: test.dat: openBinaryFile: resource busy (file is locked)
for obvious reasons.
import Data.Binary ( Binary, encode, decode )
import Data.ByteString.Lazy as B ( readFile, writeFile )
import Codec.Compression.GZip ( compress, decompress )
encodeAndCompressFile :: Binary a => FilePath -> a -> IO ()
encodeAndCompressFile f = B.writeFile f . compress . encode
decodeAndDecompressFile :: Binary a => FilePath -> IO a
decodeAndDecompressFile f = return . decode . decompress =<< B.readFile f
main = do
let i = 0 :: Int
encodeAndCompressFile "test.dat" i
doStuff
doStuff = do
i <- decodeAndDecompressFile "test.dat" :: IO Int
print i
encodeAndCompressFile "test.dat" (i+1)
doStuff
All 'puts' or 'writes' to files are strict. The act of writeFile demands all Haskell data be evaluated in order to put it on disk.
So what you need to concentrate on is the lazy reading of the input. In your example above you both lazily read the file, then lazily decode it.
Instead, try reading the file strictly (e.g. with strict bytestrings), and you'll be fine.
Consider using a package such as conduit, pipes, iteratee or enumerator. They provide much of the benefits of lazy IO (simpler code, potentially smaller memory footprint) without the lazy IO. Here's an example using conduit and cereal:
import Data.Conduit
import Data.Conduit.Binary (sinkFile, sourceFile)
import Data.Conduit.Cereal (sinkGet, sourcePut)
import Data.Conduit.Zlib (gzip, ungzip)
import Data.Serialize (Serialize, get, put)
encodeAndCompressFile :: Serialize a => FilePath -> a -> IO ()
encodeAndCompressFile f v =
runResourceT $ sourcePut (put v) $$ gzip =$ sinkFile f
decodeAndDecompressFile :: Serialize a => FilePath -> IO a
decodeAndDecompressFile f = do
val <- runResourceT $ sourceFile f $$ ungzip =$ sinkGet get
case val of
Right v -> return v
Left err -> fail err
main = do
let i = 0 :: Int
encodeAndCompressFile "test.dat" i
doStuff
doStuff = do
i <- decodeAndDecompressFile "test.dat" :: IO Int
print i
encodeAndCompressFile "test.dat" (i+1)
doStuff
An alternative to using conduits et al. would be to just use System.IO, which will allow you to control explicitly when files are closed with respect to the IO execution order.
You can use openBinaryFile followed by normal reading operations (probably the ones from Data.ByteString) and hClose when you're done with it, or withBinaryFile, which closes the file automatically (but beware this sort of problem).
Whatever the method you use, as Don said, you probably want to read as a strict bytestring and then convert the strict to lazy afterwards with fromChunks.

Resources