Haskell parse a string into array of arrays - arrays

I Need to parse a String into Array of Arrays. String is splited in Arrays by "\n" char. And each of that Array is splited by "," or ";" signs.
Example: 4;5\n6,7 -----> [[4,5][6,7]]
import qualified Text.Parsec as P (char,runP,noneOf,many,(<|>),eof)
import Text.ParserCombinators.Parsec
import Text.Parsec.String
import Text.Parsec.Char
import Text.PrettyPrint.HughesPJ
import Data.Maybe
newtype CSV = CSV [Row] deriving (Show,Eq)
type Row = [String]
parseCSV :: Parser CSV
parseCSV = do
return $ CSV (endBy (sepBy (many (noneOf ",\n")) (Text.Parsec.Char.char ',')) (Text.Parsec.Char.char '\n'))
runParsec :: Parser a -> String -> Maybe a
runParsec parser input = case P.runP parser () "" input of
Left _ -> Nothing
Right a -> Just a
But when I try to run the Code I get error because of wrong data types

Here is a solution that parses runParsec parseCSV "4;5\n6,7\n" into Just (CSV [["4","5"],["6","7"]]).
parseCSV :: Parser CSV
parseCSV = CSV <$> csv
where
csv = row `endBy` char '\n'
row = many (noneOf ",;\n") `sepBy` oneOf ",;"

Related

numpy.loadtxt -> could not convert string to float: '-0,0118'

When I try to run this code:
x,y = np.genfromtxt('Diode_A_Upcd.txt', unpack = True, delimiter = ';' )
I get this:
Array Data
My data:
Diode_A_Upcd.txt
I would like to store each row in individual arrays in order to then plot them
The error message is due to the string being 0,0118. Numpy needs floats to be 0.0118, decimal point not a decimal comma. np.loadtxt has an optional argument converters to allow a function to convert strings in one format to another.
import numpy as np
test_string = [ '0,123;5,234', '0,789;123,45']
def translate( s ):
s = s.replace( b',', b'.' )
return s
conv = { 0: translate, 1: translate }
# Apply translate to both columns.
result = np.loadtxt( test_string, delimiter = ';', converters = conv )
result
# array([[1.2300e-01, 5.2340e+00],
# [7.8900e-01, 1.2345e+02]])

Pandas, how to reset? - Shape of passed values is (1,1), indices imply (3,1)

I'm currently writing some code and am using pandas to export all of the data into csv files. My program runs multiple iterations until it has gone through all of the necessary files. Pandas is re-writing one file each iteration but when it moves onto the next file I need it to reset all of the data (I think).
Structure is roughly:
While loop>a few variables are named>program runs>dataframe=(pandas.DataFrame(averagepercentagelist,index=namelist,columns=header))
This part works with no problem for one file. When moving onto the next file, all of the arrays I use are reset and this I think is why pandas gives the error Shape of passed values is (1,1), indices imply (3,1).
Please let me know if I need to explain it better.
EDIT:
While True:
try:
averagepercentagelist=[]
namelist=[]
columns=[]
for row in database:
averagepercentagelist=["12","23"]
namelist=["Name0","Name1"]
columns=["Average percentage"]
dataframe=(pandas.DataFrame(averagepercentagelist,index=namelist,columns=header))
except Exception as e:
print e
break
SNIPPET:
dataframe= (pandas.DataFrame(averagepercentagelist,index=namelist,columns=header))
currentcalculatedatafrane = 'averages' + currentcalculate
dataframeexportpath = os.path.join(ROOT_PATH,'Averages',currentcalculatedatafrane)
dataframe.to_csv(dataframeexportpath)
FULL PROGRAM SO FAR:
import csv
import os
import re
import pandas
import tkinter as tk
from tkinter import messagebox
from os.path import isfile, join
from os import listdir
import time
ROOT_PATH = os.path.dirname(os.path.abspath(__file__))
indexforcalcu=0
line_count=0
testlist=[]
namelist=[]
header=['Average Percentage']
def clearvariables():
indexforcalcu=0
testlist=[]
def findaverageofstudent(findaveragenumber,numoftests):
total=0
findaveragenumber = findaveragenumber/numoftests
findaveragenumber = round(findaveragenumber, 1)
return findaveragenumber
def removecharacters(nameforfunc):
nameforfunc=str(nameforfunc)
elem=re.sub("[{'}]", "",nameforfunc)
return elem
def getallclasses():
onlyfiles = [f for f in listdir(ROOT_PATH) if isfile(join(ROOT_PATH, f))]
onlyfiles.remove("averagecalculatorv2.py")
return onlyfiles
def findaveragefunc():
indexforcalcu=-1
while True:
try:
totaltests=0
line_count=0
averagepercentagelist=[]
indexforcalcu=indexforcalcu+1
allclasses=getallclasses()
currentcalculate=allclasses[indexforcalcu]
classpath = os.path.join(ROOT_PATH, currentcalculate)
with open(classpath) as csv_file:
classscoredb = csv.reader(csv_file, delimiter=',')
for i, row in enumerate(classscoredb):
if line_count == 0:
while True:
try:
totaltests=totaltests+1
rowreader= {row[totaltests]}
except:
totaltests=totaltests-1
line_count = line_count + 1
break
else:
calculating_column_location=1
total=0
while True:
try:
total = total + int(row[calculating_column_location])
calculating_column_location = calculating_column_location + 1
except:
break
i=str(i)
name=row[0]
cleanname=removecharacters(nameforfunc=name)
namelist.append(cleanname)
findaveragenumbercal=findaverageofstudent(findaveragenumber=total,numoftests=totaltests)
averagepercentagelist.append(findaveragenumbercal)
line_count = line_count + 1
dataframe= (pandas.DataFrame(averagepercentagelist,index=namelist,columns=header))
currentcalculatedatafrane = 'averages' + i + currentcalculate
dataframeexportpath = os.path.join(ROOT_PATH,'Averages',currentcalculatedatafrane)
dataframe.to_csv(dataframeexportpath)
i=int(i)
except Exception as e:
print("ERROR!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\n",e)
break
def makenewclass():
global newclassname
getclassname=str(newclassname.get())
if getclassname == "":
messagebox.showerror("Error","The class name you have entered is invalid.")
else:
classname = getclassname + ".csv"
with open(classname, mode='w') as employee_file:
classwriter = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
classwriter.writerow(["Name","Test 1"])
root=tk.Tk()
root.title("Test result average finder")
findaveragebutton=tk.Button(root,text="Find Averages",command=findaveragefunc())
findaveragebutton.grid(row=2,column=2,padx=(10, 10),pady=(0,10))
classnamelabel=tk.Label(root, text="Class name:")
classnamelabel.grid(row=1, column=0,padx=(10,0),pady=(10,10))
newclassname = tk.Entry(root)
newclassname.grid(row=1,column=1,padx=(10, 10))
newclassbutton=tk.Button(root,text="Create new class",command=makenewclass)
newclassbutton.grid(row=1,column=2,padx=(0, 10),pady=(10,10))
root.mainloop()
Thanks in advance,
Sean
Use:
import glob, os
import pandas as pd
ROOT_PATH = os.path.dirname(os.path.abspath(__file__))
#extract all csv files to list
files = glob.glob(f'{ROOT_PATH}/*.csv')
print (files)
#create new folder if necessary
new = os.path.join(ROOT_PATH,'Averages')
if not os.path.exists(new):
os.makedirs(new)
#loop each file
for f in files:
#create DataFrame and convert first column to index
df = pd.read_csv(f, index_col=[0])
#count average in each row, rond and create one colum DataFrame
avg = df.mean(axis=1).round(1).to_frame('Average Percentage')
#remove index name if nncessary
avg.index.name = None
print (avg)
#create new path
head, tail = os.path.split(f)
path = os.path.join(head, 'Averages', tail)
print (path)
#write DataFrame to csv
avg.to_csv(path)

Spark 2.0.x dump a csv file from a dataframe containing one array of type string

I have a dataframe df that contains one column of type array
df.show() looks like
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
I try to dump that df in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
It is not working because of the column ArrayOfString. I get the error:
CSV data source does not support array string data type
The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
Pyspark implementation.
In this example, change the field column_as_array to column_as_string before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)
No need for a UDF if you already know which fields contain arrays. You can simply use Spark's cast function:
import org.apache.spark.sql.functions._
val dumpCSV = df.withColumn("ArrayOfString", col("ArrayOfString").cast("string"))
.write
.csv(path="/home/me/saveDF")
Hope that helps.
Here is a method for converting all ArrayType (of any underlying type) columns of a DataFrame to StringType columns:
def stringifyArrays(dataFrame: DataFrame): DataFrame = {
val colsToStringify = dataFrame.schema.filter(p => p.dataType.typeName == "array").map(p => p.name)
colsToStringify.foldLeft(dataFrame)((df, c) => {
df.withColumn(c, concat(lit("["), concat_ws(", ", col(c).cast("array<string>")), lit("]")))
})
}
Also, it doesn't use a UDF.
CSV is not the ideal export format, but if you just want to visually inspect your data, this will work [Scala]. Quick and dirty solution.
case class example ( id: String, ArrayOfString: String, Age: String, Gender: String)
df.rdd.map{line => example(line(0).toString, line(1).toString, line(2).toString , line(3).toString) }.toDF.write.csv("/tmp/example.csv")
To answer DreamerP's question (from one of the comments) :
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('Antecedent_as_str', array_to_string_udf(df["Antecedent"]))
df = df.withColumn('Consequent_as_str', array_to_string_udf(df["Consequent"]))
df = df.drop("Consequent")
df = df.drop("Antecedent")
df.write.csv("foldername")

How do I read (and parse) a file and then append to the same file without getting an exception?

I am trying to read from a file correctly in Haskell but I seem to get this error.
*** Exception: neo.txt: openFile: resource busy (file is locked)
This is my code.
import Data.Char
import Prelude
import Data.List
import Text.Printf
import Data.Tuple
import Data.Ord
import Control.Monad
import Control.Applicative((<*))
import Text.Parsec
( Parsec, ParseError, parse -- Types and parser
, between, noneOf, sepBy, many1 -- Combinators
, char, spaces, digit, newline -- Simple parsers
)
These are the movie fields.
type Title = String
type Director = String
type Year = Int
type UserRatings = (String,Int)
type Film = (Title, Director, Year , [UserRatings])
type Period = (Year, Year)
type Database = [Film]
This is the Parsing of all the types in order to read correctly from the file
-- Parse a string to a string
stringLit :: Parsec String u String
stringLit = between (char '"') (char '"') $ many1 $ noneOf "\"\n"
-- Parse a string to a list of strings
listOfStrings :: Parsec String u [String]
listOfStrings = stringLit `sepBy` (char ',' >> spaces)
-- Parse a string to an int
intLit :: Parsec String u Int
intLit = fmap read $ many1 digit
-- Or `read <$> many1 digit` with Control.Applicative
stringIntTuple :: Parsec String u (String , Int)
stringIntTuple = liftM2 (,) stringLit intLit
film :: Parsec String u Film
film = do
-- alternatively `title <- stringLit <* newline` with Control.Applicative
title <- stringLit
newline
director <- stringLit
newline
year <- intLit
newline
userRatings <- stringIntTuple
newline
return (title, director, year, [userRatings])
films :: Parsec String u [Film]
films = film `sepBy` newline
This is the main program (write "main" in winghci to start the program)
-- The Main
main :: IO ()
main = do
putStr "Enter your Username: "
name <- getLine
filmsDatabase <- loadFile "neo.txt"
appendFile "neo.txt" (show filmsDatabase)
putStrLn "Your changes to the database have been successfully saved."
This is the loadFile function
loadFile :: FilePath -> IO (Either ParseError [Film])
loadFile filename = do
database <- readFile filename
return $ parse films "Films" database
the other txt file name is neo and includes some movies like this
"Blade Runner"
"Ridley Scott"
1982
("Amy",5), ("Bill",8), ("Ian",7), ("Kevin",9), ("Emma",4), ("Sam",7), ("Megan",4)
"The Fly"
"David Cronenberg"
1986
("Megan",4), ("Fred",7), ("Chris",5), ("Ian",0), ("Amy",6)
Just copy paste everything include a txt file in the same directory and test it to see the error i described.
Whoopsy daisy, being lazy
tends to make file changes crazy.
File's not closed, as supposed
thus the error gets imposed.
This small guile, by loadFile
is what you must reconcile.
But don't fret, least not yet,
I will show you, let's get set.
As many other functions that work with IO in System.IO, readFile doesn't actually consume any input. It's lazy. Therefore, the file doesn't get closed, unless all its content has been consumed (it's then half-closed):
The file is read lazily, on demand, as with getContents.
We can demonstrate this on a shorter example:
main = do
let filename = "/tmp/example"
writeFile filename "Hello "
contents <- readFile filename
appendFile filename "world!" -- error here
This will fail, since we never actually checked contents (entirely). If you get all the content (for example with printing, length or similar), it won't fail anymore:
main = do
let filename = "/tmp/example2"
writeFile filename "Hello "
content <- readFile filename
putStrLn content
appendFile filename "world!" -- no error
Therefore, we need either something that really closes the file, or we need to make sure that we've read all the contents before we try to append to the file.
For example, you can use withFile together with some "magic" function force that makes sure that the content really gets evaluated:
readFile' filename = withFile filename ReadMode $ \handle -> do
theContent <- hGetContents handle
force theContent
However, force is tricky to achieve. You could use bang patterns, but this will evaluate the list only to WHNF (basically just the first character). You could use the functions by deepseq, but that adds another dependency and is probably not allowed in your assignment/exercise.
Or you could use any function that will somehow make sure that all elements are evaluated or sequenced. In this case, we can use a small trick and mapM return:
readFile' filename = withFile filename ReadMode $ \handle -> do
theContent <- hGetContents handle
mapM return theContent
It's good enough, but you would use something like pipes or conduit instead in production.
The other method is to make sure that we've really used all the contents. This can be done by using another parsec parser method instead, namely runParserT. We can combine this with our withFile approach from above:
parseFile :: ParsecT String () IO a -> FilePath -> IO (Either ParseError a)
parseFile p filename = withFile filename ReadMode $ \handle ->
hGetContents handle >>= runParserT p () filename
Again, withFile makes sure that we close the file. We can use this now in your loadFilm:
loadFile :: FilePath -> IO (Either ParseError [Film])
loadFile filename = parseFile films filename
This version of loadFile won't keep the file locked anymore.
The problem is that readFile doesn't actually read the entire file into memory immediately; it opens the file and instantly returns a string. As you "look at" the string, behind the scenes the file is being read. So when readFile returns, the file it still open for reading, and you can't do anything else with it. This is called "lazy I/O", and many people consider it to be "evil" precisely because it tends to cause problems like the one you currently have.
There are several ways you can go about fixing this. Probably the simplest is to just force the whole string into memory before continuing. Calculating the length of the string will do that — but only if you "use" the length for something, because the length itself is lazy. (See how this rapidly becomes messy? This is why people avoid lazy I/O.)
The simplest thing you could try is printing the number of films loaded right before you try to append to the database.
main = do
putStr "Enter your Username: "
name <- getLine
filmsDatabase <- loadFile "neo.txt"
putStrLn $ "Loaded " ++ show (length filmsDatabase) ++ " films."
appendFile "neo.txt" (show filmsDatabase)
putStrLn "Your changes to the database have been successfully saved."
It's kind of evil that what looks like a simple print message is actually fundamental to making the code work though!
The other alternative is to save the new database under a different name, and then delete the old file and rename the new one over the top of the old one. This does have the advantage that if the program were to crash half way through saving, you haven't just lost all your stuff.

Find string in log files and return extra characters

How can I get Python to loop through a directory and find a specific string in each file located within that directory, then output a summary of what it found?
I want to search the long files for the following string:
FIRMWARE_VERSION = "2.15"
Only, the firmware version can be different in each file. So I want the log file to report back with whatever version it finds.
import glob
import os
print("The following list contains the firmware version of each server.\n")
os.chdir( "LOGS\\" )
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + " = ???)
I was thinking I could use something like the following to return the extra characters but it's not working.
file[:+5]
I want the output to look something like this:
server1.web.com = FIRMWARE_VERSION = "2.16"
server2.web.com = FIRMWARE_VERSION = "3.01"
server3.web.com = FIRMWARE_VERSION = "1.26"
server4.web.com = FIRMWARE_VERSION = "4.1"
server5.web.com = FIRMWARE_VERSION = "3.50"
Any suggestions on how I can do this?
You can use regex for grub the text :
import re
for file in glob.glob('*.log'):
with open(file) as f:
contents = f.read()
if 'FIRMWARE_VERSION = "' in contents:
print (file + '='+ re.search(r'FIRMWARE_VERSION ="([\d.]+)"',contents).group(1))
In this case re.search will do the job! with searching the file content based on the following pattern :
r'FIRMWARE_VERSION ="([\d.]+)"'
that find a float number between two double quote!also you can use the following that match anything right after FIRMWARE_VERSIONbetween two double quote.
r'FIRMWARE_VERSION =(".*")'

Resources