Scala: Parsing Array of String to a case class - arrays

I have created a case class like this:
def case_class(): Unit = {
case class StockPrice(quarter : Byte,
stock : String,
date : String,
open : Double,
high : Double,
low : Double,
close : Double,
volume : Double,
percent_change_price : Double,
percent_change_volume_over_last_wk : Double,
previous_weeks_volume : Double,
next_weeks_open : Double,
next_weeks_close : Double,
percent_change_next_weeks_price : Double,
days_to_next_dividend : Double,
percent_return_next_dividend : Double
)
And I have thousands of line as Array of String like this:
1,AA,1/7/2011,$15.82,$16.72,$15.78,$16.42,239655616,3.79267,,,$16.71,$15.97,-4.42849,26,0.182704
1,AA,1/14/2011,$16.71,$16.71,$15.64,$15.97,242963398,-4.42849,1.380223028,239655616,$16.19,$15.79,-2.47066,19,0.187852
1,AA,1/21/2011,$16.19,$16.38,$15.60,$15.79,138428495,-2.47066,-43.02495926,242963398,$15.87,$16.13,1.63831,12,0.189994
1,AA,1/28/2011,$15.87,$16.63,$15.82,$16.13,151379173,1.63831,9.355500109,138428495,$16.18,$17.14,5.93325,5,0.185989
How Can I parse data from Array into that case class?
Thank you for your help!

You can proceed as below (I've taken simplified example)
Given your case class and data (lines)
// Your case-class
case class MyCaseClass(
fieldByte: Byte,
fieldString: String,
fieldDouble: Double
)
// input data
val lines: List[String] = List(
"1,AA,$1.1",
"2,BB,$2.2",
"3,CC,$3.3"
)
Note: you can read lines from a text file as
val lines = Source.fromFile("my_file.txt").getLines.toList
You can have some utility methods for mapping (cleaning & parsing)
// remove '$' symbols from string
def removeDollars(line: String): String = line.replaceAll("\\$", "")
// split string into tokens and
// convert into MyCaseClass object
def parseLine(line: String): MyCaseClass = {
val tokens: Seq[String] = line.split(",")
MyCaseClass(
fieldByte = tokens(0).toByte,
fieldString = tokens(1),
fieldDouble = tokens(2).toDouble
)
}
And then use them to convert strings into case-class objects
// conversion
val myCaseClassObjects: Seq[MyCaseClass] = lines.map(removeDollars).map(parseLine)
As a more advanced (and generalized) approach, you can generate the mapping (parsing) function for converting tokens into fields of your case-class using something like reflection, as told here

Here's one way of doing it. I'd recommend splitting everything you do up into lots of small, easy-to-manage functions, otherwise you will get lost trying to figure out where something is going wrong if it all starts throwing exceptions. Data setup:
val array = Array("1,AA,1/7/2011,$15.82,$16.72,$15.78,$16.42,239655616,3.79267,,,$16.71,$15.97,-4.42849,26,0.182704",
"1,AA,1/14/2011,$16.71,$16.71,$15.64,$15.97,242963398,-4.42849,1.380223028,239655616,$16.19,$15.79,-2.47066,19,0.187852",
"1,AA,1/21/2011,$16.19,$16.38,$15.60,$15.79,138428495,-2.47066,-43.02495926,242963398,$15.87,$16.13,1.63831,12,0.189994",
"1,AA,1/28/2011,$15.87,$16.63,$15.82,$16.13,151379173,1.63831,9.355500109,138428495,$16.18,$17.14,5.93325,5,0.185989")
case class StockPrice(quarter: Byte, stock: String, date: String, open: Double,
high: Double, low: Double, close: Double, volume: Double, percent_change_price: Double,
percent_change_volume_over_last_wk: Double, previous_weeks_volume: Double,
next_weeks_open: Double, next_weeks_close: Double, percent_change_next_weeks_price: Double,
days_to_next_dividend: Double, percent_return_next_dividend: Double
)
Function to turn Array[String] into Array[List[String]] and handle any empty fields (I've made an assumption here that you want empty fields to be 0. Change this as necessary):
def splitArray(arr: Array[String]): Array[List[String]] = {
arr.map(
_.replaceAll("\\$", "") // Remove $
.split(",") // Split by ,
.map {
case x if x.isEmpty => "0" // If empty
case y => y // If not empty
}
.toList
)
}
Function to turn a List[String] into a StockPrice. Note that this will fall over if the List isn't exactly 16 items long. I'll leave you to handle any of that. Also, the names are pretty non-descriptive so you can change that too. It will also fall over if your data doesn't map to the relevant .toDouble or toByte or whatever - you can handle this yourself too:
def toStockPrice: List[String] => StockPrice = {
case a :: b :: c :: d :: e :: f :: g :: h :: i :: j :: k :: l :: m :: n :: o :: p :: Nil =>
StockPrice(a.toByte, b, c, d.toDouble, e.toDouble, f.toDouble, g.toDouble, h.toDouble, i.toDouble, j.toDouble,
k.toDouble, l.toDouble, m.toDouble, n.toDouble, o.toDouble, p.toDouble)
}
A nice function to bring this all together:
def makeCaseClass(arr: Array[String]): Seq[StockPrice] = {
val splitArr: Array[List[String]] = splitArray(arr)
splitArr.map(toStockPrice)
}
Output:
println(makeCaseClass(array))
//ArraySeq(
// StockPrice(1,AA,1/7/2011,15.82,16.72,15.78,16.42,2.39655616E8,3.79267,0.0,0.0,16.71,15.97,-4.42849,26.0,0.182704),
// StockPrice(1,AA,1/14/2011,16.71,16.71,15.64,15.97,2.42963398E8,-4.42849,1.380223028,2.39655616E8,16.19,15.79,-2.47066,19.0,0.187852),
// StockPrice(1,AA,1/21/2011,16.19,16.38,15.6,15.79,1.38428495E8,-2.47066,-43.02495926,2.42963398E8,15.87,16.13,1.63831,12.0,0.189994),
// StockPrice(1,AA,1/28/2011,15.87,16.63,15.82,16.13,1.51379173E8,1.63831,9.355500109,1.38428495E8,16.18,17.14,5.93325,5.0,0.185989)
//)
Edit:
To explain the a :: b :: c ..... bit - this is a way of assigning names to items in a List or Seq, given you know the List's size.
val ls = List(1, 2, 3)
val a :: b :: c :: Nil = List(1, 2, 3)
println(a == ls.head) // true
println(b == ls(1)) // true
println(c == ls(2)) // true
Note that the Nil is important because it signifies the last element of the List being Nil. Without it, c would be equal to List(3) as the rest of any List is assigned to the last value in your definition.
You can use this in pattern matching as I have in order to do something with the results:
val ls = List(1, "b", true)
ls match {
case a :: b :: c if c == true => println("this will not be printed")
case a :: b :: c :: Nil if c == true => println(s"this will get printed because c == $c")
} // not exhaustive but you get the point
You can also use it if you know what you want each element in the List to be, like this:
val personCharacteristics = List("James", 26, "blue", 6, 85.4, "brown")
val name :: age :: eyeColour :: otherCharacteristics = personCharacteristics
println(s"Name: $name; Age: $age; Eye colour: $eyeColour")
// Name: James; Age: 26; Eye colour: blue
Obviously these examples are pretty trivial and not exactly what you'd see as a professional Scala developer (at least I don't), but it's a handy thing to be aware of as I do still use this :: syntax at work sometimes.

Related

numpy.loadtxt -> could not convert string to float: '-0,0118'

When I try to run this code:
x,y = np.genfromtxt('Diode_A_Upcd.txt', unpack = True, delimiter = ';' )
I get this:
Array Data
My data:
Diode_A_Upcd.txt
I would like to store each row in individual arrays in order to then plot them
The error message is due to the string being 0,0118. Numpy needs floats to be 0.0118, decimal point not a decimal comma. np.loadtxt has an optional argument converters to allow a function to convert strings in one format to another.
import numpy as np
test_string = [ '0,123;5,234', '0,789;123,45']
def translate( s ):
s = s.replace( b',', b'.' )
return s
conv = { 0: translate, 1: translate }
# Apply translate to both columns.
result = np.loadtxt( test_string, delimiter = ';', converters = conv )
result
# array([[1.2300e-01, 5.2340e+00],
# [7.8900e-01, 1.2345e+02]])

Declaring an empty array of type Edge Graphx

I'm reading data from file to create edges of graph. I've declare an array and adding edges in it one by one. This code is working fine:
class AIRecipes()
case class edgeProperty(val relation: String, val usedIn: String) extends AIRecipes
var edgeArray = Array(Edge(0L, 0L, edgeProperty("", "")))
edgeArray = edgeArray ++ Array(Edge(VertexId, VertexId, edgeProperty("", "")) )
But in first line, instead of declaring an extra edge with dummy values, I want to declare an empty array like that:
var edgeArray = Array.empty[Edge[(Long, Long, Object)]]
edgeArray = edgeArray ++ Array(Edge(VertexId, VertexId, edgeProperty("", "")) )
But it give me following compilation error on '++':
type mismatch; found : Array[org.apache.spark.graphx.Edge[_ >:
(Long, Long, Object) with
net.sansa_stack.template.spark.rdf.TripleReader.edgeProperty <:
Product with Serializable]] required:
Array[org.apache.spark.graphx.Edge[(Long, Long, Object)]] Note:
org.apache.spark.graphx.Edge[_ >: (Long, Long, Object) with
net.sansa_stack.template.spark.rdf.TripleReader.edgeProperty <:
Product with Serializable] >: org.apache.spark.graphx.Edge[(Long,
Long, Object)], but class Array is invariant in type T. You may wish
to investigate a wildcard type such as _ >:
org.apache.spark.graphx.Edge[(Long, Long, Object)].
I also tried this:
edgeArray :+ Array(Edge(VertexId, VertexId, edgeProperty("", "")) )
It don't give me compilation error but nothing is being added in array.
Type of the first array is incorrect. Note that Edge is parameterized only by its property so the type of the expression you are trying to merge is Array[Edge[edgeProperty]]:
scala> :t Array(Edge(0L, 0L, edgeProperty("", "")))
Array[org.apache.spark.graphx.Edge[edgeProperty]]
while you define variable as Array.empty[Edge[(Long, Long, Object)]].
The Object part is the second problem. As you can read in the exception message, Array (as any other mutable container) is invariant. So if you really want to go with Object you'll have to:
scala> var edgeArray = Array.empty[Edge[Object]]
edgeArray: Array[org.apache.spark.graphx.Edge[Object]] = Array()
scala> edgeArray = edgeArray ++ (Array(Edge(1L, 2L, edgeProperty("", "")) ): Array[Edge[Object]])
edgeArray: Array[org.apache.spark.graphx.Edge[Object]] = [Lorg.apache.spark.graphx.Edge;#338
but I still recommend
scala> var edgeArray = Array.empty[Edge[edgeProperty]]
edgeArray: Array[org.apache.spark.graphx.Edge[edgeProperty]] = Array()
scala> edgeArray = edgeArray ++ Array(Edge(1L, 2L, edgeProperty("", "")) )
edgeArray: Array[org.apache.spark.graphx.Edge[edgeProperty]] = [Lorg.apache.spark.graphx.Edge;#7d59e8d4

Scala read only certain parts of file

I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))

Passing arrays to lua as function argument from Fortran

I am looking for Fortran examples (also interface function) to pass arrays as arguments to lua functions. I was able to use fortlua project to start. But the example provided passes one element at a time. Appreciate any help.
--Lua code
local q1
local q2
function getoutput( qout1, qout2)
-- qout1 and qout2 are arrays with some dimension
q1 = qout1
q2 = qout2
end
-- in fortran I used
config_function('getoutput', args, 2, cstatus)
But setting the args is where I am looking for some help. The following code does the job for scalar argument variable not an array I guess.
!> Evaluate a function in the config file and get its result.
FUNCTION config_function(name,args,nargs,status)
REAL :: config_function
CHARACTER(LEN=*) :: name
REAL, DIMENSION(nargs) :: args
REAL(KIND=c_double) :: anarg
INTEGER :: nargs
INTEGER :: status
INTEGER :: iargs
INTEGER(c_int) :: stackstart
stackstart = lua_gettop(mluastate)
config_function = 0
status = 0
CALL lua_getglobal(mluastate,TRIM(name)//C_NULL_CHAR)
IF ( lua_type(mluastate,-1) .eq. LUA_TFUNCTION ) THEN
DO iargs = 1,nargs
anarg = args(iargs)
CALL lua_pushnumber(mluastate,anarg)
ENDDO
IF (lua_pcall(mluastate,nargs,1,0) .eq. 0) THEN
if (lua_isnumber(mluastate,-1) .ne. 0) THEN
config_function = lua_tonumber(mluastate,-1)
CALL lua_settop(mluastate,-2)
ELSE
! Nothing to pop here
status=-3
ENDIF
ELSE
CALL lua_settop(mluastate,-2)
status=-2
ENDIF
ELSE
CALL lua_settop(mluastate,-2)
status=-1
ENDIF
IF (stackstart .ne. lua_gettop(mluastate)) THEN
WRITE(*,*) 'The stack is a different size coming out of config_function'
ENDIF
END FUNCTION config_function
To expand a little bit on my comment, here is a small program implementing an array argument with the help of Aotus:
program aot_vecarg_test
use flu_binding, only: flu_State, flu_settop
use aotus_module, only: open_config_file, close_config
use aot_fun_module, only: aot_fun_type, aot_fun_do, &
& aot_fun_put, aot_fun_open, &
& aot_fun_close
use aot_references_module, only: aot_reference_for, aot_reference_to_top
use aot_table_module, only: aot_table_open, aot_table_close, &
& aot_table_from_1Darray
implicit none
type(flu_State) :: conf
type(aot_fun_type) :: luafun
integer :: iError
character(len=80) :: ErrString
real :: args(2)
integer :: argref
integer :: arghandle
args(1) = 1.0
args(2) = 2.0
call create_script('aot_vecarg_test_config.lua')
write(*,*)
write(*,*) 'Running aot_vecarg_test...'
write(*,*) ' * open_config_file (aot_vecarg_test_config.lua)'
call open_config_file(L = conf, filename = 'aot_vecarg_test_config.lua', &
& ErrCode = iError, ErrString = ErrString)
if (iError /= 0) then
write(*,*) ' : unexpected FATAL Error occured !!!'
write(*,*) ' : Could not open the config file aot_ref_test_config.lua:'
write(*,*) trim(ErrString)
STOP
end if
write(*,*) ' : success.'
! Create a table with data
call aot_table_from_1Darray( L = conf, &
& thandle = arghandle, &
& val = args )
! Create a reference to this table
call flu_setTop(L = conf, n = arghandle)
argref = aot_reference_for(L = conf)
! Start the processing of the function
call aot_fun_open(L = conf, fun = luafun, key = 'print_array')
! Put the previously defined table onto the stack by using the reference
call aot_reference_to_top(L = conf, ref = argref)
! Put the top of the stack to the argument list of the Lua function
call aot_fun_put(L = conf, fun = luafun)
! Execute the Lua function
call aot_fun_do(L = conf, fun = luafun, nresults = 0)
call aot_fun_close(L = conf, fun = luafun)
write(*,*) ' * close_conf'
call close_config(conf)
write(*,*) ' : success.'
write(*,*) '... Done with aot_vecarg_test.'
write(*,*) 'PASSED'
contains
subroutine create_script(filename)
character(len=*) :: filename
open(file=trim(filename), unit=22, action='write', status='replace')
write(22,*) '-- test script for vectorial argument'
write(22,*) 'function print_array(x)'
write(22,*) ' for i, num in ipairs(x) do'
write(22,*) ' print("Lua:"..num)'
write(22,*) ' end'
write(22,*) 'end'
close(22)
end subroutine create_script
end program aot_vecarg_test
This makes use of a little helping routine aot_table_from_1Darray that creates a Lua table for an array of real numbers. Have a look at its code to see how data can be put into a table.
We then create a reference to this table to easily look it up later on and pass it as an argument to the Lua function.
The example creates the corresponding Lua script itself, which defines a simple function that expects a single table as input and prints each of the tables entries. Running this yields the following output:
Running aot_vecarg_test...
* open_config_file (aot_vecarg_test_config.lua)
: success.
Lua:1.0
Lua:2.0
* close_conf
: success.
... Done with aot_vecarg_test.
PASSED
Where the two lines starting with Lua are written by the Lua function print_array.
There are other possible solutions, but I hope this gives at least some idea on how this could be done. We could also think about extending the aot_fun_put interface to take care of arrays itself.

writing p-values to file in R

Can someone help me with this piece of code. In a loop I'm saving p-values in f and then I want to write the p-values to a file but I don't know which function to use to write to file. I'm getting error with write function.
{
f = fisher.test(x, y = NULL, hybrid = FALSE, alternative = "greater",
conf.int = TRUE, conf.level = 0.95, simulate.p.value = FALSE)
write(f, file="fisher_pvalues.txt", sep=" ", append=TRUE)
}
Error in cat(list(...), file, sep, fill, labels, append) :
argument 1 (type 'list') cannot be handled by 'cat'
The return value from fisher.test is (if you read the docs):
Value:
A list with class ‘"htest"’ containing the following components:
p.value: the p-value of the test.
conf.int: a confidence interval for the odds ratio. Only present in
the 2 by 2 case and if argument ‘conf.int = TRUE’.
etc etc. R doesn't know how to write things like that to a file. More precisely, it doesn't know how YOU want it written to a file.
If you just want to write the p value, then get the p value and write that:
write(f$p.value,file="foo.values",append=TRUE)
f is an object of class 'htest', so writing it to a file will write much more than just the p-value.
If you do want to simply save a written representation of the results to a file, just as they appear on the screen, you can use capture.output() to do so:
Convictions <-
matrix(c(2, 10, 15, 3),
nrow = 2,
dimnames =
list(c("Dizygotic", "Monozygotic"),
c("Convicted", "Not convicted")))
f <- fisher.test(Convictions, alternative = "less")
capture.output(f, file="fisher_pvalues.txt", append=TRUE)
More likely, you want to just store the p-value. In that case you need to extract it from f before writing it to the file, using code something like this:
write(paste("p-value from Experiment 1:", f$p.value, "\n"),
file = "fisher_pvalues.txt", append=TRUE)

Resources