I have created a case class like this:
def case_class(): Unit = {
case class StockPrice(quarter : Byte,
stock : String,
date : String,
open : Double,
high : Double,
low : Double,
close : Double,
volume : Double,
percent_change_price : Double,
percent_change_volume_over_last_wk : Double,
previous_weeks_volume : Double,
next_weeks_open : Double,
next_weeks_close : Double,
percent_change_next_weeks_price : Double,
days_to_next_dividend : Double,
percent_return_next_dividend : Double
)
And I have thousands of line as Array of String like this:
1,AA,1/7/2011,$15.82,$16.72,$15.78,$16.42,239655616,3.79267,,,$16.71,$15.97,-4.42849,26,0.182704
1,AA,1/14/2011,$16.71,$16.71,$15.64,$15.97,242963398,-4.42849,1.380223028,239655616,$16.19,$15.79,-2.47066,19,0.187852
1,AA,1/21/2011,$16.19,$16.38,$15.60,$15.79,138428495,-2.47066,-43.02495926,242963398,$15.87,$16.13,1.63831,12,0.189994
1,AA,1/28/2011,$15.87,$16.63,$15.82,$16.13,151379173,1.63831,9.355500109,138428495,$16.18,$17.14,5.93325,5,0.185989
How Can I parse data from Array into that case class?
Thank you for your help!
You can proceed as below (I've taken simplified example)
Given your case class and data (lines)
// Your case-class
case class MyCaseClass(
fieldByte: Byte,
fieldString: String,
fieldDouble: Double
)
// input data
val lines: List[String] = List(
"1,AA,$1.1",
"2,BB,$2.2",
"3,CC,$3.3"
)
Note: you can read lines from a text file as
val lines = Source.fromFile("my_file.txt").getLines.toList
You can have some utility methods for mapping (cleaning & parsing)
// remove '$' symbols from string
def removeDollars(line: String): String = line.replaceAll("\\$", "")
// split string into tokens and
// convert into MyCaseClass object
def parseLine(line: String): MyCaseClass = {
val tokens: Seq[String] = line.split(",")
MyCaseClass(
fieldByte = tokens(0).toByte,
fieldString = tokens(1),
fieldDouble = tokens(2).toDouble
)
}
And then use them to convert strings into case-class objects
// conversion
val myCaseClassObjects: Seq[MyCaseClass] = lines.map(removeDollars).map(parseLine)
As a more advanced (and generalized) approach, you can generate the mapping (parsing) function for converting tokens into fields of your case-class using something like reflection, as told here
Here's one way of doing it. I'd recommend splitting everything you do up into lots of small, easy-to-manage functions, otherwise you will get lost trying to figure out where something is going wrong if it all starts throwing exceptions. Data setup:
val array = Array("1,AA,1/7/2011,$15.82,$16.72,$15.78,$16.42,239655616,3.79267,,,$16.71,$15.97,-4.42849,26,0.182704",
"1,AA,1/14/2011,$16.71,$16.71,$15.64,$15.97,242963398,-4.42849,1.380223028,239655616,$16.19,$15.79,-2.47066,19,0.187852",
"1,AA,1/21/2011,$16.19,$16.38,$15.60,$15.79,138428495,-2.47066,-43.02495926,242963398,$15.87,$16.13,1.63831,12,0.189994",
"1,AA,1/28/2011,$15.87,$16.63,$15.82,$16.13,151379173,1.63831,9.355500109,138428495,$16.18,$17.14,5.93325,5,0.185989")
case class StockPrice(quarter: Byte, stock: String, date: String, open: Double,
high: Double, low: Double, close: Double, volume: Double, percent_change_price: Double,
percent_change_volume_over_last_wk: Double, previous_weeks_volume: Double,
next_weeks_open: Double, next_weeks_close: Double, percent_change_next_weeks_price: Double,
days_to_next_dividend: Double, percent_return_next_dividend: Double
)
Function to turn Array[String] into Array[List[String]] and handle any empty fields (I've made an assumption here that you want empty fields to be 0. Change this as necessary):
def splitArray(arr: Array[String]): Array[List[String]] = {
arr.map(
_.replaceAll("\\$", "") // Remove $
.split(",") // Split by ,
.map {
case x if x.isEmpty => "0" // If empty
case y => y // If not empty
}
.toList
)
}
Function to turn a List[String] into a StockPrice. Note that this will fall over if the List isn't exactly 16 items long. I'll leave you to handle any of that. Also, the names are pretty non-descriptive so you can change that too. It will also fall over if your data doesn't map to the relevant .toDouble or toByte or whatever - you can handle this yourself too:
def toStockPrice: List[String] => StockPrice = {
case a :: b :: c :: d :: e :: f :: g :: h :: i :: j :: k :: l :: m :: n :: o :: p :: Nil =>
StockPrice(a.toByte, b, c, d.toDouble, e.toDouble, f.toDouble, g.toDouble, h.toDouble, i.toDouble, j.toDouble,
k.toDouble, l.toDouble, m.toDouble, n.toDouble, o.toDouble, p.toDouble)
}
A nice function to bring this all together:
def makeCaseClass(arr: Array[String]): Seq[StockPrice] = {
val splitArr: Array[List[String]] = splitArray(arr)
splitArr.map(toStockPrice)
}
Output:
println(makeCaseClass(array))
//ArraySeq(
// StockPrice(1,AA,1/7/2011,15.82,16.72,15.78,16.42,2.39655616E8,3.79267,0.0,0.0,16.71,15.97,-4.42849,26.0,0.182704),
// StockPrice(1,AA,1/14/2011,16.71,16.71,15.64,15.97,2.42963398E8,-4.42849,1.380223028,2.39655616E8,16.19,15.79,-2.47066,19.0,0.187852),
// StockPrice(1,AA,1/21/2011,16.19,16.38,15.6,15.79,1.38428495E8,-2.47066,-43.02495926,2.42963398E8,15.87,16.13,1.63831,12.0,0.189994),
// StockPrice(1,AA,1/28/2011,15.87,16.63,15.82,16.13,1.51379173E8,1.63831,9.355500109,1.38428495E8,16.18,17.14,5.93325,5.0,0.185989)
//)
Edit:
To explain the a :: b :: c ..... bit - this is a way of assigning names to items in a List or Seq, given you know the List's size.
val ls = List(1, 2, 3)
val a :: b :: c :: Nil = List(1, 2, 3)
println(a == ls.head) // true
println(b == ls(1)) // true
println(c == ls(2)) // true
Note that the Nil is important because it signifies the last element of the List being Nil. Without it, c would be equal to List(3) as the rest of any List is assigned to the last value in your definition.
You can use this in pattern matching as I have in order to do something with the results:
val ls = List(1, "b", true)
ls match {
case a :: b :: c if c == true => println("this will not be printed")
case a :: b :: c :: Nil if c == true => println(s"this will get printed because c == $c")
} // not exhaustive but you get the point
You can also use it if you know what you want each element in the List to be, like this:
val personCharacteristics = List("James", 26, "blue", 6, 85.4, "brown")
val name :: age :: eyeColour :: otherCharacteristics = personCharacteristics
println(s"Name: $name; Age: $age; Eye colour: $eyeColour")
// Name: James; Age: 26; Eye colour: blue
Obviously these examples are pretty trivial and not exactly what you'd see as a professional Scala developer (at least I don't), but it's a handy thing to be aware of as I do still use this :: syntax at work sometimes.
I start with an empty array, and a Hash of key, values.
I would like to iterate over the Hash and compare it against the empty array. If the value for each k,v pair doesn't already exist in the array, I would like to create an object with that value and then access an object method to append the key to an array inside the object.
This is my code
class Test
def initialize(name)
#name = name
#values = []
end
attr_accessor :name
def values=(value)
#values << value
end
def add(value)
#values.push(value)
end
end
l = []
n = {'server_1': 'cluster_x', 'server_2': 'cluster_y', 'server_3': 'cluster_z', 'server_4': 'cluster_x', 'server_5': 'cluster_y'}
n.each do |key, value|
l.any? do |a|
if a.name == value
a.add(key)
else
t = Test.new(value)
t.add(key)
l << t
end
end
end
p l
I would expect to see this:
[
#<Test:0x007ff8d10cd3a8 #name=:cluster_x, #values=["server_1, server_4"]>,
#<Test:0x007ff8d10cd2e0 #name=:cluster_y, #values=["server_2, server_5"]>,
#<Test:0x007ff8d10cd1f0 #name=:cluster_z, #values=["server_3"]>
]
Instead I just get an empty array.
I think that the condition if a.name == value is not being met and then the add method isn't being called.
#Cyzanfar gave me a clue as to what to look for, and I found the answer here
https://stackoverflow.com/a/34904864/5006720
n.each do |key, value|
found = l.detect {|e| e.name == value}
if found
found.add(key)
else
t = Test.new(value)
t.add(key)
l << t
end
end
#ARL you're almost there! The last thing you need to consider is when found actually returns an object since detect will find a matching one at some point.
n.each do |key, value|
found = l.detect {|e| e.name == value}
if found
found.add(key)
else
t = Test.new(value)
t.add(key)
l << t
end
end
You actually only want to add a new instance of Test when found return nil. This code should yield your desired output:
[
#<Test:0x007ff8d10cd3a8 #name=:cluster_x, #values=["server_1, server_4"]>,
#<Test:0x007ff8d10cd2e0 #name=:cluster_y, #values=["server_2, server_5"]>,
#<Test:0x007ff8d10cd1f0 #name=:cluster_z, #values=["server_3"]>
]
I observe two things in your code :
def values=(value)
#values << value
def add(value)
#values.push(value)
two methods do the same thing, pushing a value, as << is a kind of syntactic sugar meaning push
you have changed the meaning of values=, which is usually reserved for a setter method, equivalent to attire_writer :values.
Just to illustrate that there are many ways to do things in Ruby, I propose the following :
class Test
def initialize(name, value)
#name = name
#values = [value]
end
def add(value)
#values << value
end
end
h_cluster = {} # intermediate hash whose key is the cluster name
n = {'server_1': 'cluster_x', 'server_2': 'cluster_y', 'server_3': 'cluster_z',
'server_4': 'cluster_x', 'server_5': 'cluster_y'}
n.each do | server, cluster |
puts "server=#{server}, cluster=#{cluster}"
cluster_found = h_cluster[cluster] # does the key exist ? => nil or Test
# instance with servers list
puts "cluster_found=#{cluster_found.inspect}"
if cluster_found
then # add server to existing cluster
cluster_found.add(server)
else # create a new cluster
h_cluster[cluster] = Test.new(cluster, server)
end
end
p h_cluster.collect { | cluster, servers | servers }
Execution :
$ ruby -w t.rb
server=server_1, cluster=cluster_x
cluster_found=nil
server=server_2, cluster=cluster_y
cluster_found=nil
server=server_3, cluster=cluster_z
cluster_found=nil
server=server_4, cluster=cluster_x
cluster_found=#<Test:0x007fa7a619ae10 #name="cluster_x", #values=[:server_1]>
server=server_5, cluster=cluster_y
cluster_found=#<Test:0x007fa7a619ac58 #name="cluster_y", #values=[:server_2]>
[#<Test:0x007fa7a619ae10 #name="cluster_x", #values=[:server_1, :server_4]>,
#<Test:0x007fa7a619ac58 #name="cluster_y", #values=[:server_2, :server_5]>,
#<Test:0x007fa7a619aac8 #name="cluster_z", #values=[:server_3]>]
I'm trying to read an input file in Scala that I know the structure of, however I only need every 9th entry. So far I have managed to read the whole thing using:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
The issue, this leaves me with an array that is huge (we're talking 20GB of data). Not only have I seen myself forced to write some very ugly code in order to convert between RDD[Array[String]] and Array[String] but it's essentially made my code useless.
I've tried different approaches and mixes between using
.map()
.flatMap() and
.reduceByKey()
however nothing actually put my collected "cells" into the format that I need them to be.
Here's what is supposed to happen: Reading a folder of text files from our server, the code should read each "line" of text in the format:
*---------*
| NASDAQ: |
*---------*
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
and only keep a hold of the stock_symbol as that is the identifier I'm counting. So far my attempts have been to turn the entire thing into an array only collect every 9th index from the first one into a collected_cells var. Issue is, based on my calculations and real life results, that code would take 335 days to run (no joke).
Here's my current code for reference:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkNum {
def main(args: Array[String]) {
// Do some Scala voodoo
val sc = new SparkContext(new SparkConf().setAppName("Spark Numerical"))
// Set input file as per HDFS structure + input args
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
var collected_cells:Array[String] = new Array[String](0)
//println("[MESSAGE] Length of CC: " + collected_cells.length)
val divider:Long = 9
val array_length = fields.count / divider
val casted_length = array_length.toInt
val indexedFields = fields.zipWithIndex
val indexKey = indexedFields.map{case (k,v) => (v,k)}
println("[MESSAGE] Number of lines: " + array_length)
println("[MESSAGE] Casted lenght of: " + casted_length)
for( i <- 1 to casted_length ) {
println("[URGENT DEBUG] Processin line " + i + " of " + casted_length)
var index = 9 * i - 8
println("[URGENT DEBUG] Index defined to be " + index)
collected_cells :+ indexKey.lookup(index)
}
println("[MESSAGE] collected_cells size: " + collected_cells.length)
val single_cells = collected_cells.flatMap(collected_cells => collected_cells);
val counted_cells = single_cells.map(cell => (cell, 1).reduceByKey{case (x, y) => x + y})
// val result = counted_cells.reduceByKey((a,b) => (a+b))
// val inmem = counted_cells.persist()
//
// // Collect driver into file to be put into user archive
// inmem.saveAsTextFile("path to server location")
// ==> Not necessary to save the result as processing time is recorded, not output
}
}
The bottom part is currently commented out as I tried to debug it, but it acts as pseudo-code for me to know what I need done. I may want to point out that I am next to not at all familiar with Scala and hence things like the _ notation confuse the life out of me.
Thanks for your time.
There are some concepts that need clarification in the question:
When we execute this code:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val fields = lines.map(line => line.split(","))
That does not result in a huge array of the size of the data. That expression represents a transformation of the base data. It can be further transformed until we reduce the data to the information set we desire.
In this case, we want the stock_symbol field of a record encoded a csv:
exchange, stock_symbol, date, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close
I'm also going to assume that the data file contains a banner like this:
*---------*
| NASDAQ: |
*---------*
The first thing we're going to do is to remove anything that looks like this banner. In fact, I'm going to assume that the first field is the name of a stock exchange that start with an alphanumeric character. We will do this before we do any splitting, resulting in:
val lines = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields = validLines.map(line => line.split(","))
It helps to write the types of the variables, to have peace of mind that we have the data types that we expect. As we progress in our Scala skills that might become less important. Let's rewrite the expression above with types:
val lines: RDD[String] = sc.textFile("hdfs://moonshot-ha-nameservice/" + args(0))
val validLines: RDD[String] = lines.filter(line => !line.isEmpty && line.head.isLetter)
val fields: RDD[Array[String]] = validLines.map(line => line.split(","))
We are interested in the stock_symbol field, which positionally is the element #1 in a 0-based array:
val stockSymbols:RDD[String] = fields.map(record => record(1))
If we want to count the symbols, all that's left is to issue a count:
val totalSymbolCount = stockSymbols.count()
That's not very helpful because we have one entry for every record. Slightly more interesting questions would be:
How many different stock symbols we have?
val uniqueStockSymbols = stockSymbols.distinct.count()
How many records for each symbol do we have?
val countBySymbol = stockSymbols.map(s => (s,1)).reduceByKey(_+_)
In Spark 2.0, CSV support for Dataframes and Datasets is available out of the box
Given that our data does not have a header row with the field names (what's usual in large datasets), we will need to provide the column names:
val stockDF = sparkSession.read.csv("/tmp/quotes_clean.csv").toDF("exchange", "symbol", "date", "open", "close", "volume", "price")
We can answer our questions very easy now:
val uniqueSymbols = stockDF.select("symbol").distinct().count
val recordsPerSymbol = stockDF.groupBy($"symbol").agg(count($"symbol"))