map reduce sum item weights in a string - arrays

I have a string like the following:
s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
such that a pair of item and its corresponding weights is separated by a comma, and the item name and its weight is by a space. I want to get the sum of the weights for each item:
//scala.collection.immutable.Map[String,Double] = Map(eggs -> 448.46, milk -> 245.45, widgets -> 123.11)
I have done the following but got stuck on the steps of separating out the item and its weight:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1)
//Array[(String, Int)] = Array((eggs 345.22,1), (milk 14.2,1), (milk 231.25,1), (widgets 103.24,1), (widgets 123.11,1))
I think to proceed, for each element in the array I need to separate out the item name and weight separated by space, but when I tried the following I got quite confused:
s.split(",").map(w=>(w,1)).sortWith(_._1 < _._1).map(w => w._1.split(" ") )
//Array[Array[String]] = Array(Array(eggs, 345.22), Array(milk, 14.2), Array(milk, 231.25), Array(widgets, 103.24), Array(widgets, 123.11))
I am not sure what the next steps should be to proceed the calculations.

If you guaranteed to have the string in this format (so no exceptions and edge cases handling) you can do something like that:
val s = "eggs 103.24,eggs 345.22,milk 231.25,widgets 123.11,milk 14.2"
val result = s
.split(",") // array of strings like "eggs 103.24"
.map(_.split(" ")) // sequence of arrays like ["egg", "103.24"]
.map { case Array(x, y) => (x, y.toFloat)} // convert to tuples (key, number)
.groupBy(_._1) // group by key
.map(t => (t._1, t._2.map(_._2).sum)) // process groups, results in Map(eggs -> 448.46, ...)

Similar to what #GuruStron proposed, but handling possible errors (by just ignoring any kind of malformed data).
Also this one requires Scala 2.13+, older versions won't work.
def mapReduce(data: String): Map[String, Double] =
data
.split(',')
.iterator
.map(_.split(' '))
.collect {
case Array(key, value) =>
key.trim.toLowerCase -> value.toDoubleOption.getOrElse(default = 0)
}.toList
.groupMapReduce(_._1)(_._2)(_ + _)

Related

How to add every element of a list at the end of every element of another list in scala?

I would like to add element of a list at the end of every element of another list.
I have :
val Cars_tmp :List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
=> Result : List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
val Values_tmp: List[String] = a.map(r => ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString ).toList
=> Result : List[String] = List(2, 5)
I would like to have the following result (first element of Values_tmp is concatenate with first element of Cars_tmp, second element of Values_tmp is concatenate with second element of Cars_tmp...) like below:
List("Cars|10|Paris|5|Type|New|2", "Cars|15|Paris|3|Type|New|5")
I tried to do this:
Values_tmp.foldLeft( Seq[String](), Cars_tmp) { case ((acc, rest), elmt) => ((rest :+ elmt)::acc) }
I have the following error:
console>:28: error: type mismatch;
found : scala.collection.immutable.IndexedSeq[Any]
required: List[String]
Than you for your help.
Try to avoid zip, it "fails" silently when the iterables do not have the same size. (In your code, it seems obvious that the 2 lists have the same size, but for more complex code, this is not obvious.)
You can compute the "value" you need and concatenate it on the fly:
val Cars_tmp: List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
def getValue(str: String): String = {
val Array(_, a, _, b, _, _) = str.split('|') // Note the single quote for the split.
(a.toInt / b.toInt).toString
}
Cars_tmp.map(str => str + getValue(str))
I proposed an implementation of getValue using the unapply of Arrays, but you can keep your implementation !
def getValue(r: String) = ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString

Create Tuple out of Array(Array[String) of Varying Sizes using Scala

I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
I can't figure out how to separate the first element from each array and then map it to every other element.
If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.
Given the following input RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
The following should do what you want:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).
For verification:
output.collect().foreach(println)
Should print out:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This is a classic fold operation; but folding in Spark is calling aggregate:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
The solution is a non-Spark environment:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
Convert your input element to seq and all and then try to write the wrapper which will give you List(List(item1,item2), List(item1,item2),...)
Try below code
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
Write a wrapper to convert seq into a pair of two
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
Now send your seq as params and it it will give your pair of two
toPairs(seqs).mkString(" ")
After making it to string you will get the output like
res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
Now you can convert your string, however, you want.
Using df and explode.
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
Output:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
Update1:
Using paired rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
Results:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)

Scala: Using a value from an array to search in another array

I have two arrays:
GlobalArray:Array(Int,Array[String]) and SpecificArray:Array(Int,Int).
The first Int in both of them is a key and I would like to get the element corresponding to that key from the GlobalArray.
In pseudocode:
val v1
For each element of SpecificArray
Get the corresponding element from GlobalArray to use its Array[String]
If (sum <= 100)
for each String of the Array
update v1
// ... some calculation
sum += 1
println (v1)
I know using .map() I could go through each position of the SpecificArray, but so far I was able to do this:
SpecificArray.map{x => val in_global = GlobalArray.filter(e => (e._1 == x._1))
// I don't know how to follow
}
How about something like below, I would prefer to for comprehension code which has better readability.
var sum:Int = 0
var result:String = ""
for {
(k1,v1) <- SpecificArray //v1 is the second int from specific array
(k2,values) <- GlobalArray if k1 == k2 //values is the second array from global array
value <- values if sum < 100 //value is one value frome the values
_ = {sum+=1; result += s"(${v1}=${value})"} //Update something here with the v1 and value
} yield ()
println(result)
Note needs more optimization
Convert GlobalArray to Map for faster lookup.
val GlobalMap = GlobalArray.toMap
SpecificArray.flatMap(x => GlobalMap(x._1))
.foldLeft(0)((sum:Int, s:String) => {
if(sum<=100) {
// update v1
// some calculation
}
sum+1
})
If not all keys of SpecificArray is present in GlobalMap then use GlobalMap.getOrElse(x._1, Array())
How sum affects the logic and what exactly is v1 is not clear from your code, but it looks like you do search through GlobalArray many times. If this is so, it makes sense to convert this array into a more search-friendly data structure: Map. You can do it like this
val globalMap = GlobalArray.toMap
and then you may use to join the strings like this
println(SpecificArray.flatMap({case (k,v) => globalMap(k).map(s => (k,v,s))}).toList)
If all you need is strings you may use just
println(SpecificArray.flatMap({case (k,v) => globalMap(k)}).toList)
Note that this code assumes that for every key in the SpecificArray there will be a matching key in the GlobalArray. If this is not the case, you should use some other method to access the Map like getOrElse:
println(SpecificArray.flatMap({case (k,v) => globalMap.getOrElse(k, Array()).map(s => (k,v,s))}).toList)
Update
If sum is actually count and it works for whole "joined" data rather than for each key in the SpecificArray, you may use take instead of it. Code would go like this:
val joined = SpecificArray2.flatMap({case (k,v) => globalMap.getOrElse(k, Array()).map(s => (s,v))})
.take(100) // use take instead of sum
And then you may use joined whatever way you want. And updated demo that builds v1 as joined string of form v1 += String_of_GlobalArray + " = " + 2nd_Int_of_SpecificArray is here. The idea is to use mkString instead of explicit variable update.

Converting datatypes in Spark/Scala

I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))

How do I remove duplicate values from my Multidimensional array in a Scala way?

I'm trying extract some values from a String. The string contains several lines with values. The values on each line are number, firstname, last name. Then I want to filter by a given pattern and remove the duplicate numbers.
This is my test:
test("Numbers should be unique") {
val s = Cool.prepareListAccordingToPattern(ALLOWED_PATTERN, "1234,örjan,nilsson\n4321,eva-lisa,nyman\n1234,eva,nilsson")
assert(s.length == 2, "Well that didn't work.. ")
info("Chopping seems to work. Filtered duplicate numbers. Expected 1234:4321, got: "+s(0)(0)+":"+s(1)(0))
}
The methods:
def prepareListAccordingToPattern(allowedPattern: String, s: String) : Array[Array[String]] = {
val lines = chop("\n", s)
val choppedUp = lines.map(line =>
chop(",", line)).filter(array =>
array.length == 3 && array(0).matches(allowedPattern)
)
choppedUp
}
def chop(splitSymbol: String, toChop: String) : Array[String] = {
toChop.split(splitSymbol)
}
My test fails as expected since I receive back a multidimensional array with duplicates:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
[2]["1234","eva","nilsson"]
What I would like to do is to filter out the duplicated numbers, in this case "1234"
so that I get back:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
How should I do this in a scala way? Maybe I could attack this problem differently?
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.groupBy( _(0)).map{ case (_, vs) => vs.head}.toArray
// Array(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
If you have a collection of elements (in this case Array of Array[String]) and want to get single element with every value of some property (in this case property is the first string from Array[String]) you should group elements of collection based on this property (arr.groupBy( _(0))) and then somehow select one element from every group. In this case we picked up the first element (Array[String]) from every group.
If you want to select any (not necessary first) element for every group you could convert every element (Array[String]) to the key-value pair ((String, Array[String])) where key is the value of target property, and then convert this collection of pairs to Map:
val myMap = arr.map{ a => a(0) -> a }.toMap
// Map(1234 -> Array(1234, eva, nilsson), 4321 -> Array(4321, eva-lisa, nyman))
myMap.values.toArray
// Array(Array(1234, eva, nilsson), Array(4321, eva-lisa, nyman))
In this case you'll get the last element from every group.
A bit implicit, but should work:
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.view.reverse.map(x => x.head -> x).toMap.values
// Iterable[Array[String]] = MapLike(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
Reverse here to override "eva","nilsson" with "rjan","nilsson", not vice versa

Resources