Scala Join Arrays and Reduce on Value - arrays

I have two Arrays of Key/Value Pairs Array[(String, Int)] that I want to join, but only return the minimum value when there's a match on the key.
val a = Array(("personA", 1), ("personB", 4), ("personC", 5))
val b = Array(("personC", 4), ("personA", 2))
Goal: c = Array((personA, 1), (personC, 4))
val c = a.join(b).collect()
results in: c = Array((personA, (1, 2)), (personC, (5, 4)))
I've tried to achieve this using the join method but am having difficulties reducing the values after they have been joined into a single array: Array[(String, (Int, Int))].

Try this:
val a = Array(("personA", 1), ("personB", 4), ("personC", 5))
val b = Array(("personC", 4), ("personA", 2))
val bMap = b.toMap
val cMap = a.toMap.filterKeys(bMap.contains).map {
case(k, v) => k -> Math.min(v, bMap(k))
}
val c = cMap.toArray
The toMap method converts the Array[(String, Int)] into a Map[String, Int]; filterKeys is then used to retain only the keys (strings) in a.toMap that are also in b.toMap. The map operation then chooses the minimum value of the two available values for each key, and creates a new map associating each key with that minimum value. Finally, we convert the resulting map back to an Array[(String, Int)] using toArray.
UPDATED
BTW: I'm not sure where you get the Array.join method from: Array doesn't have such a method, so a.join(b) doesn't work for me. However, I suspect that a and b might be Apache Spark PairRDD collections (or something similar). If that's the case, then you can join a and b, then map the values to the minimum of each pair (a reduce operation is not what you want) as follows:
a.join(b).mapValues(v => Math.min(v._1, v._2)).collect
collect converts the result into an Array[(String, Int)] as you require.

Related

Convert an array with n elements to a tuple with n elements in Scala

I have an array of elements (numbers in my case)
var myArray= Array(1, 2, 3, 4, 5, 6)
//myArray: Array[Int] = Array(1, 2, 3, 4, 5, 6)
and I want to obtain a tuple than contain all of them:
var whatIwant= (1,2,3,4,5,6)
//whatIwant: (Int, Int, Int, Int, Int, Int) = (1,2,3,4,5,6)
I tried the following code but it doesn't work:
var tuple = myArray(0)
for (e <- myArray)
{
var tuple = tuple :+ e
}
//error: recursive variable tuple needs type
The trivial answer is this:
val whatIwant =
myArray match {
case Array(a, b, c, d, e, f) => (a, b, c, d, e, f)
case _ => (0, 0, 0, 0, 0, 0)
}
If you want to support different numbers of element in myArray then you are in for a world of pain because you will lose all the type information associated with a tuple.
If you are using Spark you should probably use its mechanisms to generate the data you want directly rather than converting to Array first.
The number of elements of a tuple is not infinitely superimposed. In earlier versions, there were only 22 at most. Scala treats tuples with different numbers of elements as different classes. So you can't add elements like a list.
Apart from utilizing metaprogramming techniques such as reflection, tuple objects may only be generated explicitly.

How to filter out common keys of 2 different maps in scala

I want to retrieve the key, value of a list/array (assume A) which the key exists in another list/array (assume B)
val B: List[String] = List("key1","key3") //I can refactor the type if needed
val paramNames: Array[String] = parameterNames // ["key1", "key2", "key3"]
val paramValues: Array[AnyRef] = args // [1, "value", Obj]
val A: Array[(String,AnyRef)] = paramNames.zip(paramValues) // [("key1", 1), ("key2", "value"), ("key3", Obj)]
//now I want to retrieve from A, all keys exist in B with their values
//to get [("key1", 1), ("key3", Obj)]
Simply use filter:
val C = A.filter(k => B.contains(k._1))
This will get only tuples, whose keys are contained in B.

split dataframe into array scala

I am trying to split an Dataframe into multiple arrays according to their id.
So I have a table
id|name
12|a
12|b
12|c
13|z
13|y
13|z
and I want to get multiple vectors that look like:
<a,b,c> <x,y,z>
So I have managed to get all the different IDs using:
val ids=dataframe.select("id").distinct.collect.flatMap(_.toSeq)
and that would return 12 and 13.
And I have tried to get for each one of them the names:
val namesArray=ids.map(id=>dataframe.where($"id"===id))
but that doesnt seem to be the way since it is returning the column names and I should find a way to get only the name out of it.
If you already have a DataSet with your data then you can do the following,
val yourDataSet = sc.parallelize(List((12, "a"), (12, "b"), (13, "y"), (13, "z"))).toDF("id", "val")
val requiredDataSet = yourDataSet
.groupBy("id")
.agg(collect_list("val"))
.select("collect_list(val)")
Or you can achieve this by getting the underlying Rdd and then transforming it.
val vaueVectorRdd = dataframe.rdd
.map(row.getInt(0), row.getString(1))
.groupByKey({ case (k, v) => k })
.map({ case (k, iter) => iter.map(_._2).toVector })

Element-wise sum of arrays in Scala

How do I compute element-wise sum of the Arrays?
val a = new Array[Int](5)
val b = new Array[Int](5)
// assign values
// desired output: Array -> [a(0)+b(0), a(1)+b(1), a(2)+b(2), a(3)+b(3), a(4)+b(4)]
a.zip(b).flatMap(_._1+_._2)
missing parameter type for expanded function
Try:
a.zip(b).map { case (x, y) => x + y }
When you use an underscore as a placeholder in a function definition, it can only appear once (for each function argument position, that is, but in this case flatMap takes a Function1, so there's only one). If you need to refer to an argument more than once, you can't use the placeholder syntax—you'll need to give the argument a name.
As the other answers point out, you can use .map { case (x, y) => x + y } or the tuple accessor version, but it's also worth noting that if you want to avoid a bunch of tuple allocations in an intermediate collection, you can write the following:
scala> (a, b).zipped.map(_ + _)
res5: Array[Int] = Array(0, 0, 0, 0, 0)
Here zipped is a method that's available on pairs of collections that has a special map that takes a Function2, which means the only tuple that gets created is the (a, b) pair. The extra efficiency probably doesn't matter much in most cases, but the fact that you can pass a Function2 instead of a function from pairs means the syntax is often a little nicer as well.
// one D Array
val x = Array(1, 2, 3, 40, 55)
val x1 = Array(1, 2, 3, 40, 55)
x.indices.map(i=>x(i)+ x(i) )
// TWo D Array
val x1= Array((3,5), (5,7))
val x = Array((1,2), (3,4))
x.indices.map(i=>( x(i)._1 + x1(i)._1, x(i)._2 + x1(i)._2))

How to check that an array contains a particular value in Scala 2.8?

I've got an array A of D unique (int, int) tuples.
I need to know if the array contains (X, Y) value.
Am I to implement a search algorithm myself or there is a standard function for this in Scala 2.8? I've looked at documentation but couldn't find anything of such there.
That seems easy (unless I'm missing something):
scala> val A = Array((1,2),(3,4))
A: Array[(Int, Int)] = Array((1,2), (3,4))
scala> A contains (1,2)
res0: Boolean = true
scala> A contains (5,6)
res1: Boolean = false
I think the api calls you're looking for is in ArrayLike.
I found this nice way of doing
scala> var personArray = Array(("Alice", 1), ("Bob", 2), ("Carol", 3))
personArray: Array[(String, Int)] = Array((Alice,1), (Bob,2), (Carol,3))
scala> personArray.find(_ == ("Alice", 1))
res25: Option[(String, Int)] = Some((Alice,1))
scala> personArray.find(_ == ("Alic", 1))
res26: Option[(String, Int)] = None
scala> personArray.find(_ == ("Alic", 1)).getOrElse(("David", 1))
res27: (String, Int) = (David,1)

Resources