Scala logical indexing with for comprehension - arrays

I'm trying to translate the following Matlab logical-indexing pattern into Scala code:
% x is an [Nx1] array of Int32
% y is an [Nx1] array of Int32
% myExpensiveFunction() processes batches of unique x.
ux = unique(x);
z = nan(size(x));
for i = 1:length(ux)
idx = x == ux(i);
z(idx) = myExpensiveFuntion(x(idx), y(idx));
end
Assume I'm working with val x: Array[Int] in Scala. What is the best way to do this?
Edit: To clarify, I'm looking to process batches of (x,y) at a time, grouped by unique x, and return a result (z) with an order corresponding to the initial input. I'm open to sorting x, but eventually need to get back to the original unsorted order. My primary requirement is to handle all the indexing/mapping/sorting in a clear and reasonably efficient way.

Most of this is pretty straightforward in Scala; the only thing that's a bit out of the ordinary is the unique x indices. In Scala you'd do that with a `groupBy'. Since this is a really index-heavy method, I'm just going to give in and go with indices all the way:
val z = Array.fill(x.length)(Double.NaN)
x.indices.groupBy(i => x(i)).foreach{ case (xi, is) =>
is.foreach(i => z(i) = myExpensiveFunction(xi, y(i)))
}
z
assuming you can live with a lack of vectors going to myExpensiveFunction. If not,
val z = Array.fill(x.length)(Double.NaN)
x.indices.groupBy(i => x(i)).foreach{ case (xi, is) =>
val xs = Array.fill(is.length)(xi)
val ys = is.map(i => y(i)).toArray
val zs = myExpensiveFunction(xs, ys)
is.foreach(i => z(i) = zs(i))
}
z
This isn't the most natural way to do the computation in Scala, or the most efficient, but you don't care about efficiency if your expensive function is expensive, and it's the closest I can come to a literal translation.
(Translating your matlab-algorithms into almost everything else involves a certain amount of pain or rethinking, since the "natural" computations in matlab are not like those in most other languages.)

The important point is to get Matlab's unique right. A simple solution would be to use a Set to determine the unique values:
val occurringValues = x.toSet
occurringValues.foreach{ value =>
val indices = x.indices.filter(i => x(i) == value)
for (i <- indices) {
z(i) = myExpensiveFunction(x(i), y(i))
}
}
Note: I assume that it is possible to change myExpensiveFunction to element-wise operation...

scala> def process(xs: Array[Int], ys: Array[Int], f: (Seq[Int], Seq[Int]) => Double): Array[Double] = {
| val ux = xs.distinct
| val zs = Array.fill(xs.size)(Double.NaN)
| for(x <- ux) {
| val idx = xs.indices.filter{ i => xs(i) == x }
| val res = f(idx.map(xs), idx.map(ys))
| idx foreach { i => zs(i) = res }
| }
| zs
| }
process: (xs: Array[Int], ys: Array[Int], f: (Seq[Int], Seq[Int]) => Double)Array[Double]
scala> val xs = Array(1,2,1,2,3)
xs: Array[Int] = Array(1, 2, 1, 2, 3)
scala> val ys = Array(1,2,3,4,5)
ys: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val f = (a: Seq[Int], b: Seq[Int]) => a.sum/b.sum.toDouble
f: (Seq[Int], Seq[Int]) => Double = <function2>
scala> process(xs, ys, f)
res0: Array[Double] = Array(0.5, 0.6666666666666666, 0.5, 0.6666666666666666, 0.6)

Related

Spark sequences [int] comparison with [String] sequence ouput

I was trying to compare integer wrapped arrays in two different columns and give the ratings as string:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
The DataFrame data has column A and B with wrapped array I would like to compare:
val data = Seq(
(Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)),
(Seq(1,1,3),Seq(6,5,7),Seq(11,9,8))
).toDF("A","B","C")
And here is how it looks like:
data: org.apache.spark.sql.DataFrame = [A: array<int>, B: array<int> ... 1 more field]
+---------+---------+----------+
| A| B| C|
+---------+---------+----------+
|[1, 2, 3]|[4, 5, 6]| [7, 8, 9]|
|[1, 1, 3]|[6, 5, 7]|[11, 9, 8]|
+---------+---------+----------+
Then here is the the user define function which I would like to compare each elements in paired arrays in column A and B per row and give the ratings with simple logics. For example if A(1) > B(1) then D(1) is "Top". So as first row with column D, I hope to have ["Top", "Top", "Top"]
def myToChar(num1: Seq[Int], num2: Seq[Int]): Seq[String] = {
val twozipped = num1.zip(num2)
for ((x,y) <- num1.zip(num2)) {
if (x > y) "Top"
if (x < y) "Well"
if (x == y) "Good"
}}
val udfToChar = udf(myToChar(_: Seq[Int], _: Seq[Int]))
val ouput = data.withColumn("D",udfToChar($"A",$"B"))
However, I kept getting the <console>:45: error: type mismatch; error information. Not sure if my udf() type definition is wrong and appreciate any guidance to correct my mistake.
Your myToChar definition is declared to return a Seq[String] - but its implementation doesn't - it returns Unit, because a for expression (without a yield clause) has Unit type.
You can fix this by fixing the implementation of the function:
Replace the for with a map operation
Replace the last if with an else, otherwise the mapping function also returns Unit for inputs that adhere to none of the if conditions (unlike with pattern matching, the compiler can't conclude that your if conditions are exhaustive - it must assume there's also a possibility none of them would hold true)
So - a correct implementation would be:
def myToChar(num1: Seq[Int], num2: Seq[Int]): Seq[String] = {
num1.zip(num2).map { case (x, y) =>
if (x > y) "Top"
if (x < y) "Well"
else "Good"
}
}
Or alternatively using pattern matching with guards:
def myToChar(num1: Seq[Int], num2: Seq[Int]): Seq[String] = {
num1.zip(num2).map {
case (x, y) if x > y => "Top"
case (x, y) if x < y => "Well"
case _ => "Good"
}
}

Scala: iteration 2d array to do operation

A newbie here.
val arr_one = Array(Array(1, 2), Array(3, 4), Array(5, 6),Array(x, y)..and so on)
val arr_two = Array(Array(2,3), Array(4, 5), Array(6, 7))
var tempArr = ArrayBuffer[Double]()
I want to multiply arr_one and arr_two. for example
Iteration1 :Array(1*2+2*3, 1*4 +2*5, 1*6+2*7 ) assign to tempArr
Iteration2 :Array(3*2+4*3, 3*4 +4*5, 3*6+4*7) assign to tempArr
Iteration3 :Array(5*2+6*3, 5*4 +6*5, 5*6+6*7) assign to tempArr
I knew that if
val x = Array(1, 2) ; val y = Array(Array(2,3), Array(4, 5), Array(6, 7))
I can use y map {x zip _ map{case(a, b) => a * b} sum}
But If x like arr_one form, I don't know how to use for loop or something else to do that.
I really have on idea.
How can I do this in scala?
Really thanks.
I believe this does what you need, without any mutable state and "iterations" - it uses the "for-comprehension" syntax which is kind of a non-imperative for-loop - in other words, instead of changing state in each iteration, it returns a value which is the sequence of results per "iteration":
val result: Array[Array[Int]] = for (arr1 <- arr_one) yield {
for (arr2 <- arr_two) yield multArrays(arr1, arr2)
}
Assuming that multArrays has the following signature:
def multArrays(arr1: Array[Int], arr2: Array[Int]): Int
That calculates the value for each cell. A naive implementation (assuming arrays have size 2) would be:
def multArrays(arr1: Array[Int], arr2: Array[Int]): Int = {
arr1(0) * arr2(0) + arr1(1) * arr2(1)
}
But of course this can be generalized to any size arrays.
May be this is what you need:
val tmp = arr_one map ((arr1) => {arr_two map (arr2 => (arr1 zip arr2) map {case(a, b) => a * b} reduce (_ + _))} )
And to get ArrayBuffer simply use :
tmpArr = tmp.toBuffer

For comprehension over Option array

I am getting compilation error:
Error:(64, 9) type mismatch;
found : Array[(String, String)]
required: Option[?]
y <- x
^
in a fragment:
val z = Some(Array("a"->"b", "c" -> "d"))
val l = for(
x <- z;
y <- x
) yield y
Why generator over Array does not produce items of the array? And where from requirement to have Option is coming from?
To be more ridiculous, if I replace "yield" with println(y) then it does compile.
Scala version: 2.10.6
This is because of the way for expressions are translated into map, flatmap and foreach expressions. Let's first simplify your example:
val someArray: Some[Array[Int]] = Some(Array(1, 2, 3))
val l = for {
array: Array[Int] <- someArray
number: Int <- array
} yield number
In accordance with the relevant part of the Scala language specification, this first gets translated into
someArray.flatMap {case array => for (number <- array) yield number}
which in turn gets translated into
someArray.flatMap {case array => array.map{case number => number}}
The problem is that someArray.flatMap expects a function from Array[Int] to Option[Array[Int]], whereas we've provided a function from Array[Int] to Array[Int].
The reason the compilation error goes away if yield number is replaced by println(number) is that for loops are translated differently from for comprehensions: it will now be translated as someArray.foreach{case array => array.foreach {case item => println(item)}}, which doesn't have the same typing issues.
A possible solution is to begin by converting the Option to the kind of collection you want to end up with, so that its flatMap method will have the right signature:
val l = for {
array: Array[Int] <- someArray.toArray
number: Int <- array
} yield number
It's the usual "option must be converted to mix monads" thing.
scala> for (x <- Option.option2Iterable(Some(List(1,2,3))); y <- x) yield y
res0: Iterable[Int] = List(1, 2, 3)
Compare
scala> for (x <- Some(List(1,2,3)); y <- x) yield y
<console>:12: error: type mismatch;
found : List[Int]
required: Option[?]
for (x <- Some(List(1,2,3)); y <- x) yield y
^
to
scala> Some(List(1,2,3)) flatMap (is => is map (i => i))
<console>:12: error: type mismatch;
found : List[Int]
required: Option[?]
Some(List(1,2,3)) flatMap (is => is map (i => i))
^
or
scala> for (x <- Some(List(1,2,3)).toSeq; y <- x) yield y
res3: Seq[Int] = List(1, 2, 3)

How to obtain nested WrappedArray

I need read-only structure with fast indexed access and minimum overhead. That structure would be queried quite often by the application. So, as it was supposed on the net, I tried to use Arrays and cast them to IndexedSeq
scala> val wa : IndexedSeq[Int] = Array(1,2,3)
wa: IndexedSeq[Int] = WrappedArray(1, 2, 3)
So far, so good. But I need to use nested Arrays and there the problem lies.
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
<console>:8: error: type mismatch;
found : Array[Array[_ <: Int]]
required: IndexedSeq[IndexedSeq[Int]]
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
Scala compiler could not apply implicit conversion recursively.
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array[Int](1,2) : IndexedSeq[Int], Array[Int](3) : IndexedSeq[Int], Array[Int]() : IndexedSeq[Int])
wa2d: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3), WrappedArray())
That worked as expected, but this form is too verbose, for each subarray I need to specify types twice. And I would like to avoid it completely. So I've tried another approach
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array()).map(_.to[IndexedSeq])
wa2d: IndexedSeq[IndexedSeq[Int]] = ArraySeq(Vector(1, 2), Vector(3), Vector())
But all WrappedArrays mysteriously disappeared and was replaced with ArraySeq and Vector.
So what is the less obscure way to define nested WrappedArrays ?
Here is how you do it:
scala> def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
scala> wrap(Array(Array(1,2), Array(3,4)))
res13: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3, 4))
If you want to use implicit conversions, use this:
def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
implicit def nestedArrayIsNestedIndexedSeq[T](x: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = wrap(x)
val x: IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2),Array(3,4))
And here is why you might not want to do it:
val th = ichi.bench.Thyme.warmed()
val a = (0 until 100).toArray
val b = a: IndexedSeq[Int]
def sumArray(a: Array[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
def sumIndexedSeq(a: IndexedSeq[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
scala> th.pbenchOff("")(sumArray(a))(sumIndexedSeq(b))
Benchmark comparison (in 439.6 ms)
Significantly different (p ~= 0)
Time ratio: 3.18875 95% CI 3.06446 - 3.31303 (n=30)
First 65.12 ns 95% CI 62.69 ns - 67.54 ns
Second 207.6 ns 95% CI 205.2 ns - 210.1 ns
res15: Int = 4950
The bottom line is that once you access your Array[Int] indirectly via WrappedArray[Int], primitives get boxed. So things get much slower. If you really need the full performance of arrays, you have to use them directly. And if you don't, just use a Vector and stop worrying about it.
I would just go with Vector for prototyping and then go to Array once/if you are sure that this is actually a performance bottleneck. Use a type alias so you can quickly switch from Vector to Array.
Somewhere in your package object:
type Vec[T] = Vector[T]
val Vec = Vector
// type Vec[T] = Array[T]
// val Vec = Array
Then you can write code like this
val grid = Vec(Vec(1,2), Vec(3,4))
and switch quickly to an array version in case you measure that this is actually a performance bottleneck.

Count elements of array A in array B with Scala

I have two arrays of strings, say
A = ('abc', 'joia', 'abas8', '09ma09', 'oiam0')
and
B = ('gfdg', '89jkjj', '09ma09', 'asda', '45645ghf', 'dgfdg', 'yui345gd', '6456ds', '456dfs3', 'abas8', 'sfgds').
What I want to do is simply to count the number of elements of every string in A that appears in B (if any). For example, the resulted array here should be: C = (0, 0, 1, 1, 0). How can I do that?
try this:
A.map( x => B.count(y => y == x)))
You can do it how idursun suggested, it will work, but may be not efficient as if you'll prepare intersection first. If B is much bigger than A it will give massive speedup. 'intersect' method has better 'big-O' complexity then doing linear search for each element of A in B.
val A = Array("abc", "joia", "abas8", "09ma09", "oiam0")
val B = Array("gfdg", "89jkjj", "09ma09", "asda", "45645ghf", "dgfdg", "yui345gd", "6456ds", "456dfs3", "abas8", "sfgds")
val intersectCounts: Map[String, Int] =
A.intersect(B).map(s => s -> B.count(_ == s)).toMap
val count = A.map(intersectCounts.getOrElse(_, 0))
println(count.toSeq)
Result
(0, 0, 1, 1, 0)
Use a foldLeft construction as the yield off of each element of A:
val A = List("a","b")
val B = List("b","b")
val C = for (a <- A)
yield B.foldLeft(0) { case (totalc : Int, w : String) =>
totalc + (if (w == a) 1 else 0)
}
And the result:
C: List[Int] = List(0, 2)

Resources