If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:
val arr = Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))
and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:
newArr = Array(6, 600, ...)
How can I do this efficiently in Spark Scala?
There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:
arr.toList.transpose.map(_.sum)
(then call .toArray if you specifically need the result as an array).
Using breeze Vector:
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)
If your input is sparse you may consider using breeze.linalg.SparseVector.
In practice a linear algebra vector library as mentioned by #zero323 will often be the better choice.
If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :
def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}
def colsum(a:Array[Array[Int]]):Array[Int] = {
a.reduce(col2sum)
}
val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));
colsum(z)
--> Array[Int] = Array(4, 15, 9, 12, 15)
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
.map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))
res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)
flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.
Related
object Solution extends App {
val arr1 = Array(
Array(1,2,3),
Array(4,5,6)
)
var arr2 = Array.ofDim[Int](2,3)
Array.copy(arr1,0,arr2,0,arr1.length)
arr1(0)(1) = 23
println(arr1.map(_.mkString(",")).mkString("\n"))
println()
println(arr2.map(_.mkString(",")).mkString("\n"))
}
1,23,3
4,5,6
1,23,3
4,5,6
what is wrong, why is the 23 appearing in both arrays
Because Array in Scala, or if to be more precise in JVM, because of Scala interop with Java - is a mutable structure, and you performing shallow copy and not a deep copy. Meaning - you copying the upper structure (or top array in your case) and not entire structure recursively, like all downstream array.
Solution might look like:
val source = Array(
Array(1, 2, 3),
Array(4, 5, 6)
)
val target = Array.ofDim[Int](2, 3)
source.zipWithIndex.foreach { case (row, index) =>
Array.copy(source(index), 0, target(index), 0, source.length)
}
target(0)(1) = 23
println(source.map(_.mkString(",")).mkString("\n"))
println()
println(target.map(_.mkString(",")).mkString("\n"))
which will print out result:
1,2,3
4,5,6
1,23,0
4,5,0
Scatie example: https://scastie.scala-lang.org/lrrHyGqZRxKk7mZ6CbLoiA
UPDATE
As correctly stated #Luis Miguel Mejía Suárez in the comment section - zipWithIndex expensive operation. More optimal solution would be
(0 until source.length).foreach { index =>
Array.copy(source(index), 0, target(index), 0, source.length)
}
Array.copy uses System.arrayCopy which modifies both the arrays. In the doc:
Copy one array to another. Equivalent to Java's System.arraycopy(src, srcPos, dest, destPos, length), except that this also works for polymorphic and boxed arrays.
Note that the passed-in dest array will be modified by this call.
You can try a simple map with identity:
scala> val arr1 = Array(Array(1,2,3),Array(4,5,6))
arr1: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
scala> val arr3 = arr1.map(_.map(identity))
arr3: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
scala> arr1(0)(1) = 23
scala> arr1
res16: Array[Array[Int]] = Array(Array(1, 23, 3), Array(4, 5, 6))
scala> arr3
res17: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
How do you convert from an Array[Int] of length n, into an Array[(Int,Option[Int])] of length Math.ceil(n / 2.0).toInt?
scala> val a = Array(1, 2, 3, 4)
a: Array[Int] = Array(1, 2, 3, 4)
The resultant Array for the example above would be Array((1, Some(2)), (3, some(4))). If a were Array(1), then the desired result would be Array((1, None)).
Capice?
grouped is useful for breaking the array into pairs, and case is useful for dealing with the possibility of a leftover element:
def toPairs[T](xs: Array[T]): Array[(T, Option[T])] =
xs.grouped(2)
.map{
case Array(a, b) => (a, Some(b))
case Array(a) => (a, None)
}.toArray
scala> toPairs(Array(1, 2, 3, 4, 5))
res0: Array[(Int, Option[Int])] = Array((1,Some(2)), (3,Some(4)), (5,None))
Something similar to Seth's suggestion, just a little more concise.
scala> Array(1,2,3,4,5).grouped(2).map(x=> (x.head, x.tail.headOption)).toArray
res17: Array[(Int, Option[Int])] = Array((1,Some(2)), (3,Some(4)), (5,None))
In Python, this is how I would do it.
>>> x
array([10, 9, 8, 7, 6, 5, 4, 3, 2])
>>> x[np.array([3, 3, 1, 8])]
array([7, 7, 9, 2])
This doesn't work in the Scala Spark shell:
scala> val indices = Array(3,2,0)
indices: Array[Int] = Array(3, 2, 0)
scala> val A = Array(10,11,12,13,14,15)
A: Array[Int] = Array(10, 11, 12, 13, 14, 15)
scala> A(indices)
<console>:28: error: type mismatch;
found : Array[Int]
required: Int
A(indices)
The foreach method doesn't work either:
scala> indices.foreach(println(_))
3
2
0
scala> indices.foreach(A(_))
<no output>
What I want is the result of B:
scala> val B = Array(A(3),A(2),A(0))
B: Array[Int] = Array(13, 12, 10)
However, I don't want to hard code it like that because I don't know how long indices is or what would be in it.
The most concise way I can think of is to flip your mental model and put indices first:
indices map A
And, I would potentially suggest using lift to return an Option
indices map A.lift
You can use map on indices, which maps each element to a new element based on a mapping lambda. Note that on Array, you get an element at an index with the apply method:
indices.map(index => A.apply(index))
You can leave off apply:
indices.map(index => A(index))
You can also use the underscore syntax:
indices.map(A(_))
When you're in a situation like this, you can even leave off the underscore:
indices.map(A)
And you can use the alternate space syntax:
indices map A
You were trying to use foreach, which returns Unit, and is only used for side effects. For example:
indices.foreach(index => println(A(index)))
indices.map(A).foreach(println)
indices map A foreach println
I create zeroed Arrays in Scala with
(0 until Nrows).map (_ => 0).toArray but is there anything faster ? map is slow.
I have the same question but with 1 instead of O, i.e. I also want to accelerate (0 until Nrows).map (_ => 1).toArray
Zero is the default value for an array of Ints, so just do this:
val array = new Array[Int](NRows)
If you want all those values to be 1s then use .fill() (with thanks to #gourlaysama):
val array = Array.fill(NRows)(1)
However, looking at how this works internally, it involves the creation of a few objects that you don't need. I suspect the following (uglier) approach may be quicker if speed is your main concern:
val array = new Array[Int](NRows)
for (i <- 0 until array.length) { array(i) = 1 }
For multidimensional arrays consider Array.ofDim, for instance,
scala> val a = Array.ofDim[Int](3,3)
a: Array[Array[Int]] = Array(Array(0, 0, 0), Array(0, 0, 0), Array(0, 0, 0))
Likewise,
scala> val a = Array.ofDim[Int](3)
a: Array[Int] = Array(0, 0, 0)
In the context here,
val a = Array.ofDim[Int](NRows)
For setting (possibly nonzero) initial values, consider Array.tabulate, for instance,
scala> Array.tabulate(3,3)( (x,y) => 1)
res5: Array[Array[Int]] = Array(Array(1, 1, 1), Array(1, 1, 1), Array(1, 1, 1))
scala> Array.tabulate(3)( x => 1)
res18: Array[Int] = Array(1, 1, 1)
I have an array of array of integers. Like:
val t1 = Array(Array(1, 2, 3), Array(2), Array(4, 5, 6), Array(5, 6))
I want to remove the arrays that are subsets of another array. So, the result should be:
Array(Array(1, 2, 3), Array(4, 5, 6))
Ideally, these should be Sets, but in the context of my program, they are arrays, and I don't want to convert them to sets due to performance reasons.
I solved it this way in Scala, but I would like to know if there is a more elegant (and/or more efficient) way to do this:
def removeSubsets[T: ClassManifest](clusters: Array[Array[T]]) = {
val sortedClusters = clusters.sortBy(-1 * _.length)
sortedClusters.foldLeft(Array[Array[T]]()){ (acc, ele) =>
val isASubset = acc.exists(arr => (ele diff arr).length == 0)
if (isASubset) acc else acc :+ ele
}
}