Removing arrays which are subsets of other arrays in Scala - arrays

I have an array of array of integers. Like:
val t1 = Array(Array(1, 2, 3), Array(2), Array(4, 5, 6), Array(5, 6))
I want to remove the arrays that are subsets of another array. So, the result should be:
Array(Array(1, 2, 3), Array(4, 5, 6))
Ideally, these should be Sets, but in the context of my program, they are arrays, and I don't want to convert them to sets due to performance reasons.
I solved it this way in Scala, but I would like to know if there is a more elegant (and/or more efficient) way to do this:
def removeSubsets[T: ClassManifest](clusters: Array[Array[T]]) = {
val sortedClusters = clusters.sortBy(-1 * _.length)
sortedClusters.foldLeft(Array[Array[T]]()){ (acc, ele) =>
val isASubset = acc.exists(arr => (ele diff arr).length == 0)
if (isASubset) acc else acc :+ ele
}
}

Related

Scala Array Copy looks to be working differently

object Solution extends App {
val arr1 = Array(
Array(1,2,3),
Array(4,5,6)
)
var arr2 = Array.ofDim[Int](2,3)
Array.copy(arr1,0,arr2,0,arr1.length)
arr1(0)(1) = 23
println(arr1.map(_.mkString(",")).mkString("\n"))
println()
println(arr2.map(_.mkString(",")).mkString("\n"))
}
1,23,3
4,5,6
1,23,3
4,5,6
what is wrong, why is the 23 appearing in both arrays
Because Array in Scala, or if to be more precise in JVM, because of Scala interop with Java - is a mutable structure, and you performing shallow copy and not a deep copy. Meaning - you copying the upper structure (or top array in your case) and not entire structure recursively, like all downstream array.
Solution might look like:
val source = Array(
Array(1, 2, 3),
Array(4, 5, 6)
)
val target = Array.ofDim[Int](2, 3)
source.zipWithIndex.foreach { case (row, index) =>
Array.copy(source(index), 0, target(index), 0, source.length)
}
target(0)(1) = 23
println(source.map(_.mkString(",")).mkString("\n"))
println()
println(target.map(_.mkString(",")).mkString("\n"))
which will print out result:
1,2,3
4,5,6
1,23,0
4,5,0
Scatie example: https://scastie.scala-lang.org/lrrHyGqZRxKk7mZ6CbLoiA
UPDATE
As correctly stated #Luis Miguel Mejía Suárez in the comment section - zipWithIndex expensive operation. More optimal solution would be
(0 until source.length).foreach { index =>
Array.copy(source(index), 0, target(index), 0, source.length)
}
Array.copy uses System.arrayCopy which modifies both the arrays. In the doc:
Copy one array to another. Equivalent to Java's System.arraycopy(src, srcPos, dest, destPos, length), except that this also works for polymorphic and boxed arrays.
Note that the passed-in dest array will be modified by this call.
You can try a simple map with identity:
scala> val arr1 = Array(Array(1,2,3),Array(4,5,6))
arr1: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
scala> val arr3 = arr1.map(_.map(identity))
arr3: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))
scala> arr1(0)(1) = 23
scala> arr1
res16: Array[Array[Int]] = Array(Array(1, 23, 3), Array(4, 5, 6))
scala> arr3
res17: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6))

how to insert element to rdd array in spark

Hi I've tried to insert element to rdd array[String] using scala in spark.
Here is example.
val data = RDD[Array[String]] = Array(Array(1,2,3), Array(1,2,3,4), Array(1,2)).
I want to make length 4 of all arrays in this data.
If the length of array is less than 4, I want to fill the NULL value in the array.
here is my code that I tried to solve.
val newData = data.map(x =>
if(x.length < 4){
for(i <- x.length until 4){
x.union("NULL")
}
}
else{
x
}
)
But The result is Array[Any] = Array((), Array(1, 2, 3, 4), ()).
So I tried another ways. I used yield on for loop.
val newData = data.map(x =>
if(x.length < 4){
for(i <- x.length until 4)yield{
x.union("NULL")
}
}
else{
x
}
)
The result is Array[Object] = Array(Vector(Array(1, 2, 3, N, U, L, L)), Array(1, 2, 3, 4), Vector(Array(1, 2, N, U, L, L), Array(1, 2, N, U, L, L)))
these are not what I want. I want to return like this
RDD[Array[String]] = Array(Array(1,2,3,NULL), Array(1,2,3,4), Array(1,2,NULL,NULL)).
What should I do?
Is there a method to solve it?
union is a functional operation, it doesn't change the array x. You don't need to do this with a loop, though, and any loop implementations will probably be slower -- it's much better to create one new collection with all the NULL values instead of mutating something every time you add a null. Here's a lambda function that should work for you:
def fillNull(x: Array[Int], desiredLength: Int): Array[String] = {
x.map(_.toString) ++ Array.fill(desiredLength - x.length)("NULL")
}
val newData = data.map(fillNull(_, 4))
I solved your use case with the following code:
val initialRDD = sparkContext.parallelize(Array(Array[AnyVal](1, 2, 3), Array[AnyVal](1, 2, 3, 4), Array[AnyVal](1, 2, 3)))
val transformedRDD = initialRDD.map(array =>
if (array.length < 4) {
val transformedArray = Array.fill[AnyVal](4)("NULL")
Array.copy(array, 0, transformedArray, 0, array.length)
transformedArray
} else {
array
}
)
val result = transformedRDD.collect()

Multidimensional Array zip array in scala

I have two array like:
val one = Array(1, 2, 3, 4)
val two = Array(4, 5, 6, 7)
var three = one zip two map{case(a, b) => a * b}
It's ok.
But I have a multidimensional Array and a one-dimensional array now, like this:
val mulOne = Array(Array(1, 2, 3, 4),Array(5, 6, 7, 8),Array(9, 10, 11, 12))
val one_arr = Array(1, 2, 3, 4)
I would like to multiplication them, how can I do this in scala?
You could use:
val tmpArr = mulOne.map(_ zip one_arr).map(_.map{case(a,b) => a*b})
This would give you Array(Array(1*1, 2*2, 3*3, 4*4), Array(5*1, 6*2, 7*3, 8*4), Array(9*1, 10*2, 11*3, 12*4)).
Here mulOne.map(_ zip one_arr) maps each internal array of mulOne with corresponding element of one_arr to create pairs like: Array(Array((1,1), (2,2), (3,3), (4,4)), ..) (Note: I have used placeholder syntax). In the next step .map(_.map{case(a,b) => a*b}) multiplies each elements of pair to give
output like: Array(Array(1, 4, 9, 16),..)
Then you could use:
tmpArr.map(_.reduce(_ + _))
to get sum of all internal Arrays to get Array(30, 70, 110)
Try this
mulOne.map{x => (x, one_arr)}.map { case(one, two) => one zip two map{case(a, b) => a * b} }
mulOne.map{x => (x, one_arr)} => for every array inside mulOne, create a tuple with content of one_arr.
.map { case(one, two) => one zip two map{case(a, b) => a * b} } is basically the operation that you performed in your first example on every tuple that were created in the first step.
Using a for comprehension like this,
val prods =
for { xs <- mulOne
zx = one_arr zip xs
(a,b) <- zx
} yield a*b
and so
prods.sum
delivers the final result.

In Scala, how do you convert an Array with n elements into an Array of tuples?

How do you convert from an Array[Int] of length n, into an Array[(Int,Option[Int])] of length Math.ceil(n / 2.0).toInt?
scala> val a = Array(1, 2, 3, 4)
a: Array[Int] = Array(1, 2, 3, 4)
The resultant Array for the example above would be Array((1, Some(2)), (3, some(4))). If a were Array(1), then the desired result would be Array((1, None)).
Capice?
grouped is useful for breaking the array into pairs, and case is useful for dealing with the possibility of a leftover element:
def toPairs[T](xs: Array[T]): Array[(T, Option[T])] =
xs.grouped(2)
.map{
case Array(a, b) => (a, Some(b))
case Array(a) => (a, None)
}.toArray
scala> toPairs(Array(1, 2, 3, 4, 5))
res0: Array[(Int, Option[Int])] = Array((1,Some(2)), (3,Some(4)), (5,None))
Something similar to Seth's suggestion, just a little more concise.
scala> Array(1,2,3,4,5).grouped(2).map(x=> (x.head, x.tail.headOption)).toArray
res17: Array[(Int, Option[Int])] = Array((1,Some(2)), (3,Some(4)), (5,None))

How to sum up every column of a Scala array?

If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:
val arr = Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))
and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:
newArr = Array(6, 600, ...)
How can I do this efficiently in Spark Scala?
There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:
arr.toList.transpose.map(_.sum)
(then call .toArray if you specifically need the result as an array).
Using breeze Vector:
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)
If your input is sparse you may consider using breeze.linalg.SparseVector.
In practice a linear algebra vector library as mentioned by #zero323 will often be the better choice.
If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :
def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}
def colsum(a:Array[Array[Int]]):Array[Int] = {
a.reduce(col2sum)
}
val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));
colsum(z)
--> Array[Int] = Array(4, 15, 9, 12, 15)
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
.map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))
res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)
flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.

Resources