I need read-only structure with fast indexed access and minimum overhead. That structure would be queried quite often by the application. So, as it was supposed on the net, I tried to use Arrays and cast them to IndexedSeq
scala> val wa : IndexedSeq[Int] = Array(1,2,3)
wa: IndexedSeq[Int] = WrappedArray(1, 2, 3)
So far, so good. But I need to use nested Arrays and there the problem lies.
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
<console>:8: error: type mismatch;
found : Array[Array[_ <: Int]]
required: IndexedSeq[IndexedSeq[Int]]
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
Scala compiler could not apply implicit conversion recursively.
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array[Int](1,2) : IndexedSeq[Int], Array[Int](3) : IndexedSeq[Int], Array[Int]() : IndexedSeq[Int])
wa2d: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3), WrappedArray())
That worked as expected, but this form is too verbose, for each subarray I need to specify types twice. And I would like to avoid it completely. So I've tried another approach
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array()).map(_.to[IndexedSeq])
wa2d: IndexedSeq[IndexedSeq[Int]] = ArraySeq(Vector(1, 2), Vector(3), Vector())
But all WrappedArrays mysteriously disappeared and was replaced with ArraySeq and Vector.
So what is the less obscure way to define nested WrappedArrays ?
Here is how you do it:
scala> def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
scala> wrap(Array(Array(1,2), Array(3,4)))
res13: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3, 4))
If you want to use implicit conversions, use this:
def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
implicit def nestedArrayIsNestedIndexedSeq[T](x: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = wrap(x)
val x: IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2),Array(3,4))
And here is why you might not want to do it:
val th = ichi.bench.Thyme.warmed()
val a = (0 until 100).toArray
val b = a: IndexedSeq[Int]
def sumArray(a: Array[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
def sumIndexedSeq(a: IndexedSeq[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
scala> th.pbenchOff("")(sumArray(a))(sumIndexedSeq(b))
Benchmark comparison (in 439.6 ms)
Significantly different (p ~= 0)
Time ratio: 3.18875 95% CI 3.06446 - 3.31303 (n=30)
First 65.12 ns 95% CI 62.69 ns - 67.54 ns
Second 207.6 ns 95% CI 205.2 ns - 210.1 ns
res15: Int = 4950
The bottom line is that once you access your Array[Int] indirectly via WrappedArray[Int], primitives get boxed. So things get much slower. If you really need the full performance of arrays, you have to use them directly. And if you don't, just use a Vector and stop worrying about it.
I would just go with Vector for prototyping and then go to Array once/if you are sure that this is actually a performance bottleneck. Use a type alias so you can quickly switch from Vector to Array.
Somewhere in your package object:
type Vec[T] = Vector[T]
val Vec = Vector
// type Vec[T] = Array[T]
// val Vec = Array
Then you can write code like this
val grid = Vec(Vec(1,2), Vec(3,4))
and switch quickly to an array version in case you measure that this is actually a performance bottleneck.
Related
I have a simple class I always implement when working with a new Language, MergeSort. So I am looking at my implementations of it with type Int and it looks great. Then I wanted to genericize it. I started with a simple implementation of T, but i noticed that needed to relfect the ClassTag. How do i assign the reflected ClassTag + extending?
class MergeSort[T: scala.reflect.ClassTag] {
var array: Array[T] = Array[T]()
var length: Int = 0
var tempArray: Array[T] = new Array(length)
def sort(data: Array[T]): Unit = {
array = data;
length = data.length;
tempArray = new Array[T](length)
//sort(0, length - 1)
}
}
Now this looks nice! It works, but when I i do the sort and rest of the functionality, I need to be able to compare 2 items of type T. The "Java" way was to just make sure the Object has the compareTo method. So i was thinking: [T extends Comparable]
but in scala, I am doing assignment for T with ClassTag, and
class MergeSort[T: scala.reflect.ClassTag extends Comparable] {} for example. It will error saying:
']' expected, but 'extends' found.
I was thinking this would sorta be the way to do things, but i am not sure whats going on here.
The endstate is to implement the merge portion of the class:
def merge(lower: Int, center: Int, upper: Int){
// ...
// loop
// if (tempArr(i) <= tempArr(j)) {} // OLD WAY, since First attempt was with Int.
// if (tempArr(i).compareTo(tempArr(j)) < 0) {} // Modified way with Comparable
}
Is this the scala way of implementing? I was noticing that people were mentioning Ordering, but i thought Comparable made sense.
The Scala way of implementing merge sort is using List and vals and the Ordering trait. The advantage of Ordering (the Java Comparator) is that Scala gives you implicit orderings for all standard library types by default.
def msort[T: Ordering](xs: List[T]): List[T] = {
#tailrec
def merge(xs: List[T], ys: List[T], acc: List[T] = Nil): List[T] =
(xs, ys) match {
case (Nil, _) => acc.reverse ++ ys
case (_, Nil) => acc.reverse ++ xs
case (x :: xs1, y :: ys1) =>
if (implicitly[Ordering[T]].lt(x, y))
merge(xs1, ys, x :: acc)
else
merge(xs, ys1, y :: acc)
}
xs match {
case Nil | _ :: Nil => xs
case _ =>
val (xs1, xs2) = xs splitAt (xs.length / 2)
merge(msort(xs1), msort(xs2))
}
}
msort(List(4, 23, 1, 2, 5, 76, 3, 142, 4321, 213, 42323))
// List(1, 2, 3, 4, 5, 23, 76, 142, 213, 4321, 42323)
msort(List("John", "Chris", "Helen", "Danny", "Michelle"))
// List(Chris, Danny, Helen, John, Michelle)
Another advantage over Ordered is that Scala provides implicit conversions from Ordered[A] => Ordering[A], which means your custom types that mix in Ordered will work with msort without the need to define implicit orderings.
Finally, the last advantage over Ordered is when using numeric types: Int, Double, etc. do not mix in Ordered, so you will not be able to sort elements of these types with Ordered, this is why most use Ordering instead.
I'm well aware this variant is not in-memory, but it does not require ClassTag at all to implement.
Disclaimer: I'm VERY new to spark and scala. I am working on a document similarity project in Scala with Spark. I have a dataframe which looks like this:
+--------+--------------------+------------------+
| text| shingles| hashed_shingles|
+--------+--------------------+------------------+
| qwerty|[qwe, wer, ert, rty]| [-4, -6, -1, -9]|
|qwerasfg|[qwe, wer, era, r...|[-4, -6, 6, -2, 2]|
+--------+--------------------+------------------+
Where I split the document text into shingles and computed a hash value for each one.
Imagine I have a hash_function(integer, seed) -> integer.
Now I want to apply n different hash functions of this form to the hashed_shingles arrays. I.e. obtain an array of n arrays such that each array is hash_function(hashed_shingles, seed) with seed from 1 to n.
I'm trying something like this, but I cannot get it to work:
val n = 3
df = df.withColumn("tmp", array_repeat($"hashed_shingles", n)) // Repeat minhashes
val minhash_expr = "transform(tmp,(x,i) -> hash_function(x, i))"
df = df.withColumn("tmp", expr(minhash_expr)) // Apply hash to each array
I know how to do it with a udf, but as I understand they are not optimized and I should try to avoid using them, so I try to do everything with org.apache.spark.sql.functions.
Any ideas on how to approach it without udf?
The udf which achieves the same goal is this:
// Family of hashing functions
class Hasher(seed: Int, max_val : Int, p : Int = 104729) {
private val random_generator = new scala.util.Random(seed)
val a = 1 + 2*random_generator.nextInt((p-2)/2)// a odd in [1, p-1]
val b = 1 + random_generator.nextInt(p - 2) // b in [1, p-1]
def getHash(x : Int) : Int = ((a*x + b) % p) % max_val
}
// Compute a list of minhashes from a list of hashers given a set of ids
class MinHasher(hashes : List[Hasher]) {
def getMinHash(set : Seq[Int])(hasher : Hasher) : Int = set.map(hasher.getHash).min
def getMinHashes(set: Seq[Int]) : Seq[Int] = hashes.map(getMinHash(set))
}
// Minhasher
val minhash_len = 100
val hashes = List.tabulate(minhash_len)(n => new Hasher(n, shingle_bins))
val minhasher = new MinHasher(hashes)
// Compute Minhashes
val minhasherUDF = udf[Seq[Int], Seq[Int]](minhasher.getMinHashes)
df = df.withColumn("minhashes", minhasherUDF('hashed_shingles))
I have two arrays of strings, say
A = ('abc', 'joia', 'abas8', '09ma09', 'oiam0')
and
B = ('gfdg', '89jkjj', '09ma09', 'asda', '45645ghf', 'dgfdg', 'yui345gd', '6456ds', '456dfs3', 'abas8', 'sfgds').
What I want to do is simply to count the number of elements of every string in A that appears in B (if any). For example, the resulted array here should be: C = (0, 0, 1, 1, 0). How can I do that?
try this:
A.map( x => B.count(y => y == x)))
You can do it how idursun suggested, it will work, but may be not efficient as if you'll prepare intersection first. If B is much bigger than A it will give massive speedup. 'intersect' method has better 'big-O' complexity then doing linear search for each element of A in B.
val A = Array("abc", "joia", "abas8", "09ma09", "oiam0")
val B = Array("gfdg", "89jkjj", "09ma09", "asda", "45645ghf", "dgfdg", "yui345gd", "6456ds", "456dfs3", "abas8", "sfgds")
val intersectCounts: Map[String, Int] =
A.intersect(B).map(s => s -> B.count(_ == s)).toMap
val count = A.map(intersectCounts.getOrElse(_, 0))
println(count.toSeq)
Result
(0, 0, 1, 1, 0)
Use a foldLeft construction as the yield off of each element of A:
val A = List("a","b")
val B = List("b","b")
val C = for (a <- A)
yield B.foldLeft(0) { case (totalc : Int, w : String) =>
totalc + (if (w == a) 1 else 0)
}
And the result:
C: List[Int] = List(0, 2)
I'm trying to translate the following Matlab logical-indexing pattern into Scala code:
% x is an [Nx1] array of Int32
% y is an [Nx1] array of Int32
% myExpensiveFunction() processes batches of unique x.
ux = unique(x);
z = nan(size(x));
for i = 1:length(ux)
idx = x == ux(i);
z(idx) = myExpensiveFuntion(x(idx), y(idx));
end
Assume I'm working with val x: Array[Int] in Scala. What is the best way to do this?
Edit: To clarify, I'm looking to process batches of (x,y) at a time, grouped by unique x, and return a result (z) with an order corresponding to the initial input. I'm open to sorting x, but eventually need to get back to the original unsorted order. My primary requirement is to handle all the indexing/mapping/sorting in a clear and reasonably efficient way.
Most of this is pretty straightforward in Scala; the only thing that's a bit out of the ordinary is the unique x indices. In Scala you'd do that with a `groupBy'. Since this is a really index-heavy method, I'm just going to give in and go with indices all the way:
val z = Array.fill(x.length)(Double.NaN)
x.indices.groupBy(i => x(i)).foreach{ case (xi, is) =>
is.foreach(i => z(i) = myExpensiveFunction(xi, y(i)))
}
z
assuming you can live with a lack of vectors going to myExpensiveFunction. If not,
val z = Array.fill(x.length)(Double.NaN)
x.indices.groupBy(i => x(i)).foreach{ case (xi, is) =>
val xs = Array.fill(is.length)(xi)
val ys = is.map(i => y(i)).toArray
val zs = myExpensiveFunction(xs, ys)
is.foreach(i => z(i) = zs(i))
}
z
This isn't the most natural way to do the computation in Scala, or the most efficient, but you don't care about efficiency if your expensive function is expensive, and it's the closest I can come to a literal translation.
(Translating your matlab-algorithms into almost everything else involves a certain amount of pain or rethinking, since the "natural" computations in matlab are not like those in most other languages.)
The important point is to get Matlab's unique right. A simple solution would be to use a Set to determine the unique values:
val occurringValues = x.toSet
occurringValues.foreach{ value =>
val indices = x.indices.filter(i => x(i) == value)
for (i <- indices) {
z(i) = myExpensiveFunction(x(i), y(i))
}
}
Note: I assume that it is possible to change myExpensiveFunction to element-wise operation...
scala> def process(xs: Array[Int], ys: Array[Int], f: (Seq[Int], Seq[Int]) => Double): Array[Double] = {
| val ux = xs.distinct
| val zs = Array.fill(xs.size)(Double.NaN)
| for(x <- ux) {
| val idx = xs.indices.filter{ i => xs(i) == x }
| val res = f(idx.map(xs), idx.map(ys))
| idx foreach { i => zs(i) = res }
| }
| zs
| }
process: (xs: Array[Int], ys: Array[Int], f: (Seq[Int], Seq[Int]) => Double)Array[Double]
scala> val xs = Array(1,2,1,2,3)
xs: Array[Int] = Array(1, 2, 1, 2, 3)
scala> val ys = Array(1,2,3,4,5)
ys: Array[Int] = Array(1, 2, 3, 4, 5)
scala> val f = (a: Seq[Int], b: Seq[Int]) => a.sum/b.sum.toDouble
f: (Seq[Int], Seq[Int]) => Double = <function2>
scala> process(xs, ys, f)
res0: Array[Double] = Array(0.5, 0.6666666666666666, 0.5, 0.6666666666666666, 0.6)
Let's consider a simple mapping example:
val a = Array("One", "Two", "Three")
val b = a.map(s => myFn(s))
What I need is to use not myFn(s: String): String here, but myFn(s: String, n: Int): String, where n would be the index of s in a. In this particular case myFn would expect the second argument to be 0 for s == "One", 1 for s == "Two" and 2 for s == "Three". How can I achieve this?
Depends whether you want convenience or speed.
Slow:
a.zipWithIndex.map{ case (s,i) => myFn(s,i) }
Faster:
for (i <- a.indices) yield myFn(a(i),i)
{ var i = -1; a.map{ s => i += 1; myFn(s,i) } }
Possibly fastest:
Array.tabulate(a.length){ i => myFn(a(i),i) }
If not, this surely is:
val b = new Array[Whatever](a.length)
var i = 0
while (i < a.length) {
b(i) = myFn(a(i),i)
i += 1
}
(In Scala 2.10.1 with Java 1.6u37, if "possibly fastest" is declared to take 1x time for a trivial string operation (truncation of a long string to a few characters), then "slow" takes 2x longer, "faster" each take 1.3x longer, and "surely" takes only 0.5x the time.)
A general tip: Use .iterator method liberally, to avoid creation of intermediate collections, and thus speed up your computation. (Only when performance requirements demand it. Or else don't.)
scala> def myFun(s: String, i: Int) = s + i
myFun: (s: String, i: Int)java.lang.String
scala> Array("nami", "zoro", "usopp")
res17: Array[java.lang.String] = Array(nami, zoro, usopp)
scala> res17.iterator.zipWithIndex
res19: java.lang.Object with Iterator[(java.lang.String, Int)]{def idx: Int; def idx_=(x$1: Int): Unit} = non-empty iterator
scala> res19 map { case (k, v) => myFun(k, v) }
res22: Iterator[java.lang.String] = non-empty iterator
scala> res22.toArray
res23: Array[java.lang.String] = Array(nami0, zoro1, usopp2)
Keep in mind that iterators are mutable, and hence once consumed cannot be used again.
An aside: The map call above involves de-tupling and then function application. This forces use of some local variables. You can avoid that using some higher order sorcery - convert a regular function to the one accepting tuple, and then pass it to map.
scala> Array("nami", "zoro", "usopp").zipWithIndex.map(Function.tupled(myFun))
res24: Array[java.lang.String] = Array(nami0, zoro1, usopp2)
What about this? I think it should be fast and it's pretty. But I'm no expert on Scala speed...
a.foldLeft(0) ((i, x) => {myFn(x, i); i + 1;} )
Index can also be accessed via the second element of tuples generated by the zipWithIndex method:
val a = Array("One", "Two", "Three")
val b = a.zipWithIndex.map(s => myFn(s._1, s._2))