Disclaimer: I'm VERY new to spark and scala. I am working on a document similarity project in Scala with Spark. I have a dataframe which looks like this:
+--------+--------------------+------------------+
| text| shingles| hashed_shingles|
+--------+--------------------+------------------+
| qwerty|[qwe, wer, ert, rty]| [-4, -6, -1, -9]|
|qwerasfg|[qwe, wer, era, r...|[-4, -6, 6, -2, 2]|
+--------+--------------------+------------------+
Where I split the document text into shingles and computed a hash value for each one.
Imagine I have a hash_function(integer, seed) -> integer.
Now I want to apply n different hash functions of this form to the hashed_shingles arrays. I.e. obtain an array of n arrays such that each array is hash_function(hashed_shingles, seed) with seed from 1 to n.
I'm trying something like this, but I cannot get it to work:
val n = 3
df = df.withColumn("tmp", array_repeat($"hashed_shingles", n)) // Repeat minhashes
val minhash_expr = "transform(tmp,(x,i) -> hash_function(x, i))"
df = df.withColumn("tmp", expr(minhash_expr)) // Apply hash to each array
I know how to do it with a udf, but as I understand they are not optimized and I should try to avoid using them, so I try to do everything with org.apache.spark.sql.functions.
Any ideas on how to approach it without udf?
The udf which achieves the same goal is this:
// Family of hashing functions
class Hasher(seed: Int, max_val : Int, p : Int = 104729) {
private val random_generator = new scala.util.Random(seed)
val a = 1 + 2*random_generator.nextInt((p-2)/2)// a odd in [1, p-1]
val b = 1 + random_generator.nextInt(p - 2) // b in [1, p-1]
def getHash(x : Int) : Int = ((a*x + b) % p) % max_val
}
// Compute a list of minhashes from a list of hashers given a set of ids
class MinHasher(hashes : List[Hasher]) {
def getMinHash(set : Seq[Int])(hasher : Hasher) : Int = set.map(hasher.getHash).min
def getMinHashes(set: Seq[Int]) : Seq[Int] = hashes.map(getMinHash(set))
}
// Minhasher
val minhash_len = 100
val hashes = List.tabulate(minhash_len)(n => new Hasher(n, shingle_bins))
val minhasher = new MinHasher(hashes)
// Compute Minhashes
val minhasherUDF = udf[Seq[Int], Seq[Int]](minhasher.getMinHashes)
df = df.withColumn("minhashes", minhasherUDF('hashed_shingles))
Related
Say I have two arrays. One array A of a set of integers - all distinct. Another array B of a list of integers, all appearing in array A, but not necessarily distinct. For example:
A could be Array(123, 456, 789)
B could be Array(123, 123, 456, 123, 789, 456)
I want to create an array C, which tells us the frequency of each element (from array A) appearing in array B. In this case, C would be Array(3, 2, 1) because 123 appears 3 times, 456 appears 2 times, and 789 appears 1 time.
What is an efficient way to do this in Scala?
My attempt is
val C: Array[Int] = Array.fill(3)(0)
var idx = 0
for(i <- A){for(j <- B){if(j == i){C(idx) += 1}}
idx += 1}
for(i <- C){println(i)}
But I understand that this is probably inefficient, and would take a long time if I am dealing with a much larger array A and array B. But I am restricted to for loops and if statements since I am only a beginner with Scala. Is there a more efficient way to do this?
Lets say that n is length of Array A and m is length of array B.
As of now your solution is O(n * m)
You can improve this to O(n + m) by using a mutable HashMap and O(n) extra space.
import scala.collection.mutable
val a = Array(123, 456, 789)
val b = Array(123, 123, 456, 123, 789, 456)
val countMap = mutable.HashMap.empty[Int, Int]
// add all integers in `a` with count 0
for (i <- a) {
countMap.put(i, 0)
}
// iterate on b
// and update the count in countMap (if exists)
for (i <- b) {
countMap.get(i).foreach(c => countMap.put(i, c + 1))
}
// fill your array `c`
val c = Array.ofDim[Int](a.length)
for ((i, index) <- a.zipWithIndex) {
c(index) = countMap.getOrElse(i, 0)
}
println(c.mkString(", "))
// 3, 2, 1
Keep in mind that for's for Scala collections have their own costs, you can improve it further by using while loops.
import scala.collection.mutable
val a = Array(123, 456, 789)
val b = Array(123, 123, 456, 123, 789, 456)
val countMap = mutable.HashMap.empty[Int, Int]
// to use with our while loops
var i = 0
// add all integers in `a` with count 0
i = 0
while (i < a.length) {
countMap.put(a(i), 0)
i = i + 1
}
// iterate on b
// and update the count in countMap (if exists)
i = 0
while (i < b.length) {
if (countMap.contains(b(i))) {
countMap.put(b(i), countMap(b(i)) + 1)
}
i = i + 1
}
// fill your array `c`
val c = Array.ofDim[Int](a.length)
i = 0
while (i < a.length) {
c(i) = countMap.getOrElse(a(i), 0)
i = i + 1
}
println(c.mkString(", "))
// 3, 2, 1
I have a function that I would like to apply on each element of a cartesian product of a linear space. I do know how to do it with functions of one variable that is defined using lambda, namely using map. Here is an example:
import numpy as np
xpts = np.linspace(0, 1, 5)
fun = lambda p: p**2
arr = np.array(list(map(fun , xpts)))
But with a multivariate function I did not manage to use the map function. Here is an example of what I am doing instead, which is slow:
def fun(x,y):
return 2*x+y
xpts = np.linspace(0, 1, 5)
# make dict of indexes
count=0
dic=dict()
for j in xpts:
dic[j]=count
count+=1
# preallocate array
arr = np.empty([len(xpts)]*2)
for tup in itertools.product(xpts, xpts):
ind1 = dic[tup[0]]
ind2 = dic[tup[1]]
val1 = tup[0]
val2 = tup[1]
arr[ind1, ind2] = fun(val1, val2)
As my function is complicated and the space is large, I am looking for an efficient/scalable way.
I have a situation where I have two arrays and I need to partition them so that I end up with 3 arrays:
elements that are only in A
elements that are only in B
elements that are in both A and B
Example:
A = [1, 4, 3, 2]
B = [2, 6, 5, 3]
3part(A,B) => [[1,4], [6,5], [2,3]] # the order of the elements in each array doesn't matter
I've come up with a correct solution, but wonder if it could be quicker. It is (in pseudocode):
3part(A,B) =>
a_only = []
b_only = []
intersect = []
foreach a in A
if B.contains(a)
intersect.push(a)
else
a_only.push(a)
foreach b in B
if not intersect.contains(b)
b_only.push(b)
return [a_only, b_only, intersect]
In my case at hand, A & B will each contain up to 5000 complex structures (instead of ints) and it runs in about 1.5 secs. It gets used as part of a user interaction which can happen frequently, so ideally it would take < .5sec .
BTW, is there a name for this operation as a whole, other than "difference-difference-intersection"?
Thanks!
EDIT:
Based on the suggestions to use a hash, my updated code runs in under 40ms :-D
Here is the pseudocode:
(say that "key" is the element that I am using for comparison)
array_to_hash(A, Key)
h = {}
foreach a in A
h[a[Key]] = a
return h
3part(A,B) =>
a_only = []
b_only = []
intersect = {} // note this is now a hash instead of array
Ah = array_to_hash(A, 'key')
Bh = array_to_hash(B, 'key')
foreach ak in Ah.keys()
if Bh.hasKey(ak)
intersect[ak] = Ah[ak]
else
a_only.push(Ah[ak])
foreach bk in Bh.keys()
if not intersect.hasKey(bk)
b_only.push(Bh[bk])
return [a_only, b_only, intersect.values]
Thank you all for the suggestions.
If your arrays are sortable then you could do 2 things
to check if a value is in the other array, simply do a binary search on the other array, Complexity O(nlogm + mlogn)
Or you could merge arrays into the 3 arrays using 2 pointers, since the arrays are sorted, if the first elements are equal add them to the intersection set, incase they are not if element in A < element in B. add the element in A to the array a[] and now check the 2nd element with the first element in B.
same if B was less than A
Complexity O(n + m).
you can maintain which element we are referring to using 2 pointers
Your pseudocode is good IF you use a hashed set data structure for all the "arrays".
I guess all current programming environments have decent collections support, including hash-based sets. Here are two examples how to do it in Java, running in more or less O(n+m). In Java, for hashed collections to function properly, it's important that your complex objects implement the hashCode() and the equals() method in a compliant fashion (can often be auto-generated by your IDE).
The first version completely relies on the set-algebra implementation of your library, which should result in O(n) if the library is OK:
private static void test1() {
Integer[] a = {1, 4, 3, 2};
Integer[] b = {2, 6, 5, 3};
Set<Integer> aOnly = new HashSet<>(Arrays.asList(a));
Set<Integer> bOnly = new HashSet<>(Arrays.asList(b));
Set<Integer> ab = new HashSet<>(aOnly);
ab.retainAll(bOnly);
aOnly.removeAll(ab);
bOnly.removeAll(ab);
System.out.println("A only: " + aOnly);
System.out.println("A and B: " + ab);
System.out.println("B only: " + bOnly);
}
The second one uses the fact that in Java, the remove() method returns true if the element was present before removing it. If your library doesn't do that you have to
private static void test2() {
Integer[] a = {1, 4, 3, 2};
Integer[] b = {2, 6, 5, 3};
Set<Integer> aOnly = new HashSet<>(Arrays.asList(a));
Set<Integer> bOnly = new HashSet<>();
Set<Integer> ab = new HashSet<>();
for (int bElem : b) {
if (aOnly.remove(bElem)) {
ab.add(bElem);
} else {
bOnly.add(bElem);
}
}
System.out.println("A only: " + aOnly);
System.out.println("A and B: " + ab);
System.out.println("B only: " + bOnly);
}
I sense that you're allowing yourself to be handicapped by assumptions of primitive operations. Current hardware includes excellent hashing support and GEMM operations.
Hash the values of A and B into a single space, with a range on the order of |A + B|.
Convert both arrays to one-hot encoding in that range.
Apply vector AND-OR-NOT operations to obtain your three results.
I need read-only structure with fast indexed access and minimum overhead. That structure would be queried quite often by the application. So, as it was supposed on the net, I tried to use Arrays and cast them to IndexedSeq
scala> val wa : IndexedSeq[Int] = Array(1,2,3)
wa: IndexedSeq[Int] = WrappedArray(1, 2, 3)
So far, so good. But I need to use nested Arrays and there the problem lies.
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
<console>:8: error: type mismatch;
found : Array[Array[_ <: Int]]
required: IndexedSeq[IndexedSeq[Int]]
val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array())
Scala compiler could not apply implicit conversion recursively.
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array[Int](1,2) : IndexedSeq[Int], Array[Int](3) : IndexedSeq[Int], Array[Int]() : IndexedSeq[Int])
wa2d: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3), WrappedArray())
That worked as expected, but this form is too verbose, for each subarray I need to specify types twice. And I would like to avoid it completely. So I've tried another approach
scala> val wa2d : IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2), Array(3), Array()).map(_.to[IndexedSeq])
wa2d: IndexedSeq[IndexedSeq[Int]] = ArraySeq(Vector(1, 2), Vector(3), Vector())
But all WrappedArrays mysteriously disappeared and was replaced with ArraySeq and Vector.
So what is the less obscure way to define nested WrappedArrays ?
Here is how you do it:
scala> def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
scala> wrap(Array(Array(1,2), Array(3,4)))
res13: IndexedSeq[IndexedSeq[Int]] = WrappedArray(WrappedArray(1, 2), WrappedArray(3, 4))
If you want to use implicit conversions, use this:
def wrap[T](a: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = { val x = a.map(x => x: IndexedSeq[T]); x }
implicit def nestedArrayIsNestedIndexedSeq[T](x: Array[Array[T]]): IndexedSeq[IndexedSeq[T]] = wrap(x)
val x: IndexedSeq[IndexedSeq[Int]] = Array(Array(1,2),Array(3,4))
And here is why you might not want to do it:
val th = ichi.bench.Thyme.warmed()
val a = (0 until 100).toArray
val b = a: IndexedSeq[Int]
def sumArray(a: Array[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
def sumIndexedSeq(a: IndexedSeq[Int]): Int = { var i = 0; var sum = 0; while(i < a.length) { sum += a(i); i += 1 }; sum }
scala> th.pbenchOff("")(sumArray(a))(sumIndexedSeq(b))
Benchmark comparison (in 439.6 ms)
Significantly different (p ~= 0)
Time ratio: 3.18875 95% CI 3.06446 - 3.31303 (n=30)
First 65.12 ns 95% CI 62.69 ns - 67.54 ns
Second 207.6 ns 95% CI 205.2 ns - 210.1 ns
res15: Int = 4950
The bottom line is that once you access your Array[Int] indirectly via WrappedArray[Int], primitives get boxed. So things get much slower. If you really need the full performance of arrays, you have to use them directly. And if you don't, just use a Vector and stop worrying about it.
I would just go with Vector for prototyping and then go to Array once/if you are sure that this is actually a performance bottleneck. Use a type alias so you can quickly switch from Vector to Array.
Somewhere in your package object:
type Vec[T] = Vector[T]
val Vec = Vector
// type Vec[T] = Array[T]
// val Vec = Array
Then you can write code like this
val grid = Vec(Vec(1,2), Vec(3,4))
and switch quickly to an array version in case you measure that this is actually a performance bottleneck.
I have two arrays of strings, say
A = ('abc', 'joia', 'abas8', '09ma09', 'oiam0')
and
B = ('gfdg', '89jkjj', '09ma09', 'asda', '45645ghf', 'dgfdg', 'yui345gd', '6456ds', '456dfs3', 'abas8', 'sfgds').
What I want to do is simply to count the number of elements of every string in A that appears in B (if any). For example, the resulted array here should be: C = (0, 0, 1, 1, 0). How can I do that?
try this:
A.map( x => B.count(y => y == x)))
You can do it how idursun suggested, it will work, but may be not efficient as if you'll prepare intersection first. If B is much bigger than A it will give massive speedup. 'intersect' method has better 'big-O' complexity then doing linear search for each element of A in B.
val A = Array("abc", "joia", "abas8", "09ma09", "oiam0")
val B = Array("gfdg", "89jkjj", "09ma09", "asda", "45645ghf", "dgfdg", "yui345gd", "6456ds", "456dfs3", "abas8", "sfgds")
val intersectCounts: Map[String, Int] =
A.intersect(B).map(s => s -> B.count(_ == s)).toMap
val count = A.map(intersectCounts.getOrElse(_, 0))
println(count.toSeq)
Result
(0, 0, 1, 1, 0)
Use a foldLeft construction as the yield off of each element of A:
val A = List("a","b")
val B = List("b","b")
val C = for (a <- A)
yield B.foldLeft(0) { case (totalc : Int, w : String) =>
totalc + (if (w == a) 1 else 0)
}
And the result:
C: List[Int] = List(0, 2)