Related
I am familiar with this approach - case in point an example from How to obtain the average of an array-type column in scala-spark over all row entries per entry?
val array_size = 3
val avgAgg = for (i <- 0 to array_size -1) yield avg($"value".getItem(i))
df.select(array(avgAgg: _*).alias("avg_value")).show(false)
However, the 3 is hard-coded in reality.
No matter how hard I try not to use an UDF, I cannot do this type of thing dynamically based on the size of an array column already present in the data frame. E.g:
...
val z = for (i <- 1 to size($"sortedCol") ) yield array (element_at($"sortedCol._2", i), element_at($"sortedCol._3", i) )
...
...
.withColumn("Z", array(z: _*) )
I am looking as to how this can be done by applying to an existing array col which is variable in length. transform, expr? Not sure.
Full code as per request:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
case class abc(year: Int, month: Int, item: String, quantity: Int)
val df0 = Seq(abc(2019, 1, "TV", 8),
abc(2019, 7, "AC", 10),
abc(2018, 1, "TV", 2),
abc(2018, 2, "AC", 3),
abc(2019, 2, "CO", 10)).toDS()
val df1 = df0.toDF()
// Gen some data, can be done easier, but not the point.
val itemsList= collect_list(struct("month", "item", "quantity"))
// This nn works.
val nn = 3
val z = for (i <- 1 to nn) yield array (element_at($"sortedCol.item", i), element_at($"sortedCol.quantity", i) )
// But want this.
//val z = for (i <- 1 to size($"sortedCol") ) yield array (element_at($"sortedCol.item", i), element_at($"sortedCol.quantity", i) )
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
.withColumn("S", size($"sortedCol")) // cannot use this either
.withColumn("Z", array(z: _*) )
.drop("items")
.orderBy($"year".desc)
df2.show(false)
// Col Z is the output I want, but not the null value Array
UPD
In apache spark SQL, how to remove the duplicate rows when using collect_list in window function? there I solve with a very simple UDF but I was looking for a way without UDF and in particular the dynamic setting of the to value in the for loop. The answer proves that certain constructs are not possible - which was the verification being sort.
If I correctly understand your need, you can simply use transform function like this:
val df2 = df1.groupBy($"year")
.agg(itemsList as "items")
.withColumn("sortedCol", sort_array($"items", asc = true))
val transform_expr = "transform(sortedCol, x -> array(x.item, x.quantity))"
df2.withColumn("Z", expr(transform_expr)).show(false)
//+----+--------------------------------------+--------------------------------------+-----------------------------+
//|year|items |sortedCol |Z |
//+----+--------------------------------------+--------------------------------------+-----------------------------+
//|2018|[[1, TV, 2], [2, AC, 3]] |[[1, TV, 2], [2, AC, 3]] |[[TV, 2], [AC, 3]] |
//|2019|[[1, TV, 8], [7, AC, 10], [2, CO, 10]]|[[1, TV, 8], [2, CO, 10], [7, AC, 10]]|[[TV, 8], [CO, 10], [AC, 10]]|
//+----+--------------------------------------+--------------------------------------+-----------------------------+
I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).
for example:
arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]
arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]
I want to find common coordinates (same first 3 elements):
list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]
At the moment I'm doing this double loop, which is extremely slow:
list_arr=[]
for i in arr1:
for j in arr2:
if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
list_arr.append (j)
I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.
Can you help me with this?
Approach #1
Here's a vectorized one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]
Approach #2
Another with dimensionality-reduction for ints -
d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]
Improvement #1
We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -
unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]
Improvement #2
For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]
You can do it with the help of set
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])
# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]
aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])
Output
array([[7, 8, 9],
[1, 2, 3]])
Edit
Use list comprehension
l = [list(i) for i in arr2 if i[:3] in arr]
print(l)
Output:
[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]
For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:
1.+1e-15==1.
False
1.+1e-16==1.
True
If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.
import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
tree=spatial.cKDTree(arr2[:,0:3])
#You can check here if the distance is small enough and otherwise raise an error
dist,ind=tree.query(arr1, k=1)
return ind
I am trying to find the common strings in a map and an array to output the respective values(from map, values here is Map[key -> value]) in Scala, I'm trying to not use any loops. Example:
Input:
Array("Ash","Garcia","Mac") Map("Ash" -> 5, "Mac" -> 4, "Lucas" -> 3)
Output:
Array(5,4)
The output is an array with 5 and 4 because Ash and Mac are common in both the data structures
What constitutes a loop?
def common(arr: Array[String], m: Map[String,Int]): Array[Int] =
arr flatMap m.get
Usage:
common(Array("Ash","Garcia","Mac")
,Map("Ash" -> 5, "Mac" -> 4, "Lucas" -> 3))
// res0: Array[Int] = Array(5, 4)
This is the most elegant solution, I think, but the results may not fit your requirements if there are duplicates in the array.
yourArray.collect(yourMap) // Array(5,4)
Use .filter to find the matching entries only, then get the value of your filtered map.
Given
scala> val names = Array("Ash","Garcia","Mac")
names: Array[String] = Array(Ash, Garcia, Mac)
scala> val nameToNumber = Map("Ash" -> 5, "Mac" -> 4, "Lucas" -> 3)
nameToNumber: scala.collection.immutable.Map[String,Int] = Map(Ash -> 5, Mac -> 4, Lucas -> 3)
.filter.map
scala> nameToNumber.filter(x => names.contains(x._1)).map(_._2)
res3: scala.collection.immutable.Iterable[Int] = List(5, 4)
Alternatively, you can use collect,
scala> nameToNumber.collect{case kv if names.contains(kv._1) => kv._2}
res6: scala.collection.immutable.Iterable[Int] = List(5, 4)
Your complexity here is O(n2)
Quite easy for scala elegant syntax:
val a = Array("Ash","Garcia","Mac")
val m = Map("Ash" -> 5, "Mac" -> 4, "Lucas" -> 3)
println(m.filter { case (k, v) => a.contains(k)}.map { case (k, v) => v}.toArray)
Here is the solution!
Hi I've tried to insert element to rdd array[String] using scala in spark.
Here is example.
val data = RDD[Array[String]] = Array(Array(1,2,3), Array(1,2,3,4), Array(1,2)).
I want to make length 4 of all arrays in this data.
If the length of array is less than 4, I want to fill the NULL value in the array.
here is my code that I tried to solve.
val newData = data.map(x =>
if(x.length < 4){
for(i <- x.length until 4){
x.union("NULL")
}
}
else{
x
}
)
But The result is Array[Any] = Array((), Array(1, 2, 3, 4), ()).
So I tried another ways. I used yield on for loop.
val newData = data.map(x =>
if(x.length < 4){
for(i <- x.length until 4)yield{
x.union("NULL")
}
}
else{
x
}
)
The result is Array[Object] = Array(Vector(Array(1, 2, 3, N, U, L, L)), Array(1, 2, 3, 4), Vector(Array(1, 2, N, U, L, L), Array(1, 2, N, U, L, L)))
these are not what I want. I want to return like this
RDD[Array[String]] = Array(Array(1,2,3,NULL), Array(1,2,3,4), Array(1,2,NULL,NULL)).
What should I do?
Is there a method to solve it?
union is a functional operation, it doesn't change the array x. You don't need to do this with a loop, though, and any loop implementations will probably be slower -- it's much better to create one new collection with all the NULL values instead of mutating something every time you add a null. Here's a lambda function that should work for you:
def fillNull(x: Array[Int], desiredLength: Int): Array[String] = {
x.map(_.toString) ++ Array.fill(desiredLength - x.length)("NULL")
}
val newData = data.map(fillNull(_, 4))
I solved your use case with the following code:
val initialRDD = sparkContext.parallelize(Array(Array[AnyVal](1, 2, 3), Array[AnyVal](1, 2, 3, 4), Array[AnyVal](1, 2, 3)))
val transformedRDD = initialRDD.map(array =>
if (array.length < 4) {
val transformedArray = Array.fill[AnyVal](4)("NULL")
Array.copy(array, 0, transformedArray, 0, array.length)
transformedArray
} else {
array
}
)
val result = transformedRDD.collect()
I have an Array[Array[String]] like this:
Array(Array("A","1","2"),
Array("A","3","4"),
Array("A","5","6"),
Array("B","7","8"),
Array("B","9","10"))
I would like to groupby on the first element of each sub Array and get a Map like this:
var A:Map[String, Array[String] = Map()
A + = ('A' -> Array("1", "2"))
A + = ('A' -> Array("3", "4"))
A + = ('A' -> Array("5", "6"))
A + = ('B' -> Array("7", "8"))
A + = ('B' -> Array("9", "10"))
I don't know how to manipulate groupby to get this result.
Do you have any idea?
Try this.
val arr = Array(Array("A","1","2"),
Array("A","3","4"),
Array("A","5","6"),
Array("B","7","8"),
Array("B","9","10"))
val result = arr.groupBy(_.head).map{case (k,v) => k -> v.flatMap(_.tail)}
result("A") // Array(1, 2, 3, 4, 5, 6)
result("B") // Array(7, 8, 9, 10)
Basically, after grouping you need to remove the head of each sub-array (that's the tail part), and you need to flatten the sub-arrays into a single array (that's the flatMap part).
Warning: this will throw a runtime exception if any of the sub-arrays are empty. Here's a version that will take care of that.
val result=arr.groupBy(_.headOption).collect{case (Some(k),v)=>k->v.flatMap(_.tail)}