I have two dataframes where I have to use a value of one dataframe to filter on the second dataframe using that value.
For example, below are the datasets
import pyspark
from pyspark.sql import Row
cust = spark.createDataFrame([Row(city='hyd',cust_id=100),
Row(city='blr',cust_id=101),
Row(city='chen',cust_id=102),
Row(city='mum',cust_id=103)])
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
Row(item='chicken',geography=['a','hyd','c']),
Row(item='rice',geography=['a','b','c','blr']),
Row(item='soup',geography=['a','kol','simla']),
Row(item='pav',geography=['a','del']),
Row(item='kachori',geography=['a','guj']),
Row(item='fries',geography=['a','chen']),
Row(item='noodles',geography=['a','mum'])])
cust dataset output:
+----+-------+
|city|cust_id|
+----+-------+
| hyd| 100|
| blr| 101|
|chen| 102|
| mum| 103|
+----+-------+
item dataset output:
+-------+------------------+
| item| geography|
+-------+------------------+
| fish|[london, a, b,hyd]|
|chicken| [a, hyd, c]|
| rice| [a, b, c, blr]|
| soup| [a, kol, simla]|
| pav| [a, del]|
|kachori| [a, guj]|
| fries| [a, chen]|
|noodles| [a, mum]|
+-------+------------------+
I need to use the city column values from cust dataframe to get the items from the item dataset. The final output should be:
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| hyd|[fish, chicken]| 100|
| blr| [rice]| 101|
|chen| [fries]| 102|
| mum| [noodles]| 103|
+----+---------------+-------+
Before the join I would explode the array column. Then, collect_list aggregation can move all items to one list.
from pyspark.sql import functions as F
df = cust.join(item.withColumn('city', F.explode('geography')), 'city', 'left')
df = (df.groupBy('city', 'cust_id')
.agg(F.collect_list('item').alias('items'))
.select('city', 'items', 'cust_id')
)
df.show(truncate=False)
#+----+---------------+-------+
#|city|items |cust_id|
#+----+---------------+-------+
#|blr |[rice] |101 |
#|chen|[fries] |102 |
#|hyd |[fish, chicken]|100 |
#|mum |[noodles] |103 |
#+----+---------------+-------+
new = (
#join the two columns on city
item.withColumn('city',explode(col('geography')))
.join(cust,how='left',on='city')
#drop null rows and unwanted column
.dropna().drop('geography')
#groupby for the outcome
.groupby('city','cust_id').agg(collect_list('item').alias('items'))
)
new.show()
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| blr| [rice]| 101|
|chen| [fries]| 102|
| hyd|[fish, chicken]| 100|
| mum| [noodles]| 103|
+----+---------------+-------+
I have a DataFrame with one column of array[string] type.
scala> df.printSchema
root
|-- user: string (nullable = true) ### this is an unique key
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
Due to some limitations on the consumer's side, I need to limit the number of elements in the items column, e.g: to maximum 1000 elements. The outcome DataFrame would have the same schema, except there's no uniqueness on the items column anymore. For example, with max elements = 3:
Input DataFrame:
+----+----------------------+
|user|items |
+----+----------------------+
|u1 |[a, b, cc, d, e, f, g]|
|u2 |[h, ii] |
|u3 |[j, kkkk, m, nn, o] |
+----+----------------------+
Output DataFrame:
+----+------------+
|user|items |
+----+------------+
|u1 |[a, f, g] |
|u1 |[b, cc, d] |
|u1 |[e] |
|u2 |[h, ii] |
|u3 |[j, nn, m] |
|u3 |[kkkk, o] |
+----+------------+
The order of items is not important.
The value of each item is just a string of alphanumeric chars, but the size of each item is not fixed.
Performance is not an issue, the DataFrame is small but we need the solution in SparkSQL.
This can be worked out without higher-order functions, in three easy steps:
posexplode the arrays of items
take integral part from dividing item's pos by N, the desired number of elements in subarrays
collect_list new arrays grouping by user and pos.
For N=3:
>>> df = spark.createDataFrame([
... {'user':'u1','items':['a', 'b', 'cc', 'd', 'e', 'f', 'g']},
... {'user':'u2','items':['h', 'ii']},
... {'user':'u3','items':['j', 'kkkk', 'm', 'nn', 'o']}
... ])
>>> from pyspark.sql.functions import *
>>> df1 = df.select(posexplode(df.items),df.user)
>>> df2 = df1.select(floor(df1.pos/3).alias('pos'),df1.col.alias('item'),df1.user)
>>> df3 = df2.groupby([df2.user,df2.pos]).agg(collect_list(df2.item)).drop('pos')
>>> df3.show(truncate=False)
+----+------------------+
|user|collect_list(item)|
+----+------------------+
|u2 |[h, ii] |
|u1 |[a, b, cc] |
|u1 |[d, e, f] |
|u1 |[g] |
|u3 |[nn, o] |
|u3 |[j, kkkk, m] |
+----+------------------+
>>>
I have the following dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.
The expected result would be
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.
You can do a pivot, and use some array functions to get the difference:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Thanks #mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))
I have list containing random number of elements
Emp list
101 [a,b,c,d,e]
102 [q,w,e]
103 [z,x,w,t,e,q,s]
I need the result to be split between 3 columns
Emp col1 col2 col3
101 a b c
101 d e
102 q w e
103 z x w
103 t e q
103 s
Check this out:
scala> val df = Seq((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s"))).toDF("emp","list")
df: org.apache.spark.sql.DataFrame = [emp: int, list: array<string>]
scala> df.show(false)
+---+---------------------+
|emp|list |
+---+---------------------+
|101|[a, b, c, d, e] |
|102|[q, w, e] |
|103|[z, x, w, t, e, q, s]|
+---+---------------------+
scala> val udf_slice = udf( (x:Seq[String]) => x.grouped(3).toList )
udf_slice: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(StringType,true))))
scala> df.select(col("*"), explode(udf_slice($"list")).as("newlist")).select($"emp", $"newlist"(0).as("col1"), $"newlist"(1).as("col2"), $"newlist"(2).as("col3") ).show(false)
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101|a |b |c |
|101|d |e |null|
|102|q |w |e |
|103|z |x |w |
|103|t |e |q |
|103|s |null|null|
+---+----+----+----+
scala>
Spark 2.4 - just tried to implement without udfs.. but the slice() function is not accepting other columns as parameters for the range
val df = Seq((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s"))).toDF("emp","list")
df.show(false)
val df2 = df.withColumn("list_size_arr", array_repeat(lit(1), ceil(size('list)/3).cast("int")) )
val df3 = df2.select(col("*"),posexplode('list_size_arr))
val udf_slice = udf( (x:Seq[String],start:Int, end:Int ) => x.slice(start,end) )
df3.withColumn("newlist",udf_slice('list,'pos*3, ('pos+1)*3 )).select($"emp", $"newlist").show(false)
Results:
+---+---------------------+
|emp|list |
+---+---------------------+
|101|[a, b, c, d, e] |
|102|[q, w, e] |
|103|[z, x, w, t, e, q, s]|
+---+---------------------+
+---+---------+
|emp|newlist |
+---+---------+
|101|[a, b, c]|
|101|[d, e] |
|102|[q, w, e]|
|103|[z, x, w]|
|103|[t, e, q]|
|103|[s] |
+---+---------+
To get in separate columns
val df4 = df3.withColumn("newlist",udf_slice('list,'pos*3, ('pos+1)*3 )).select($"emp", $"newlist")
df4.select($"emp", $"newlist"(0).as("col1"), $"newlist"(1).as("col2"), $"newlist"(2).as("col3") ).show(false)
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101|a |b |c |
|101|d |e |null|
|102|q |w |e |
|103|z |x |w |
|103|t |e |q |
|103|s |null|null|
+---+----+----+----+
Another approach not using UDF is as follows - note sliding can be used as well, but it does involve a conversion to RDD and back again:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// No use of UDF means conversion to RDD and back again.
val data = List( (102, Array("a", "b", "c")), (103, Array("1", "2", "3", "4", "5", "6", "7", "8")), (104, Array("r")) )
val rdd = sc.parallelize(data)
val df = rdd.toDF("k", "v")
// Make groups of 3 as requested, methods possible.
val rddX = df.as[(Int, List[String])].rdd // This avoids Row and Any issues that typically crop up.
//val rddY = rddX.map(x => (x._1, x._2.grouped(3).toArray))
val rddY = rddX.map(x => (x._1, x._2.sliding(3,3).toArray))
// Get k,v's with v the set of 3 and make single columns.
val df2 = rddY.toDF("k", "v")
val df3 = df2.select($"k", explode($"v").as("v_3"))
val df4 = df3.select($"k", $"v_3"(0).as("v_3_1"), $"v_3"(1).as("v_3_2"), $"v_3"(2).as("v_3_3") )
df4.show(false)
returns:
+---+-----+-----+-----+
|k |v_3_1|v_3_2|v_3_3|
+---+-----+-----+-----+
|102|a |b |c |
|103|1 |2 |3 |
|103|4 |5 |6 |
|103|7 |8 |null |
|104|r |null |null |
+---+-----+-----+-----+
there is probably a better solution, but I came up with this:
import java.util.Arrays
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val employees = Array((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s")))
def f(emp:Int, num:Array[String]):Row={
Row.fromSeq(s"${emp}" +: num)
}
val rowArray =for {
x <- employees
z <- x._2.sliding(3,3)
}yield f(x._1,Arrays.copyOf(z,3))
import spark.implicits._
val schema = StructType(
List(StructField("emp", StringType, false),
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true)))
val sqlContext=new SQLContext(sc)
val dfFromArray = sqlContext.createDataFrame(sc.parallelize(rowArray), schema)
dfFromArray.show
It will return you something like this:
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101| a| b| c|
|101| d| e|null|
|102| q| w| e|
|103| z| x| w|
|103| t| e| q|
|103| s|null|null|
+---+----+----+----+
This is the answer that I should have posted, if not using a UDF.
Here we use the newer Dataset which allows us to use Scala functions directly on DS-fields easier, like an RDD. That's the point here.
Indeed DS's are best of both worlds and point taken that a UDF is not used as in other answer, but sometimes the performance is an issue.
In any event, same results gotten so only DS approach shown. Note that DS and DF definitions change - these are indicated in the val names.
case class X(k: Integer, v: List[String])
import org.apache.spark.sql.functions._
val df = Seq( (102, Array("a", "b", "c")),
(103, Array("1", "2", "3", "4", "5", "6", "7", "8")),
(104, Array("r")) ).toDF("k", "v")
val ds = df.as[X]
val df2 = ds.map(x => (x.k, x.v.sliding(3,3).toArray))
.withColumnRenamed ("_1", "k" )
.withColumnRenamed ("_2", "v")
val df3 = df2.select($"k", explode($"v").as("v_3")).select($"k", $"v_3"(0).as("v_3_1"),
$"v_3"(1).as("v_3_2"), $"v_3"(2).as("v_3_3") )
df3.show(false)
results again in:
+---+-----+-----+-----+
|k |v_3_1|v_3_2|v_3_3|
+---+-----+-----+-----+
|102|a |b |c |
|103|1 |2 |3 |
|103|4 |5 |6 |
|103|7 |8 |null |
|104|r |null |null |
+---+-----+-----+-----+
Trying to do an insert based on whether a value all ready exists or not in a JSON blob in a Postgres table.
| a | b | c | d | metadata |
_____________________________________________________
| 1 | 2 | 3 | 4 | {"other-key": 1} |
| 2 | 1 | 4 | 4 | {"key": 99} |
| 3 | 1 | 4 | 4 | {"key": 99, "other-key": 33} |
Currently trying to use something like this.
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
(:metadata->>'key'::TEXT IS NULL
OR :metadata->>'key'::TEXT NOT IN (SELECT :metadata->>'key'::TEXT
FROM mytable));
But keep getting an ERROR: operator does not exist: unknown ->> boolean
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.
It was just a simple casting issue
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
((:metadata::JSONB->>'key') is NULL or
:metadata::JSONB->>'key' NOT IN (
SELECT metadata->>'key' FROM mytable
WHERE
metadata->>'key' = :metadata::JSONB->>'key'));