Join on element inside array - arrays

I have two dataframes where I have to use a value of one dataframe to filter on the second dataframe using that value.
For example, below are the datasets
import pyspark
from pyspark.sql import Row
cust = spark.createDataFrame([Row(city='hyd',cust_id=100),
Row(city='blr',cust_id=101),
Row(city='chen',cust_id=102),
Row(city='mum',cust_id=103)])
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
Row(item='chicken',geography=['a','hyd','c']),
Row(item='rice',geography=['a','b','c','blr']),
Row(item='soup',geography=['a','kol','simla']),
Row(item='pav',geography=['a','del']),
Row(item='kachori',geography=['a','guj']),
Row(item='fries',geography=['a','chen']),
Row(item='noodles',geography=['a','mum'])])
cust dataset output:
+----+-------+
|city|cust_id|
+----+-------+
| hyd| 100|
| blr| 101|
|chen| 102|
| mum| 103|
+----+-------+
item dataset output:
+-------+------------------+
| item| geography|
+-------+------------------+
| fish|[london, a, b,hyd]|
|chicken| [a, hyd, c]|
| rice| [a, b, c, blr]|
| soup| [a, kol, simla]|
| pav| [a, del]|
|kachori| [a, guj]|
| fries| [a, chen]|
|noodles| [a, mum]|
+-------+------------------+
I need to use the city column values from cust dataframe to get the items from the item dataset. The final output should be:
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| hyd|[fish, chicken]| 100|
| blr| [rice]| 101|
|chen| [fries]| 102|
| mum| [noodles]| 103|
+----+---------------+-------+

Before the join I would explode the array column. Then, collect_list aggregation can move all items to one list.
from pyspark.sql import functions as F
df = cust.join(item.withColumn('city', F.explode('geography')), 'city', 'left')
df = (df.groupBy('city', 'cust_id')
.agg(F.collect_list('item').alias('items'))
.select('city', 'items', 'cust_id')
)
df.show(truncate=False)
#+----+---------------+-------+
#|city|items |cust_id|
#+----+---------------+-------+
#|blr |[rice] |101 |
#|chen|[fries] |102 |
#|hyd |[fish, chicken]|100 |
#|mum |[noodles] |103 |
#+----+---------------+-------+

new = (
#join the two columns on city
item.withColumn('city',explode(col('geography')))
.join(cust,how='left',on='city')
#drop null rows and unwanted column
.dropna().drop('geography')
#groupby for the outcome
.groupby('city','cust_id').agg(collect_list('item').alias('items'))
)
new.show()
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| blr| [rice]| 101|
|chen| [fries]| 102|
| hyd|[fish, chicken]| 100|
| mum| [noodles]| 103|
+----+---------------+-------+

Related

Pyspark loop over dataframe and decrement column value

I need help with looping row by row in pyspark dataframe:
E.g:
df1
+---------+
|id|value|
+---------+
|a|100|
|b|100|
|c|100|
+---------+
I need to loop and decrease the value based on another dataframe
df2
+---------+---------------
|id|value|timestamp
+---------+---------------
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
+---------+-------------
Expected Output based on a udf or function
customFunction(df1(row_n)))
df1
+---------+
|id|value|
+---------+
|a|30| ( 100-20 ) ( 80 - 50 )
|b|50| ( 100-50 ) skip operation since lhs < rhs ( 50 - 80 )
|c|50| ( 100 - 50 )
+---------+
How do i achieve this in pyspark ? Also the dataframes will have > 50k rows
You can achieve this with joining both the dataframes & using groupBy to aggregate the values from df2 to determine if the value is greater or less than the aggregation.
Combining DataFrames
input_str1 = """
|a|100
|b|100
|c|100
""".split("|")
input_values1 = list(map(lambda x:x.strip(),input_str1))[1:]
input_list1 = [(x,y) for x,y in zip(input_values1[0::2],input_values1[1::2])]
sparkDF1 = sql.createDataFrame(input_list1,['id','value'])
input_str2 = """
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
""".split("|")
input_values2 = list(map(lambda x:x.strip(),input_str2))[1:]
input_list2 = [(x,y,z) for x,y,z in zip(input_values2[0::3]
,input_values2[1::3],input_values2[2::3])]
sparkDF2 = sql.createDataFrame(input_list2,['id','value','timestamp'])
finalDF = (sparkDF1.join(sparkDF2
,sparkDF1['id'] == sparkDF2['id']
,'inner'
).select(sparkDF1["*"],sparkDF2['value'].alias('value_right')))
finalDF.show()
+---+-----+-----------+
| id|value|value_right|
+---+-----+-----------+
| c| 100| 50|
| b| 100| 50|
| b| 100| 80|
| a| 100| 20|
| a| 100| 50|
+---+-----+-----------+
GroupBy
agg_lst = [
F.first(F.col('value')).alias('value')
,F.sum(F.col('value_right')).alias('sum_value_right')
,F.min(F.col('value_right')).alias('min_value_right')
]
finalDF = finalDF.groupBy('id').agg(*agg_lst).orderBy('id')
finalDF = finalDF.withColumn('final_value'
,F.when(F.col('value') > F.col('sum_value_right')
,F.col('value') - F.col('sum_value_right'))\
.otherwise(F.col('value') - F.col('min_value_right'))
)
finalDF.show()
+---+-----+---------------+---------------+-----------+
| id|value|sum_value_right|min_value_right|final_value|
+---+-----+---------------+---------------+-----------+
| a| 100| 70.0| 20| 30.0|
| b| 100| 130.0| 50| 50.0|
| c| 100| 50.0| 50| 50.0|
+---+-----+---------------+---------------+-----------+
Note - If the above logic does not work on your entire set , implementing a UDF with your custom logic , along with groupBy would be the ideal solution

Spark: Removing first array from Nested Array in Scala

I have a DataFrame with 2 columns.I want to remove first array of the nested array in every record. Example :- I have a DF like this
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
and I want my output like this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+
From Spark-2.4 use slice function.
Example:
df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|[[A, b]] |
+------------------------+
*/
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| [[c, z]]|
| null|
+----------------+
*/

Get the most common element of an array using Pyspark

How can I get the most common element of an array after concatenating two columns using Pyspark
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr2")
df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))
expected output:
+------------------------+
| arr | arr_1 | arr_2 |
+------------------------+
| [a] | [a,a,b] | [a] |
| [d] | [c,d,d] | [] |
| [e] | [e] | [e,f] |
| [] | [] | [] |
+------------------------+
Try it
df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
df1.select('id',explode('arr')).\
groupBy('id','col').count().\
select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
filter(col('rank')==1).\
join(df1,'id').\
select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
+---+---------+------+
|arr| arr_1| arr_2|
+---+---------+------+
| a|[a, a, b]| [a]|
| | []| []|
| e| [e]|[e, f]|
| d|[c, d, d]| []|
+---+---------+------+
You can explode the array then by doing group by count, Window we can get the most occurring element.
Example:
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr_2")
df_new = df.withColumn('arr_concat',concat(col('arr_1'),col('arr_2')))
from pyspark.sql.functions import *
from pyspark.sql import *
df1=df_new.withColumn("mid",monotonically_increasing_id())
df2=df1.selectExpr("explode(arr_concat) as arr","mid").groupBy("mid","arr").agg(count(lit("1")).alias("cnt"))
w=Window.partitionBy("mid").orderBy(desc("cnt"))
df3=df2.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop(*["rn","cnt"])
df3.join(df1,['mid'],'inner').drop(*['mid','arr_concat']).withColumn("arr",array(col("arr"))).show()
#+---+---------+------+
#|arr| arr_1| arr_2|
#+---+---------+------+
#|[d]|[c, d, d]| []|
#|[e]| [e]|[e, f]|
#|[a]|[a, a, b]| [a]|
#| []| []| []|
#+---+---------+------+

compare String with Array[String] in scala dataframe?

How to compare String with Array[String] in scala ?
For example, if "a" belongs to ["a", "b", "c"].
I have dataframe df
col1 col2
a [a,b,c]
d [a,b,c]
Expected output
col1 col2 status
a [a,b,c] present
d [a,b,c] missing
I wrote following script in scala
val arrayContains = udf( (col1: String, col2: Array[String]) =>
if(col2.contains(col1) ) "present" else "missing" )
I append new column with my dataframe by filled this new column "status" as follows
df.withColumn("status", arrayContains($"col1", $"col2" )).show()
but it prompts me following error.
(run-main-0) org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string, array) => string)
How can I resolve that issue ?
Here you go:
import org.apache.spark.sql.functions._
import scala.collection.mutable
+----+---------+
|col1| col2|
+----+---------+
| a|[a, b, c]|
| d|[a, b, c]|
| e|[a, b, c]|
| aa|[a, b, c]|
| c|[a, b, c]|
| f| []|
+----+---------+
def compareStrAgainstArray() = udf((str: String,lst: mutable.WrappedArray[String]) =>
if (lst.exists(str.matches(_)))"present" else "missing")
df.withColumn("status",compareStrAgainstArray()($"col1",$"col2")).show()
+----+---------+-------+
|col1| col2| status|
+----+---------+-------+
| a|[a, b, c]|present|
| d|[a, b, c]|missing|
| e|[a, b, c]|missing|
| aa|[a, b, c]|missing|
| c|[a, b, c]|present|
| f| []|missing|
+----+---------+-------+
Hope this helps!

Postgres Insert based on JSON data

Trying to do an insert based on whether a value all ready exists or not in a JSON blob in a Postgres table.
| a | b | c | d | metadata |
_____________________________________________________
| 1 | 2 | 3 | 4 | {"other-key": 1} |
| 2 | 1 | 4 | 4 | {"key": 99} |
| 3 | 1 | 4 | 4 | {"key": 99, "other-key": 33} |
Currently trying to use something like this.
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
(:metadata->>'key'::TEXT IS NULL
OR :metadata->>'key'::TEXT NOT IN (SELECT :metadata->>'key'::TEXT
FROM mytable));
But keep getting an ERROR: operator does not exist: unknown ->> boolean
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.
It was just a simple casting issue
INSERT INTO mytable (a, b, c, d, metadata)
SELECT :a, :b, :c, :d, :metadata::JSONB
WHERE
((:metadata::JSONB->>'key') is NULL or
:metadata::JSONB->>'key' NOT IN (
SELECT metadata->>'key' FROM mytable
WHERE
metadata->>'key' = :metadata::JSONB->>'key'));

Resources