Pyspark loop over dataframe and decrement column value - loops

I need help with looping row by row in pyspark dataframe:
E.g:
df1
+---------+
|id|value|
+---------+
|a|100|
|b|100|
|c|100|
+---------+
I need to loop and decrease the value based on another dataframe
df2
+---------+---------------
|id|value|timestamp
+---------+---------------
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
+---------+-------------
Expected Output based on a udf or function
customFunction(df1(row_n)))
df1
+---------+
|id|value|
+---------+
|a|30| ( 100-20 ) ( 80 - 50 )
|b|50| ( 100-50 ) skip operation since lhs < rhs ( 50 - 80 )
|c|50| ( 100 - 50 )
+---------+
How do i achieve this in pyspark ? Also the dataframes will have > 50k rows

You can achieve this with joining both the dataframes & using groupBy to aggregate the values from df2 to determine if the value is greater or less than the aggregation.
Combining DataFrames
input_str1 = """
|a|100
|b|100
|c|100
""".split("|")
input_values1 = list(map(lambda x:x.strip(),input_str1))[1:]
input_list1 = [(x,y) for x,y in zip(input_values1[0::2],input_values1[1::2])]
sparkDF1 = sql.createDataFrame(input_list1,['id','value'])
input_str2 = """
|a|20 |2020-01-02 01:30
|a|50 |2020-01-02 05:30
|b|50 |2020-01-15 07:30
|b|80 |2020-02-01 09:30
|c|50 |2020-02-01 09:30
""".split("|")
input_values2 = list(map(lambda x:x.strip(),input_str2))[1:]
input_list2 = [(x,y,z) for x,y,z in zip(input_values2[0::3]
,input_values2[1::3],input_values2[2::3])]
sparkDF2 = sql.createDataFrame(input_list2,['id','value','timestamp'])
finalDF = (sparkDF1.join(sparkDF2
,sparkDF1['id'] == sparkDF2['id']
,'inner'
).select(sparkDF1["*"],sparkDF2['value'].alias('value_right')))
finalDF.show()
+---+-----+-----------+
| id|value|value_right|
+---+-----+-----------+
| c| 100| 50|
| b| 100| 50|
| b| 100| 80|
| a| 100| 20|
| a| 100| 50|
+---+-----+-----------+
GroupBy
agg_lst = [
F.first(F.col('value')).alias('value')
,F.sum(F.col('value_right')).alias('sum_value_right')
,F.min(F.col('value_right')).alias('min_value_right')
]
finalDF = finalDF.groupBy('id').agg(*agg_lst).orderBy('id')
finalDF = finalDF.withColumn('final_value'
,F.when(F.col('value') > F.col('sum_value_right')
,F.col('value') - F.col('sum_value_right'))\
.otherwise(F.col('value') - F.col('min_value_right'))
)
finalDF.show()
+---+-----+---------------+---------------+-----------+
| id|value|sum_value_right|min_value_right|final_value|
+---+-----+---------------+---------------+-----------+
| a| 100| 70.0| 20| 30.0|
| b| 100| 130.0| 50| 50.0|
| c| 100| 50.0| 50| 50.0|
+---+-----+---------------+---------------+-----------+
Note - If the above logic does not work on your entire set , implementing a UDF with your custom logic , along with groupBy would be the ideal solution

Related

Join on element inside array

I have two dataframes where I have to use a value of one dataframe to filter on the second dataframe using that value.
For example, below are the datasets
import pyspark
from pyspark.sql import Row
cust = spark.createDataFrame([Row(city='hyd',cust_id=100),
Row(city='blr',cust_id=101),
Row(city='chen',cust_id=102),
Row(city='mum',cust_id=103)])
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
Row(item='chicken',geography=['a','hyd','c']),
Row(item='rice',geography=['a','b','c','blr']),
Row(item='soup',geography=['a','kol','simla']),
Row(item='pav',geography=['a','del']),
Row(item='kachori',geography=['a','guj']),
Row(item='fries',geography=['a','chen']),
Row(item='noodles',geography=['a','mum'])])
cust dataset output:
+----+-------+
|city|cust_id|
+----+-------+
| hyd| 100|
| blr| 101|
|chen| 102|
| mum| 103|
+----+-------+
item dataset output:
+-------+------------------+
| item| geography|
+-------+------------------+
| fish|[london, a, b,hyd]|
|chicken| [a, hyd, c]|
| rice| [a, b, c, blr]|
| soup| [a, kol, simla]|
| pav| [a, del]|
|kachori| [a, guj]|
| fries| [a, chen]|
|noodles| [a, mum]|
+-------+------------------+
I need to use the city column values from cust dataframe to get the items from the item dataset. The final output should be:
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| hyd|[fish, chicken]| 100|
| blr| [rice]| 101|
|chen| [fries]| 102|
| mum| [noodles]| 103|
+----+---------------+-------+
Before the join I would explode the array column. Then, collect_list aggregation can move all items to one list.
from pyspark.sql import functions as F
df = cust.join(item.withColumn('city', F.explode('geography')), 'city', 'left')
df = (df.groupBy('city', 'cust_id')
.agg(F.collect_list('item').alias('items'))
.select('city', 'items', 'cust_id')
)
df.show(truncate=False)
#+----+---------------+-------+
#|city|items |cust_id|
#+----+---------------+-------+
#|blr |[rice] |101 |
#|chen|[fries] |102 |
#|hyd |[fish, chicken]|100 |
#|mum |[noodles] |103 |
#+----+---------------+-------+
new = (
#join the two columns on city
item.withColumn('city',explode(col('geography')))
.join(cust,how='left',on='city')
#drop null rows and unwanted column
.dropna().drop('geography')
#groupby for the outcome
.groupby('city','cust_id').agg(collect_list('item').alias('items'))
)
new.show()
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| blr| [rice]| 101|
|chen| [fries]| 102|
| hyd|[fish, chicken]| 100|
| mum| [noodles]| 103|
+----+---------------+-------+

Spark: GroupBy and collect_list while filtering by another column

I have the following dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.
The expected result would be
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.
You can do a pivot, and use some array functions to get the difference:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Thanks #mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))

Compare rows of an array column with the headers of another data frame using Scala and Spark

I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+

Spark: Removing first array from Nested Array in Scala

I have a DataFrame with 2 columns.I want to remove first array of the nested array in every record. Example :- I have a DF like this
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
and I want my output like this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+
From Spark-2.4 use slice function.
Example:
df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|[[A, b]] |
+------------------------+
*/
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| [[c, z]]|
| null|
+----------------+
*/

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

Resources