Append column to an array in a pyspark dataframe - arrays

I have a Dataframe containing 2 columns
| VPN | UPC |
+--------+-----------------+
| 1 | [4,2] |
| 2 | [1,2] |
| null | [4,7] |
I need a result column with the values of vpn (string) appended to the array UPC. the result should look something like this below.
| result |
+--------+
| [4,2,1]|
| [1,2,2]|
| [4,7,] |

One option is to use concat + array. First use array to convert VPN column to an array type, then concatenate the two array columns with concat method:
df = spark.createDataFrame([(1, [4, 2]), (2, [1, 2]), (None, [4, 7])], ['VPN', 'UPC'])
df.show()
+----+------+
| VPN| UPC|
+----+------+
| 1|[4, 2]|
| 2|[1, 2]|
|null|[4, 7]|
+----+------+
df.selectExpr('concat(UPC, array(VPN)) as result').show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+
Or more pythonic:
from pyspark.sql.functions import array, concat
df.select(concat('UPC', array('VPN')).alias('result')).show()
+---------+
| result|
+---------+
|[4, 2, 1]|
|[1, 2, 2]|
| [4, 7,]|
+---------+

Related

Iterating through tables with map function and function that queries other dataframes

I have two tables: Table A
|Group ID | User ids in group|
| -------- | -------------- |
| 11 | [45,46,47,48] |
| 20 | [49,10,11,12] |
| 31 | [55,7,48,43] |
and Table B:
| User ids| Related Id |
| ------- | -------------- |
| 1 | [5,6,7,8] |
| 2 | [6, 9, 10,11] |
| 3 | [1, 2, 5, 7] |
And I have a reference table that has the info: Reference table:
| User ids | Group ID |
| -------- | -------------- |
| 1 | 11 |
| 2 | 20 |
| 3 | 31 |
This is just a minimal sample, I have this situation with millions of rows on each table. I am trying to use pyspark (or sql but I haven't figured out a way to do it here) to iterate through User ids column in the reference table and get the intersection between the lists of User ids in group from Table A and Related Id from Table B.
So in the end, I would like to have a table of the form:
| User ids | Intersection |
| -------- | -------------- |
| 2 | [10, 11] |
| 3 | [7] |
In Pyspark Id have a function of the form:
def test_function(user_id, ref_df, tableB_df, tableA_df):
group_id = int(ref_df.filter(ref_df.userID == user_id).collect()[0][1])
group_list = tableA_df.filter(tableA_df.groupID == group_id)
related_id_list = tableB_df.filter(tableB_df.userID == user_id)
return group_list.intsersection(related_id_list)
abc = ref_df.rdd.map(lambda x: test_function(x, ref_df, tableB_df, tableA_df))
However, when I run this function I am getting the following error:
An error was encountered:
Could not serialize object: TypeError: can't pickle _thread.RLock objects
Can anyone give any suggestion on how to solve this problem? Or how to modify my approach to solve this problem? Since my table has millions of rows, I want to use pyspark as best as possible to make use of the parallelization abilities as much as possible. Thanks for all your help.
You first join the reference table with table a on Group ID, and join the resulting table with table b on User ids. This will give you a dataframe that looks like this:
+--------+--------+-----------------+--------------+
|User ids|Group ID|User ids in group| Related Id|
+--------+--------+-----------------+--------------+
| 1| 11| [45, 46, 47, 48]| [5, 6, 7, 8]|
| 2| 20| [49, 10, 11, 12]|[6, 9, 10, 11]|
| 3| 31| [55, 7, 48, 43]| [1, 2, 5, 7]|
+--------+--------+-----------------+--------------+
Then you perform an intersection on column User ids in group and Related Id. This gives you the columns you want, but you need to filter rows where the intersection is empty.
The code snippet below does all of that in pyspark:
import pyspark.sql.functions as F
# Init example tables
table_a = spark.createDataFrame(
[(11, [45, 46, 47, 48]), (20, [49, 10, 11, 12]), (31, [55, 7, 48, 43])],
["Group ID", "User ids in group"],
)
table_b = spark.createDataFrame(
[(1, [5, 6, 7, 8]), (2, [6, 9, 10, 11]), (3, [1, 2, 5, 7])],
["User ids", "Related Id"],
)
reference_table = spark.createDataFrame(
[(1, 11), (2, 20), (3, 31)], ["User ids", "Group ID"]
)
# Relevant code
joined_df = reference_table.join(table_a, on="Group ID").join(table_b, on="User ids")
intersected_df = joined_df.withColumn("Intersection", F.array_intersect("User ids in group", "Related Id"))
intersected_df.select("User ids", "Intersection").filter(F.size("Intersection") > 0).show()
output:
+--------+------------+
|User ids|Intersection|
+--------+------------+
| 2| [10, 11]|
| 3| [7]|
+--------+------------+

Spark: GroupBy and collect_list while filtering by another column

I have the following dataframe
+-----+-----+------+
|group|label|active|
+-----+-----+------+
| a| 1| y|
| a| 2| y|
| a| 1| n|
| b| 1| y|
| b| 1| n|
+-----+-----+------+
I would like to group by "group" column and collect by "label" column, meanwhile filtering on the value in column active.
The expected result would be
+-----+---------+---------+----------+
|group| labelyes| labelno |difference|
+-----+---------+---------+----------+
|a | [1,2] | [1] | [2] |
|b | [1] | [1] | [] |
+-----+---------+---------+----------+
I could get easily filter for "y" label by
val dfyes = df.filter($"active" === "y").groupBy("group").agg(collect_set("label"))
and similarly for the "n" value
val dfno = df.filter($"active" === "n").groupBy("group").agg(collect_set("label"))
but I don't understand if it's possible to aggregate simultaneously while filtering and how to get the difference of the two sets.
You can do a pivot, and use some array functions to get the difference:
val df2 = df.groupBy("group").pivot("active").agg(collect_list("label")).withColumn(
"difference",
array_union(
array_except(col("n"), col("y")),
array_except(col("y"), col("n"))
)
)
df2.show
+-----+---+------+----------+
|group| n| y|difference|
+-----+---+------+----------+
| b|[1]| [1]| []|
| a|[1]|[1, 2]| [2]|
+-----+---+------+----------+
Thanks #mck for his help. I have found an alternative way to solve the question, namely to filter with when during the aggregation:
df
.groupBy("group")
.agg(
collect_set(when($"active" === "y", $"label")).as("labelyes"),
collect_set(when($"active" === "n", $"label")).as("labelno")
)
.withColumn("diff", array_except($"labelyes", $"labelno"))

Compare rows of an array column with the headers of another data frame using Scala and Spark

I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+

Spark: Removing first array from Nested Array in Scala

I have a DataFrame with 2 columns.I want to remove first array of the nested array in every record. Example :- I have a DF like this
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
and I want my output like this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+
From Spark-2.4 use slice function.
Example:
df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|[[A, b]] |
+------------------------+
*/
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| [[c, z]]|
| null|
+----------------+
*/

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

Resources