Pyspark dataframe: Count elements in array or list

Pyspark dataframe: Count elements in array or list - arrays

Let us assume dataframe df as:
df.show()
Output:
+------+----------------+
|letter| list_of_numbers|
+------+----------------+
| A| [3, 1, 2, 3]|
| B| [1, 2, 1, 1]|
+------+----------------+
What I want to do is to count number of a specific element in column list_of_numbers. Something like this:
+------+----------------+----+
|letter| list_of_numbers|ones|
+------+----------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+----------------+----+
I have so far tried creating udf and it perfectly works, but I'm wondering if I can do it without defining any udf.

You can explode the array and filter the exploded values for 1. Then groupBy and count:
from pyspark.sql.functions import col, count, explode
df.select("*", explode("list_of_numbers").alias("exploded"))\
.where(col("exploded") == 1)\
.groupBy("letter", "list_of_numbers")\
.agg(count("exploded").alias("ones"))\
.show()
#+------+---------------+----+
#|letter|list_of_numbers|ones|
#+------+---------------+----+
#| A| [3, 1, 2, 3]| 1|
#| B| [1, 2, 1, 1]| 3|
#+------+---------------+----+
In order to keep all rows, even when the count is 0, you can convert the exploded column into an indicator variable. Then groupBy and sum.
from pyspark.sql.functions import col, count, explode, sum as sum_
df.select("*", explode("list_of_numbers").alias("exploded"))\
.withColumn("exploded", (col("exploded") == 1).cast("int"))\
.groupBy("letter", "list_of_numbers")\
.agg(sum_("exploded").alias("ones"))\
.show()
Note, I have imported pyspark.sql.functions.sum as sum_ as to not overwrite the builtin sum function.

From pyspark 3+, we can use array transformations.
https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-array-functions-720b8fbfa729
import pyspark.sql.functions as F
df = spark_session.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df1 = df.selectExpr('*','filter(list_of_numbers, x->x=1) as ones_array')
df2 = df1.selectExpr('*', 'size(ones_array) as ones')
df2.show()
+------+---------------+----------+----+
|letter|list_of_numbers|ones_array|ones|
+------+---------------+----------+----+
| A| [3, 1, 2, 3]| [1]| 1|
| B| [1, 2, 1, 1]| [1, 1, 1]| 3|
+------+---------------+----------+----+

Assuming that the length of the list is constant, one way i can think of is,
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
expr = reduce(add,[F.when(F.col('list_of_numbers').getItem(x)==1, 1)\
.otherwise(0) for x in range(4)])
df = df.withColumn('ones', expr)
df.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+

There was a comment above from Ala Tarighati that the solution did not work for arrays with different lengths. The following is a udf that will solve that problem
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df_ones = (
df.withColumn(
'ones',
reduce(
add,
[
F.when(
F.col("list_of_numbers").getItem(x) == F.lit("1"), 1
).otherwise(0)
for x in range(len("drivers"))
],
),
)
)
df_ones.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+

Related

Sort array column where the last item equals the first of the next row

Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[( 1, 'aa', [None, 9]),
( 1, None, [ 9, 1]),
( 1, 'bb', [ 1, 4]),
( 1, 'cc', [ 4, 5]),
( 2, 'ee', [None, 2]),
( 2, None, [ 2, 8]),
( 2, 'dd', [ 8, 7]),
( 2, None, [ 7, 1])],
['col_id', 'col_val', 'col_arr'])
Desired result - I want to group by col_id and return the last non-null item from col_val:
+------+-------+
|col_id|col_val|
+------+-------+
| 1| cc|
| 2| dd|
+------+-------+
The problem is the order column. It's an array where its last element is repeated as the first element of the following row. In the above example, the order of col_id=2 goes:
[None, 2], [2, 8], [8, 7], [7, 1].
Since col_val of [7, 1] is null, the result of [8, 7] should be returned, i.e. 'dd'. The ordering always starts with null (None).
I've tried
df = (df
.filter(~F.isnull('col_val'))
.groupBy('col_id')
.agg(F.max_by('col_val', F.col('col_arr')[1]))
)
df.show()
# +------+---------------------------+
# |col_id|max_by(col_val, col_arr[1])|
# +------+---------------------------+
# | 1| aa|
# | 2| dd|
# +------+---------------------------+
It's not successful, as my order column does not follow a simple ascending / descending order.

So, after some decent thinking, I have found a working approach. The steps:
collecting modified rows (as structs) for every col_id into lists
creating a map for every col_id with first elements of the inner lists as keys
sequential lookup in maps, "looping" through elements in array to create ordered lists
removing nulls and extracting the last item
from pyspark.sql import functions as F, Window as W
df = df.withColumn('col_arr', F.transform('col_arr', lambda x: F.coalesce(x, F.lit(-9))))
inner_struct = F.struct('col_val', F.col('col_arr')[1].alias('last'))
c = F.collect_set(F.struct(F.col('col_arr')[0], inner_struct))
df = df.groupBy('col_id').agg(
F.element_at(F.filter(F.aggregate(
c,
F.expr("array(struct(string(null) col_val, -9L last))"),
lambda acc, x: F.array_union(
acc,
F.array(F.map_from_entries(c)[F.element_at(acc, -1)['last']])
)
), lambda x: x.col_val.isNotNull()), -1).col_val.alias('col_val')
)
df.show()
# +------+-------+
# |col_id|col_val|
# +------+-------+
# | 1| cc|
# | 2| dd|
# +------+-------+

Multiplication of members of two arrays

I have the following table:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
cols = [ 'a1', 'a2']
data = [([2, 3], [4, 5]),
([1, 3], [2, 4])]
df = spark.createDataFrame(data, cols)
df.show()
# +------+------+
# | a1| a2|
# +------+------+
# |[2, 3]|[4, 5]|
# |[1, 3]|[2, 4]|
# +------+------+
I know how to multiply array by a scalar. But how to multiply members of one array with corresponding members of another array?
Desired result:
# +------+------+-------+
# | a1| a2| res|
# +------+------+-------+
# |[2, 3]|[4, 5]|[8, 15]|
# |[1, 3]|[2, 4]|[2, 12]|
# +------+------+-------+

Similarly to your example, you can access the 2nd array from the transform function. This assumes that both arrays have same length:
from pyspark.sql.functions import expr
cols = [ 'a1', 'a2']
data = [([2, 3], [4, 5]),
([1, 3], [2, 4])]
df = spark.createDataFrame(data, cols)
df = df.withColumn("res", expr("transform(a1, (x, i) -> a2[i] * x)"))
# +------+------+-------+
# | a1| a2| res|
# +------+------+-------+
# |[2, 3]|[4, 5]|[8, 15]|
# |[1, 3]|[2, 4]|[2, 12]|
# +------+------+-------+

Assuming you can have arrays with different sizes:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.getOrCreate()
cols = ['a1', 'a2']
data = [([2, 3], [4, 5]),
([1, 3], [2, 4]),
([1, 3], [2, 4, 6])]
df = spark.createDataFrame(data, cols)
df = df.withColumn("res", expr("transform(arrays_zip(a1, a2), x -> coalesce(x.a1 * x.a2, 0))"))
df.show(truncate=False)
# +------+---------+----------+
# |a1 |a2 |res |
# +------+---------+----------+
# |[2, 3]|[4, 5] |[8, 15] |
# |[1, 3]|[2, 4] |[2, 12] |
# |[1, 3]|[2, 4, 6]|[2, 12, 0]|
# +------+---------+----------+

Use User Defined Functions(UDF) to create a function to perform the multiplication and the call this function.
def sum(x, y):
return [x[0] * y[0], x[1] * y[1]]
sum_cols = udf(sum, ArrayType(IntegerType()))
df1 = df.withColumn("res", sum_cols('a1', 'a2'))
df1.show()
+------+------+-------+
| a1| a2| res|
+------+------+-------+
|[2, 3]|[4, 5]|[8, 15]|
|[1, 3]|[2, 4]|[2, 12]|
+------+------+-------+
https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

PySpark equivalent of function "typedLit" from Scala API

We have a function typedLit in Scala API for Spark to add the Array or Map as column value.
import org.apache.spark.sql.functions.typedLit
val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")
df1.withColumn("seq", typedLit(Seq(1,2,3)))
.show(truncate=false)
+---+---+---------+
|a |b |seq |
+---+---+---------+
|1 |0 |[1, 2, 3]|
|2 |3 |[1, 2, 3]|
+---+---+---------+
I couldn't find the equivalent in PySpark. How can we create a column in PySpark with Array as a column value?

There isn't an equivalent function in pyspark yet, but you can have an array column as shown below:
from pyspark.sql.functions import array, lit
df = sc.parallelize([[1,2], [3,4]]).toDF(['a', 'b'])
df.withColumn('seq', array([lit(i) for i in [1,2,3]])).show()
Output:
+---+---+---------+
| a| b| seq|
+---+---+---------+
| 1| 2|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+---+---+---------+

Using expr and array looks the most elegant to me:
df = df.withColumn('seq', F.expr('array(1,2,3)'))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1,0), (2,3)], ['a', 'b'])
df = df.withColumn('seq', F.expr('array(1,2,3)'))
df.show()
# +---+---+---------+
# | a| b| seq|
# +---+---+---------+
# | 1| 0|[1, 2, 3]|
# | 2| 3|[1, 2, 3]|
# +---+---+---------+
Use F.expr('sequence(1,3)') if array numbers need to go in a sequence.

You can use .cast() directly after the lit() call to type the Column:
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType
df1.withColumn("long", sf.lit(1).cast(LongType()))
The same works for array():
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType, ArrayType
df1.withColumn("pirate", sf.array([sf.lit(x).cast(LongType()) for x in [1, 2, 3]]))
df1.withColumn("pirate", sf.array([sf.lit(x) for x in [1, 2, 3]]).cast(ArrayType(LongType())))
and if you really like text and typing but hate types, you could use:
df1.withColumn("pirate", sf.array(sf.lit("1"), sf.lit("2")).cast("array<int>"))
;)
PS Also consider using map with sf.lit instead of the for comprehension.

Replace elements in an array with their corresponding elements in PySpark

I have this dataframe:
+-----+---------------------+
|Index|flagArray |
+-----+---------------------+
|1 |[A, S, A, E, Z, S, S]|
|2 |[A, Z, Z, E, Z, S, S]|
+-----+---------------------+
I want to represent array elements with their corresponding numeric values.
A - 0
F - 1
S - 2
E - 3
Z - 4
So the output dataframe should look like this:
+-----+---------------------+---------------------+
|Index|flagArray |finalArray |
+-----+---------------------+---------------------+
|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+-----+---------------------+---------------------+
I have written a udf in PySpark where I am achieving it by writing some if else statements. Is there any better way to handle this?

For Spark 2.4+, you can simply use transform function to loop through each element of flagArray array and get its mapping value from a map column that you can create from that mapping using element_at:
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = map_from_entries(array(*[struct(lit(k), lit(v)) for k, v in mappings.items()]))
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", expr(""" transform(flagArray, x -> element_at(mappings, x))""")) \
.drop("mappings")
df.show(truncate=False)
#+-----+---------------------+---------------------+
#|Index|flagArray |finalArray |
#+-----+---------------------+---------------------+
#|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
#|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
#+-----+---------------------+---------------------+

There doesn't seem to be a built-in function to map array elements, so here's perhaps an alternative udf, different from yours in that it uses a list comprehension:
dic = {'A':0,'F':1,'S':2,'E':3,'Z':4}
map_array = f.udf(lambda a: [dic[k] for k in a])
df.withColumn('finalArray', map_array(df['flagArray'])).show(truncate=False)
Output:
+------+---------------------+---------------------+
|Index |flagArray |finalArray |
+------+---------------------+---------------------+
|1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
|2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
+------+---------------------+---------------------+

For Spark 3.1+, you could call pyspark.sql.functions.transform and pyspark.sql.functions.element_at to do the job:
import pyspark.sql.functions as F
# create example DataFrame
df = spark.createDataFrame([
(1, "A S A E Z S S".split(" ")),
(2, "A Z Z E Z S S".split(" ")),
], ["Index", "flagArray"])
# make mapping column
mappings = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
mapping_col = F.map_from_entries(F.array(*[F.struct(F.lit(k), F.lit(v)) for k, v in mappings.items()]))
# transform DataFrame
df = df.withColumn("mappings", mapping_col) \
.withColumn("finalArray", F.transform("flagArray", lambda x: F.element_at("mappings", x))) \
.drop("mappings")

After trying out multiple ways, I successfully created a VectorUDT column with DoubleType values.
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.ml.functions import array_to_vector
def mapper_func(val):
r = []
for i, x in enumerate(val):
v = str(mapping_dict.get(x, 0.0)) # Default replace with 0.0
r.append(float(v))
return r
my_udf = lambda z: mapper_func(z)
label_udf = F.udf(my_udf, T.ArrayType(T.DoubleType()))
df = df.withColumn('finalArray', array_to_vector(label_udf(df.flagArray)))
This will work even for values not present in the dictionary/map. Default replace is 0.0

Spark 3.1+
transform is available directly in Python API
d = {"A": 0, "F": 1, "S": 2, "E": 3, "Z": 4}
map_col = F.create_map([F.lit(x) for i in d.items() for x in i])
df = df.withColumn("finalArray", F.transform('flagArray', lambda x: map_col[x]))
df.show(truncate=0)
# +-----+---------------------+---------------------+
# |Index|flagArray |finalArray |
# +-----+---------------------+---------------------+
# |1 |[A, S, A, E, Z, S, S]|[0, 2, 0, 3, 4, 2, 2]|
# |2 |[A, Z, Z, E, Z, S, S]|[0, 4, 4, 3, 4, 2, 2]|
# +-----+---------------------+---------------------+

Create new column with an array of range of numbers

So I need to create an array of numbers enumerating from 1 to 100 as the value for each row as an extra column.
Using the array() function with a bunch of literal values works, but surely there's a way to use / convert a Scala Range(a to b) instead of listing each number individually?
spark.sql("SELECT key FROM schema.table")
.otherCommands
.withColumn("range", array(lit(1), lit(2), ..., lit(100)))
To something like:
withColumn("range", array(1 to 100))

From Spark 2.4 you can use [sequence][1] function
If you have this dataframe:
df.show()
+--------+
|column_1|
+--------+
| 1|
| 2|
| 3|
| 0|
+--------+
If you use the sequence function from 0 to column_1 you got this:
df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1| range|
+--------+------------+
| 1| [0, 1]|
| 2| [0, 1, 2]|
| 3|[0, 1, 2, 3]|
| 0| [0]|
+--------+------------+
For this case, set both values with lit:
df.withColumn("range", sequence(lit(0), lit(100)))

You can use map function using lit inbuilt function inside array function as
df.withColumn("range", array((1 to 100).map(lit(_)): _*))

For Spark 2.2+ a new function typedLit was introduced that easily solves this problem without using .map(lit(_)) on the array. From the documentation:
The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
Use as follows:
import org.apache.spark.sql.functions.typedLit
df.withColumn("range", typedLit((1 to 100).toList))

In case of PySpark:
from pyspark.sql import functions as F
DF.withColumn("range",F.array([F.lit(i) for i in range(1,11)]))
I hope the above answer is useful.

Tested this solution with spark version 2.2.0
Please try this simple way for the same thing:
val df = spark.range(5).toDF("id")
df.withColumn("range", lit(1 to 10 toArray)).show(false)
The output of the code:
+---+-------------------------------+
|id |range |
+---+-------------------------------+
|0 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|1 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|2 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|3 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|4 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
+---+-------------------------------+

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Pyspark dataframe: Count elements in array or list - arrays

Related

Sort array column where the last item equals the first of the next row

Multiplication of members of two arrays

PySpark equivalent of function "typedLit" from Scala API

Replace elements in an array with their corresponding elements in PySpark

Create new column with an array of range of numbers

Categories

Resources