Spark how to union two arrays column without removing duplicates - arrays

There is function array_union, that union two arrays without duplicates. How can I union two arrays without removing duplicates?
+---------+---------+
|field |field1 |
+---------+---------+
|[1, 2, 2]|[1, 2, 2]|
+---------+---------+
.withColumn("union", array_union(col("field"), col("field1")))
Result:
+---------+---------+------------------+
|field |field1 |union |
+---------+---------+------------------+
|[1, 2, 2]|[1, 2, 2]|[1, 2, 2, 1, 2, 2]|
+---------+---------+------------------+

Just use concat instead,
import org.apache.spark.sql.functions.{concat}
df1.withColumn("NewArr", concat("Array1","Array2")).show()
Input:
Output:

Related

Sort array column where the last item equals the first of the next row

Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[( 1, 'aa', [None, 9]),
( 1, None, [ 9, 1]),
( 1, 'bb', [ 1, 4]),
( 1, 'cc', [ 4, 5]),
( 2, 'ee', [None, 2]),
( 2, None, [ 2, 8]),
( 2, 'dd', [ 8, 7]),
( 2, None, [ 7, 1])],
['col_id', 'col_val', 'col_arr'])
Desired result - I want to group by col_id and return the last non-null item from col_val:
+------+-------+
|col_id|col_val|
+------+-------+
| 1| cc|
| 2| dd|
+------+-------+
The problem is the order column. It's an array where its last element is repeated as the first element of the following row. In the above example, the order of col_id=2 goes:
[None, 2], [2, 8], [8, 7], [7, 1].
Since col_val of [7, 1] is null, the result of [8, 7] should be returned, i.e. 'dd'. The ordering always starts with null (None).
I've tried
df = (df
.filter(~F.isnull('col_val'))
.groupBy('col_id')
.agg(F.max_by('col_val', F.col('col_arr')[1]))
)
df.show()
# +------+---------------------------+
# |col_id|max_by(col_val, col_arr[1])|
# +------+---------------------------+
# | 1| aa|
# | 2| dd|
# +------+---------------------------+
It's not successful, as my order column does not follow a simple ascending / descending order.
So, after some decent thinking, I have found a working approach. The steps:
collecting modified rows (as structs) for every col_id into lists
creating a map for every col_id with first elements of the inner lists as keys
sequential lookup in maps, "looping" through elements in array to create ordered lists
removing nulls and extracting the last item
from pyspark.sql import functions as F, Window as W
df = df.withColumn('col_arr', F.transform('col_arr', lambda x: F.coalesce(x, F.lit(-9))))
inner_struct = F.struct('col_val', F.col('col_arr')[1].alias('last'))
c = F.collect_set(F.struct(F.col('col_arr')[0], inner_struct))
df = df.groupBy('col_id').agg(
F.element_at(F.filter(F.aggregate(
c,
F.expr("array(struct(string(null) col_val, -9L last))"),
lambda acc, x: F.array_union(
acc,
F.array(F.map_from_entries(c)[F.element_at(acc, -1)['last']])
)
), lambda x: x.col_val.isNotNull()), -1).col_val.alias('col_val')
)
df.show()
# +------+-------+
# |col_id|col_val|
# +------+-------+
# | 1| cc|
# | 2| dd|
# +------+-------+

PySpark equivalent of function "typedLit" from Scala API

We have a function typedLit in Scala API for Spark to add the Array or Map as column value.
import org.apache.spark.sql.functions.typedLit
val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")
df1.withColumn("seq", typedLit(Seq(1,2,3)))
.show(truncate=false)
+---+---+---------+
|a |b |seq |
+---+---+---------+
|1 |0 |[1, 2, 3]|
|2 |3 |[1, 2, 3]|
+---+---+---------+
I couldn't find the equivalent in PySpark. How can we create a column in PySpark with Array as a column value?
There isn't an equivalent function in pyspark yet, but you can have an array column as shown below:
from pyspark.sql.functions import array, lit
df = sc.parallelize([[1,2], [3,4]]).toDF(['a', 'b'])
df.withColumn('seq', array([lit(i) for i in [1,2,3]])).show()
Output:
+---+---+---------+
| a| b| seq|
+---+---+---------+
| 1| 2|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+---+---+---------+
Using expr and array looks the most elegant to me:
df = df.withColumn('seq', F.expr('array(1,2,3)'))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1,0), (2,3)], ['a', 'b'])
df = df.withColumn('seq', F.expr('array(1,2,3)'))
df.show()
# +---+---+---------+
# | a| b| seq|
# +---+---+---------+
# | 1| 0|[1, 2, 3]|
# | 2| 3|[1, 2, 3]|
# +---+---+---------+
Use F.expr('sequence(1,3)') if array numbers need to go in a sequence.
You can use .cast() directly after the lit() call to type the Column:
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType
df1.withColumn("long", sf.lit(1).cast(LongType()))
The same works for array():
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType, ArrayType
df1.withColumn("pirate", sf.array([sf.lit(x).cast(LongType()) for x in [1, 2, 3]]))
df1.withColumn("pirate", sf.array([sf.lit(x) for x in [1, 2, 3]]).cast(ArrayType(LongType())))
and if you really like text and typing but hate types, you could use:
df1.withColumn("pirate", sf.array(sf.lit("1"), sf.lit("2")).cast("array<int>"))
;)
PS Also consider using map with sf.lit instead of the for comprehension.

Pyspark dataframe: Count elements in array or list

Let us assume dataframe df as:
df.show()
Output:
+------+----------------+
|letter| list_of_numbers|
+------+----------------+
| A| [3, 1, 2, 3]|
| B| [1, 2, 1, 1]|
+------+----------------+
What I want to do is to count number of a specific element in column list_of_numbers. Something like this:
+------+----------------+----+
|letter| list_of_numbers|ones|
+------+----------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+----------------+----+
I have so far tried creating udf and it perfectly works, but I'm wondering if I can do it without defining any udf.
You can explode the array and filter the exploded values for 1. Then groupBy and count:
from pyspark.sql.functions import col, count, explode
df.select("*", explode("list_of_numbers").alias("exploded"))\
.where(col("exploded") == 1)\
.groupBy("letter", "list_of_numbers")\
.agg(count("exploded").alias("ones"))\
.show()
#+------+---------------+----+
#|letter|list_of_numbers|ones|
#+------+---------------+----+
#| A| [3, 1, 2, 3]| 1|
#| B| [1, 2, 1, 1]| 3|
#+------+---------------+----+
In order to keep all rows, even when the count is 0, you can convert the exploded column into an indicator variable. Then groupBy and sum.
from pyspark.sql.functions import col, count, explode, sum as sum_
df.select("*", explode("list_of_numbers").alias("exploded"))\
.withColumn("exploded", (col("exploded") == 1).cast("int"))\
.groupBy("letter", "list_of_numbers")\
.agg(sum_("exploded").alias("ones"))\
.show()
Note, I have imported pyspark.sql.functions.sum as sum_ as to not overwrite the builtin sum function.
From pyspark 3+, we can use array transformations.
https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-array-functions-720b8fbfa729
import pyspark.sql.functions as F
df = spark_session.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df1 = df.selectExpr('*','filter(list_of_numbers, x->x=1) as ones_array')
df2 = df1.selectExpr('*', 'size(ones_array) as ones')
df2.show()
+------+---------------+----------+----+
|letter|list_of_numbers|ones_array|ones|
+------+---------------+----------+----+
| A| [3, 1, 2, 3]| [1]| 1|
| B| [1, 2, 1, 1]| [1, 1, 1]| 3|
+------+---------------+----------+----+
Assuming that the length of the list is constant, one way i can think of is,
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
expr = reduce(add,[F.when(F.col('list_of_numbers').getItem(x)==1, 1)\
.otherwise(0) for x in range(4)])
df = df.withColumn('ones', expr)
df.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+
There was a comment above from Ala Tarighati that the solution did not work for arrays with different lengths. The following is a udf that will solve that problem
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df_ones = (
df.withColumn(
'ones',
reduce(
add,
[
F.when(
F.col("list_of_numbers").getItem(x) == F.lit("1"), 1
).otherwise(0)
for x in range(len("drivers"))
],
),
)
)
df_ones.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+

Create new column with an array of range of numbers

So I need to create an array of numbers enumerating from 1 to 100 as the value for each row as an extra column.
Using the array() function with a bunch of literal values works, but surely there's a way to use / convert a Scala Range(a to b) instead of listing each number individually?
spark.sql("SELECT key FROM schema.table")
.otherCommands
.withColumn("range", array(lit(1), lit(2), ..., lit(100)))
To something like:
withColumn("range", array(1 to 100))
From Spark 2.4 you can use [sequence][1] function
If you have this dataframe:
df.show()
+--------+
|column_1|
+--------+
| 1|
| 2|
| 3|
| 0|
+--------+
If you use the sequence function from 0 to column_1 you got this:
df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1| range|
+--------+------------+
| 1| [0, 1]|
| 2| [0, 1, 2]|
| 3|[0, 1, 2, 3]|
| 0| [0]|
+--------+------------+
For this case, set both values with lit:
df.withColumn("range", sequence(lit(0), lit(100)))
You can use map function using lit inbuilt function inside array function as
df.withColumn("range", array((1 to 100).map(lit(_)): _*))
For Spark 2.2+ a new function typedLit was introduced that easily solves this problem without using .map(lit(_)) on the array. From the documentation:
The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
Use as follows:
import org.apache.spark.sql.functions.typedLit
df.withColumn("range", typedLit((1 to 100).toList))
In case of PySpark:
from pyspark.sql import functions as F
DF.withColumn("range",F.array([F.lit(i) for i in range(1,11)]))
I hope the above answer is useful.
Tested this solution with spark version 2.2.0
Please try this simple way for the same thing:
val df = spark.range(5).toDF("id")
df.withColumn("range", lit(1 to 10 toArray)).show(false)
The output of the code:
+---+-------------------------------+
|id |range |
+---+-------------------------------+
|0 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|1 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|2 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|3 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|4 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
+---+-------------------------------+

Using Powershell, how can i count the occurrence of each element in an array?

If I have an array:
1 1 1 2 2 3 4 4 4 4 5 5
How can I use Powershell to tell me how many of each element there are in that array?
To be a little more clear, I should have a separate count for each array element:
Element:Count
1:3
2:2
3:1
4:4
5:2
You can use the Group-Object cmdlet:
PS> 1,1,1,2,2,3,4,4,4,4,5,5 | group
Count Name Group
----- ---- -----
3 1 {1, 1, 1}
2 2 {2, 2}
1 3 {3}
4 4 {4, 4, 4, 4}
2 5 {5, 5}
If you want a hashtable for the items and their counts you just need a little ForEach-Object after it:
$array | group | % { $h = #{} } { $h[$_.Name] = $_.Count } { $h }
You can adjust the output and format it as you like:
PS> $ht= 1,1,1,2,2,3,4,4,4,4,5,5 | Group-Object -AsHashTable -AsString
PS> $ht
Name Value
---- -----
2 {2, 2}
4 {4, 4, 4, 4}
5 {5, 5}
1 {1, 1, 1}
3 {3}
PS> $ht['1']
1
1
1
Joey's helpful answer provides the crucial pointer: use the Group-Object cmdlet to group input objects by equality.
(To group them by one ore more of their property values instead, use -Property).
Group-Object outputs [Microsoft.PowerShell.Commands.GroupInfo] objects that each represent a group of equal input objects, whose notable properties are:
.Values ... the value(s) that define the group, as a [System.Collections.ArrayList] instance (which has just 1 element in the case at hand, since the input objects as a whole are used to form the groups).
.Count ... the count of objects in that group.
If, as in this case, there is no need to collect the individual group members as part of each group, -NoElement can be used for efficiency.
You're free to further process the group objects as needed; to get the specific output format stipulated in your question, you can use Select-Object with a calculated property.
To put it all together:
PS> 1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5 | Group-Object -NoElement |
Select-Object #{ n='Element:Count'; e={ '{0}:{1}' -f $_.Values[0], $_.Count } }
Element:Count
-------------
1:3
2:2
3:1
4:4
5:2

Resources