So I need to create an array of numbers enumerating from 1 to 100 as the value for each row as an extra column.
Using the array() function with a bunch of literal values works, but surely there's a way to use / convert a Scala Range(a to b) instead of listing each number individually?
spark.sql("SELECT key FROM schema.table")
.otherCommands
.withColumn("range", array(lit(1), lit(2), ..., lit(100)))
To something like:
withColumn("range", array(1 to 100))
From Spark 2.4 you can use [sequence][1] function
If you have this dataframe:
df.show()
+--------+
|column_1|
+--------+
| 1|
| 2|
| 3|
| 0|
+--------+
If you use the sequence function from 0 to column_1 you got this:
df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1| range|
+--------+------------+
| 1| [0, 1]|
| 2| [0, 1, 2]|
| 3|[0, 1, 2, 3]|
| 0| [0]|
+--------+------------+
For this case, set both values with lit:
df.withColumn("range", sequence(lit(0), lit(100)))
You can use map function using lit inbuilt function inside array function as
df.withColumn("range", array((1 to 100).map(lit(_)): _*))
For Spark 2.2+ a new function typedLit was introduced that easily solves this problem without using .map(lit(_)) on the array. From the documentation:
The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
Use as follows:
import org.apache.spark.sql.functions.typedLit
df.withColumn("range", typedLit((1 to 100).toList))
In case of PySpark:
from pyspark.sql import functions as F
DF.withColumn("range",F.array([F.lit(i) for i in range(1,11)]))
I hope the above answer is useful.
Tested this solution with spark version 2.2.0
Please try this simple way for the same thing:
val df = spark.range(5).toDF("id")
df.withColumn("range", lit(1 to 10 toArray)).show(false)
The output of the code:
+---+-------------------------------+
|id |range |
+---+-------------------------------+
|0 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|1 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|2 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|3 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
|4 |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|
+---+-------------------------------+
Related
There is function array_union, that union two arrays without duplicates. How can I union two arrays without removing duplicates?
+---------+---------+
|field |field1 |
+---------+---------+
|[1, 2, 2]|[1, 2, 2]|
+---------+---------+
.withColumn("union", array_union(col("field"), col("field1")))
Result:
+---------+---------+------------------+
|field |field1 |union |
+---------+---------+------------------+
|[1, 2, 2]|[1, 2, 2]|[1, 2, 2, 1, 2, 2]|
+---------+---------+------------------+
Just use concat instead,
import org.apache.spark.sql.functions.{concat}
df1.withColumn("NewArr", concat("Array1","Array2")).show()
Input:
Output:
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[( 1, 'aa', [None, 9]),
( 1, None, [ 9, 1]),
( 1, 'bb', [ 1, 4]),
( 1, 'cc', [ 4, 5]),
( 2, 'ee', [None, 2]),
( 2, None, [ 2, 8]),
( 2, 'dd', [ 8, 7]),
( 2, None, [ 7, 1])],
['col_id', 'col_val', 'col_arr'])
Desired result - I want to group by col_id and return the last non-null item from col_val:
+------+-------+
|col_id|col_val|
+------+-------+
| 1| cc|
| 2| dd|
+------+-------+
The problem is the order column. It's an array where its last element is repeated as the first element of the following row. In the above example, the order of col_id=2 goes:
[None, 2], [2, 8], [8, 7], [7, 1].
Since col_val of [7, 1] is null, the result of [8, 7] should be returned, i.e. 'dd'. The ordering always starts with null (None).
I've tried
df = (df
.filter(~F.isnull('col_val'))
.groupBy('col_id')
.agg(F.max_by('col_val', F.col('col_arr')[1]))
)
df.show()
# +------+---------------------------+
# |col_id|max_by(col_val, col_arr[1])|
# +------+---------------------------+
# | 1| aa|
# | 2| dd|
# +------+---------------------------+
It's not successful, as my order column does not follow a simple ascending / descending order.
So, after some decent thinking, I have found a working approach. The steps:
collecting modified rows (as structs) for every col_id into lists
creating a map for every col_id with first elements of the inner lists as keys
sequential lookup in maps, "looping" through elements in array to create ordered lists
removing nulls and extracting the last item
from pyspark.sql import functions as F, Window as W
df = df.withColumn('col_arr', F.transform('col_arr', lambda x: F.coalesce(x, F.lit(-9))))
inner_struct = F.struct('col_val', F.col('col_arr')[1].alias('last'))
c = F.collect_set(F.struct(F.col('col_arr')[0], inner_struct))
df = df.groupBy('col_id').agg(
F.element_at(F.filter(F.aggregate(
c,
F.expr("array(struct(string(null) col_val, -9L last))"),
lambda acc, x: F.array_union(
acc,
F.array(F.map_from_entries(c)[F.element_at(acc, -1)['last']])
)
), lambda x: x.col_val.isNotNull()), -1).col_val.alias('col_val')
)
df.show()
# +------+-------+
# |col_id|col_val|
# +------+-------+
# | 1| cc|
# | 2| dd|
# +------+-------+
We have a function typedLit in Scala API for Spark to add the Array or Map as column value.
import org.apache.spark.sql.functions.typedLit
val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")
df1.withColumn("seq", typedLit(Seq(1,2,3)))
.show(truncate=false)
+---+---+---------+
|a |b |seq |
+---+---+---------+
|1 |0 |[1, 2, 3]|
|2 |3 |[1, 2, 3]|
+---+---+---------+
I couldn't find the equivalent in PySpark. How can we create a column in PySpark with Array as a column value?
There isn't an equivalent function in pyspark yet, but you can have an array column as shown below:
from pyspark.sql.functions import array, lit
df = sc.parallelize([[1,2], [3,4]]).toDF(['a', 'b'])
df.withColumn('seq', array([lit(i) for i in [1,2,3]])).show()
Output:
+---+---+---------+
| a| b| seq|
+---+---+---------+
| 1| 2|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+---+---+---------+
Using expr and array looks the most elegant to me:
df = df.withColumn('seq', F.expr('array(1,2,3)'))
Test results:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1,0), (2,3)], ['a', 'b'])
df = df.withColumn('seq', F.expr('array(1,2,3)'))
df.show()
# +---+---+---------+
# | a| b| seq|
# +---+---+---------+
# | 1| 0|[1, 2, 3]|
# | 2| 3|[1, 2, 3]|
# +---+---+---------+
Use F.expr('sequence(1,3)') if array numbers need to go in a sequence.
You can use .cast() directly after the lit() call to type the Column:
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType
df1.withColumn("long", sf.lit(1).cast(LongType()))
The same works for array():
import pyspark.sql.functions as sf
from pyspark.sql.types import LongType, ArrayType
df1.withColumn("pirate", sf.array([sf.lit(x).cast(LongType()) for x in [1, 2, 3]]))
df1.withColumn("pirate", sf.array([sf.lit(x) for x in [1, 2, 3]]).cast(ArrayType(LongType())))
and if you really like text and typing but hate types, you could use:
df1.withColumn("pirate", sf.array(sf.lit("1"), sf.lit("2")).cast("array<int>"))
;)
PS Also consider using map with sf.lit instead of the for comprehension.
I'm trying to remove int value from jsonb array of ints in json stored in jsonb value type in db.
I have a json value, like:
select '{"a":[1,3]}'::jsonb->'a'
?column?
----------
[1, 3]
(1 row)
And i want to remove '1' from it, like that:
select array_remove(ARRAY[1,1,2,3], 1);
array_remove
--------------
{2,3}
(1 row)
But if i do something like that:
select array_remove(array('{"a":[1,3]}'::jsonb->'a'), 1);
or:
select array_remove('{"a":[1,3]}'::jsonb->'a'::text::int[], 1);
I have an error like:
operator does not exist: jsonb -> integer[]
How can i cast json value to array to remove value from it?
For now, you basically have to expand the array to individual values, filter those values, and then create a new array:
with data(original) as (VALUES ('{"a":[1,1,2,3]}'::jsonb), ('{"a":[2,3,4,5,1]}'::jsonb))
select original->'a', filtered
from data
JOIN LATERAL (SELECT jsonb_agg(value) as filtered
from jsonb_array_elements(original->'a') j(value)
WHERE value::int != 1) as filtered ON TRUE;
?column? | filtered
-----------------+--------------
[1, 1, 2, 3] | [2, 3]
[2, 3, 4, 5, 1] | [2, 3, 4, 5]
In postgres 12, you'll be able to use the jsonb_path functions. The syntax is a bit difficult to parse, but it doesn't involve any joins:
with data(original) as (VALUES ('{"a":[1,1,2,3]}'::jsonb), ('{"a":[2,3,4,5,1]}'::jsonb))
select original->'a',
jsonb_path_query_array(original, '$.a[*] ? (# != $filter)', '{"filter":1}')
from data;
?column? | jsonb_path_query_array
-----------------+------------------------
[1, 1, 2, 3] | [2, 3]
[2, 3, 4, 5, 1] | [2, 3, 4, 5]
(2 rows)
Let us assume dataframe df as:
df.show()
Output:
+------+----------------+
|letter| list_of_numbers|
+------+----------------+
| A| [3, 1, 2, 3]|
| B| [1, 2, 1, 1]|
+------+----------------+
What I want to do is to count number of a specific element in column list_of_numbers. Something like this:
+------+----------------+----+
|letter| list_of_numbers|ones|
+------+----------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+----------------+----+
I have so far tried creating udf and it perfectly works, but I'm wondering if I can do it without defining any udf.
You can explode the array and filter the exploded values for 1. Then groupBy and count:
from pyspark.sql.functions import col, count, explode
df.select("*", explode("list_of_numbers").alias("exploded"))\
.where(col("exploded") == 1)\
.groupBy("letter", "list_of_numbers")\
.agg(count("exploded").alias("ones"))\
.show()
#+------+---------------+----+
#|letter|list_of_numbers|ones|
#+------+---------------+----+
#| A| [3, 1, 2, 3]| 1|
#| B| [1, 2, 1, 1]| 3|
#+------+---------------+----+
In order to keep all rows, even when the count is 0, you can convert the exploded column into an indicator variable. Then groupBy and sum.
from pyspark.sql.functions import col, count, explode, sum as sum_
df.select("*", explode("list_of_numbers").alias("exploded"))\
.withColumn("exploded", (col("exploded") == 1).cast("int"))\
.groupBy("letter", "list_of_numbers")\
.agg(sum_("exploded").alias("ones"))\
.show()
Note, I have imported pyspark.sql.functions.sum as sum_ as to not overwrite the builtin sum function.
From pyspark 3+, we can use array transformations.
https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/
https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-array-functions-720b8fbfa729
import pyspark.sql.functions as F
df = spark_session.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df1 = df.selectExpr('*','filter(list_of_numbers, x->x=1) as ones_array')
df2 = df1.selectExpr('*', 'size(ones_array) as ones')
df2.show()
+------+---------------+----------+----+
|letter|list_of_numbers|ones_array|ones|
+------+---------------+----------+----+
| A| [3, 1, 2, 3]| [1]| 1|
| B| [1, 2, 1, 1]| [1, 1, 1]| 3|
+------+---------------+----------+----+
Assuming that the length of the list is constant, one way i can think of is,
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
expr = reduce(add,[F.when(F.col('list_of_numbers').getItem(x)==1, 1)\
.otherwise(0) for x in range(4)])
df = df.withColumn('ones', expr)
df.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+
There was a comment above from Ala Tarighati that the solution did not work for arrays with different lengths. The following is a udf that will solve that problem
from operator import add
from functools import reduce
import pyspark.sql.functions as F
df = sql.createDataFrame(
[
['A',[3, 1, 2, 3]],
['B',[1, 2, 1, 1]]
],
['letter','list_of_numbers'])
df_ones = (
df.withColumn(
'ones',
reduce(
add,
[
F.when(
F.col("list_of_numbers").getItem(x) == F.lit("1"), 1
).otherwise(0)
for x in range(len("drivers"))
],
),
)
)
df_ones.show()
+------+---------------+----+
|letter|list_of_numbers|ones|
+------+---------------+----+
| A| [3, 1, 2, 3]| 1|
| B| [1, 2, 1, 1]| 3|
+------+---------------+----+