Filter dataframe on non-empty WrappedArray - arrays

My problem is that I have to find in a list, these which are not empty. When I use the filter function is not null, than I get also every row.
My program code looks like this:
...
val csc = new CassandraSQLContext(sc)
val df = csc.sql("SELECT * FROM test").toDF()
val wrapped = df.select("fahrspur_liste")
wrapped.printSchema
The column fahrspur_liste contains the wrapped arrays and this column I have to analyze. When I run the code, than I get this structure for my wrapped array and these entries:
root
|-- fahrspur_liste: array (nullable = true)
| |-- element: long (containsNull = true)
+--------------+
|fahrspur_liste|
+--------------+
| []|
| []|
| [56]|
| []|
| [36]|
| []|
| []|
| [34]|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| [103]|
| []|
| [136]|
| []|
| [77]|
+--------------+
only showing top 20 rows
Now I want to filter these rows, so that I have only the entries [56],[36],[34],[103], ...
How can I write a filter function, that I get only these rows, which contains a number?

I don't think you need to use a UDF here.
You can just use size method and filter all those rows with array size = 0
df.filter(""" size(fahrspur_liste) != 0 """)

You can do this with an udf in Spark:
val removeEmpty = udf((array: Seq[Long]) => !array.isEmpty)
val df2 = df.filter(removeEmpty($"fahrspur_liste"))
Here the udf checks if the array is empty or not. The filter function will then remove those that come back as true.

Related

Reading complex nested json file in pyspark

I'm having troubles for some days trying to resolve this.
I have a nested json file with a complex schema (array inside structure, structure inside array) and I need to put data in dataframe.
What I have in input is this (as an example):
+-----+----------------+-----------------------------------+---------+
| id | name | detail | item |
+-----+----------------+-----------------------------------+---------+
| 100 | Peter Castle | [[D100A, Credit],[D100B, Debit]] | [10,31] |
| 101 | Quino Yukimori | [[D101A, Credit],[D101B, Credit]] | [55,49] |
+-----+----------------+-----------------------------------+---------+
I should read like this
+-----+----------------+-----------+--------+-----------+
| id | name | detail_id | type | item_qty |
+-----+----------------+-----------+--------+-----------+
| 100 | Peter Castle | D100A | Credit | 10 |
| 100 | Peter Castle | D100B | Debit | 31 |
| 101 | Quino Yukimori | D101A | Credit | 55 |
| 101 | Quino Yukimori | D101B | Credit | 49 |
+-----+----------------+-----------+--------+-----------+
But what I get is this:
df.withColumn('detail', explode('detail')).withColumn('item', explode('item'))
+-----+----------------+-----------+--------+-----------+
| id | name | detail_id | type | item_qty |
+-----+----------------+-----------+--------+-----------+
| 100 | Peter Castle | D100A | Credit | 10 |
| 100 | Peter Castle | D100A | Debit | 10 |
| 100 | Peter Castle | D100B | Credit | 31 |
| 100 | Peter Castle | D100B | Debit | 31 |
| 101 | Quino Yukimori | D101A | Credit | 55 |
| 101 | Quino Yukimori | D101A | Credit | 55 |
| 101 | Quino Yukimori | D101B | Credit | 49 |
| 101 | Quino Yukimori | D101B | Credit | 49 |
+-----+----------------+-----------+--------+-----------+
I have tried combining columns with arrays_zip and then explode, but the problem is there is array inside array, and if I explode detail array columns, the explode of item array columns multiply data.
Any idea how can I implement that?
Sorry about my english, is not my birth language.
UPDATED
Here is my schema, which complicates me reading it for the multiple nested arrays:
|-- id: string(nullable = true)
|-- name: string(nullable = true)
|-- detail: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- detail_id: string(nullable = true)
| | |-- type: string(nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item_qty : long(nullable = true)
|-- deliveryTrack: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- date: string(nullable = true)
| | |-- track: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- time: string (nullable = true)
| | | | |-- driver: string (nullable = true)
Use explode only once after you zip both arrays with arrays_zip. After that, use the expr function to get the data.
from pyspark.sql.functions import explode, arrays_zip, col, expr
df1 = (df
.withColumn('buffer', explode(arrays_zip(col('detail'), col('item'))))
.withColumn('detail_id', expr("buffer.detail.detail_id"))
.withColumn('type', expr("buffer.detail.type"))
.withColumn('item_qty', expr("buffer.item.item_qty"))
.drop(*['detail', 'item', 'buffer'])
)
df1.show()
+---+--------------+---------+------+--------+
|id |name |detail_id|type |item_qty|
+---+--------------+---------+------+--------+
|100|Peter Castle |D100A |Credit|10 |
|100|Peter Castle |D100B |Debit |31 |
|101|Quino Yukimori|D101A |Credit|55 |
|101|Quino Yukimori|D101B |Credit|49 |
+---+--------------+---------+------+--------+

Get the most common element of an array using Pyspark

How can I get the most common element of an array after concatenating two columns using Pyspark
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr2")
df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))
expected output:
+------------------------+
| arr | arr_1 | arr_2 |
+------------------------+
| [a] | [a,a,b] | [a] |
| [d] | [c,d,d] | [] |
| [e] | [e] | [e,f] |
| [] | [] | [] |
+------------------------+
Try it
df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
df1.select('id',explode('arr')).\
groupBy('id','col').count().\
select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
filter(col('rank')==1).\
join(df1,'id').\
select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
+---+---------+------+
|arr| arr_1| arr_2|
+---+---------+------+
| a|[a, a, b]| [a]|
| | []| []|
| e| [e]|[e, f]|
| d|[c, d, d]| []|
+---+---------+------+
You can explode the array then by doing group by count, Window we can get the most occurring element.
Example:
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr_2")
df_new = df.withColumn('arr_concat',concat(col('arr_1'),col('arr_2')))
from pyspark.sql.functions import *
from pyspark.sql import *
df1=df_new.withColumn("mid",monotonically_increasing_id())
df2=df1.selectExpr("explode(arr_concat) as arr","mid").groupBy("mid","arr").agg(count(lit("1")).alias("cnt"))
w=Window.partitionBy("mid").orderBy(desc("cnt"))
df3=df2.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop(*["rn","cnt"])
df3.join(df1,['mid'],'inner').drop(*['mid','arr_concat']).withColumn("arr",array(col("arr"))).show()
#+---+---------+------+
#|arr| arr_1| arr_2|
#+---+---------+------+
#|[d]|[c, d, d]| []|
#|[e]| [e]|[e, f]|
#|[a]|[a, a, b]| [a]|
#| []| []| []|
#+---+---------+------+

Mark cells that contain words from a FILTER list

I have a list of words in a basket which I want to pre-select through a FILTER list and color code the words in the basket which appear in the FILTER sheet. The challenge is, that the words in the basket have not the same word, but rather contain it. Here an example >
FILTER:
+--------+
| Apple |
+--------+
| Banana |
+--------+
Basket:
+--------------+---+
| Apple//Cake | x |
+--------------+---+
| Water | |
+--------------+---+
| Coke bottle | |
+--------------+---+
| Banana split | x |
+--------------+---+
use:
=(A2<>"")*(INDEX(REGEXMATCH(LOWER(A2),
TEXTJOIN("|", 1, LOWER(INDIRECT("FILTER!A:A"))))))
You can use the following formula as a rule in Conditional formatting
=REGEXMATCH(A2,""&JOIN("|",INDIRECT("'FILTER'!A1:A3"))&"")

Add JSON object field to a JSON array field in the dataframe using scala

Is there any method, where i can add a json object to already existing json object array:
I have a dataframe:
+-------------------------+---------------------------------------------------------+------------+
| name | hit_songs | column1 |
+-------------------------+---------------------------------------------------------+------------+
|{"HomePhone":"34567002"} | [{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}] | value1 |
|{"HomePhone":"34567011"} | [{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}] | value2 |
+-------------------------+---------------------------------------------------------+------------+
I want a resulting dataframe as:
+---------------------------------------------------------------------------------+------------+
| name column1
+------------------------------------------------------------------------------------+------------+
|[ {"HomePhone":"34567002"},{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"} ] | value1 |
|[ {"HomePhone":"34567011"},{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"} ] | value2 |
+-------------------------+---------------------------------------------------------++------------+
Use array_union function.
name is of type string, to convert this column to array type use array
Check below code.
scala> df.show(false)
+------------------------+-------------------------------------------------------+
|name |hit_songs |
+------------------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------------------------------------------------+
scala> df.withColumn("name",array_union(array($"name"),$"hit_songs")).show(false) // Use array_union function, to join name string column with hit_songs array column, first convert name to array(name).
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|name |hit_songs |
+---------------------------------------------------------------------------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------------------------------------------------+
scala> df.show(false)
+------------------------+-------------+-------------------------------------------------------+
|name |dammy |hit_songs |
+------------------------+-------------+-------------------------------------------------------+
|{"HomePhone":"34567002"}|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|{"HomePhone":"34567011"}|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+------------------------+-------------+-------------------------------------------------------+
scala> df.printSchema
root
|-- name: string (nullable = true)
|-- dammy: string (nullable = true)
|-- hit_songs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df.withColumn("name",array_union(array_union(array($"name"),$"hit_songs"),array($"dammy"))).show(false)
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|name |dammy |hit_songs |
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+
|[{"HomePhone":"34567002"}, {"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|{"aaa":"aaa"}|[{"Phonetypecode":"PTC001"},{"Phonetypecode":"PTC002"}]|
|[{"HomePhone":"34567011"}, {"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|{"bbb":"bbb"}|[{"Phonetypecode":"PTC021"},{"Phonetypecode":"PTC022"}]|
+---------------------------------------------------------------------------------+-------------+-------------------------------------------------------+

How to iterate over Presto ARRAY(MAP(VARCHAR, VARCHAR))

I have an array of maps (unordered key-value pairs), and would like to filter out any map items in the array that do not have either a created or a modified date before 2019-01-01. Is there a way to accomplish this in presto without nested tables (I have to iterate over multiple columns that are structured in this way)?
BEFORE
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+--+
| Category1 | Count_Items | Item_Details | | |
+===========+=============+============================================================================================================================================================+==+==+
| Fruit | 3 | [{"created":"2019-09-15","color":"red","name":"apples"},{"name":"bananas","created":"2018-08-20"},{"modified":"2019-02-01","name":"kiwi","color":"green"}] | | |
| Vegetable | 2 | [{"color":"green","modified":"2018-01-01","created":"2019-03-31","name":"kale"},{"name":"cauliflower","created":"2019-01-02"}] | | |
+-----------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+--+--+
AFTER
+-----------+-------------+----------------------------------------------------------------------------------+--+--+
| Category1 | Count_Items | Item_Details | | |
+===========+=============+==================================================================================+==+==+
| Fruit | 1 | [{"name":"bananas","created":"2018-08-20"}] | | |
| Vegetable | 1 | [{"color":"green","modified":"2018-01-01","created":"2019-03-31","name":"kale"}] | | |
+-----------+-------------+----------------------------------------------------------------------------------+--+--+
You need to use array filter -- you have array(map) and want to have array(map). For this, you need to construct the filter function for the filter (a lambda).
(Let me know if you need more detailed instructions.)

Resources