Get the most common element of an array using Pyspark - arrays

How can I get the most common element of an array after concatenating two columns using Pyspark
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr2")
df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))
expected output:
+------------------------+
| arr | arr_1 | arr_2 |
+------------------------+
| [a] | [a,a,b] | [a] |
| [d] | [c,d,d] | [] |
| [e] | [e] | [e,f] |
| [] | [] | [] |
+------------------------+

Try it
df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
df1.select('id',explode('arr')).\
groupBy('id','col').count().\
select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
filter(col('rank')==1).\
join(df1,'id').\
select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
+---+---------+------+
|arr| arr_1| arr_2|
+---+---------+------+
| a|[a, a, b]| [a]|
| | []| []|
| e| [e]|[e, f]|
| d|[c, d, d]| []|
+---+---------+------+

You can explode the array then by doing group by count, Window we can get the most occurring element.
Example:
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr_2")
df_new = df.withColumn('arr_concat',concat(col('arr_1'),col('arr_2')))
from pyspark.sql.functions import *
from pyspark.sql import *
df1=df_new.withColumn("mid",monotonically_increasing_id())
df2=df1.selectExpr("explode(arr_concat) as arr","mid").groupBy("mid","arr").agg(count(lit("1")).alias("cnt"))
w=Window.partitionBy("mid").orderBy(desc("cnt"))
df3=df2.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop(*["rn","cnt"])
df3.join(df1,['mid'],'inner').drop(*['mid','arr_concat']).withColumn("arr",array(col("arr"))).show()
#+---+---------+------+
#|arr| arr_1| arr_2|
#+---+---------+------+
#|[d]|[c, d, d]| []|
#|[e]| [e]|[e, f]|
#|[a]|[a, a, b]| [a]|
#| []| []| []|
#+---+---------+------+

Related

Spark: Removing first array from Nested Array in Scala

I have a DataFrame with 2 columns.I want to remove first array of the nested array in every record. Example :- I have a DF like this
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
and I want my output like this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+
From Spark-2.4 use slice function.
Example:
df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|[[A, b]] |
+------------------------+
*/
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| [[c, z]]|
| null|
+----------------+
*/

DataFrame column (Array type) contains Null values and empty array (len =0). How to convert Null to empty array?

I've Spark DataFrame with a Array column (StringType)
Sample DataFrame:
df = spark.createDataFrame([
[None],
[[]],
[['foo']]
]).toDF("a")
Current Output:
+-----+
| a|
+-----+
| null|
| []|
|[foo]|
+-----+
Desired Output:
+-----+
| a|
+-----+
| []|
| []|
|[foo]|
+-----+
I need to convert the Null values to an empty Array to concat with another array column.
Already tried this, but it's not working
df.withColumn("a",F.coalesce(F.col("a"),F.from_json(F.lit("[]"), T.ArrayType(T.StringType()))))
Convert null values to empty array in Spark DataFrame
Use array function.
df = spark.createDataFrame([
[None],
[[]],
[['foo']]
]).toDF("a")
import pyspark.sql.functions as F
df.withColumn('a', F.coalesce(F.col('a'), F.array(F.lit(None)))).show(10, False)
+-----+
|a |
+-----+
|[] |
|[] |
|[foo]|
+-----+
The result is now array(string), so there is no null value. Please check the results.
temp = spark.sql("SELECT a FROM table WHERE a is NULL")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table WHERE a = array(NULL)")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table")
temp.show(10, False)
+---+
|a |
+---+
+---+
+---+
|a |
+---+
|[] |
+---+
+-----+
|a |
+-----+
|[] |
|[] |
|[foo]|
+-----+

Having trouble with Postgres unnest array syntax

I am looking for guidance on the best way to do this insert. I am trying to create 11 entries for role_id 58385 while looping through the values of each of these arrays. I am new to PostgreSQL and need some guidance as to what I am doing wrong in this instance.
INSERT INTO public.acls (role_id, acl_id, update, can_grant, retrieve, create, archive) VALUES (
'58385',
unnest(array[1,14,20,21,22,24,25,26,36,300,302]),
unnest(array[f,f,t,t,f,f,f,t,f,t,t]),
unnest(array[f,f,f,f,f,f,f,f,f,f,f]),
unnest(array[t,t,t,t,t,t,t,t,t,t,t]),
unnest(array[f,f,t,t,f,f,f,t,f,t,t]),
unnest(array[f,f,f,f,f,f,f,f,f,f,f])
)
Do I need a SELECT subquery for each of the arrays? Or could I make one array from the six and Insert them.
A single select will do it for you, but t and f will need to be true and false:
select '58385',
unnest(array[1,14,20,21,22,24,25,26,36,300,302]),
unnest(array[false,false,true,true,false,false,false,true,false,true,true]),
unnest(array[false,false,false,false,false,false,false,false,false,false,false]),
unnest(array[true,true,true,true,true,true,true,true,true,true,true]),
unnest(array[false,false,true,true,false,false,false,true,false,true,true]),
unnest(array[false,false,false,false,false,false,false,false,false,false,false])
;
?column? | unnest | unnest | unnest | unnest | unnest | unnest
----------+--------+--------+--------+--------+--------+--------
58385 | 1 | f | f | t | f | f
58385 | 14 | f | f | t | f | f
58385 | 20 | t | f | t | t | f
58385 | 21 | t | f | t | t | f
58385 | 22 | f | f | t | f | f
58385 | 24 | f | f | t | f | f
58385 | 25 | f | f | t | f | f
58385 | 26 | t | f | t | t | f
58385 | 36 | f | f | t | f | f
58385 | 300 | t | f | t | t | f
58385 | 302 | t | f | t | t | f
(11 rows)

How to return first not empty cell from importrange values?

my google sheet excel document contain data like this
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | | c | | x | |
+---+---+---+---+---+---+
| 2 | | r | | 4 | |
+---+---+---+---+---+---+
| 3 | | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
Column B and D contain data provided by IMPORTRANGE function, which are store in different files.
And i would like to fill column A with first not empty value in row, in other words: desired result must look like this:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | r | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
I tried ISBLANK function, but apperantly if column is imported then, even if the value is empty, is not blank, so this function dosn't work for my case. Then i tried QUERY function in 2 different variant:
1) =QUERY({B1;D1}; "select Col1 where Col1 is not null limit 1"; 0) but result in this case is wrong when row contain cells with numbers. Result with this query is following:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | 4 | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
2) =QUERY({B1;D1};"select Col1 where Col1 <> '' limit 1"; 0) / =QUERY({B1;D1};"select Col1 where Col1 != '' limit 1"; 0) and this dosn't work at all, result is always #N/A
Also i would like to avoid using nested IFs and javascript scripts, if possible, as solution with QUERY function suits for my case best due to easy expansion to another columns without any deeper knowladge about programming. Is there any way how to make it simply, just with QUERY, and i am just missing something, or i have to use IFs/javascript?
try:
=ARRAYFORMULA(SUBSTITUTE(INDEX(IFERROR(SPLIT(TRIM(TRANSPOSE(QUERY(
TRANSPOSE(SUBSTITUTE(B:G, " ", "♦")),,99^99))), " ")),,1), "♦", " "))
selective columns:

Filter dataframe on non-empty WrappedArray

My problem is that I have to find in a list, these which are not empty. When I use the filter function is not null, than I get also every row.
My program code looks like this:
...
val csc = new CassandraSQLContext(sc)
val df = csc.sql("SELECT * FROM test").toDF()
val wrapped = df.select("fahrspur_liste")
wrapped.printSchema
The column fahrspur_liste contains the wrapped arrays and this column I have to analyze. When I run the code, than I get this structure for my wrapped array and these entries:
root
|-- fahrspur_liste: array (nullable = true)
| |-- element: long (containsNull = true)
+--------------+
|fahrspur_liste|
+--------------+
| []|
| []|
| [56]|
| []|
| [36]|
| []|
| []|
| [34]|
| []|
| []|
| []|
| []|
| []|
| []|
| []|
| [103]|
| []|
| [136]|
| []|
| [77]|
+--------------+
only showing top 20 rows
Now I want to filter these rows, so that I have only the entries [56],[36],[34],[103], ...
How can I write a filter function, that I get only these rows, which contains a number?
I don't think you need to use a UDF here.
You can just use size method and filter all those rows with array size = 0
df.filter(""" size(fahrspur_liste) != 0 """)
You can do this with an udf in Spark:
val removeEmpty = udf((array: Seq[Long]) => !array.isEmpty)
val df2 = df.filter(removeEmpty($"fahrspur_liste"))
Here the udf checks if the array is empty or not. The filter function will then remove those that come back as true.

Resources