How to map over an Spark array without exploding it? - arrays

MY case is that I have an array column that I'd like to filter. Consider the following:
+------------------------------------------------------+
| column|
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix4-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix5-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+
I'd like to filter only columns containing prefix-4, prefix-5, prefix-6, prefix-7, [...]. So,using an "or" statement is not scalable here.
Of course, I can just:
val prefixesList = List("prefix-4", "prefix-5", "prefix-6", "prefix-7")
df
.withColumn("prefix", explode($"column"))
.withColumn("prefix", split($"prefix", "\\-").getItem(0))
.withColumn("filterColumn", $"prefix".inInCollection(prefixesList))
But that involves exploding, which I want to avoid. My plan right now is to define an array column from prefixesList, and then use array_intersect to filter it - for this to work, though, I have to get rid of the -whatever part (which is, obviously, different for each entry). Was this a Scala array, I could easily do a map over it. But, being it a Spark Array, I don't know if that is possible.
TL;DR I have a dataframe containing an array column. I'm trying to manipulate it and filter it without exploding (because, if I do explode, I'll have to manipulate it later to reverse the explode, and I'd like to avoid it).
Can I achieve that without exploding? If so, how?

Not sure if I understood your question correctly: you want to keep all lines that do not contain any of the prefixes in prefixesList?
If so, you can write your own filter function:
def filterPrefixes (row: Row) : Boolean = {
for( s <- row.getSeq[String](0)) {
for( p <- Seq("prefix4", "prefix5", "prefix6", "prefix7")) {
if( s.startsWith(p) ) {
return false
}
}
}
return true
}
and then use it as argument for the filter call:
df.filter(filterPrefixes _)
.show(false)
prints
+------------------------------------------------------+
|column |
+------------------------------------------------------+
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
|[prefix1-whatever, prefix2-whatever, prefix3-whatever]|
+------------------------------------------------------+

It's relatively trivial to convert the Dataframe to a Dataset[Array[String]], and map over those arrays as whole elements. The basic idea is that you can iterate over your list of arrays easily, without having to flatten the entire dataset.
val df = Seq(Seq("prefix1-whatever", "prefix2-whatever", "prefix4-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix5-whatever"),
Seq("prefix1-whatever", "prefix2-whatever", "prefix3-whatever")
).toDF("column")
val pl = List("prefix4", "prefix5", "prefix6", "prefix7")
val df2 = df.as[Array[String]].map(a => {
a.flatMap(s => {
val start = s.split("-")(0)
if(pl.contains(start)) {
Some(s)
} else {
None
}
})
}).toDF("column")
df2.show(false)
The above code results in:
+------------------+
|column |
+------------------+
|[prefix4-whatever]|
|[] |
|[prefix5-whatever]|
|[] |
+------------------+
I'm not entirely sure how this would compare performance wise to actually flattening and recombining the data set. Doing this misses any catalyst optimizations, but avoids a lot of unnecessary shuffling of data.
P.S. I corrected for a minor issue in your prefix list, since "prefix-N" didn't match the data's pattern.

You can achieve it using SQL API. If you want to keep only rows that contain any of values prefix-4, prefix-5, prefix-6, prefix-7 you could use arrays_overlap function. Otherwise, if you want to keep rows that contain all of your values you could try array_intersect and then check if its size is equal to count of your values.
val df = Seq(
Seq("prefix1-a", "prefix2-b", "prefix3-c", "prefix4-d"),
Seq("prefix4-e", "prefix5-f", "prefix6-g", "prefix7-h", "prefix8-i"),
Seq("prefix6-a", "prefix7-b", "prefix8-c", "prefix9-d"),
Seq("prefix8-d", "prefix9-e", "prefix10-c", "prefix12-a")
).toDF("arr")
val schema = StructType(Seq(
StructField("arr", ArrayType.apply(StringType)),
StructField("arr2", ArrayType.apply(StringType))
))
val encoder = RowEncoder(schema)
val df2 = df.map(s =>
(s.getSeq[String](0).toArray, s.getSeq[String](0).map(s => s.substring(0, s.indexOf("-"))).toArray)
).map(s => RowFactory.create(s._1, s._2))(encoder)
val prefixesList = Array("prefix4", "prefix5", "prefix6", "prefix7")
val prefixesListSize = prefixesList.size
val prefixesListCol = lit(prefixesList)
df2.select('arr,'arr2,
arrays_overlap('arr2,prefixesListCol).as("OR"),
(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).as("AND")
).show(false)
it will give you:
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|arr |arr2 |OR |AND |
+-------------------------------------------------------+---------------------------------------------+-----+-----+
|[prefix1-a, prefix2-b, prefix3-c, prefix4-d] |[prefix1, prefix2, prefix3, prefix4] |true |false|
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|true |true |
|[prefix6-a, prefix7-b, prefix8-c, prefix9-d] |[prefix6, prefix7, prefix8, prefix9] |true |false|
|[prefix8-d, prefix9-e, prefix10-c, prefix12-a] |[prefix8, prefix9, prefix10, prefix12] |false|false|
+-------------------------------------------------------+---------------------------------------------+-----+-----+
so finally you can use:
df2.filter(size(array_intersect('arr2,prefixesListCol)) === prefixesListSize).show(false)
and you will get below result:
+-------------------------------------------------------+---------------------------------------------+
|arr |arr2 |
+-------------------------------------------------------+---------------------------------------------+
|[prefix4-e, prefix5-f, prefix6-g, prefix7-h, prefix8-i]|[prefix4, prefix5, prefix6, prefix7, prefix8]|
+-------------------------------------------------------+---------------------------------------------+

Related

How to compare two array of string columns in Pyspark

I want to compare two arrays and filter the data frame
condition_1 = AAA
condition_2 = ["AAA","BBB","CCC"]
My spark data frame has a column with array of strings
df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))
# to filter a string condition_1 with the array column
df = df.filter(
F.col('array_column').isin(condition_1) &
# second filter here
)
But how can I filter condition_2 in in a similar way? since they are both arrays?
Code I tried:
df = df.filter(
F.col('array_column').isin(condition_1) &
any(x in condition_2 for x in F.col('array_column'))
)
But I get an error - Column is not iterable.
I also tried - bool(set(F.col('array_column')).intersection(condition_2))
But still have the same error. Can anyone help me with this?
Hope I got your question right. It wasnt as clear. Use pyspark's array functions
Data
condition_1 = 'AAA'
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+------------------+
| ID|Company_Id| value| array_column|
+---+----------+-------+------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA]|
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C| 9800bvd|value-3| [AAA]|
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC]|
+---+----------+-------+------------------+
Code
df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)
Outcome
+---+----------+-------+----------------+
|ID |Company_Id|value |array_column |
+---+----------+-------+----------------+
|1A |3412asd |value-1|[XXX, YYY, AAA] |
|3C |9800bvd |value-3|[AAA] |
|3C |9800bvd |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+
How it works
condition_1 ; get a boolean selection of where column contains string
array_contains(col('array_column'), lit(condition_1))
condition_2 ; This happens in stages
Intersect column with the list
array_intersect(col('array_column'),array([lit(x) for x in condition_2]))
get the size of the outcome of 1 above
size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))
Check that the intersection contains at least one item
size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0
Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods

Compare two arrays from two different dataframes in Pyspark

I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('3',['QQQ']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+--------------------+
| ID|Company_Id| value| array_column |
+---+----------+-------+--------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
+---+----------+-------+---------------------+
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
)
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))

How to create a new array of substrings from string array column in a spark dataframe

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.
df.printSchema()
root
(...)
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
example data from column "arr_agent":
["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF" ]
What I need to have in the new column:
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF" ]
I already tried to define a udf that does it for me.
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
return(out_arr)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
But then calling:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
Throws an error
AnalysisException: grouping expressions sequence is empty,
Any tips would be appreciated.
thanks
You can solve this using higher order functions available from spark 2.4+ using transform and substring and then take array distinct:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
out.show(truncate=False)
+-----------------------------------------------------+----------------------------------------+
|arr_agent |New |
+-----------------------------------------------------+----------------------------------------+
|[NRCANL2AXXX, NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF] |[MQBFDEFF] |
+-----------------------------------------------------+----------------------------------------+

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

Scala: Delete empty array values from a Spark DataFrame

I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| []| []| []| []|
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| []| []| []| []|
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String]) {
def isEmpty: Boolean = ...
}
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method

Resources