Split an array column into chunks of max size - arrays

I have a DataFrame with one column of array[string] type.
scala> df.printSchema
root
|-- user: string (nullable = true) ### this is an unique key
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
Due to some limitations on the consumer's side, I need to limit the number of elements in the items column, e.g: to maximum 1000 elements. The outcome DataFrame would have the same schema, except there's no uniqueness on the items column anymore. For example, with max elements = 3:
Input DataFrame:
+----+----------------------+
|user|items |
+----+----------------------+
|u1 |[a, b, cc, d, e, f, g]|
|u2 |[h, ii] |
|u3 |[j, kkkk, m, nn, o] |
+----+----------------------+
Output DataFrame:
+----+------------+
|user|items |
+----+------------+
|u1 |[a, f, g] |
|u1 |[b, cc, d] |
|u1 |[e] |
|u2 |[h, ii] |
|u3 |[j, nn, m] |
|u3 |[kkkk, o] |
+----+------------+
The order of items is not important.
The value of each item is just a string of alphanumeric chars, but the size of each item is not fixed.
Performance is not an issue, the DataFrame is small but we need the solution in SparkSQL.

This can be worked out without higher-order functions, in three easy steps:
posexplode the arrays of items
take integral part from dividing item's pos by N, the desired number of elements in subarrays
collect_list new arrays grouping by user and pos.
For N=3:
>>> df = spark.createDataFrame([
... {'user':'u1','items':['a', 'b', 'cc', 'd', 'e', 'f', 'g']},
... {'user':'u2','items':['h', 'ii']},
... {'user':'u3','items':['j', 'kkkk', 'm', 'nn', 'o']}
... ])
>>> from pyspark.sql.functions import *
>>> df1 = df.select(posexplode(df.items),df.user)
>>> df2 = df1.select(floor(df1.pos/3).alias('pos'),df1.col.alias('item'),df1.user)
>>> df3 = df2.groupby([df2.user,df2.pos]).agg(collect_list(df2.item)).drop('pos')
>>> df3.show(truncate=False)
+----+------------------+
|user|collect_list(item)|
+----+------------------+
|u2 |[h, ii] |
|u1 |[a, b, cc] |
|u1 |[d, e, f] |
|u1 |[g] |
|u3 |[nn, o] |
|u3 |[j, kkkk, m] |
+----+------------------+
>>>

Related

PostgreSQL / TypeORM: search array in array column - return only the highest arrays' intersection

let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:
Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.
Second edge had 3 events: BC, BC, C.
We want the user to be able to search
how many events occurred on every edge
by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.
We represent this schema with 2 pre-aggregated tables:
Edges table:
+----+
| id |
+----+
| 1 |
| 2 |
+----+
EdgeStats table (which contains relation to Edges table via tag_id):
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 1 | 1 | [A, B, C] | 7 |
| 2 | 1 | [A, B] | 7 |
| 3 | 1 | [B, C] | 5 |
| 4 | 1 | [A, C] | 6 |
| 5 | 1 | [A] | 5 |
| 6 | 1 | [B] | 4 |
| 7 | 1 | [C] | 4 |
| 8 | 1 | null | 7 | //null represents aggregated stats for given edge, not important here.
| 9 | 2 | [B, C] | 3 |
| 10 | 2 | [B] | 2 |
| 11 | 2 | [C] | 3 |
| 12 | 2 | null | 3 |
+------+---------+-----------+---------------+
Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.
Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).
I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:
+------+---------+-----------+---------------+
| id | edge_id | tags | metric_amount |
+------+---------+-----------+---------------+
| 4 | 1 | [A, C] | 6 |
| 11 | 2 | [C] | 3 |
+------+---------+-----------+---------------+
i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.
Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].
How do I write query to return this?
If you can just answer the question above, you can ignore what's bellow:
If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:
query.where("edge_stats.tags = :filter",
{
filter: [A, B],
}
)
However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.
Here is list of few possible solutions, all imperfect:
try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
use #> operator:
.where(
"edge_stats.tags <# :tags",
{
tags:[A, C, D],
}
)
This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.
approach built on 2) and inspired by this answer:
.where(
"edge_stats.tags <# :tags",
{
tags: [A, C, D],
}
)
.addOrderBy("edge.id")
.addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
.distinctOn(["edge.id"]);
What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).
So returned rows indeed are 4, 11.
This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.
I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?
Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,
DataTeam populated EdgeStats table with all permutations of tags observed per given edge
If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.
That way, you can
SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)
using the array_intersect function from here.

Extract value from complex array of map type to string

I have a dataframe like below.
No comp_value
1 [[ -> 10]]
2 [[ -> 35]]
The schema type of column - value is.
comp_value: array (nullable = true)
element: map(containsNull = true)
key: string
value: long (valueContainsNull = true)
I would like to convert the comp_value from complex type to string using PySpark. Is there a way to achieve this?
Expected output:
No comp_value
1 10
2 35
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1,' [[ -> 10]]'),
(2, '[[ -> 35]]')],
['No', 'v'])
df.show()
replace the corner brackets, remove trailing spaces, split by space to get a list and get the elements you want by slicing the list
new = df.withColumn('comp_value', split(trim(regexp_replace('v','\[|\]','')),'\s')[1])
new.show()
+---+-----------+----------+
| No| v|comp_value|
+---+-----------+----------+
| 1| [[ -> 10]]| 10|
| 2| [[ -> 35]]| 35|
+---+-----------+----------+
I will assume your data looks like this:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 10),
(2, 35)],
['No', 'v'])
df = df.select('No', F.array(F.create_map(F.lit(''), 'v')).alias('comp_value'))
df.show()
# +---+----------+
# |No |comp_value|
# +---+----------+
# |1 |[{ -> 10}]|
# |2 |[{ -> 35}]|
# +---+----------+
You can extract values inside array by referencing to them using index number (in this case [0]). And extracting values from maps is done by referencing keys (in this case ['']).
df2 = df.select('No', F.col('comp_value')[0][''].cast('string').alias('comp_value'))
df2.show()
# +---+----------+
# |No |comp_value|
# +---+----------+
# | 1| 10|
# | 2| 35|
# +---+----------+

Join on element inside array

I have two dataframes where I have to use a value of one dataframe to filter on the second dataframe using that value.
For example, below are the datasets
import pyspark
from pyspark.sql import Row
cust = spark.createDataFrame([Row(city='hyd',cust_id=100),
Row(city='blr',cust_id=101),
Row(city='chen',cust_id=102),
Row(city='mum',cust_id=103)])
item = spark.createDataFrame([Row(item='fish',geography=['london','a','b','hyd']),
Row(item='chicken',geography=['a','hyd','c']),
Row(item='rice',geography=['a','b','c','blr']),
Row(item='soup',geography=['a','kol','simla']),
Row(item='pav',geography=['a','del']),
Row(item='kachori',geography=['a','guj']),
Row(item='fries',geography=['a','chen']),
Row(item='noodles',geography=['a','mum'])])
cust dataset output:
+----+-------+
|city|cust_id|
+----+-------+
| hyd| 100|
| blr| 101|
|chen| 102|
| mum| 103|
+----+-------+
item dataset output:
+-------+------------------+
| item| geography|
+-------+------------------+
| fish|[london, a, b,hyd]|
|chicken| [a, hyd, c]|
| rice| [a, b, c, blr]|
| soup| [a, kol, simla]|
| pav| [a, del]|
|kachori| [a, guj]|
| fries| [a, chen]|
|noodles| [a, mum]|
+-------+------------------+
I need to use the city column values from cust dataframe to get the items from the item dataset. The final output should be:
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| hyd|[fish, chicken]| 100|
| blr| [rice]| 101|
|chen| [fries]| 102|
| mum| [noodles]| 103|
+----+---------------+-------+
Before the join I would explode the array column. Then, collect_list aggregation can move all items to one list.
from pyspark.sql import functions as F
df = cust.join(item.withColumn('city', F.explode('geography')), 'city', 'left')
df = (df.groupBy('city', 'cust_id')
.agg(F.collect_list('item').alias('items'))
.select('city', 'items', 'cust_id')
)
df.show(truncate=False)
#+----+---------------+-------+
#|city|items |cust_id|
#+----+---------------+-------+
#|blr |[rice] |101 |
#|chen|[fries] |102 |
#|hyd |[fish, chicken]|100 |
#|mum |[noodles] |103 |
#+----+---------------+-------+
new = (
#join the two columns on city
item.withColumn('city',explode(col('geography')))
.join(cust,how='left',on='city')
#drop null rows and unwanted column
.dropna().drop('geography')
#groupby for the outcome
.groupby('city','cust_id').agg(collect_list('item').alias('items'))
)
new.show()
+----+---------------+-------+
|city| items|cust_id|
+----+---------------+-------+
| blr| [rice]| 101|
|chen| [fries]| 102|
| hyd|[fish, chicken]| 100|
| mum| [noodles]| 103|
+----+---------------+-------+

How to split the elements of list into specific number of columns in Spark Scala?

I have list containing random number of elements
Emp list
101 [a,b,c,d,e]
102 [q,w,e]
103 [z,x,w,t,e,q,s]
I need the result to be split between 3 columns
Emp col1 col2 col3
101 a b c
101 d e
102 q w e
103 z x w
103 t e q
103 s
Check this out:
scala> val df = Seq((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s"))).toDF("emp","list")
df: org.apache.spark.sql.DataFrame = [emp: int, list: array<string>]
scala> df.show(false)
+---+---------------------+
|emp|list |
+---+---------------------+
|101|[a, b, c, d, e] |
|102|[q, w, e] |
|103|[z, x, w, t, e, q, s]|
+---+---------------------+
scala> val udf_slice = udf( (x:Seq[String]) => x.grouped(3).toList )
udf_slice: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(StringType,true))))
scala> df.select(col("*"), explode(udf_slice($"list")).as("newlist")).select($"emp", $"newlist"(0).as("col1"), $"newlist"(1).as("col2"), $"newlist"(2).as("col3") ).show(false)
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101|a |b |c |
|101|d |e |null|
|102|q |w |e |
|103|z |x |w |
|103|t |e |q |
|103|s |null|null|
+---+----+----+----+
scala>
Spark 2.4 - just tried to implement without udfs.. but the slice() function is not accepting other columns as parameters for the range
val df = Seq((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s"))).toDF("emp","list")
df.show(false)
val df2 = df.withColumn("list_size_arr", array_repeat(lit(1), ceil(size('list)/3).cast("int")) )
val df3 = df2.select(col("*"),posexplode('list_size_arr))
val udf_slice = udf( (x:Seq[String],start:Int, end:Int ) => x.slice(start,end) )
df3.withColumn("newlist",udf_slice('list,'pos*3, ('pos+1)*3 )).select($"emp", $"newlist").show(false)
Results:
+---+---------------------+
|emp|list |
+---+---------------------+
|101|[a, b, c, d, e] |
|102|[q, w, e] |
|103|[z, x, w, t, e, q, s]|
+---+---------------------+
+---+---------+
|emp|newlist |
+---+---------+
|101|[a, b, c]|
|101|[d, e] |
|102|[q, w, e]|
|103|[z, x, w]|
|103|[t, e, q]|
|103|[s] |
+---+---------+
To get in separate columns
val df4 = df3.withColumn("newlist",udf_slice('list,'pos*3, ('pos+1)*3 )).select($"emp", $"newlist")
df4.select($"emp", $"newlist"(0).as("col1"), $"newlist"(1).as("col2"), $"newlist"(2).as("col3") ).show(false)
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101|a |b |c |
|101|d |e |null|
|102|q |w |e |
|103|z |x |w |
|103|t |e |q |
|103|s |null|null|
+---+----+----+----+
Another approach not using UDF is as follows - note sliding can be used as well, but it does involve a conversion to RDD and back again:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// No use of UDF means conversion to RDD and back again.
val data = List( (102, Array("a", "b", "c")), (103, Array("1", "2", "3", "4", "5", "6", "7", "8")), (104, Array("r")) )
val rdd = sc.parallelize(data)
val df = rdd.toDF("k", "v")
// Make groups of 3 as requested, methods possible.
val rddX = df.as[(Int, List[String])].rdd // This avoids Row and Any issues that typically crop up.
//val rddY = rddX.map(x => (x._1, x._2.grouped(3).toArray))
val rddY = rddX.map(x => (x._1, x._2.sliding(3,3).toArray))
// Get k,v's with v the set of 3 and make single columns.
val df2 = rddY.toDF("k", "v")
val df3 = df2.select($"k", explode($"v").as("v_3"))
val df4 = df3.select($"k", $"v_3"(0).as("v_3_1"), $"v_3"(1).as("v_3_2"), $"v_3"(2).as("v_3_3") )
df4.show(false)
returns:
+---+-----+-----+-----+
|k |v_3_1|v_3_2|v_3_3|
+---+-----+-----+-----+
|102|a |b |c |
|103|1 |2 |3 |
|103|4 |5 |6 |
|103|7 |8 |null |
|104|r |null |null |
+---+-----+-----+-----+
there is probably a better solution, but I came up with this:
import java.util.Arrays
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val employees = Array((101,Array("a","b","c","d","e")),(102,Array("q","w","e")),(103,Array("z","x","w","t","e","q","s")))
def f(emp:Int, num:Array[String]):Row={
Row.fromSeq(s"${emp}" +: num)
}
val rowArray =for {
x <- employees
z <- x._2.sliding(3,3)
}yield f(x._1,Arrays.copyOf(z,3))
import spark.implicits._
val schema = StructType(
List(StructField("emp", StringType, false),
StructField("col1", StringType, true),
StructField("col2", StringType, true),
StructField("col3", StringType, true)))
val sqlContext=new SQLContext(sc)
val dfFromArray = sqlContext.createDataFrame(sc.parallelize(rowArray), schema)
dfFromArray.show
It will return you something like this:
+---+----+----+----+
|emp|col1|col2|col3|
+---+----+----+----+
|101| a| b| c|
|101| d| e|null|
|102| q| w| e|
|103| z| x| w|
|103| t| e| q|
|103| s|null|null|
+---+----+----+----+
This is the answer that I should have posted, if not using a UDF.
Here we use the newer Dataset which allows us to use Scala functions directly on DS-fields easier, like an RDD. That's the point here.
Indeed DS's are best of both worlds and point taken that a UDF is not used as in other answer, but sometimes the performance is an issue.
In any event, same results gotten so only DS approach shown. Note that DS and DF definitions change - these are indicated in the val names.
case class X(k: Integer, v: List[String])
import org.apache.spark.sql.functions._
val df = Seq( (102, Array("a", "b", "c")),
(103, Array("1", "2", "3", "4", "5", "6", "7", "8")),
(104, Array("r")) ).toDF("k", "v")
val ds = df.as[X]
val df2 = ds.map(x => (x.k, x.v.sliding(3,3).toArray))
.withColumnRenamed ("_1", "k" )
.withColumnRenamed ("_2", "v")
val df3 = df2.select($"k", explode($"v").as("v_3")).select($"k", $"v_3"(0).as("v_3_1"),
$"v_3"(1).as("v_3_2"), $"v_3"(2).as("v_3_3") )
df3.show(false)
results again in:
+---+-----+-----+-----+
|k |v_3_1|v_3_2|v_3_3|
+---+-----+-----+-----+
|102|a |b |c |
|103|1 |2 |3 |
|103|4 |5 |6 |
|103|7 |8 |null |
|104|r |null |null |
+---+-----+-----+-----+

compare String with Array[String] in scala dataframe?

How to compare String with Array[String] in scala ?
For example, if "a" belongs to ["a", "b", "c"].
I have dataframe df
col1 col2
a [a,b,c]
d [a,b,c]
Expected output
col1 col2 status
a [a,b,c] present
d [a,b,c] missing
I wrote following script in scala
val arrayContains = udf( (col1: String, col2: Array[String]) =>
if(col2.contains(col1) ) "present" else "missing" )
I append new column with my dataframe by filled this new column "status" as follows
df.withColumn("status", arrayContains($"col1", $"col2" )).show()
but it prompts me following error.
(run-main-0) org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string, array) => string)
How can I resolve that issue ?
Here you go:
import org.apache.spark.sql.functions._
import scala.collection.mutable
+----+---------+
|col1| col2|
+----+---------+
| a|[a, b, c]|
| d|[a, b, c]|
| e|[a, b, c]|
| aa|[a, b, c]|
| c|[a, b, c]|
| f| []|
+----+---------+
def compareStrAgainstArray() = udf((str: String,lst: mutable.WrappedArray[String]) =>
if (lst.exists(str.matches(_)))"present" else "missing")
df.withColumn("status",compareStrAgainstArray()($"col1",$"col2")).show()
+----+---------+-------+
|col1| col2| status|
+----+---------+-------+
| a|[a, b, c]|present|
| d|[a, b, c]|missing|
| e|[a, b, c]|missing|
| aa|[a, b, c]|missing|
| c|[a, b, c]|present|
| f| []|missing|
+----+---------+-------+
Hope this helps!

Resources