I was trying to implement pivoting similar to sql server in spark
As of now, I'm using sqlContext and applying all the transformation within the sql.
I would like to know if I can do a direct pull from sql server and implement the pivot funtion using spark.
Below is an example of what I'm trying to achieve-
SQL Server queries below-
create table #temp(ID Int, MonthPrior int, Amount float);
insert into #temp
values (100,1,10),(100,2,20),(100,3,30),(100,4,10),(100,5,20),(100,6,60),(200,1,10),(200,2,20),(200,3,30),(300,4,10),(300,5,20),(300,6,60);
select * from #temp;
|ID |MonthPrior|Amount|
|-------|----------|------|
|100 |1 |10|
|100 |2 |20|
|100 |3 |30|
|100 |4 |10|
|100 |5 |20|
|100 |6 |60|
|200 |1 |10|
|200 |2 |20|
|200 |3 |30|
|300 |4 |10|
|300 |5 |20|
|300 |6 |60|
Select ID,coalesce([1],0) as Amount1Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0) as Amount1to3Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0)+coalesce([4],0)+coalesce([5],0)+coalesce([6],0) as Amount_AllMonths from (select * from #temp) A
pivot
( sum(Amount) for MonthPrior in ([1],[2],[3],[4],[5],[6]) ) as Pvt
|ID |Amount1Mth |Amount1to3Mth |Amount_AllMonths|
|-------|-------|-------|---|
|100 |10 |60 |150|
|200 |10 |60 |60|
|300 |0 |0 |90|
If your Amount column is of Decimal type, it would be best to use java.math.BigDecimal as the corresponding argument type. Note that methods + and sum are no longer applicable hence are replaced with add and reduce respectively.
import org.apache.spark.sql.functions._
import java.math.BigDecimal
val df = Seq(
(100, 1, new BigDecimal(10)),
(100, 2, new BigDecimal(20)),
(100, 3, new BigDecimal(30)),
(100, 4, new BigDecimal(10)),
(100, 5, new BigDecimal(20)),
(100, 6, new BigDecimal(60)),
(200, 1, new BigDecimal(10)),
(200, 2, new BigDecimal(20)),
(200, 3, new BigDecimal(30)),
(300, 4, new BigDecimal(10)),
(300, 5, new BigDecimal(20)),
(300, 6, new BigDecimal(60))
).toDF("ID", "MonthPrior", "Amount")
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[BigDecimal]) => (a zip b).toMap
)
// Create array columns which get zipped into a map
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, BigDecimal], n: Int) =>
if (n > 0)
m.collect{ case (k, v) => if (k <= n) v else new BigDecimal(0) }.reduce(_ add _)
else
m.collect{ case (k, v) => v }.reduce(_ add _)
)
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show(truncate=false)
+---+--------------------+--------------------+--------------------+
| ID| Amount1Mth| Amount1to3Mth| Amount_AllMonths|
+---+--------------------+--------------------+--------------------+
|300| 0E-18| 0E-18|90.00000000000000...|
|100|10.00000000000000...|60.00000000000000...|150.0000000000000...|
|200|10.00000000000000...|60.00000000000000...|60.00000000000000...|
+---+--------------------+--------------------+--------------------+
One approach would be to create a map-type column from arrays of MonthPrior and Amount, and apply a UDF that sum the map values based on an integer parameter:
val df = Seq(
(100, 1, 10),
(100, 2, 20),
(100, 3, 30),
(100, 4, 10),
(100, 5, 20),
(100, 6, 60),
(200, 1, 10),
(200, 2, 20),
(200, 3, 30),
(300, 4, 10),
(300, 5, 20),
(300, 6, 60)
).toDF("ID", "MonthPrior", "Amount")
import org.apache.spark.sql.functions._
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[Int]) => (a zip b).toMap
)
// Aggregate columns into arrays and apply arrayToMap UDF to create map column
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, Int], n: Int) =>
if (n > 0) m.collect{ case (k, v) => if (k <= n) v else 0 }.sum else
m.collect{ case (k, v) => v }.sum
)
// Apply sumMapValues UDF to the map column
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show
+---+----------+-------------+----------------+
| ID|Amount1Mth|Amount1to3Mth|Amount_AllMonths|
+---+----------+-------------+----------------+
|300| 0| 0| 90|
|100| 10| 60| 150|
|200| 10| 60| 60|
+---+----------+-------------+----------------+
Thanks #LeoC. above solution worked. I also tried the following-
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
lazy val months = (((df select ($"MonthPrior") distinct) sort
($"MonthPrior".asc)).rdd map (_.getAs[Int](0)) collect).toList
lazy val sliceSpec = List((0, 2, "1-2"), (0, 3, "1-3"), (0, 4, "1-4"), (0, 5, "1-5"), (0, 6, "1-6"))
lazy val createGroup: List[Any] => ((Int, Int, String) => Column) = sliceMe => (start, finish, aliasName) =>
sliceMe slice (start, finish) map (value => col(value.toString)) reduce (_ + _) as aliasName
lazy val grouper = createGroup(months).tupled
lazy val groupedCols = sliceSpec map (group => grouper(group))
val pivoted = df groupBy ($"ID") pivot ("MonthPrior") agg (sum($"Amount"))
val writeMe = pivoted select ((pivoted.columns map col) ++ (groupedCols): _*)
z.show(writeMe sort ($"ID".asc))
Related
I want to apply a logic on an array field of variable length (0-4000) and split it into its columns. A udf with explode, creating new columns and renaming the columns will do the work, but I am not sure how to apply it iteratively as a udf. UDF will take the variable length array field and return the set of new columns (0-4000) to the dataframe. Sample input data frame shown below
+--------------------+--------------------+
| hashval| dec_spec (array|
+--------------------+--------------------+
|3c65252a67546832d...|[8.02337424829602...|
|f5448c29403c80ea7...|[7.50372884795069...|
|94ff32cd2cfab9919...|[5.85195317398756...|
+--------------------+--------------------+
output should look like
+--------------------+--------------------+
| hashval| dec_spec (array| ftr_1 | ftr_2 | ftr_3 |...
+--------------------+--------------------+-----------+---------+--------+
|3c65252a67546832d...|[8.02337424829602...| 8.023 | 3.21 | 4.23.....
|f5448c29403c80ea7...|[7.50372884795069...| 7.502 | 8.23 |2.125
|94ff32cd2cfab9919...|[5.85195317398756...|
+--------------------+--------------------+
the udf can take some of the logic like this below
df_grp = df2.withColumn("explode_col", F.explode_outer("dec_spec"))
df_grp = df_grp.groupBy("hashval").pivot("explode_col").agg(F.avg("explode_col"))
below for renaming columns
count = 1
for col in df_grp.columns:
if col != "hashval":
df_grp = df_grp.withColumnRenamed(col, "ftr"+str(count))
count = count+1
Any help is appreciated.
PS For the code above, have taken help from others in the forum here.
mock sample data
from pyspark.sql import functions as sf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField, StringType
sdf1 = sc.parallelize([["aaa", "1,2,3"],["bbb", "1,2,3,4,5"]]).toDF(["hash_val", "arr_str"])
sdf2 = sdf1.withColumn("arr", sf.split("arr_str", ","))
sdf2.show()
+--------+---------+---------------+
|hash_val| arr_str| arr|
+--------+---------+---------------+
| aaa| 1,2,3| [1, 2, 3]|
| bbb|1,2,3,4,5|[1, 2, 3, 4, 5]|
+--------+---------+---------------+
udf to make all array same length
schema = ArrayType(StringType())
def fill_list(input_list, input_length):
fill_len = input_length - len(input_list)
if fill_len > 0:
input_list += [None]*(fill_len)
return input_list[0:input_length]
fill_list_udf = udf(fill_list, schema)
sdf3 = sdf2.withColumn("arr1", fill_list_udf(sf.col("arr"), sf.lit(3)))
sdf3.show()
+--------+---------+---------------+---------+
|hash_val| arr_str| arr| arr1|
+--------+---------+---------------+---------+
| aaa| 1,2,3| [1, 2, 3]|[1, 2, 3]|
| bbb|1,2,3,4,5|[1, 2, 3, 4, 5]|[1, 2, 3]|
+--------+---------+---------------+---------+
expand them
sdf3.select("hash_val", *[sf.col("arr1")[i] for i in range(3)]).show()
+--------+-------+-------+-------+
|hash_val|arr1[0]|arr1[1]|arr1[2]|
+--------+-------+-------+-------+
| aaa| 1| 2| 3|
| bbb| 1| 2| 3|
+--------+-------+-------+-------+
Every approach I have tried leaves me with a sum of the entire column. Each row has an array filled with doubles. What I need is a column of sums for each row.
So you start with a dataframe that looks like this:
id c2 c3
-------------------------
1 1 [2.0, 1.0, 0.0]
2 2 [0.0, 0,0, 0.0]
And as a result I want this:
id c2 c3sum
-------------------------
1 1 3.0
2 2 0.0
I tried using the sum method after doing a groupBy on "id". I also tried using a udf:
def mySum(arr:Seq[Double]):Double=arr.reduceLeft(_+_)
val dfsum = df.withColumn("c3sum", mySum($"c3"))
These and other variants of the udf have always returned the sum of everything in the column. As a test I also tried using array.max to just get the maximum number for each array instead of summing them, and it returned the max for the entire column. Therefore it is likely some basic syntax issue I am not understanding.
Thank you in advance for your help.
You might want to consider leveraging Dataset's map with sum instead of relying on UDF:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 1, Array(2.0, 1.0, 0.0)),
(2, 2, Array(0.0, 0.0, 0.0))
).toDF("id", "c2", "c3")
df.
withColumn("c3", coalesce($"c3", lit(Array[Double]()))).
as[(Int, Int, Array[Double])].
map{ case (id, c2, c3) => (id, c2, c3.sum) }.
toDF("id", "c2", "c3sum").
show
// +---+---+-----+
// | id| c2|c3sum|
// +---+---+-----+
// | 1| 1| 3.0|
// | 2| 2| 0.0|
// +---+---+-----+
Note that before transforming to a Dataset, coalesce is applied to c3 to replace null (if any) with an empty Array[Double].
One possible solution is using an udf (as you have tried). For it to work, you need to import and use org.apache.spark.sql.functions.udf to create an udf. Working example:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, 1, Seq(2.0, 1.0, 0.0)),
(2, 2, Seq(0.0, 0.0, 0.0)),
(3, 3, Seq(0.0, 1.0, 0.0))
).toDF("id", "c2", "c3")
val mySum = udf((arr: Seq[Double]) => arr.sum)
val dfsum = df.withColumn("c3sum", mySum($"c3"))
Will give:
+---+---+---------------+-----+
| id| c2| c3|c3Sum|
+---+---+---------------+-----+
| 1| 1|[2.0, 1.0, 0.0]| 3.0|
| 2| 2|[0.0, 0.0, 0.0]| 0.0|
| 3| 3|[0.0, 1.0, 0.0]| 1.0|
+---+---+---------------+-----+
Pretty new to spark/scala. I am wondering if there is an easy way to aggregate an Array[Double] in a column-wise fashion. Here is an example:
c1 c2 c3
-------------------------
1 1 [1.0, 1.0, 3.4]
1 2 [1.0, 0,0, 4.3]
2 1 [0.0, 0.0, 0.0]
2 3 [1.2, 1.1, 1.1]
Then, upon aggregation, I would end with a table that looks like:
c1 c3prime
-------------
1 [2.0, 1.0, 7.7]
2 [1.2, 1.1, 1.1]
Looking at UDAF now, but was wondering if I need to code at all?
Thanks for your consideration.
Assuming the array values of c3 are of the same size, you can sum the column element-wise by means of a UDF like below:
val df = Seq(
(1, 1, Seq(1.0, 1.0, 3.4)),
(1, 2, Seq(1.0, 0.0, 4.3)),
(2, 1, Seq(0.0, 0.0, 0.0)),
(2, 3, Seq(1.2, 1.1, 1.1))
).toDF("c1", "c2", "c3")
def elementSum = udf(
(a: Seq[Seq[Double]]) => {
val zeroSeq = Seq.fill[Double](a(0).size)(0.0)
a.foldLeft(zeroSeq)(
(a, x) => (a zip x).map{ case (u, v) => u + v }
)
}
)
val df2 = df.groupBy("c1").agg(
elementSum(collect_list("c3")).as("c3prime")
)
df2.show(truncate=false)
// +---+-----------------------------+
// |c1 |c3prime |
// +---+-----------------------------+
// |1 |[2.0, 1.0, 7.699999999999999]|
// |2 |[1.2, 1.1, 1.1] |
// +---+-----------------------------+
Here's one without UDF. It utilizes Spark's Window functions. Not sure how efficient it is, since it involves multiple groupBys
df.show
// +---+---+---------------+
// | c1| c2| c3|
// +---+---+---------------+
// | 1| 1|[1.0, 1.0, 3.4]|
// | 1| 2|[1.0, 0.0, 4.3]|
// | 2| 1|[0.0, 0.0, 0.0]|
// | 2| 2|[1.2, 1.1, 1.1]|
// +---+---+---------------+
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"c1", $"c2").orderBy($"c1", $"c2")
df.withColumn("c3", explode($"c3") )
.withColumn("rn", row_number() over window)
.groupBy($"c1", $"rn").agg(sum($"c3").as("c3") )
.orderBy($"c1", $"rn")
.groupBy($"c1")
.agg(collect_list($"c3").as("c3prime") ).show
// +---+--------------------+
// | c1| c3prime|
// +---+--------------------+
// | 1|[2.0, 1.0, 7.6999...|
// | 2| [1.2, 1.1, 1.1]|
// +---+--------------------+
You can combine some inbuilt functions such as groupBy, agg, sum, array, alias(as) etc. to get the desired final dataframe.
import org.apache.spark.sql.functions._
df.groupBy("c1")
.agg(sum($"c3"(0)).as("c3_1"), sum($"c3"(1)).as("c3_2"), sum($"c3"(2)).as("c3_3"))
.select($"c1", array("c3_1","c3_2","c3_3").as("c3prime"))
I hope the answer is helpful.
I have a table
id int | data json
With data:
1 | [1,2,3,2]
2 | [2,3,4]
I want to modify rows to delete array element (int) 2
Expected result:
1 | [1,3]
2 | [3,4]
As a_horse_with_no_name suggests in his comment the proper data type is int[] in this case. However, you can transform the json array to int[], use array_remove() and transform the result back to json:
with my_table(id, data) as (
values
(1, '[1,2,3,2]'::json),
(2, '[2,3,4]')
)
select id, to_json(array_remove(translate(data::text, '[]', '{}')::int[], 2))
from my_table;
id | to_json
----+---------
1 | [1,3]
2 | [3,4]
(2 rows)
Another possiblity is to unnest the arrays with json_array_elements(), eliminate unwanted elements and aggregate the result:
select id, json_agg(elem)
from (
select id, elem
from my_table,
lateral json_array_elements(data) elem
where elem::text::int <> 2
) s
group by 1;
A web application can send to a function an array of arrays like
[
[
[1,2],
[3,4]
],
[
[],
[4,5,6]
]
]
The outer array length is n > 0. The middle arrays are of constant length, 2 in this example. And the inner arrays lengths are n >= 0.
I could string build it like this:
with t(a, b) as (
values (1, 4), (2, 3), (1, 4), (7, 3), (7, 4)
)
select distinct a, b
from t
where
(a = any(array[1,2]) or array_length(array[1,2],1) is null)
and
(b = any(array[3,4]) or array_length(array[3,4],1) is null)
or
(a = any(array[]::int[]) or array_length(array[]::int[],1) is null)
and
(b = any(array[4,5,6]) or array_length(array[4,5,6],1) is null)
;
a | b
---+---
7 | 4
1 | 4
2 | 3
But I think I can do better like this
with t(a, b) as (
values (1, 4), (2, 3), (1, 4), (7, 3), (7, 4)
), u as (
select unnest(a)::text[] as a
from (values
(
array[
'{"{1,2}", "{3,4}"}',
'{"{}", "{4,5,6}"}'
]::text[]
)
) s(a)
), s as (
select a[1]::int[] as a1, a[2]::int[] as a2
from u
)
select distinct a, b
from
t
inner join
s on
(a = any(a1) or array_length(a1, 1) is null)
and
(b = any(a2) or array_length(a2, 1) is null)
;
a | b
---+---
7 | 4
2 | 3
1 | 4
Notice that a text array was passed and then "casted" inside the function. That was necessary as Postgresql can only deal with arrays of matched dimensions and the passed inner arrays can vary in dimension. I could "fix" them before passing by adding some special value like zero to make them all the same length of the longest one but I think it is cleaner to deal with that inside the function.
Am I missing something? Is it the best approach?
I like your second approach.
SELECT DISTINCT t.*
FROM (VALUES (1, 4), (5, 1), (2, 3), (1, 4), (7, 3), (7, 4)) AS t(a, b)
JOIN (
SELECT arr[1]::int[] AS a1
,arr[2]::int[] AS b1
FROM (
SELECT unnest(ARRAY['{"{1,2}", "{3,4}"}'
,'{"{}" , "{4,5,6}"}'
,'{"{5}" , "{}"}' -- added element to 1st dimension
])::text[] AS arr -- 1d text array
) sub
) s ON (a = ANY(a1) OR a1 = '{}')
AND (b = ANY(b1) OR b1 = '{}')
;
Suggesting only minor improvements:
Subqueries instead of CTEs for slightly better performance.
Simplified test for empty array: checking against literal '{}' instead of function call.
One less subquery level for unwrapping the array.
Result:
a | b
--+---
2 | 3
7 | 4
1 | 4
5 | 1
For the casual reader: Wrapping the multi-dimensional array of integer is necessary, since Postgres demands that (quoting error message):
multidimensional arrays must have array expressions with matching dimensions
An alternate route would be with a 2-dimensional text array and unnest it using generate_subscripts():
WITH a(arr) AS (SELECT '{{"{1,2}", "{3,4}"}
,{"{}", "{4,5,6}"}
,{"{5}", "{}"}}'::text[] -- 2d text array
)
SELECT DISTINCT t.*
FROM (VALUES (1, 4), (5, 1), (2, 3), (1, 4), (7, 3), (7, 4)) AS t(a, b)
JOIN (
SELECT arr[i][1]::int[] AS a1
,arr[i][2]::int[] AS b1
FROM a, generate_subscripts(a.arr, 1) i -- using implicit LATERAL
) s ON (t.a = ANY(s.a1) OR s.a1 = '{}')
AND (t.b = ANY(s.b1) OR s.b1 = '{}');
Might be faster, can you test?
In versions before 9.3 one would use an explicit CROSS JOIN instead of lateral cross joining.