Pretty new to spark/scala. I am wondering if there is an easy way to aggregate an Array[Double] in a column-wise fashion. Here is an example:
c1 c2 c3
-------------------------
1 1 [1.0, 1.0, 3.4]
1 2 [1.0, 0,0, 4.3]
2 1 [0.0, 0.0, 0.0]
2 3 [1.2, 1.1, 1.1]
Then, upon aggregation, I would end with a table that looks like:
c1 c3prime
-------------
1 [2.0, 1.0, 7.7]
2 [1.2, 1.1, 1.1]
Looking at UDAF now, but was wondering if I need to code at all?
Thanks for your consideration.
Assuming the array values of c3 are of the same size, you can sum the column element-wise by means of a UDF like below:
val df = Seq(
(1, 1, Seq(1.0, 1.0, 3.4)),
(1, 2, Seq(1.0, 0.0, 4.3)),
(2, 1, Seq(0.0, 0.0, 0.0)),
(2, 3, Seq(1.2, 1.1, 1.1))
).toDF("c1", "c2", "c3")
def elementSum = udf(
(a: Seq[Seq[Double]]) => {
val zeroSeq = Seq.fill[Double](a(0).size)(0.0)
a.foldLeft(zeroSeq)(
(a, x) => (a zip x).map{ case (u, v) => u + v }
)
}
)
val df2 = df.groupBy("c1").agg(
elementSum(collect_list("c3")).as("c3prime")
)
df2.show(truncate=false)
// +---+-----------------------------+
// |c1 |c3prime |
// +---+-----------------------------+
// |1 |[2.0, 1.0, 7.699999999999999]|
// |2 |[1.2, 1.1, 1.1] |
// +---+-----------------------------+
Here's one without UDF. It utilizes Spark's Window functions. Not sure how efficient it is, since it involves multiple groupBys
df.show
// +---+---+---------------+
// | c1| c2| c3|
// +---+---+---------------+
// | 1| 1|[1.0, 1.0, 3.4]|
// | 1| 2|[1.0, 0.0, 4.3]|
// | 2| 1|[0.0, 0.0, 0.0]|
// | 2| 2|[1.2, 1.1, 1.1]|
// +---+---+---------------+
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"c1", $"c2").orderBy($"c1", $"c2")
df.withColumn("c3", explode($"c3") )
.withColumn("rn", row_number() over window)
.groupBy($"c1", $"rn").agg(sum($"c3").as("c3") )
.orderBy($"c1", $"rn")
.groupBy($"c1")
.agg(collect_list($"c3").as("c3prime") ).show
// +---+--------------------+
// | c1| c3prime|
// +---+--------------------+
// | 1|[2.0, 1.0, 7.6999...|
// | 2| [1.2, 1.1, 1.1]|
// +---+--------------------+
You can combine some inbuilt functions such as groupBy, agg, sum, array, alias(as) etc. to get the desired final dataframe.
import org.apache.spark.sql.functions._
df.groupBy("c1")
.agg(sum($"c3"(0)).as("c3_1"), sum($"c3"(1)).as("c3_2"), sum($"c3"(2)).as("c3_3"))
.select($"c1", array("c3_1","c3_2","c3_3").as("c3prime"))
I hope the answer is helpful.
Related
I am in a Julia project and I am trying to write code for saving Matrix data as .json file.
The problem I am facing is when I read the json string from the file and parse it, the Matrix is changed into "vector of vectors".
In [1]:
n=10
mat = zeros(n,n)
data = Dict(
"n" => n,
"mat" => mat
)
using JSON
output_text = JSON.json(data)
out = open("jsonTest.json","w")
println(out,output_text)
close(out)
In [2]:
data
Out[2]:
Dict{String, Any} with 2 entries:
"mat" => [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.…
"n" => 10
In [3]:
op = JSON.parsefile("jsonTest.json")
Out[3]:
Dict{String, Any} with 2 entries:
"mat" => Any[Any[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Any[0.0, …
"n" => 10
According to this question, one simple solution is using hcat function to re-convert "vector of vectors" to "matrix" when reading the json.
It works, but it is a bit messy(and might time-consuming) to take care of each matrix in the json. Is there simpler way?
Any information would be appreciated.
Since JSON (the spec) does not know about Matrix, you need to either do nested vectors (as you're right now), or reshape.
I personally prefer reshape because its memory layout is the same as an concrete Matrix in Julia and reshap has no allocation and less overhead over hcat
julia> a = rand(2,3)
2×3 Matrix{Float64}:
0.534246 0.282277 0.140581
0.841056 0.443697 0.427142
julia> as = JSON3.write(a)
"[0.5342463881378705,0.8410557102859995,0.2822771326129221,0.44369703601566,0.1405805564055571,0.4271417199755423]"
julia> reshape(JSON3.read(as), (2,3))
2×3 reshape(::JSON3.Array{Float64, Base.CodeUnits{UInt8, String}, Vector{UInt64}}, 2, 3) with eltype Float64:
0.534246 0.282277 0.140581
0.841056 0.443697 0.427142
I have different time series ts_values turned to list and I want to predict the next itemss using an ARIMA model but it seems not to take care about zeroes:
row['shop_id']: 5 row['item_id']: 5037
[2599.0, 2599.0, 3998.0, 3998.0, 1299.0, 1499.0, 1499.0, 2997.5, 749.5, 0.0, 0.0, 0.0, 0.0]
predicted: 2599.019975890905
-------------------
row['shop_id']: 5 row['item_id']: 5320
predicted: 0
-------------------
row['shop_id']: 5 row['item_id']: 5233
[2697.0, 1198.0, 599.0, 2997.0, 1199.0, 0.0]
predicted: 2697.000099353263
-------------------
row['shop_id']: 5 row['item_id']: 5232
predicted: 0
-------------------
row['shop_id']: 5 row['item_id']: 5268
predicted: 0
-------------------
row['shop_id']: 5 row['item_id']: 5039
[5198.0, 6597.0, 2599.0, 5197.0, 749.5, 1499.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
predicted: 5198.0926378541535
So I wondered what did I did wrong there.
Here is my code:
import statsmodels.tsa.arima.model as smt
ranges = range(1, 5)
for difference in ranges:
# try:
tmp_model = smt.ARIMA(ts_values, order=(0, 1, 0), trend='t').fit()
tmp_aic = tmp_model.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_difference = difference
best_model = tmp_model
# except Exception as e:
# print(e)
# continue
if best_model is not None:
y_hat = best_model.forecast()[0]
I know difference is of no use there. It was used for the coefficient of ARIMA. But I've been told that as far as my lists would be of a max size of 32 I should use a simple time forecasting (0,1,0).
Edit: version 0.12.2 of Statsmodels has been released, so your original code should now work.
Unfortunately, this is a known bug with ARIMA in v0.12.1. Until v0.12.2 is released, one option is to use SARIMAX instead:
from statsmodels.tsa.statespace import sarimax as smt
...
tmp_model = smt.SARIMAX(ts_values, order=(0, 1, 0), trend='t').fit()
Apologies for the clumsy wording, I am struggling on how to describe this problem.
My goal is to write a function that takes in three variables and outputs a 2D array with this pattern:
var foo = function(x, y, z) {
array = [
[x + 8, y + 16, z + 35],
[x + 6, y + 8, z + 30],
[x + 4, y + 4, z + 20],
[x + 2, y + 2, z + 10],
[x , y , z ],
[x - 2, y + 2, z - 10],
[x - 4, y + 4, z - 20],
[x - 6, y + 8, z - 30],
[x - 8, y + 16, z - 35]
]
return array;
}
Obviously, this way of writing the function seems pretty inefficient.
One way I tried to solve this is with a loop. But my solution introduces three arrays and is also pretty inelegant.
var x_mod = [8, 6, 4, 2, 0, -2, -4, -6, -8];
var y_mod = [16, 8, 4, 2, 0, 2, 4, 8, 16];
var z_mod = [35, 30, 20, 10, 0, -10, -20, -30, -35];
for(let i = 0; i < 9; i++) {
array[i] = [x + x_mod[i], y + y_mod[i], z + z_mod[i]);
}
Is there a better way of writing this algorithm? I would also appreciate any clues as to what this kind of problem is called, or what I should study to solve it.
Thank you!
EDIT
This is an example of the kind of optimization I was thinking of.
The following function
var bar = function(x, y, z) {
array = [
[x + 1, y + 2, z + 3],
[x + 2, y + 4, z + 6],
[x + 3, y + 6, z + 9]
]
return array;
}
could also be written in the following way:
var bar = function(x, y, z) {
array = [];
for(var i = 1; i < 4; i++)
array[i] = [x + i, x + i*2, x + i*3];
return array;
}
This is the kind of "optimization" that I wanted to apply to my original problem. Again, I apologize that I lack the vocabulary to adequately describe this problem.
Is this what you are looking for (in c# code).
static class Program
{
static void Main(string[] args)
{
var m_2 = GenerateMatrix(2, 0.0, 0.0, 0.0);
// result:
// | 2.0 2.0 10.0 | + span = 2
// | 0.0 0.0 0.0 | +
// | -2.0 -2.0 -10.0 |
var m_3 = GenerateMatrix(3, 0.0, 0.0, 0.0);
// result:
// | 4.0 4.0 20.0 | +
// | 2.0 2.0 10.0 | | span = 3
// | 0.0 0.0 0.0 | +
// | -2.0 -2.0 -10.0 |
// | -4.0 -4.0 -20.0 |
var m_5 = GenerateMatrix(5, 0.0, 0.0, 0.0);
// result:
// | 8.0 16.0 40.0 | +
// | 6.0 8.0 30.0 | |
// | 4.0 4.0 20.0 | | span = 5
// | 2.0 2.0 10.0 | |
// | 0.0 0.0 0.0 | +
// | -2.0 -2.0 -10.0 |
// | -4.0 -4.0 -20.0 |
// | -6.0 -8.0 -30.0 |
// | -8.0 -16.0 -40.0 |
}
static double[][] GenerateMatrix(int span, double x, double y, double z)
{
var result = new double[2*(span-1)+1][];
result[span-1] = new double[] { x, y, z };
for (int i = 0; i < span-1; i++)
{
result[span-2-i] = new double[] { x+2*(i+1), y + (2<<i), z + 10*(i+1) };
result[span+i] = new double[] { x-2*(i+1), y - (2<<i), z - 10*(i+1) };
}
return result;
}
I am using the following rules (use counter=1..span-1). Set the rows symmetrically from the middle since they follow the same pattern with only + or - as a difference:
x values are multiples of twos, x+2*counter and x-2*counter
y values are power of twos, pow(2,counter)=2<<counter
z values are multiples of tens, x+10*counter and x-10*counter
While I think that your first definition is the best, formulas might be defined:
diff = (4 - i)
ad = abs(diff)
x + diff * 2
y + (1 << abs(ad)) - trunc((4 - ad) / 4)
//using bit shift to compose power of two if possible
z + 10 * diff - 5 * trunc(diff / 4)
//rounding towards zero!
Python check:
import math
for i in range(0, 9):
diff = (4 - i)
ad = abs(diff)
print(i, diff * 2, (1 << abs(ad)) - (4 - ad) // 4, 10 * diff - 5 * math.trunc(diff / 4))
0 8 16 35
1 6 8 30
2 4 4 20
3 2 2 10
4 0 0 0
5 -2 2 -10
6 -4 4 -20
7 -6 8 -30
8 -8 16 -35
you can use recursive approach for your solution:
var your_array = []
function myFun(x, y, z, count){
//base case
if(count = 4)
return;
// head recursion
temp = [];
temp.push(x); temp.push(y); temp.push(z);
your_array.push(temp);
myFun(x-2, y/2, z-10, count+1)
//tail recursion
temp = []
temp.push(x); temp.push(y); temp.push(z);
your_array.push(temp);
}
Every approach I have tried leaves me with a sum of the entire column. Each row has an array filled with doubles. What I need is a column of sums for each row.
So you start with a dataframe that looks like this:
id c2 c3
-------------------------
1 1 [2.0, 1.0, 0.0]
2 2 [0.0, 0,0, 0.0]
And as a result I want this:
id c2 c3sum
-------------------------
1 1 3.0
2 2 0.0
I tried using the sum method after doing a groupBy on "id". I also tried using a udf:
def mySum(arr:Seq[Double]):Double=arr.reduceLeft(_+_)
val dfsum = df.withColumn("c3sum", mySum($"c3"))
These and other variants of the udf have always returned the sum of everything in the column. As a test I also tried using array.max to just get the maximum number for each array instead of summing them, and it returned the max for the entire column. Therefore it is likely some basic syntax issue I am not understanding.
Thank you in advance for your help.
You might want to consider leveraging Dataset's map with sum instead of relying on UDF:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 1, Array(2.0, 1.0, 0.0)),
(2, 2, Array(0.0, 0.0, 0.0))
).toDF("id", "c2", "c3")
df.
withColumn("c3", coalesce($"c3", lit(Array[Double]()))).
as[(Int, Int, Array[Double])].
map{ case (id, c2, c3) => (id, c2, c3.sum) }.
toDF("id", "c2", "c3sum").
show
// +---+---+-----+
// | id| c2|c3sum|
// +---+---+-----+
// | 1| 1| 3.0|
// | 2| 2| 0.0|
// +---+---+-----+
Note that before transforming to a Dataset, coalesce is applied to c3 to replace null (if any) with an empty Array[Double].
One possible solution is using an udf (as you have tried). For it to work, you need to import and use org.apache.spark.sql.functions.udf to create an udf. Working example:
import org.apache.spark.sql.functions.udf
val df = Seq(
(1, 1, Seq(2.0, 1.0, 0.0)),
(2, 2, Seq(0.0, 0.0, 0.0)),
(3, 3, Seq(0.0, 1.0, 0.0))
).toDF("id", "c2", "c3")
val mySum = udf((arr: Seq[Double]) => arr.sum)
val dfsum = df.withColumn("c3sum", mySum($"c3"))
Will give:
+---+---+---------------+-----+
| id| c2| c3|c3Sum|
+---+---+---------------+-----+
| 1| 1|[2.0, 1.0, 0.0]| 3.0|
| 2| 2|[0.0, 0.0, 0.0]| 0.0|
| 3| 3|[0.0, 1.0, 0.0]| 1.0|
+---+---+---------------+-----+
I was trying to implement pivoting similar to sql server in spark
As of now, I'm using sqlContext and applying all the transformation within the sql.
I would like to know if I can do a direct pull from sql server and implement the pivot funtion using spark.
Below is an example of what I'm trying to achieve-
SQL Server queries below-
create table #temp(ID Int, MonthPrior int, Amount float);
insert into #temp
values (100,1,10),(100,2,20),(100,3,30),(100,4,10),(100,5,20),(100,6,60),(200,1,10),(200,2,20),(200,3,30),(300,4,10),(300,5,20),(300,6,60);
select * from #temp;
|ID |MonthPrior|Amount|
|-------|----------|------|
|100 |1 |10|
|100 |2 |20|
|100 |3 |30|
|100 |4 |10|
|100 |5 |20|
|100 |6 |60|
|200 |1 |10|
|200 |2 |20|
|200 |3 |30|
|300 |4 |10|
|300 |5 |20|
|300 |6 |60|
Select ID,coalesce([1],0) as Amount1Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0) as Amount1to3Mth, coalesce([1],0)+coalesce([2],0)+coalesce([3],0)+coalesce([4],0)+coalesce([5],0)+coalesce([6],0) as Amount_AllMonths from (select * from #temp) A
pivot
( sum(Amount) for MonthPrior in ([1],[2],[3],[4],[5],[6]) ) as Pvt
|ID |Amount1Mth |Amount1to3Mth |Amount_AllMonths|
|-------|-------|-------|---|
|100 |10 |60 |150|
|200 |10 |60 |60|
|300 |0 |0 |90|
If your Amount column is of Decimal type, it would be best to use java.math.BigDecimal as the corresponding argument type. Note that methods + and sum are no longer applicable hence are replaced with add and reduce respectively.
import org.apache.spark.sql.functions._
import java.math.BigDecimal
val df = Seq(
(100, 1, new BigDecimal(10)),
(100, 2, new BigDecimal(20)),
(100, 3, new BigDecimal(30)),
(100, 4, new BigDecimal(10)),
(100, 5, new BigDecimal(20)),
(100, 6, new BigDecimal(60)),
(200, 1, new BigDecimal(10)),
(200, 2, new BigDecimal(20)),
(200, 3, new BigDecimal(30)),
(300, 4, new BigDecimal(10)),
(300, 5, new BigDecimal(20)),
(300, 6, new BigDecimal(60))
).toDF("ID", "MonthPrior", "Amount")
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[BigDecimal]) => (a zip b).toMap
)
// Create array columns which get zipped into a map
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, BigDecimal], n: Int) =>
if (n > 0)
m.collect{ case (k, v) => if (k <= n) v else new BigDecimal(0) }.reduce(_ add _)
else
m.collect{ case (k, v) => v }.reduce(_ add _)
)
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show(truncate=false)
+---+--------------------+--------------------+--------------------+
| ID| Amount1Mth| Amount1to3Mth| Amount_AllMonths|
+---+--------------------+--------------------+--------------------+
|300| 0E-18| 0E-18|90.00000000000000...|
|100|10.00000000000000...|60.00000000000000...|150.0000000000000...|
|200|10.00000000000000...|60.00000000000000...|60.00000000000000...|
+---+--------------------+--------------------+--------------------+
One approach would be to create a map-type column from arrays of MonthPrior and Amount, and apply a UDF that sum the map values based on an integer parameter:
val df = Seq(
(100, 1, 10),
(100, 2, 20),
(100, 3, 30),
(100, 4, 10),
(100, 5, 20),
(100, 6, 60),
(200, 1, 10),
(200, 2, 20),
(200, 3, 30),
(300, 4, 10),
(300, 5, 20),
(300, 6, 60)
).toDF("ID", "MonthPrior", "Amount")
import org.apache.spark.sql.functions._
// UDF to combine 2 array-type columns to map
def arrayToMap = udf(
(a: Seq[Int], b: Seq[Int]) => (a zip b).toMap
)
// Aggregate columns into arrays and apply arrayToMap UDF to create map column
val df2 = df.groupBy("ID").agg(
collect_list(col("MonthPrior")).as("MonthList"),
collect_list(col("Amount")).as("AmountList")
).select(
col("ID"), arrayToMap(col("MonthList"), col("AmountList")).as("MthAmtMap")
)
// UDF to sum map values for keys from 1 thru n (0 for all)
def sumMapValues = udf(
(m: Map[Int, Int], n: Int) =>
if (n > 0) m.collect{ case (k, v) => if (k <= n) v else 0 }.sum else
m.collect{ case (k, v) => v }.sum
)
// Apply sumMapValues UDF to the map column
val df3 = df2.withColumn( "Amount1Mth", sumMapValues(col("MthAmtMap"), lit(1)) ).
withColumn( "Amount1to3Mth", sumMapValues(col("MthAmtMap"), lit(3)) ).
withColumn( "Amount_AllMonths", sumMapValues(col("MthAmtMap"), lit(0)) ).
select( col("ID"), col("Amount1Mth"), col("Amount1to3Mth"), col("Amount_AllMonths") )
df3.show
+---+----------+-------------+----------------+
| ID|Amount1Mth|Amount1to3Mth|Amount_AllMonths|
+---+----------+-------------+----------------+
|300| 0| 0| 90|
|100| 10| 60| 150|
|200| 10| 60| 60|
+---+----------+-------------+----------------+
Thanks #LeoC. above solution worked. I also tried the following-
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
lazy val months = (((df select ($"MonthPrior") distinct) sort
($"MonthPrior".asc)).rdd map (_.getAs[Int](0)) collect).toList
lazy val sliceSpec = List((0, 2, "1-2"), (0, 3, "1-3"), (0, 4, "1-4"), (0, 5, "1-5"), (0, 6, "1-6"))
lazy val createGroup: List[Any] => ((Int, Int, String) => Column) = sliceMe => (start, finish, aliasName) =>
sliceMe slice (start, finish) map (value => col(value.toString)) reduce (_ + _) as aliasName
lazy val grouper = createGroup(months).tupled
lazy val groupedCols = sliceSpec map (group => grouper(group))
val pivoted = df groupBy ($"ID") pivot ("MonthPrior") agg (sum($"Amount"))
val writeMe = pivoted select ((pivoted.columns map col) ++ (groupedCols): _*)
z.show(writeMe sort ($"ID".asc))