How to pass variable arguments to the Cube function in spark sql and also agg function of the cube?
I have a list of columns, and I want to find the cube function on the columns and also aggerations function.
For example:
val columnsInsideCube = List("data", "product","country")
val aggColumns = List("revenue")
I want something like this:
dataFrame.cube(columns:String*).agg(aggcolumns:String*)
This is not like passing scala array to the Cube.
Cube is predefined class in the datafram.we have to send it in a proper manner.
You could use
Spark (new in version 1.4)
import pyspark.sql.DataFrame.cube
df.cube("name", df.age).count().orderBy("name", "age").show()
see also How to use "cube" only for specific fields on Spark dataframe?
or HiveSQL
GROUP BY a, b, c WITH CUBE
or which is equivalent to
GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( ))
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#space-menu-link-content
or you could use other libraries like
import com.activeviam.sparkube._
Related
I have a Google Sheet and I want to use a Query to dynamically use some data and calculate the sum from some particular columns. I can collect the data well. Then when I want to add the SUM of the columns I have an error. Below is the code is used:
={QUERY(data;"SELECT A, B, C, D, E, F, G ORDER BY E DESC";1);{
"TOTAL";
SUM('Données'!$E$1:$E)
}}
Can you help me? I would like to have the sum of columns D,E,F at the last line.
THank you
try:
={QUERY(data; "select A,B,C,D,E,F,G where A is not null order by E desc"; 1);
"Total"\""\SUM(Données!C2:C)\SUM(Données!D2:D)\SUM(Données!E2:E)\""\""}
The array sizes are different so you cannot group without "appending" the missing columns.
={QUERY(data,"SELECT A, B, C, D, E, F, G ORDER BY E DESC",1);{
"TOTAL",,
SUM('Données'!$C$1:$C),
SUM('Données'!$D$1:$D),
SUM('Données'!$E$1:$E),,
}}
Note: This question can apply to any programming language, for example Python or JavaScript.
How would you shuffle an array of elements deterministically with a seed, but where the following is also guaranteed:
If you add an additional element to the array before shuffling, the sequence of the original elements remains the same as with shuffling the original array.
I can probably explain this better with an example:
Let's say the array [a, b, c] is shuffled with seed 123, and this results in the output [c, a, b].
As you can see, b comes after a, and a comes after c.
We add an additional element to the end of the array, [a, b, c, d], and proceed to shuffle with seed 123.
This time, b must still come after a, and a must still come after c.
The output might be [c, a, d, b] or [d, c, a, b], but cannot be [b, a, c, d].
The same must apply if we continue to add more elements.
Edit: The positions of each element in the shuffled list should be completely random (certain positions should not be biased for a certain element), if mathematically possible.
(I am pretty new to python)
You are adding the additional element at the end, But you can also insert it in a random position after the shuffle
import random
x = ['a','b','c']
random.Random(123).shuffle(x)
print(x)
x = ['a','b','c']
random.Random(123).shuffle(x)
x.insert(random.randint(0,len(x)),'d')
print(x)
But this will become problematic if more elements are added.
I've a DataFrame with a lot of columns. Some of these columns are of the type array<string>.
I need to export a sample to csv and csv doesn't support array.
Now I'm doing this for every array column (sometimes is miss one or more)
df_write = df\
.withColumn('col_a', F.concat_ws(',', 'col_a'))\
.withColumn('col_g', F.concat_ws(',', 'col_g'))\
....
Is there a way to use a loop and do this for every array column without specifying them one by one?
You can check the type of each column and do a list comprehension:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType
arr_col = [
i.name
for i in df.schema
if isinstance(i.dataType, ArrayType)
]
df_write = df.select([
F.concat_ws(',', c)
if c in arr_col
else F.col(c)
for c in df.columns
])
Actually, you don't need to use concat_ws. You can just cast all columns to string type before writing to CSV, e.g.
df_write = df.select([F.col(c).cast('string') for c in df.columns])
You can also check the types using df.dtypes:
from pyspark.sql import functions as F
array_cols = [c for c, t in df.dtypes if t == "array<string>"]
df.select(*[
F.array_join(c, ",").alias(c) if c in array_cols else F.col(c)
for c in df.columns
])
I have a 3 sets of CT data each in it's own 700x700x512 array and I want to merge them into just one single array.
I had a look at the cat() function but didn't really understand how you set the dim variable- i.e. say for two simple 3x3x3 arrays, A & B, can I use AB_merge = cat(dim, A, B);
Thanks for any help!
The dim variable sets along which dimension you want to concatenate the images.
So if you want them 'on top of each other' that is along the 3rd dimensions:
AB_merge=cat(3, A, B);
If it is side by side along the x-axis:
AB_merge=cat(1, A, B);
etc.
These days I'm solving Project Euler problems in Erlang.
Since I'm a C++ programmer from the beginning, sometimes I really want to code using two dimensional arrays.
One of my idea is to use tuples and lists like this:
List=[{X,0}||X<-lists:seq(1,3)]
{1,0}
{2,0}
{3,0}
Is there nice way to implement multidimensional arrays in Erlang?
See array module but for multidimensional access you have to write your own wrapper. If any of your dimension is short and access is mostly read you can use tuples and use erlang:element and erlang:setelement. Own wrapper is recommended anyway.
Try array(actually dict) with {X, Y, Z} as a key. It's look like 3d array ;)
I wrote a small wrapper over array module for 2d arrays
-module(array_2d).
-export([new/2, get/3, set/4]).
new(Rows, Cols)->
A = array:new(Rows),
array:map(fun(_X, _T) -> array:new(Cols) end, A).
get(RowI, ColI, A) ->
Row = array:get(RowI, A),
array:get(ColI, Row).
set(RowI, ColI, Ele, A) ->
Row = array:get(RowI, A),
Row2 = array:set(ColI, Ele, Row),
array:set(RowI, Row2, A).