Bigquery - How to zip two arrays into one? - arrays

If I have two arrays in BigQuery that I know are of equal size. How can i zip them into one array of structs or an array of two element arrays or similar?
The following query gives me all possible combinations of x and y which is not what I want.
WITH test AS (
SELECT
['a', 'b', 'c'] as xs,
[1, 2, 3] as ys
)
SELECT struct(x, y) as pairs
FROM test, unnest(xs) as x, unnest(ys) as y
I would like to get something like this:
+--------+--------+
| pair.x | pair.y |
+--------+--------+
| a | 1 |
| b | 2 |
| c | 3 |
+--------+--------+

Use WITH OFFSET and the bracket operator:
WITH test AS (
SELECT
['a', 'b', 'c'] as xs,
[1, 2, 3] as ys
)
SELECT struct(x, ys[OFFSET(off)] as y) as pairs
FROM test, unnest(xs) as x WITH OFFSET off;

Related

Apply Udf on an array field(variable length) and split it into columns in pyspark

I want to apply a logic on an array field of variable length (0-4000) and split it into its columns. A udf with explode, creating new columns and renaming the columns will do the work, but I am not sure how to apply it iteratively as a udf. UDF will take the variable length array field and return the set of new columns (0-4000) to the dataframe. Sample input data frame shown below
+--------------------+--------------------+
| hashval| dec_spec (array|
+--------------------+--------------------+
|3c65252a67546832d...|[8.02337424829602...|
|f5448c29403c80ea7...|[7.50372884795069...|
|94ff32cd2cfab9919...|[5.85195317398756...|
+--------------------+--------------------+
output should look like
+--------------------+--------------------+
| hashval| dec_spec (array| ftr_1 | ftr_2 | ftr_3 |...
+--------------------+--------------------+-----------+---------+--------+
|3c65252a67546832d...|[8.02337424829602...| 8.023 | 3.21 | 4.23.....
|f5448c29403c80ea7...|[7.50372884795069...| 7.502 | 8.23 |2.125
|94ff32cd2cfab9919...|[5.85195317398756...|
+--------------------+--------------------+
the udf can take some of the logic like this below
df_grp = df2.withColumn("explode_col", F.explode_outer("dec_spec"))
df_grp = df_grp.groupBy("hashval").pivot("explode_col").agg(F.avg("explode_col"))
below for renaming columns
count = 1
for col in df_grp.columns:
if col != "hashval":
df_grp = df_grp.withColumnRenamed(col, "ftr"+str(count))
count = count+1
Any help is appreciated.
PS For the code above, have taken help from others in the forum here.
mock sample data
from pyspark.sql import functions as sf
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField, StringType
sdf1 = sc.parallelize([["aaa", "1,2,3"],["bbb", "1,2,3,4,5"]]).toDF(["hash_val", "arr_str"])
sdf2 = sdf1.withColumn("arr", sf.split("arr_str", ","))
sdf2.show()
+--------+---------+---------------+
|hash_val| arr_str| arr|
+--------+---------+---------------+
| aaa| 1,2,3| [1, 2, 3]|
| bbb|1,2,3,4,5|[1, 2, 3, 4, 5]|
+--------+---------+---------------+
udf to make all array same length
schema = ArrayType(StringType())
def fill_list(input_list, input_length):
fill_len = input_length - len(input_list)
if fill_len > 0:
input_list += [None]*(fill_len)
return input_list[0:input_length]
fill_list_udf = udf(fill_list, schema)
sdf3 = sdf2.withColumn("arr1", fill_list_udf(sf.col("arr"), sf.lit(3)))
sdf3.show()
+--------+---------+---------------+---------+
|hash_val| arr_str| arr| arr1|
+--------+---------+---------------+---------+
| aaa| 1,2,3| [1, 2, 3]|[1, 2, 3]|
| bbb|1,2,3,4,5|[1, 2, 3, 4, 5]|[1, 2, 3]|
+--------+---------+---------------+---------+
expand them
sdf3.select("hash_val", *[sf.col("arr1")[i] for i in range(3)]).show()
+--------+-------+-------+-------+
|hash_val|arr1[0]|arr1[1]|arr1[2]|
+--------+-------+-------+-------+
| aaa| 1| 2| 3|
| bbb| 1| 2| 3|
+--------+-------+-------+-------+

Finding the Difference By Row Between 2 Columns that are Both Arrays in SnowSql

I have a dataset that is comprised of a date and two other columns that are in array format. I am trying to find all the values in array_1 that are not in array_2.
Date | Array_1 | Array_2
-------------------------
1/20 | [1,2,3] | [1,2]
2/20 | [4,5,6] | [[1,2,4]
Desired Output:
Date | Array_1
--------------
1/20 | [3]
2/20 | [5,6]
The idea is:
Unnest ("flatten") the values into two tables.
Use set functions for the operation you want.
Re-aggregate to an array.
I don't have Snowflake on hand, but I think this is how it works:
select t.*, array_3
from t left join lateral
(select array_agg(el) as array_3
from (select array_1
from table(flatten(input ==> t.array_1)) a1
except
select array_2
from table(flatten(input ==> t.array_2)) a2
) x
) x
Just to remind that if you were to use application code for this, it might be as simple as (using PHP for this example):
$array1 = array(4,5,6);
$array2 = array(5,6,7);
print_r(array_diff($array1, $array2));
Outputs: Array ( [0] => 4 )

Modify JSON array element

I have a table
id int | data json
With data:
1 | [1,2,3,2]
2 | [2,3,4]
I want to modify rows to delete array element (int) 2
Expected result:
1 | [1,3]
2 | [3,4]
As a_horse_with_no_name suggests in his comment the proper data type is int[] in this case. However, you can transform the json array to int[], use array_remove() and transform the result back to json:
with my_table(id, data) as (
values
(1, '[1,2,3,2]'::json),
(2, '[2,3,4]')
)
select id, to_json(array_remove(translate(data::text, '[]', '{}')::int[], 2))
from my_table;
id | to_json
----+---------
1 | [1,3]
2 | [3,4]
(2 rows)
Another possiblity is to unnest the arrays with json_array_elements(), eliminate unwanted elements and aggregate the result:
select id, json_agg(elem)
from (
select id, elem
from my_table,
lateral json_array_elements(data) elem
where elem::text::int <> 2
) s
group by 1;

Find product of integers at interval of X and update value at position 'i' in an array for N queries

I have given an array of integers of length up to 10^5 & I want to do following operation on array.
1-> Update value of array at any position i . (1 <= i <= n)
2-> Get products of number at indexes 0, X, 2X, 3X, 4X.... (J * X <= n)
Number of operation will be up to 10^5.
Is there any log n approach to answer query and update values.
(Original thought is to use Segment Tree but I think that it is not needed...)
Let N = 10^5, A:= original array of size N
We use 0-based notation when we saying indexing below
Make a new array B of integers which of length up to M = NlgN :
First integer is equal to A[0];
Next N integers is of index 1,2,3...N of A; I call it group 1
Next N/2 integers is of index 2,4,6....; I call it group 2
Next N/3 integers 3,6,9.... I call it group 3
Here is an example of visualized B:
B = [A[0] | A[1], A[2], A[3], A[4] | A[2], A[4] | A[3] | A[4]]
I think the original thoughts can be used without even using Segment Tree..
(It is overkill when you think for operation 2, we always will query specific range on B instead of any range, i.e. we do not need that much flexibility and complexity to maintain the data structure)
You can create the new array B described above, also create another array C of length M, C[i] := products of Group i
For operation 1 simply use O(# factors of i) to see which Group(s) you need to update, and update the values in both B and C (i.e. C[x]/old B[y] *new B[y])
For operation 2 just output corresponding C[i]
Not sure if I was wrong but this should be even faster and should pass the judge, if the original idea is correct but got TLE
As OP has added a new condition: for operation 2, we need to multiply A[0] as well, so we can special handle it. Here is my thought:
Just declare a new variable z = A[0], for operation 1, if it is updating index 0, update this variable; for operation 2, query using the same method above, and multiply by z afterwards.
I have updated my answer so now I simply use the first element of B to represent A[0]
Example
A = {1,4,6,2,8,7}
B = {1 | 4,6,2,8,7 | 6,8 | 2 | 8 | 7 } // O(N lg N)
C = {1 | 2688 | 48 | 2 | 8 | 7 } // O (Nlg N)
factorization for all possible index X (X is the index, so <= N) // O(N*sqrt(N))
opeartion 1:
update A[4] to 5: factors = 1,2,4 // Number of factors of index, ~ O(sqrt(N))
which means update Group 1,2,4 i.e. the corresponding elements in B & C
to locate the corresponding elements in B & C maybe a bit tricky,
but that should not increase the complexity
B = {1 | 4,6,2,5,7 | 6,5 | 2 | 5 | 7 } // O(sqrt(N))
C = {1 | 2688 | 48/8*5 | 2 | 8/8*5 | 7 } // O(sqrt(N))
update A[0] to 2:
B = {2 | 4,6,2,5,7 | 6,5 | 2 | 5 | 7 } // O(1)
C = {2 | 2688/8*5 | 48/8*5 | 2 | 8/8*5 | 7 } // O(1)
// Now A is actually {2,4,6,2,5,7}
operation 2:
X = 3
C[3] * C[0] = 2*2 = 4 // O(1)
X = 2
C[2] * C[0] = 30*2 = 60 // O(1)

Convert Julia array to dataframe

I have an array X that I'd like to convert to a dataframe. Upon recommendation from the web, I tried converting to a dataframe and get the following error.
julia> y=convert(DataFrame,x)
ERROR:converthas no method matching convert(::Type{DataFrame}, ::Array{Float64,2})
in convert at base.jl:13
When I try DataFrame(x), the conversion works but i get a complaint that the conversion is deprecated.
julia> DataFrame(x)
WARNING: DataFrame(::Matrix, ::Vector)) is deprecated, use convert(DataFrame, Matrix) instead in DataFrame at /Users/Matthew/.julia/v0.3/DataFrames/src/deprecated.jl:54 (repeats 2 times)
Is there another method I should be aware of to keep my code consistent?
EDIT:
Julia 0.3.2,
DataFrames 0.5.10
OSX 10.9.5
julia> x=rand(4,4)
4x4 Array{Float64,2}:
0.467882 0.466358 0.28144 0.0151388
0.22354 0.358616 0.669564 0.828768
0.475064 0.187992 0.584741 0.0543435
0.0592643 0.345138 0.704496 0.844822
julia> convert(DataFrame,x)
ERROR: `convert` has no method matching convert(::Type{DataFrame}, ::Array{Float64,2}) in convert at base.jl:13
This works for me:
julia> using DataFrames
julia> x = rand(4, 4)
4x4 Array{Float64,2}:
0.790912 0.0367989 0.425089 0.670121
0.243605 0.62487 0.582498 0.302063
0.785159 0.0083891 0.881153 0.353925
0.618127 0.827093 0.577815 0.488565
julia> convert(DataFrame, x)
4x4 DataFrame
| Row | x1 | x2 | x3 | x4 |
|-----|----------|-----------|----------|----------|
| 1 | 0.790912 | 0.0367989 | 0.425089 | 0.670121 |
| 2 | 0.243605 | 0.62487 | 0.582498 | 0.302063 |
| 3 | 0.785159 | 0.0083891 | 0.881153 | 0.353925 |
| 4 | 0.618127 | 0.827093 | 0.577815 | 0.488565 |
Are you trying something different?
If that doesn't work try posting a bit more code we can help you better.
Since this is the first thing that comes up when you google, for more recent versions of DataFrames.jl, you can just use the DataFrame() function now:
julia> x = rand(4,4)
4×4 Matrix{Float64}:
0.920406 0.738911 0.994401 0.9954
0.18791 0.845132 0.277577 0.231483
0.361269 0.918367 0.793115 0.988914
0.725052 0.962762 0.413111 0.328261
julia> DataFrame(x, :auto)
4×4 DataFrame
Row │ x1 x2 x3 x4
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────
1 │ 0.920406 0.738911 0.994401 0.9954
2 │ 0.18791 0.845132 0.277577 0.231483
3 │ 0.361269 0.918367 0.793115 0.988914
4 │ 0.725052 0.962762 0.413111 0.328261
I've been confounded by the same issue a number of times, and eventually realized the issue is often related to the format of the array, and is easily resolved by simply transposing the array prior to conversion.
In short, I recommend:
julia> convert(DataFrame, x')
# convert a Matrix{Any} with a header row of col name strings to a DataFrame
# e.g. mat2df(["a" "b" "c"; 1 2 3; 4 5 6])
mat2df(mat) = convert(DataFrame,Dict(mat[1,:],
[mat[2:end,i] for i in 1:size(mat,2)]))
# convert a Matrix{Any} (mat) and a list of col name strings (headerstrings)
# to a DataFrame, e.g. matnms2df([1 2 3;4 5 6], ["a","b","c"])
matnms2df(mat, headerstrs) = convert(DataFrame,
Dict(zip(headerstrs,[mat[:,i] for i in 1:size(mat,2)])))
A little late, but with the update to the DataFrame() function, I created a custom function that would take a matrix (e.g. an XLSX imported dataset) and convert it into a DataFrame using the first row as column headers. Saves me a ton of time and, hopefully, it helps you too.
function MatrixToDataFrame(mat)
DF_mat = DataFrame(
mat[2:end, 1:end],
string.(mat[1, 1:end])
)
return DF_mat
end
So I found this online and honestly felt dumb.
using CSV
WhatIWant = DataFrame(WhatIHave)
this was adapted from an R guide, but it works so heck
DataFrame([1 2 3 4; 5 6 7 8; 9 10 11 12], :auto)
This works as per >? DataFrame

Resources