How to extract data from an Array looking JSON in BigQuery

How to extract data from an Array looking JSON in BigQuery - arrays

I am trying to extract data from the variable metric_data that looks like an Array but it's a JSON.
This is an example:
[{"segmentName":"control","values":[[1588636800000.0,101],[1588723200000.0,546],[1588809600000.0,1195],[1591056000000.0,129]]},{"segmentName":"experiment","values":[[1588636800000.0,91],[1588723200000.0,680],[1588809600000.0,1214],[1588896000000.0,1269],.0,290],[1589760000000.0,248],[1589846400000.0,173],[1589932800000.0,167],[1590019200000.0,178],[1590105600000.0,131],[1590192000000.0,110]]}]
I am specifically trying to sum up the second part of the sub-arrays associated with the key "value" so that I have a row for each segmentName and sum of its values. I only got as far as transforming into an array.
SELECT
array(select
x
FROM UNNEST(JSON_EXTRACT_ARRAY(metric_data, '$')) x
) extracted
FROM temp

Based from my understanding you would like to get the sum for each "segmentName". Two possibilities could be to sum everything (both array elements) or get the sum per element. But if my understanding is wrong please let me know so I can edit/delete my answer.
You can consider the queries below illustrating these two possibilities:
Sum of values
with sample_data as (
select '[{"segmentName":"control","values":[[1588636800000.0,101],[1588723200000.0,546],[1588809600000.0,1195],[1591056000000.0,129]]},{"segmentName":"experiment","values":[[1588636800000.0,91],[1588723200000.0,680],[1588809600000.0,1214],[1588896000000.0,1269],[1588896000000.0,290],[1589760000000.0,248],[1589846400000.0,173],[1589932800000.0,167],[1590019200000.0,178],[1590105600000.0,131],[1590192000000.0,110]]}]' as json_string
)
-- Sum all values
select
json_query(js,'$.segmentName') as segment_name,
sum(cast(arr_val as NUMERIC)) as sum_of_values
from sample_data
,unnest(json_query_array(json_string, '$')) js
,unnest(json_query_array(js,'$.values')) val
,unnest(json_query_array(val,'$')) arr_val
group by 1
Sum of values output:
Sum per element of values
with sample_data as (
select '[{"segmentName":"control","values":[[1588636800000.0,101],[1588723200000.0,546],[1588809600000.0,1195],[1591056000000.0,129]]},{"segmentName":"experiment","values":[[1588636800000.0,91],[1588723200000.0,680],[1588809600000.0,1214],[1588896000000.0,1269],[1588896000000.0,290],[1589760000000.0,248],[1589846400000.0,173],[1589932800000.0,167],[1590019200000.0,178],[1590105600000.0,131],[1590192000000.0,110]]}]' as json_string
)
,cte as (
select
json_query(js,'$.segmentName') as segment_name,
split(regexp_extract(val,r'\[(\d+\.?\d+?,\d+)\]'),',') as new_value
from sample_data
,unnest(json_query_array(json_string, '$')) js
,unnest(json_query_array(js,'$.values')) val
)
select
segment_name,
sum(cast(new_value[offset(0)] as numeric)) as elem1,
sum(cast(new_value[offset(1)] as numeric)) as elem2
from cte
group by segment_name
Sum per element of values output:
NOTE: Your JSON string is missing some brackets and values, hence I created a dummy value.

Related

ABAP - Group employees by cost center and calculate sum

I have an internal table with employees. Each employee is assigned to a cost center. In another column is the salary. I want to group the employees by cost center and get the total salary per cost center. How can I do it?
At first I have grouped them as follows:
Loop at itab assigning field-symbol(<c>)
group by <c>-kostl ascending.
Write: / <c>-kostl.
This gives me a list of all cost-centers. In the next step I would like to calculate the sum of the salaries per cost center (the sum for all employees with the same cost-center).
How can I do it? Can I use collect?
Update:
I have tried with the follwing coding. But I get the error "The syntax for a method specification is "objref->method" or "class=>method"". lv_sum_salary = sum( <p>-salary ).
loop at i_download ASSIGNING FIELD-SYMBOL(<c>)
GROUP BY <c>-kostl ascending.
Write: / <c>-kostl, <c>-salary.
data: lv_sum_salary type p DECIMALS 2.
Loop at group <c> ASSIGNING FIELD-SYMBOL(<p>).
lv_sum_salary = sum( <p>-salary ).
Write: /' ',<p>-pernr, <p>-salary.
endloop.
Write: /' ', lv_sum_salary.
endloop.

I am not sure where you got the sum function from, but there is no such build-in function. If you want to calculate a sum in a group-by loop, then you have to do it yourself.
" make sure the sum is reset to 0 for each group
CLEAR lv_sum_salary.
" Do a loop over the members of this group
LOOP AT GROUP <c> ASSIGNING FIELD-SYMBOL(<p>).
" Add the salary of the current group-member to the sum
lv_sum_salary = lv_sum_salary + <p>-salary.
ENDLOOP.
" Now we have the sum of all members
WRITE |The sum of cost center { <c>-kostl } is { lv_sum_salary }.|.

Generally speaking, to group and sum, there are these 4 possibilities (code snippets provided below):
SQL with an internal table as source: SELECT ... SUM( ... ) ... FROM #itab ... GROUP BY ... (since ABAP 7.52, HANA database only); NB: beware the possible performance overhead.
The classic way, everything coded:
Sort by cost center
Loop at the lines
At each line, add the salary to the total
If the cost center is different in the next line, process the total
LOOP AT with GROUP BY, and LOOP AT GROUP
VALUE with FOR GROUPS and GROUP BY, and REDUCE and FOR ... IN GROUP for the sum
Note that only the option with the explicit sorting will sort by cost center, the other ones won't provide a result sorted by cost center.
All the below examples have in common these declarative and initialization parts:
TYPES: BEGIN OF ty_itab_line,
kostl TYPE c LENGTH 10,
pernr TYPE c LENGTH 10,
salary TYPE p LENGTH 8 DECIMALS 2,
END OF ty_itab_line,
tt_itab TYPE STANDARD TABLE OF ty_itab_line WITH EMPTY KEY,
BEGIN OF ty_total_salaries_by_kostl,
kostl TYPE c LENGTH 10,
total_salaries TYPE p LENGTH 10 DECIMALS 2,
END OF ty_total_salaries_by_kostl,
tt_total_salaries_by_kostl TYPE STANDARD TABLE OF ty_total_salaries_by_kostl WITH EMPTY KEY.
DATA(itab) = VALUE tt_itab( ( kostl = 'CC1' pernr = 'E1' salary = '4000.00' )
( kostl = 'CC1' pernr = 'E2' salary = '3100.00' )
( kostl = 'CC2' pernr = 'E3' salary = '2500.00' ) ).
DATA(total_salaries_by_kostl) = VALUE tt_total_salaries_by_kostl( ).
and the expected result will be:
ASSERT total_salaries_by_kostl = VALUE tt_total_salaries_by_kostl(
( kostl = 'CC1' total_salaries = '7100.00' )
( kostl = 'CC2' total_salaries = '2500.00' ) ).
Examples for each possibility:
SQL on internal table:
SELECT kostl, SUM( salary ) AS total_salaries
FROM #itab AS itab ##DB_FEATURE_MODE[ITABS_IN_FROM_CLAUSE]
GROUP BY kostl
INTO TABLE #total_salaries_by_kostl.
Classic way:
SORT itab BY kostl.
DATA(next_line) = VALUE ty_ref_itab_line( ).
DATA(total_line) = VALUE ty_total_salaries_by_kostl( ).
LOOP AT itab REFERENCE INTO DATA(line).
DATA(next_kostl) = VALUE #( itab[ sy-tabix + 1 ]-kostl OPTIONAL ).
total_line-total_salaries = total_line-total_salaries + line->salary.
IF next_kostl <> line->kostl.
total_line-kostl = line->kostl.
APPEND total_line TO total_salaries_by_kostl.
CLEAR total_line.
ENDIF.
ENDLOOP.
EDIT: I don't talk about AT NEW and AT END OF because I'm not fan of them, as they don't explicitly define the possible multiple fields, they implicitly consider all the fields before the mentioned field + this field included. I also ignore ON CHANGE OF, this one being obsolete.
LOOP AT with GROUP BY:
LOOP AT itab REFERENCE INTO DATA(line)
GROUP BY ( kostl = line->kostl )
REFERENCE INTO DATA(kostl_group).
DATA(total_line) = VALUE ty_total_salaries_by_kostl(
kostl = kostl_group->kostl ).
LOOP AT GROUP kostl_group REFERENCE INTO line.
total_line-total_salaries = total_line-total_salaries + line->salary.
ENDLOOP.
APPEND total_line TO total_salaries_by_kostl.
ENDLOOP.
VALUE with FOR and GROUP BY, and REDUCE for the sum:
total_salaries_by_kostl = VALUE #(
FOR GROUPS <kostl_group> OF <line> IN itab
GROUP BY ( kostl = <line>-kostl )
( kostl = <kostl_group>-kostl
total_salaries = REDUCE #( INIT sum = 0
FOR <line_2> IN GROUP <kostl_group>
NEXT sum = sum + <line_2>-salary ) ) ).

Matching at least 1 of 2 values of an array

I've got an array [1,2,3,4,5] as a field value
and I'd to calculate the sum of all fields where the array either contains 1 or 2
ex:
array [2,3,4,5,6]
2
array [2,1,3]
1,2
array [5,8,9]
false

Arrays_Overlap() is a super efficient function - way faster than checking things one by one.
-- SET UP PLAY RAYS
WITH CTE AS ( SELECT [1,2,3,4,5] RAY UNION ALL SELECT [10,20,30,40,50] UNION
ALL SELECT [1,2,3,4,5] )
SELECT SUM(VALUE) FROM CTE,TABLE(FLATTEN(INPUT=>RAY)) WHERE ARRAYS_OVERLAP([1,2], RAY)

Weighted Average w/ Array Formula & Query That Pulls From A Separate Sheet

Link To Sheet
So I've got an array formula which I've included below. I need to adjust this so that it becomes a weighted average based on variables stored on a sheet titled Variables.
Current Formula:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:S;
PROPER(ADP!J3:J),ADP!S3:S;
PROPER(ADP!Z3:Z),ADP!AG3:AG},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
Here's what I thought would work but doesn't:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:E*(Variables!$F$11/Variables!$F$14);
PROPER(ADP!J3:J),ADP!S3:S*(Variables!$F$12/Variables!$F$14);
PROPER(ADP!Z3:Z),ADP!AG3:AG*(Variables!$F$13/Variables!$F$14)},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
What I'm trying to get is the value pulled in K to be multiplied by the value in VariablesF11, the value pulled in Y to be multiplied by VariablesF12, and the value in AL multiplied by the variables in F13. And have that numerator divided by the value in VariablesF14.

After our extensive chat, I'm providing here the answer we came up with, just on the chance it might somehow help someone else. But the issue in your case was less about the technicalities of the formula, and more about the structuring of multiple data sources, and the associated logic to pull the data together.
Here is the main formula:
={"Adjusted
Ranking
by " & Variables!F21;
arrayformula(
if(A2:A<>"",
( if(((D2:D>0) * Source1Used),D2:D,Variables!$F$21)*Variables!$F$12
+ if(((F2:F>0) * Source2Used),F2:F,Variables!$F$21)*Variables!$F$13
+ if(((H2:H>0) * Source3Used),H2:H,Variables!$F$21)*Variables!$F$14
+ if(((J2:J>0) * Source4Used),J2:J,Variables!$F$21)*Variables!$F$15
+ if(((L2:L>0) * Source5Used),L2:L,Variables!$F$21)*Variables!$F$16
+ if(((N2:N>0) * Source6Used),N2:N,Variables!$F$21)*Variables!$F$17 )) / Variables!$F$18) }
A2:A is the list of players' names. The D2:D>0 is a test of whether that player has a rating obtained from a particular data source.
Source1Used is a named range for a tickbox cell, where the user can indicate whether that data source is to be included in the calculations.
This formula creates an average value, using from 1 to 6 possible sources, user selectable.
The formula that gave the rating value for one specific source is as follows:
={"Rating in
Source1";ArrayFormula(if(A2:A<>"",if(C2:C,vlookup(A2:A,indirect("ADP!$" & ADP!E3 & "$10:" & ADP!E5),ADP!E6-ADP!E4+1,0),0),""))}
This takes a name in column A, checks if it is listed in a specific source's data, and if so, it pulls back the rating value from the data source. INDIRECT is used since the column locations for each data source may vary, but are obtained from a fixed table, in cells ADP!E3 and E5. E4 and E6 are the numeric values of the column letters.

How to parse a table with a JSON array field in PostgreSQL into rows?

I have a table that contains a json array. Here is a sample of the contents of the field from:
SELECT json_array FROM table LIMIT 5;
Result:
[{"key1":"value1"}, {"key1":"value2"}, ..., {"key2":"value3"}]
[]
[]
[]{"key1":"value1"}
[]
How can I retrieve all the values and count how many of each value was found?
I am using PostgreSQL 9.5.14, and I have tried the solutions here Querying a JSON array of objects in Postgres
and the ones suggested to me by another generous stackoverflow user in my last question: How can I parse JSON arrays in postgresql?
I tried:
SELECT
value -> 'key1'
FROM
table,
json_array_elements(json_array);
which sadly does not work for me due to receiving the error: cannot call json_array_elements on a scalar
This error happens when using a query that returns more than one row or more than one column as a scalar subquery.
Another solution I tried was:
SELECT json_array as json, (json_array->0),
coalesce(
case
when (json_array->0) IS NULL then null
else (json_array->0->>'key1')
end,
'No value') AS "Value"
FROM table;
which only returned null values for the "Value"
Referencing Querying a JSON array of objects in Postgres I attempted to use this solution as well:
WITH json_test (col) AS (
values (json_arrays)
)
SELECT
y.x->'key1' "key1"
FROM json_test jt,
LATERAL (SELECT json_array_elements(jt.col) x) y;
But I would need to be able to fit all the elements of the json_arrays into json_test
So far I have only attempted to list all the values in the all json arrays, but my ideal end-result for the query resembles this:
Value | Amount
---------------
value1 | 48
value2 | 112
value3 | 93
value4 | 0
Yet again I am grateful for any help with this, thank you in advance.

step-by-step demo:db<>fiddle
SELECT
each.value,
COUNT(*)
FROM
data,
json_array_elements(json_array) elems, -- 1
json_each_text(elems) each -- 2
GROUP BY each.value -- 3
Expand array into one row for each array element
split the key/value pairs into two columns
group by the new value column/count

Make a new array with items derived from another array

Given a PostgreSQL ARRAY of items of one type, how can I create a new array where each item is derived from the items in the initial array?
Example: I have an array of INTERVAL values. I want a new array where each item is a NUMERIC(10, 1) that is the total number of seconds in the corresponding INTERVAL value.
I know how to convert one INTERVAL value:
foo=> SELECT '00:01:20.000'::INTERVAL AS duration_interval;
duration_interval
-------------------
00:01:20
(1 row)
foo=> SELECT extract(EPOCH FROM date_trunc('second', '00:01:20.000'::INTERVAL))
::NUMERIC(10, 1) AS duration_seconds;
duration_seconds
------------------
80.0
(1 row)
The array does not exist in a table – this is a value returned from another function call – so the conversion code needs to operate on it as an array.
How can I convert an array of INTERVAL values to an array of corresponding NUMERIC values?

You need to unnest() the array, do the conversion and then aggregate back into an array.
Assuming you want to do this on a real table with a primary key:
SELECT pk, array_agg(extract(epoch from dur_int)::numeric(10,1)
ORDER BY ordinality) AS duration_seconds
FROM my_table, unnest(duration_interval) WITH ORDINALITY d(dur_int)
GROUP BY pk;
If you have a single array, such as the result from a function call:
SELECT array_agg(extract(epoch from dur_int)::numeric(10,1)
ORDER BY ordinality) AS duration_seconds
FROM unnest(function(...)) WITH ORDINALITY d(dur_int);
Note that you need the WITH ORDINALITY clause when unnesting the array. This will add a column ordinality to the result such that every row has two columns: (dur_int interval, ordinality bigint). When putting the array back again with seconds instead of an interval, you order the rows by the ordinality column. That way you ensure that the order in the resulting array of seconds is the same as in the original array of intervals. (In general, SQL row sources have no specific ordering, the server may present rows in any order it prefers.)
If you have access to the function and you are not breaking other uses of it, you might be better off by changing the function such that you can use its result directly.

If there is a primary key then #Patrick answer is enough. If not then use row_number to aggregate on:
with i(i) as (values
(array['00:01:20.000','00:00:30.000']::interval[]),
(array['00:02:10.000','00:01:30.000']::interval[])
)
select array_agg(extract(epoch from a)::numeric(10,1))
from (
select i, row_number() over() as r
from i
) s, unnest(i) a (a)
group by r
;
array_agg
--------------
{80.0,30.0}
{130.0,90.0}