BigQuery standard SQL: how to group by an ARRAY field - arrays

My table has two columns, id and a. Column id contains a number, column a contains an array of strings. I want to count the number of unique id for a given array, equality between arrays being defined as "same size, same string for each index".
When using GROUP BY a, I get Grouping by expressions of type ARRAY is not allowed. I can use something like GROUP BY ARRAY_TO_STRING(a, ","), but then the two arrays ["a,b"] and ["a","b"] are grouped together, and I lose the "real" value of my array (so if I want to use it later in another query, I have to split the string).
The values in this field array come from the user, so I can't assume that some character is simply never going to be there (and use it as a separator).

Instead of GROUP BY ARRAY_TO_STRING(a, ",") use GROUP BY TO_JSON_STRING(a)
so your query will look like below
#standardsql
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
You can test it with dummy data like below
#standardsql
WITH `project.dataset.table` AS (
SELECT 1 id, ["a,b", "c"] a UNION ALL
SELECT 1, ["a","b,c"]
)
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
with result as
Row arr cnt
1 ["a,b","c"] 1
2 ["a","b,c"] 1
Update based on #Ted's comment
#standardsql
SELECT
ANY_VALUE(a) a,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY TO_JSON_STRING(a)

Alternatively, you can use another separator than comma
ARRAY_TO_STRING(a,"|")

Related

How to perform sql aggregation on Snowflake array and output multiple arrays?

I have a snowflake array as below rows which is an input, which I would want to check for each value in the array value and spit as multiple output arrays based on the value's length for values with 5 digits as one column, and values with 6 digits as another column.
ID_COL,ARRAY_COL_VALUE
1,[22,333,666666]
2,[1,55555,999999999]
3,[22,444]
Output table:
ID_COL,FIVE_DIGIT_COL,SIX_DIGIT_COL
1,[],[666666]
2,[555555],[]
3,[],[]
Please let me know if we could iterate through each array value and perform SQL aggregation to check column length and then output as a separate column outputs. Creating it through SQL would be great, but UDFs using javascript, python if an option would also be great.
Using SQL and FLATTEN:
CREATE OR REPLACE TABLE t(ID_COL INT,ARRAY_COL_VALUE VARIANT)
AS
SELECT 1,[22,333,666666] UNION ALL
SELECT 2,[1,55555,999999999] UNION ALL
SELECT 3,[22,444];
Query:
SELECT ID_COL,
ARRAY_AGG(CASE WHEN s.value BETWEEN 10000 AND 99999 THEN s.value END) AS FIVE_DIGIT_COL,
ARRAY_AGG(CASE WHEN s.value BETWEEN 100000 AND 999999 THEN s.value END) AS SIX_DIGIT_COL
FROM t, TABLE(FLATTEN(ARRAY_COL_VALUE)) AS s
GROUP BY ID_COL;
And Python UDF:
create or replace function filter_arr(arr variant, num_digits INT)
returns variant
language python
runtime_version = 3.8
handler = 'main'
as $$
def main(arr, num_digits):
return [x for x in arr if len(str(x))==num_digits]
$$;
SELECT ID_COL,
ARRAY_COL_VALUE,
filter_arr(ARRAY_COL_VALUE, 5),
filter_arr(ARRAY_COL_VALUE, 6)
FROM t;
Output:
If you're dealing strictly with numbers here's another way
with cte (id, array_col) as
(select 1,[22,333,666666,666666] union all
select 2,[1,22222,55555,999999999] union all
select 3,[22,444])
select *,
concat(',',array_to_string(array_col,',,'),',') as str_col,
regexp_substr_all(str_col,',([^,]{5}),',1,1,'e') as len_5,
regexp_substr_all(str_col,',([^,]{6}),',1,1,'e') as len_6
from cte;
The basic idea is to turn that array into a string and keep all the digits surrounded by , so that we can parse the pattern using regex_substr_all.
If you're dealing with strings, you can modify it to use a delimiter that won't show up in your data.

Preserve order while converting string array into int array in hive

I'm trying to convert a string array to int array by keeping the original order
here is a sample of what my data looks like:
id attribut string_array
id1 attribut1, 10283:990000 ["10283","990000"]
id2 attribut2, 10283:36741000 ["10283","36741000"]
id3 attribut3, 10283:37871000 ["10283","37871000"]
id4 attribut4, 3215:90451000 ["3215","90451000"]
and here's how i convert the field "string_array" into an array of integers
select
id,
attribut,
string_array,
collect_list(cast(array_explode as int)),
from table
lateral view outer explode(string_array) r as array_explode
it gives me:
id attribut string_array int_array
id1 attribut1,10283:990000 ["10283","990000"] [990000,10283]
id2 attribut2,10283:36741000 ["10283","36741000"] [10283,36741000]
id3 attribut3,10283:37871000 ["10283","37871000"] [37871000,10283]
id4 attribut4,3215:90451000 ["3215","90451000"] [90451000,3215]
As you can see, the order in "string array" has not been preserved in "int_array" and I need it to be exactly the same as in "string_array".
anyone know how to achieve this ?
Any help would be much appreciated
For Hive: Use posexplode, in a subquery before collect_list do distribute by id sort by position
select
id,
attribut,
string_array,
collect_list(cast(element as int)),
from
(select *
from table t
lateral view outer posexplode(string_array) e as pos,element
distribute by t.id, attribut, string_array -- distribute by group key
sort by pos -- sort by initial position
) t
group by id, attribut, string_array
Another way is to extract substring from your attributes and split without exploding (as you asked in the comment)
select split(regexp_extract(attribut, '[^,]+,(.*)$',1),':')
Regexp '[^,]+,(.*)$' means:
[^,]+ - not a comma 1+ times
, - comma
(.*)$ - everything else in catpturing group 1 after comma till the end of the string
Demo:
select split(regexp_extract('attribut3,10283:37871000', '[^,]+,(.*)$',1),':')
Result:
["10283","37871000"]

Hive concatenation of string and array<struct> columns

I have a couple of string columns and an array column. My requirement is to convert the array as a string and concatenate with the other string columns to execute MD5 function over the concatenated string column
But Casting array to String is not possible and I tried to use explode and inline function as well in order to extract the array contents but of no luck so far
Any idea on how to achieve this
Explode the array and get the struct elements, build string you need using struct elements and collect array of strings, use concat_ws to convert it to the string and then concatenate with some other column. Like this:
with mydata as (
select ID, my_array
from
( --some array<struct> example
select 1 ID, array(named_struct("city","Hudson","state","NY"),named_struct("city","San Jose","state","CA"),named_struct("city","Albany","state","NY")) as my_array
union all
select 2 ID, array(named_struct("city","San Jose","state","CA"),named_struct("city","San Diego","state","CA")) as my_array
)s
)
select ID, concat(ID,'-', --'-' is a delimiter
concat_ws(',',collect_list(element)) --collect array of strings and concatenate it using ',' delimiter
) as my_string --concatenate with ID column also
from
(
select s.ID, concat_ws(':',a.mystruct.city, mystruct.state) as element --concatenate struct using : as a delimiter Or concatenate in some other way
from mydata s
lateral view explode(s.my_array) a as mystruct
)s
group by ID
;
Returns:
OK
1 1-Hudson:NY,San Jose:CA,Albany:NY
2 2-San Jose:CA,San Diego:CA
Time taken: 63.368 seconds, Fetched: 2 row(s)
Using INLINE you can get struct elements exploded
with mydata as (
select ID, my_array
from
( --some array<struct> example
select 1 ID, array(named_struct("city","Hudson","state","NY"),named_struct("city","San Jose","state","CA"),named_struct("city","Albany","state","NY")) as my_array
union all
select 2 ID, array(named_struct("city","San Jose","state","CA"),named_struct("city","San Diego","state","CA")) as my_array
)s
)
select s.ID, a.city, a.state
from mydata s
lateral view inline(s.my_array) a as city, state
;
And concatenate them as you want again in the string, collect array, concat_ws, etc

Exploding a list in Hive SQL to identify blanks

I have a column called part_nos_list as array<\string> in a hive table. Apparently that column has blank and I want to update that with a '-'. The code sort of does that but 42 rows in a group by has shown having blanks. I tried to check individual records but am not successful. Here is the hive sql. Is there something wrong here in this sql
SELECT order_id, exploded_part_nos
FROM sales.order_detail LATERAL VIEW explode(part_no_list) part_nos AS exploded_part_nos where sale_type in ('POS', 'OTC' , 'CCC') and exploded_part_nos = ''
However this group by sql shows 42 as blanks
select * from (SELECT explo,count(*) as uni_explo_cnt
FROM sales.order_detail
LATERAL VIEW explode(split(concat_ws("##", part_no_list),'##')) yy AS explo where sale_type in ('POS', 'OTC' , 'CCC') group by explo order by explo asc) DD
Here is what hive table looks like
Id Part_no_list
1 ["OTC","POS","CCC"]
2 ["OTC","POS"]
4 NULL
5
6 ["-"]
7 ["OTC","POS","CCC"]
Thanks in advance
To test if exploded array element is empty, use this:
select * from(
select explode( array("OTC","POS","CCC","")) as explo
) s where explo=''
Result is one empty string.
If you want to identify array containing empty element use array_contains:
select * from(
select array("OTC","POS","CCC","") as a
) s where array_contains(a,'')
Result:
["OTC","POS","CCC",""]
If you want to find array containing only one element - empty string use size(array)=1 or array_contains(array,'')
But there is also such thing as an empty array.
It displayed the same as array containing empty element but it is not the same.
And to find empty array, use size()=0
Example:
select * from(
select array() as a
) s where size(a)=0
Returns []
Run all these queries on your data and you will become enlightened. I think it is empty arrays, not empty element in your case
Empty array is not NULL, because it is still an array object of zero size:
select * from(
select array() as a
) s where a is null
Returns no rows
Better query array only without explode and use array_contains and size to find empty arrays and empty elements. Use LATERAL VIEW OUTER to generate rows even when a LATERAL VIEW usually would not generate a row. LATERAL VIEW without OUTER word works as INNER JOIN, see docs about LATERAL VIEW OUTER

Check if a group contains all the ids in any of the arrays supplied

I pass a 2d array to a procedure. This array contains multiple arrays of ids. I want to
group a table by group_id
for each group, for each array in the 2d array
IF this group has all the ids within this iteration array, then return it
I read here about issues with 2d arrays:
postgres, contains-operator for multidimensional arrays performs flatten before comparing?
I think I'm nearly there, but I am unsure how to get around the problem. I understand why the following code produces the error "Subquery can only return one column", but I cant work out how to fix it
DEALLOCATE my_proc;
PREPARE my_proc (bigint[][]) AS
WITH cte_arr AS (select $1 AS arr),
cte_s AS (select generate_subscripts(arr,1) AS subscript,
arr from cte_arr),
grouped AS (SELECT ufs.user_id, array_agg(entity_id)
FROM table_A AS ufs
GROUP BY ufs.user_id)
SELECT *
FROM grouped
WHERE (select arr[subscript:subscript] #> array_agg AS sub,
arr[subscript:subscript]
from cte_s);
EXECUTE my_proc(array[array[1, 2], array[1,3]]);
You can create a row for each group and each array in the parameter with a cross join:
PREPARE stmt (bigint[][]) AS
with grouped as
(
select user_id
, array_agg(entity_id) as user_groups
from table_A
group by
user_id
)
select user_id
, user_groups
, $1[subscript:subscript] as matches
from grouped
cross join
generate_subscripts($1, 1) as gen(subscript)
where user_groups #> $1[subscript:subscript]
;
Example at SQL Fiddle

Resources