Hive concatenation of string and array<struct> columns - arrays

I have a couple of string columns and an array column. My requirement is to convert the array as a string and concatenate with the other string columns to execute MD5 function over the concatenated string column
But Casting array to String is not possible and I tried to use explode and inline function as well in order to extract the array contents but of no luck so far
Any idea on how to achieve this

Explode the array and get the struct elements, build string you need using struct elements and collect array of strings, use concat_ws to convert it to the string and then concatenate with some other column. Like this:
with mydata as (
select ID, my_array
from
( --some array<struct> example
select 1 ID, array(named_struct("city","Hudson","state","NY"),named_struct("city","San Jose","state","CA"),named_struct("city","Albany","state","NY")) as my_array
union all
select 2 ID, array(named_struct("city","San Jose","state","CA"),named_struct("city","San Diego","state","CA")) as my_array
)s
)
select ID, concat(ID,'-', --'-' is a delimiter
concat_ws(',',collect_list(element)) --collect array of strings and concatenate it using ',' delimiter
) as my_string --concatenate with ID column also
from
(
select s.ID, concat_ws(':',a.mystruct.city, mystruct.state) as element --concatenate struct using : as a delimiter Or concatenate in some other way
from mydata s
lateral view explode(s.my_array) a as mystruct
)s
group by ID
;
Returns:
OK
1 1-Hudson:NY,San Jose:CA,Albany:NY
2 2-San Jose:CA,San Diego:CA
Time taken: 63.368 seconds, Fetched: 2 row(s)
Using INLINE you can get struct elements exploded
with mydata as (
select ID, my_array
from
( --some array<struct> example
select 1 ID, array(named_struct("city","Hudson","state","NY"),named_struct("city","San Jose","state","CA"),named_struct("city","Albany","state","NY")) as my_array
union all
select 2 ID, array(named_struct("city","San Jose","state","CA"),named_struct("city","San Diego","state","CA")) as my_array
)s
)
select s.ID, a.city, a.state
from mydata s
lateral view inline(s.my_array) a as city, state
;
And concatenate them as you want again in the string, collect array, concat_ws, etc

Related

Union null values to a array of struct in Big Query

I have the following table in Big Query which has an array of struct type. I have to perform a union operation with a simple table and want to add null values in place of the nested columns.
Actual nested table example -
Simple table (which needs to be union-ed)
acc
date
count
acc_6
11/29/2022
2
acc_8
11/30/2022
3
I tried the following query but it gives an error of incompatible types on the nested columns
select * from actual_table
union all
select acc, date, count,
array_agg(struct(cast(null as string) as device_id, cast(null as date) as to_date, cast(null as string) as from_date) as d
from simple_table
The resultant table should look like this -
Since d has a type of array of struct<string, string, string>, you need to write a null struct like below.
SELECT * FROM actual_table
UNION ALL
SELECT *, [STRUCT(CAST(null AS STRING), CAST(null AS STRING), CAST(null AS STRING))] FROM simple_table;
[] is for array literal. see Using array literals
field names in null struct are optional cause they are already declared in actual_table before union all.
you can use STRING(null) instead of CAST(null AS STRING) which is a little bit concise.
Query results

Preserve order while converting string array into int array in hive

I'm trying to convert a string array to int array by keeping the original order
here is a sample of what my data looks like:
id attribut string_array
id1 attribut1, 10283:990000 ["10283","990000"]
id2 attribut2, 10283:36741000 ["10283","36741000"]
id3 attribut3, 10283:37871000 ["10283","37871000"]
id4 attribut4, 3215:90451000 ["3215","90451000"]
and here's how i convert the field "string_array" into an array of integers
select
id,
attribut,
string_array,
collect_list(cast(array_explode as int)),
from table
lateral view outer explode(string_array) r as array_explode
it gives me:
id attribut string_array int_array
id1 attribut1,10283:990000 ["10283","990000"] [990000,10283]
id2 attribut2,10283:36741000 ["10283","36741000"] [10283,36741000]
id3 attribut3,10283:37871000 ["10283","37871000"] [37871000,10283]
id4 attribut4,3215:90451000 ["3215","90451000"] [90451000,3215]
As you can see, the order in "string array" has not been preserved in "int_array" and I need it to be exactly the same as in "string_array".
anyone know how to achieve this ?
Any help would be much appreciated
For Hive: Use posexplode, in a subquery before collect_list do distribute by id sort by position
select
id,
attribut,
string_array,
collect_list(cast(element as int)),
from
(select *
from table t
lateral view outer posexplode(string_array) e as pos,element
distribute by t.id, attribut, string_array -- distribute by group key
sort by pos -- sort by initial position
) t
group by id, attribut, string_array
Another way is to extract substring from your attributes and split without exploding (as you asked in the comment)
select split(regexp_extract(attribut, '[^,]+,(.*)$',1),':')
Regexp '[^,]+,(.*)$' means:
[^,]+ - not a comma 1+ times
, - comma
(.*)$ - everything else in catpturing group 1 after comma till the end of the string
Demo:
select split(regexp_extract('attribut3,10283:37871000', '[^,]+,(.*)$',1),':')
Result:
["10283","37871000"]

Exploding a list in Hive SQL to identify blanks

I have a column called part_nos_list as array<\string> in a hive table. Apparently that column has blank and I want to update that with a '-'. The code sort of does that but 42 rows in a group by has shown having blanks. I tried to check individual records but am not successful. Here is the hive sql. Is there something wrong here in this sql
SELECT order_id, exploded_part_nos
FROM sales.order_detail LATERAL VIEW explode(part_no_list) part_nos AS exploded_part_nos where sale_type in ('POS', 'OTC' , 'CCC') and exploded_part_nos = ''
However this group by sql shows 42 as blanks
select * from (SELECT explo,count(*) as uni_explo_cnt
FROM sales.order_detail
LATERAL VIEW explode(split(concat_ws("##", part_no_list),'##')) yy AS explo where sale_type in ('POS', 'OTC' , 'CCC') group by explo order by explo asc) DD
Here is what hive table looks like
Id Part_no_list
1 ["OTC","POS","CCC"]
2 ["OTC","POS"]
4 NULL
5
6 ["-"]
7 ["OTC","POS","CCC"]
Thanks in advance
To test if exploded array element is empty, use this:
select * from(
select explode( array("OTC","POS","CCC","")) as explo
) s where explo=''
Result is one empty string.
If you want to identify array containing empty element use array_contains:
select * from(
select array("OTC","POS","CCC","") as a
) s where array_contains(a,'')
Result:
["OTC","POS","CCC",""]
If you want to find array containing only one element - empty string use size(array)=1 or array_contains(array,'')
But there is also such thing as an empty array.
It displayed the same as array containing empty element but it is not the same.
And to find empty array, use size()=0
Example:
select * from(
select array() as a
) s where size(a)=0
Returns []
Run all these queries on your data and you will become enlightened. I think it is empty arrays, not empty element in your case
Empty array is not NULL, because it is still an array object of zero size:
select * from(
select array() as a
) s where a is null
Returns no rows
Better query array only without explode and use array_contains and size to find empty arrays and empty elements. Use LATERAL VIEW OUTER to generate rows even when a LATERAL VIEW usually would not generate a row. LATERAL VIEW without OUTER word works as INNER JOIN, see docs about LATERAL VIEW OUTER

BigQuery standard SQL: how to group by an ARRAY field

My table has two columns, id and a. Column id contains a number, column a contains an array of strings. I want to count the number of unique id for a given array, equality between arrays being defined as "same size, same string for each index".
When using GROUP BY a, I get Grouping by expressions of type ARRAY is not allowed. I can use something like GROUP BY ARRAY_TO_STRING(a, ","), but then the two arrays ["a,b"] and ["a","b"] are grouped together, and I lose the "real" value of my array (so if I want to use it later in another query, I have to split the string).
The values in this field array come from the user, so I can't assume that some character is simply never going to be there (and use it as a separator).
Instead of GROUP BY ARRAY_TO_STRING(a, ",") use GROUP BY TO_JSON_STRING(a)
so your query will look like below
#standardsql
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
You can test it with dummy data like below
#standardsql
WITH `project.dataset.table` AS (
SELECT 1 id, ["a,b", "c"] a UNION ALL
SELECT 1, ["a","b,c"]
)
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
with result as
Row arr cnt
1 ["a,b","c"] 1
2 ["a","b,c"] 1
Update based on #Ted's comment
#standardsql
SELECT
ANY_VALUE(a) a,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY TO_JSON_STRING(a)
Alternatively, you can use another separator than comma
ARRAY_TO_STRING(a,"|")

Check if a group contains all the ids in any of the arrays supplied

I pass a 2d array to a procedure. This array contains multiple arrays of ids. I want to
group a table by group_id
for each group, for each array in the 2d array
IF this group has all the ids within this iteration array, then return it
I read here about issues with 2d arrays:
postgres, contains-operator for multidimensional arrays performs flatten before comparing?
I think I'm nearly there, but I am unsure how to get around the problem. I understand why the following code produces the error "Subquery can only return one column", but I cant work out how to fix it
DEALLOCATE my_proc;
PREPARE my_proc (bigint[][]) AS
WITH cte_arr AS (select $1 AS arr),
cte_s AS (select generate_subscripts(arr,1) AS subscript,
arr from cte_arr),
grouped AS (SELECT ufs.user_id, array_agg(entity_id)
FROM table_A AS ufs
GROUP BY ufs.user_id)
SELECT *
FROM grouped
WHERE (select arr[subscript:subscript] #> array_agg AS sub,
arr[subscript:subscript]
from cte_s);
EXECUTE my_proc(array[array[1, 2], array[1,3]]);
You can create a row for each group and each array in the parameter with a cross join:
PREPARE stmt (bigint[][]) AS
with grouped as
(
select user_id
, array_agg(entity_id) as user_groups
from table_A
group by
user_id
)
select user_id
, user_groups
, $1[subscript:subscript] as matches
from grouped
cross join
generate_subscripts($1, 1) as gen(subscript)
where user_groups #> $1[subscript:subscript]
;
Example at SQL Fiddle

Resources