Exploding a list in Hive SQL to identify blanks

Exploding a list in Hive SQL to identify blanks - arrays

I have a column called part_nos_list as array<\string> in a hive table. Apparently that column has blank and I want to update that with a '-'. The code sort of does that but 42 rows in a group by has shown having blanks. I tried to check individual records but am not successful. Here is the hive sql. Is there something wrong here in this sql
SELECT order_id, exploded_part_nos
FROM sales.order_detail LATERAL VIEW explode(part_no_list) part_nos AS exploded_part_nos where sale_type in ('POS', 'OTC' , 'CCC') and exploded_part_nos = ''
However this group by sql shows 42 as blanks
select * from (SELECT explo,count(*) as uni_explo_cnt
FROM sales.order_detail
LATERAL VIEW explode(split(concat_ws("##", part_no_list),'##')) yy AS explo where sale_type in ('POS', 'OTC' , 'CCC') group by explo order by explo asc) DD
Here is what hive table looks like
Id Part_no_list
1 ["OTC","POS","CCC"]
2 ["OTC","POS"]
4 NULL
5
6 ["-"]
7 ["OTC","POS","CCC"]
Thanks in advance

To test if exploded array element is empty, use this:
select * from(
select explode( array("OTC","POS","CCC","")) as explo
) s where explo=''
Result is one empty string.
If you want to identify array containing empty element use array_contains:
select * from(
select array("OTC","POS","CCC","") as a
) s where array_contains(a,'')
Result:
["OTC","POS","CCC",""]
If you want to find array containing only one element - empty string use size(array)=1 or array_contains(array,'')
But there is also such thing as an empty array.
It displayed the same as array containing empty element but it is not the same.
And to find empty array, use size()=0
Example:
select * from(
select array() as a
) s where size(a)=0
Returns []
Run all these queries on your data and you will become enlightened. I think it is empty arrays, not empty element in your case
Empty array is not NULL, because it is still an array object of zero size:
select * from(
select array() as a
) s where a is null
Returns no rows
Better query array only without explode and use array_contains and size to find empty arrays and empty elements. Use LATERAL VIEW OUTER to generate rows even when a LATERAL VIEW usually would not generate a row. LATERAL VIEW without OUTER word works as INNER JOIN, see docs about LATERAL VIEW OUTER

Related

Preserve order while converting string array into int array in hive

I'm trying to convert a string array to int array by keeping the original order
here is a sample of what my data looks like:
id attribut string_array
id1 attribut1, 10283:990000 ["10283","990000"]
id2 attribut2, 10283:36741000 ["10283","36741000"]
id3 attribut3, 10283:37871000 ["10283","37871000"]
id4 attribut4, 3215:90451000 ["3215","90451000"]
and here's how i convert the field "string_array" into an array of integers
select
id,
attribut,
string_array,
collect_list(cast(array_explode as int)),
from table
lateral view outer explode(string_array) r as array_explode
it gives me:
id attribut string_array int_array
id1 attribut1,10283:990000 ["10283","990000"] [990000,10283]
id2 attribut2,10283:36741000 ["10283","36741000"] [10283,36741000]
id3 attribut3,10283:37871000 ["10283","37871000"] [37871000,10283]
id4 attribut4,3215:90451000 ["3215","90451000"] [90451000,3215]
As you can see, the order in "string array" has not been preserved in "int_array" and I need it to be exactly the same as in "string_array".
anyone know how to achieve this ?
Any help would be much appreciated

For Hive: Use posexplode, in a subquery before collect_list do distribute by id sort by position
select
id,
attribut,
string_array,
collect_list(cast(element as int)),
from
(select *
from table t
lateral view outer posexplode(string_array) e as pos,element
distribute by t.id, attribut, string_array -- distribute by group key
sort by pos -- sort by initial position
) t
group by id, attribut, string_array
Another way is to extract substring from your attributes and split without exploding (as you asked in the comment)
select split(regexp_extract(attribut, '[^,]+,(.*)$',1),':')
Regexp '[^,]+,(.*)$' means:
[^,]+ - not a comma 1+ times
, - comma
(.*)$ - everything else in catpturing group 1 after comma till the end of the string
Demo:
select split(regexp_extract('attribut3,10283:37871000', '[^,]+,(.*)$',1),':')
Result:
["10283","37871000"]

SQL to split a column values into rows in Netezza

I have data in the below way in a column. The data within the column is separated by two spaces.
4EG C6CC C6DE 6MM C6LL L3BC C3
I need to split it into as beloW. I tried using REGEXP_SUBSTR to do it but looks like it's not in the SQL toolkit. Any suggestions?
1. 4EG
2. C6CC
3. C6DE
4. 6MM
5. C6LL
6. L3BC
7. C3

This has ben answered here: http://nz2nz.blogspot.com/2016/09/netezza-transpose-delimited-string-into.html?m=1
Please note the comment at the button about the best performing way of use if array functions. I have measured the use of regexp_extract_all_sp() versus repeated regex matches and the benefit can be quite large

The examples from nz2nz.blogpost.com are hard to follow. I was able to piece together this method:
with
n_rows as (--update on your end
select row_number() over(partition by 1 order by some_field) as seq_num
from any_table_with_more_rows_than_delimited_values
)
, find_values as ( -- fake data
select 'A' as id, '10,20,30' as orig_values
union select 'B', '5,4,3,2,1'
)
select
id,
seq_num,
orig_values,
array_split(orig_values, ',') as array_list,
get_value_varchar(array_list, seq_num) as value
from
find_values
cross join n_rows
where
seq_num <= regexp_match_count(orig_values, ',') + 1 -- one row for each value in list
order by
id,
seq_num

BigQuery standard SQL: how to group by an ARRAY field

My table has two columns, id and a. Column id contains a number, column a contains an array of strings. I want to count the number of unique id for a given array, equality between arrays being defined as "same size, same string for each index".
When using GROUP BY a, I get Grouping by expressions of type ARRAY is not allowed. I can use something like GROUP BY ARRAY_TO_STRING(a, ","), but then the two arrays ["a,b"] and ["a","b"] are grouped together, and I lose the "real" value of my array (so if I want to use it later in another query, I have to split the string).
The values in this field array come from the user, so I can't assume that some character is simply never going to be there (and use it as a separator).

Instead of GROUP BY ARRAY_TO_STRING(a, ",") use GROUP BY TO_JSON_STRING(a)
so your query will look like below
#standardsql
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
You can test it with dummy data like below
#standardsql
WITH `project.dataset.table` AS (
SELECT 1 id, ["a,b", "c"] a UNION ALL
SELECT 1, ["a","b,c"]
)
SELECT
TO_JSON_STRING(a) arr,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY arr
with result as
Row arr cnt
1 ["a,b","c"] 1
2 ["a","b,c"] 1
Update based on #Ted's comment
#standardsql
SELECT
ANY_VALUE(a) a,
COUNT(DISTINCT id) cnt
FROM `project.dataset.table`
GROUP BY TO_JSON_STRING(a)

Alternatively, you can use another separator than comma
ARRAY_TO_STRING(a,"|")

Sql find Row with longest String and delete the Rest

I am currently working on a table with approx. 7.5mio rows and 16 columns. One of the rows is an internal identifier (let's call it ID) we use at my university. Another column contains a string.
So, ID is NOT the unique index for a row, so it is possible that one identifier appears more than once in the table - the only difference between the two rows being the string.
I need to find all rows with ID and just keep the one with the longest string and deleting every other row from the original table. Unfortunately I am more of a SQL Novice, and I am really stuck at this point. So if anyone could help, this would be really nice.

Take a look at this sample:
SELECT * INTO #sample FROM (VALUES
(1, 'A'),
(1,'Long A'),
(2,'B'),
(2,'Long B'),
(2,'BB')
) T(ID,Txt)
DELETE S FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY LEN(Txt) DESC) RN
FROM #sample) S
WHERE RN!=1
SELECT * FROM #sample
Results:
ID Txt
-- ------
1 Long A
2 Long B

It might be possible just in SQL, but the way I know how to do it would be a two-pass approach using application code - I assume you have an application you are writing.
The first pass would be something like:
SELECT theid, count(*) AS num, MAX(LEN(thestring)) AS keepme FROM thetable WHERE num > 1 GROUP BY theid
Then you'd loop through the results in whatever language you're using and delete anything with that ID except the one matching the string returned. The language I know is PHP, so I'll use it for my example, but the method would be the same in any language (for brevity, I'm skipping error checking, prepared statements, and such, and not testing - please use carefully):
$sql = 'SELECT theid, count(*) AS num, MAX(LEN(thestring)) AS keepme FROM thetable WHERE num > 1 GROUP BY theid';
$result = sqlsrv_query($resource, $sql);
while ($row = sqlsrv_fetch_object($result)) {
$sql = 'DELETE FROM thetable WHERE theid = '.$row->theid.' AND NOT thestring = '.$row->keepme;
$result = sqlsrv_query($resource, $sql);
}
You didn't say what you would want to do if two strings are the same length, so this solution does not deal with that at all - I'm assuming that each ID will only have one longest string.

Check if a group contains all the ids in any of the arrays supplied

I pass a 2d array to a procedure. This array contains multiple arrays of ids. I want to
group a table by group_id
for each group, for each array in the 2d array
IF this group has all the ids within this iteration array, then return it
I read here about issues with 2d arrays:
postgres, contains-operator for multidimensional arrays performs flatten before comparing?
I think I'm nearly there, but I am unsure how to get around the problem. I understand why the following code produces the error "Subquery can only return one column", but I cant work out how to fix it
DEALLOCATE my_proc;
PREPARE my_proc (bigint[][]) AS
WITH cte_arr AS (select $1 AS arr),
cte_s AS (select generate_subscripts(arr,1) AS subscript,
arr from cte_arr),
grouped AS (SELECT ufs.user_id, array_agg(entity_id)
FROM table_A AS ufs
GROUP BY ufs.user_id)
SELECT *
FROM grouped
WHERE (select arr[subscript:subscript] #> array_agg AS sub,
arr[subscript:subscript]
from cte_s);
EXECUTE my_proc(array[array[1, 2], array[1,3]]);

You can create a row for each group and each array in the parameter with a cross join:
PREPARE stmt (bigint[][]) AS
with grouped as
(
select user_id
, array_agg(entity_id) as user_groups
from table_A
group by
user_id
)
select user_id
, user_groups
, $1[subscript:subscript] as matches
from grouped
cross join
generate_subscripts($1, 1) as gen(subscript)
where user_groups #> $1[subscript:subscript]
;
Example at SQL Fiddle

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Exploding a list in Hive SQL to identify blanks - arrays

Related

Preserve order while converting string array into int array in hive

SQL to split a column values into rows in Netezza

BigQuery standard SQL: how to group by an ARRAY field

Sql find Row with longest String and delete the Rest

Check if a group contains all the ids in any of the arrays supplied

Categories

Resources