Lateral Flatten two columns with different array length in snowflake - snowflake-cloud-data-platform

i am new to snowflake and currently learning to use Lateral Flatten.
I currently have a dummy table which looks like this:
The data type used for "Customer_Number" & "Cities" is array.
I have managed to understand and apply the Flatten concept to explode the data using the following sql statement:
select c.customer_id, c.last_name, f.value as cust_num, f1.value as city
from customers as c,
lateral flatten(input => c.customer_number) f,
lateral flatten(input => c.cities) f1
where f.index = f1.index
order by customer_id;
The output shown is:
As we can clearly see from the dummy table, in row 4 customer_id 104 has 3 numbers and i would like to see all three of it in my output and if there is no matching index value in cities i would like to just see "Null" in "City".
My expected output is:
Is this possible to be done ?

The trick is to remove the second lateral, and use the index from the first to choose values from the second array:
select c.customer_id, c.last_name, f.value as cust_num, cites[f.index] as city
from customers as c,
lateral flatten(input => c.customer_number) f
order by customer_id;

As long as you can be sure the second record is going to be shorter, you can do:
select customer_id, last_name, list1_table.value::varchar as customer_number,
split(cities,',')[list1_table.index]::varchar as city
from customers, lateral flatten(input=>split(customer_number, ',')) list1_table;
Otherwise you'd have to do union between the 2 sets of records (a regular union will eliminate duplicates)

You may want to use a LEFT OUTER JOIN for this task, but need to create a rowset version of the cities first.
select c.customer_id, c.last_name, f.value as cust_num, f1.value as city
from customers as c
cross join lateral flatten(input => c.customer_number) f
left outer join (select * from customers, lateral flatten(input => cities)) f1
on f.index = f1.index
order by customer_id;

Related

Difference between LATERAL FLATTEN(...) and TABLE(FLATTEN(...)) in Snowflake

What is the difference between the use of LATERAL FLATTEN(...) and TABLE(FLATTEN(...)) in Snowflake? I checked the documentation on FLATTEN, LATERAL and TABLE and cannot make heads or tails of a functional difference between the following queries.
select
id as account_id,
account_regions.value::string as region
from
salesforce.accounts,
lateral flatten(split(salesforce.accounts.regions, ', ')) account_regions
select
id as account_id,
account_regions.value::string as region
from
salesforce.accounts,
table(flatten(split(salesforce.accounts.regions, ', '))) account_regions
I'll say that in the presented queries there's no difference - as the lateral join is implicit by the dynamic creation of a table out of the results of operating within values coming out of a row.
The real need for the flatten keyword comes out of queries like this:
select *
from departments as d
, lateral (
select *
from employees as e
where e.department_id = d.department_id
) as iv2
order by employee_id;
-- https://docs.snowflake.com/en/sql-reference/constructs/join-lateral.html
Without the lateral keyword for this join, you get an Error: invalid identifier 'D.DEPARTMENT_ID'.

Flatten and aggregate two columns of arrays via distinct in Snowflake

Table structure is
+------------+---------+
| Animals | Herbs |
+------------+---------+
| [Cat, Dog] | [Basil] |
| [Dog, Lion]| [] |
+------------+---------+
Desired output (don't care about sorting of this list):
unique_things
+------------+
[Cat, Dog, Lion, Basil]
First attempt was something like
SELECT ARRAY_CAT(ARRAY_AGG(DISTINCT(animals)), ARRAY_AGG(herbs))
But this produces
[[Cat, Dog], [Dog, Lion], [Basil], []]
Since the distinct is operating on each array, not looking at distinct components within all arrays
If I understand your requirements right and assuming the source tables of
insert into tabarray select array_construct('cat', 'dog'), array_construct('basil');
insert into tabarray select array_construct('lion', 'dog'), null;
I would say the result would look like this:
select array_agg(distinct value) from
(
select
value from tabarray
, lateral flatten( input => col1 )
union all
select
value from tabarray
, lateral flatten( input => col2 ))
;
UPDATE
It is possible without using FLATTEN, by using ARRAY_UNION_AGG:
Returns an ARRAY that contains the union of the distinct values from the input ARRAYs in a column.
For sample data:
CREATE OR REPLACE TABLE t AS
SELECT ['Cat', 'Dog'] AS Animals, ['Basil'] AS Herbs
UNION SELECT ['Dog', 'Lion'], [];
Query:
SELECT ARRAY_UNION_AGG(ARRAY_CAT(Animals, Herbs)) AS Result
FROM t
or:
SELECT ARRAY_UNION_AGG(Animals) AS Result
FROM (SELECT Animals FROM t
UNION ALL
SELECT Herbs FROM t);
Output:
You could flatten the combined array and then aggregate back:
SELECT ARRAY_AGG(DISTINCT F."VALUE") AS unique_things
FROM tab, TABLE(FLATTEN(ARRAY_CAT(tab.Animals, tab.Herbs))) f
Here is another variation to handle NULLs in case they appear in data set.
SELECT ARRAY_AGG(DISTINCT a.VALUE) unique_things from tab, TABLE (FLATTEN(array_compact(array_append(tab.Animals, tab.Herbs)))) a

Want to see query result order as written in query

I have written a query in which I want to show the order of employees as per written in the query. Query is as follow
select * from employeemaster where employeename in
('Sachin','Gaurav','Vinay','Shiv','Sandeep','Vaibhav','Prashant')
I want to see the query result dislpaying Sachin first then the others and in this case the ID's of the employees is not in sequence, ex. Sachin's ID can be 4 and Vinay's ID can be 1. But as I have written Sachin in first place, then I want to see Sachin starting first in the result.
You can use a CTE with IDs to do the sorting and the filtering with an inner join.
WITH cte as (
SELECT *
FROM (VALUES
(1,'Sachin')
,(2,'Gaurav')
,(3,'Vinay')
,(4,'Shiv')
,(5,'Sandeep')
,(6,'Vaibhav')
,(7,'Prashant')
) a (id, [name])
)
SELECT em.*
FROM employeemaster em
JOIN cte
ON em.employeename = cte.[name]
ORDER BY cte.id
select * from employeemaster
join (values
('Sachin',1)
,('Gaurav',2)
,('Vinay',3)
,('Shiv',4)
,('Sandeep',5)
,('Vaibhav',6)
,('Prashant',7)) a(employeename ,_order) on a.employeename = employeemaster.employeename
order by a.[_order]

only display one row when key field is the same

I have created a key field (C) by joining two columns(A&C). I want to run an sql that says, if column C is unique take only the top row.
Sample data:-
A B C D
10022 Blue 10022Blue Buggy
10300 Red 10300Red Noodle
10300 Red 10300Red Sammy
so I only want one line to show for 10300Red
Cheers
One way to do it is with a cte and ROW_NUMBER():
;WITH CTE AS
(
SELECT A,
B,
C,
D,
ROW_NUMBER() OVER(PARTITION BY C ORDER BY (SELECT NULL)) rn
FROM Table
)
SELECT A, B, C, D
FROM CTE
WHERE rn = 1
Note: You did say you want the "first" record, but you didn't specify the order of the records. Since tables in a relational database are unsorted by nature, "first" is simply an arbitrary row, hence "order by (select null)"
Do it this way:
select distinct A, B, C from tablename
You can find the result set by grouping it, then join it with the main table.
SELECT
A.*
FROM
YourTable A INNER JOIN
(
SELECT
G.C,
MAX(G.D) D
FROM
YourTable G
GROUP BY
G.C
) B ON A.C = B.C AND A.D = B.D

OVER (ORDER BY Col) generates 'Sort' operation

I'm working on a query that needs to do filtering, ordering and paging according to the user's input. Now I'm testing a case that's really slow, upon inspection of the Query Plan a 'Sort' is taking 96% of the time.
The datamodel is really not that complicated, the following query should be clear enough to understand what's happening:
WITH OrderedRecords AS (
SELECT
A.Id
, A.col2
, ...
, B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
*
FROM OrderedRecords WHERE RowNumber Between x AND y
A is a table containing about 100k records, but will grow to tens of millions in the field, while B is category type table with 5 items (and this will never grow any bigger then perhaps a few more). There are clustered indexes on A.Id and B.Id.
Performance is really dreadful and I'm wondering if it's possible to remedy this somehow. If, for example, the ordering is on A.Id instead of B.col1 everything is pretty darn fast. Perhaps I can optimize B.col1 is some sort of index.
I already tried putting an index on the field itself, but this didn't help. Probably because the number of distinct items in table B is very small (in itself & compared to A).
Any ideas?
I think this may be part of the problem:
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.Id = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...)
Your LEFT JOIN is going to logically act like an INNER JOIN because of the WHERE clause you have in place, since only certain B.ID rows are going to be returned. If that's your intent, then go ahead and use an inner join, which may help the optimizer realize that you are looking for a restricted number of rows.
I suggest you to try following.
For the B table create index:
create index IX_B_1 on B (col1, Id, SomeThing)
For the A table create index:
create index IX_A_1 on A (col2, BId) include (Id, ...)
In the include put all other columns of the table A, that listed in SELECT of OrderedRecords CTE.
However, as you see, index IX_A_1 is space taking, and can take size of about table data itself.
So, as an alternative you may try omit extra columns from include part of the index:
create index IX_A_2 on A (col2, BId) include (Id)
but in this case you will have to slightly modify your query:
;WITH OrderedRecords AS (
SELECT
AId = A.Id
, A.col2
-- remove other A columns from here
, bid = B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
R.*, A.OtherColumns
FROM OrderedRecords R
join A on A.Id = R.AId
WHERE R.RowNumber Between x AND y

Resources