Creating a merge statement from a conditional INSERT select - snowflake-cloud-data-platform

I'm trying to create a merge statement where I keep all the values in my FINAL_TABLE before the column DATE is >= today's date,
and insert new values from today's date from my LANDING_TABLE.
The working example with a DELETE and INSERT statement can be seen here:
DELETE FROM FINAL_TABLE
WHERE "DATE" >= CURRENT_DATE()
INSERT INTO FINAL_TABLE
SELECT X, Y.value :: string AS Y_SPLIT, "DATE", "PUBLIC"
FROM LANDING TABLE, LATERAL FLATTEN (INPUT => STRTOK_TO_ARRAY(LANDING_TABLE.column, ', '), OUTER => TRUE) y
WHERE "PUBLIC" ILIKE 'TRUE' AND "DATE" >= CURRENT_DATE()
I'd like to keep the FLATTEN statement and the WHERE conditions while having the whole statement in a single MERGE statement.
Is it possible or should I first create a temporary table with the values I want to insert and then use that in the merge statement?

The MERGE statement could use subqueries/cte as source:
MERGE INTO <target_table> USING <source>
ON <join_expr> { matchedClause | notMatchedClause } [ ... ]
source:
Specifies the table or subquery to join with the target table.
MERGE INTO FINAL_TABLE
USING (
SELECT X, Y.value :: string AS Y_SPLIT, "DATE" AS col1, "PUBLIC" AS col2
FROM LANDING TABLE
,LATERAL FLATTEN(INPUT=>STRTOK_TO_ARRAY(LANDING_TABLE.column, ', '), OUTER=>TRUE) y
WHERE "PUBLIC" ILIKE 'TRUE' AND "DATE" >= CURRENT_DATE()
) AS SRC
ON ...
WHEN ...;

Related

How to transform semicolon separated strings of column names and values into a new table

I am relatively new to Snowflake and struggle a bit with setting up a transformation for a semi-structured dataset. I have several log data batches, where each batch (table row in Snowflake) has the following columns: LOG_ID, COLUMN_NAMES, and LOG_ENTRIES .
COLUMN_NAMES contains a semicolon-separated list of columns names, e.g.:
“TIMESTAMP;Sensor A;Sensor B”, “TIMESTAMP;Sensor B;Sensor C”
LOG_ENTRIES:entry contains a semicolon separated list of values, e.g.
“2020-02-11 09:08:19; 99.24;12.25”
The COLUMN_NAMES string can be different between log batches (Snowflake rows), but the names in the order they appear describe the content of the LOG_ENTRIES column values of the same row. My goal is to transform the data into a table that has column names for all unique values present in the COLUMN_NAMES column, e.g.:
LOG_ID
TIMESTAMP
Sensor A
Sensor B
Sensor C
1
2020-02-11 09:08:19
99.24
12.25
NaN
2
2020-02-11 09:10:44
NaN
13.32
0.947
Can this be achieved with a snowflake script, and if so, how? :)
Best regards,
Johan
You should use the SPLIT_TO_TABLE function, split the two values, and join them by index.
After that, all you have to do is use PIVOT to invert the table.
Sample data:
create or replace table splittable (LOG_ID int, COLUMN_NAMES varchar, LOG_ENTRIES varchar);
insert into splittable (LOG_ID, COLUMN_NAMES, LOG_ENTRIES)
values (1, 'TIMESTAMP;Sensor A;Sensor B', '2020-02-11 09:08:19;99.24;12.25'),
(2, 'TIMESTAMP;Sensor B;Sensor C', '2020-02-11 09:10:44;13.32;0.947');
Solution proposal:
WITH src AS (
select LOG_ID, cn.VALUE as COLUMN_NAMES, le.VALUE as LOG_ENTRIES
from splittable as st,
lateral split_to_table(st.COLUMN_NAMES, ';') as cn,
lateral split_to_table(st.LOG_ENTRIES, ';') as le
where cn.INDEX = le.INDEX
)
select * from src
pivot (min(LOG_ENTRIES) for COLUMN_NAMES in ('TIMESTAMP','Sensor A','Sensor B','Sensor C'))
order by LOG_ID;
Reference: SPLIT_TO_TABLE, PIVOT
If the column list is variable and you can't define it then you have to write some generator, maybe it will help: CREATE A DYNAMIC PIVOT IN SNOWFLAKE
You could transform the data into an ACTUAL semi-structured data type that you can then natively query using Snowflake SQL.
WITH x AS (
SELECT column_names, log_entries
FROM (VALUES ('TIMESTAMP_;SENSOR1','2021-02-01'||';1.2')) x (column_names, log_entries)
),
y AS (
SELECT *
FROM x,
LATERAL FLATTEN(input => split(column_names,';')) f
),
z AS (
SELECT *
FROM x,
LATERAL FLATTEN(input => split(log_entries,';')) f
)
SELECT listagg(('"'||y.value||'":"'||z.value||'"'),',') as cnt
, parse_json('{'||cnt||'}') as var
FROM y
JOIN z
ON y.seq = z.seq
AND y.index = z.index
GROUP BY y.seq;

Why TRY_PARSE its so slow?

I have this query that basically returns (right now) only 10 rows as results:
select *
FROM Table1 as o
inner join Table2 as t on t.Field1 = o.Field2
where Code = 123456 and t.FakeData is not null
Now, if I want to parse the field FakeData (which, unfortunately, can contain different types of data, from DateTime to Surname/etc; i.e. nvarchar(70)), for data show and/or filtering:
select *, TRY_PARSE(t.FakeData as date USING 'en-GB') as RealDate
FROM Table1 as o
inner join Table2 as t on t.Field1 = o.Field2
where Code = 123456 and t.FakeData is not null
It takes x10 the query to be executed.
Where am I wrong? How can I speed up?
I can't edit the database, I'm just a customer which read data.
The TSQL documentation for TRY_PARSE makes the following observation:
Keep in mind that there is a certain performance overhead in parsing the string value.
NB: I am assuming your typical date format would be dd/mm/yyyy.
The following is something of a shot-in-the-dark that might help. By progressively assessing the nvarchar column if it is a candidate as a date it is possible to reduce the number of uses of that function. Note that a data point established in one apply can then be referenced in a subsequent apply:
CREATE TABLE mytable(
FakeData NVARCHAR(60) NOT NULL
);
INSERT INTO mytable(FakeData) VALUES (N'oiwsuhd ouhw dcouhw oduch woidhc owihdc oiwhd cowihc');
INSERT INTO mytable(FakeData) VALUES (N'9603200-0297r2-0--824');
INSERT INTO mytable(FakeData) VALUES (N'12/03/1967');
INSERT INTO mytable(FakeData) VALUES (N'12/3/2012');
INSERT INTO mytable(FakeData) VALUES (N'3/3/1812');
INSERT INTO mytable(FakeData) VALUES (N'ohsw dciuh iuh pswiuh piwsuh cpiuwhs dcpiuhws ipdcu wsiu');
select
t.FakeData, oa3.RealDate
from mytable as t
outer apply (
select len(FakeData) as fd_len
) oa1
outer apply (
select case when oa1.fd_len > 10 then 0
when len(replace(FakeData,'/','')) + 2 = oa1.fd_len then 1
else 0
end as is_candidate
) oa2
outer apply (
select case when oa2.is_candidate = 1 then TRY_PARSE(t.FakeData as date USING 'en-GB') end as RealDate
) oa3
FakeData
RealDate
oiwsuhd ouhw dcouhw oduch woidhc owihdc oiwhd cowihc
null
9603200-0297r2-0--824
null
12/03/1967
1967-03-12
12/3/2012
2012-03-12
3/3/1812
1812-03-03
ohsw dciuh iuh pswiuh piwsuh cpiuwhs dcpiuhws ipdcu wsiu
null
db<>fiddle here

Inserting multiple rows with Merge NOT MATCHED

MERGE tbl_target t
USING tbl_source s
ON t.itemnum = s.itemnum
WHEN NOT MATCHED
INSERT (itemnum, minqty, maxqty, parent)
VALUES (s.itemnum,0,99,10),(s.itemnum,0,99,80);
I'm trying to Insert two rows on the target table if an item does not exist on target but does exist on the source. Everytime I try SQL server gives an error on the ',' between the VALUES.
A MERGE statement must be terminated by a semi-colon (;)
Is it possible to do multi-row inserts in a MERGE statement?
It is possible by tweaking the USING clause to return multiple rows per tbl_source.itemnum value:
MERGE tbl_target t
USING (
select s.itemnum,
0 as minqty,
99 as maxqty,
p.parent
from tbl_source s
cross join (
select 10 as parent
union all
select 80 as parent) p
) s
ON t.itemnum = s.itemnum
WHEN NOT MATCHED THEN
INSERT (itemnum, minqty, maxqty, parent)
VALUES (s.itemnum,s.minqty,s.maxqty,s.parent);
What I understand from msdn is that you can only insert a row for each non matching record. Do you need to use Merge? If not the following will work
WITH CTE (Sitemnum)
AS
(
SELECT s.itemnum
FROM tbl_source s
LEFT JOIN tbl_target t ON (s.itemnum = t.itemnum)
WHERE t.itemnum IS NULL
)
INSERT tbl_target
SELECT Sitemnum,0,99,10
FROM CTE
UNION
SELECT Sitemnum,0,99,80
FROM CTE

How do I search for an item in an array in Hive?

Using Hive I've created a table with the following fields:
ID BIGINT,
MSISDN STRING,
DAY TINYINT,
MONTH TINYINT,
YEAR INT,
GENDER TINYINT,
RELATIONSHIPSTATUS TINYINT,
EDUCATION STRING,
LIKES_AND_PREFERENCES STRING
This was filled with data via the following SQL command:
Insert overwrite table temp_output Select a.ID, a.MSISDN, a.DAY, a.MONTH, a.YEAR, a.GENDER, a.RELATIONSHIPSTATUS, b.NAME, COLLECT_SET(c.NAME) FROM temp_basic_info a JOIN temp_education b ON (a.ID = b.ID) JOIN likes_and_music c ON (c.ID = b.ID) GROUP BY a.ID, a.MSISDN, a.DAY, a.MONTH, a.YEAR, a.Gender, a.RELATIONSHIPSTATUS, b.NAME;
Likes and Preferences is an array, but I was not foresighted enough to specify it as such (it's a string, instead). How would I go about selecting records that have a specific item in the array?
Is it as simple as:
select * from table_result where LIKES_AND_PREFERENCES = "item"
Or will that have some unforeseen issues?
I tried that query above, and it does seam to output the files with only the "items" in the array, though.
May be you should try something like this:
select * from (
select col1,col2..coln, new_column from table_name lateral view explode(array_column_name) exploded_table as new_column
) t where t.new_column = '<value of items to be searched>'
Hope this helps...!!!
Using the array_contains udf in the following manner --
select *
from mytable
where array_contains(likes_and_preferences,'item') = TRUE
array_contains will return a Boolean that you can predicate on.
You are correct the function you used will return only records where array has only one element with value : "item"
You need to use : array_contains function.

Select data from one table where a field is greater than that of another field in another table

I want to be able to select data from TableA where Field1 is greater than Field2 in TableB.
In my head i image it to be something like this
Select TableA.*
from TableA
Join TableB
On TableA.PK = TableB.FK
WHERE TableA.Field1 > TableB.Field2
I am using SQL server 2005 and the TableA.Field1 and tableB.Field2 look like:
2004102881010 - data type - Vrachar
My PK and FK look like:
0908232 - data type - nvarchar
The probelm is when this query is ran ALL the data is displaying and not just the rows where Field1 is greater.
Cheers:)
Seems to be working correctly for this demo code. Perhaps I'm not understanding the problem or data.
;
with TABLEA (PK, Field1) AS
(
-- Sample row that is filtered out
SELECT CAST('0908232' AS nvarchar(10)), CAST('2004102881010' AS varchar(50))
-- This is bigger than what's in B
UNION ALL SELECT CAST('0908232' AS nvarchar(10)), CAST('2005102881010' AS varchar(50))
)
, TABLEB(FK, Field2) AS
(
-- This matches row 1 above and will be excluded
SELECT CAST('0908232' AS nvarchar(10)), CAST('2004102881010' AS varchar(50))
)
SELECT TableA.*
FROM TableA
INNER JOIN TableB
ON TableA.PK = TableB.FK
WHERE TableA.Field1 > TableB.Field2
Results
PK Field1
0908232 2005102881010
This seems like a problem with missing zeroes:
20041028*0*81010
There is nothing wrong with your query, but your data.
Consider 2001-01-01 01:01:01, this would be seen as: 200111111
It should be seen as: 20010101010101
Comparrison operators (>, <) used on strings (varchars, nvarchars, etc.) work alphabetically. For example, '9' > '11' is true. You might try doing a data type conversion...
WHERE cast(A.field1 as int) > cast(B.field2 as int)

Resources