Streams + tasks missing inserts? - snowflake-cloud-data-platform

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;

I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.

The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Related

Joining continuous queries in Flink SQL

I'm trying to join two continuous queries, but keep running into the following error:
Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.\nPlease check the documentation for the set of currently supported SQL features.
Here's the table definition:
CREATE TABLE `Combined` (
`machineID` STRING,
`cycleID` BIGINT,
`start` TIMESTAMP(3),
`end` TIMESTAMP(3),
WATERMARK FOR `end` AS `end` - INTERVAL '5' SECOND,
`sensor1` FLOAT,
`sensor2` FLOAT
)
and the insert query
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
MAX(a.`start`) `start`,
MAX(a.`end`) `end`,
MAX(a.`sensor1`) `sensor1`,
MAX(m.`sensor2`) `sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
a.`start` = m.`timestamp`
GROUP BY a.`MachineID`, a.`cycleID`, SESSION(a.`start`, INTERVAL '1' SECOND)
In the source tables Aggregated and MachineStatus, the start and timestamp columns are time attributes with a watermark.
I've tried casting the input rows of the join to timestamps, but that didn't fix the issue and would mean that I cannot use SESSION, which is supposed to ensure that only one data point gets recorded per cycle.
Any help is greatly appreciated!
I investigated this a little further and noticed that the GROUP BY statement doesn't make sense in that context.
Furthermore, the SESSION can be replaced by a time window, which is the more idiomatic approach.
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
a.`start`,
a.`end`,
a.`sensor1`,
m.`sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
m.`timestamp` BETWEEN a.`start` AND a.`start` + INTERVAL '0' SECOND
To understand the different ways to join dynamic tables, I found the Ververica SQL training extremely helpful.

Set values on a new table from a historical table, before adding new table back into historical table

I would like to create a historical view of alerts in an application. To do this, I am grabbing all events and timestamping them, then uploading them into a MS SQL table. I would also like to be able to exempt certain objects from the total count by flagging either the finding (to exclude the finding across all systems) or the object in the finding (to exclude the object from all findings).
The idea is, I will have all previous alerts in the main table, then I will set an 'exemptobject' or 'exemptfinding' bit column in the row. When I re-run the script weekly, I will upload the results directly into a temporary table and then I would like to compare either the object or the finding for each object in the temporary table to the main database's 'object' or 'finding' and set the respective 'exemptobject' or 'exemptfinding' bit. Once all the temporary table's objects have any exemption bits set, insert the temporary table into the main table and drop the temporary table to keep a historical record.
This will give me duplicate findings and objects, so I am having difficulty with the merge command:
BEGIN TRANSACTION
MERGE INTO [dbo].[temp_table]
USING [dbo].[historical]
ON [dbo].[temp_table].[object] = [dbo].[historical].[object] OR
[dbo].[temp_table].[finding] = [dbo].[historical].[finding]
WHEN MATCHED THEN
UPDATE
SET [exemptfinding] = [dbo].[historical].[exemptfinding]
,[exemptobject] = [dbo].[historical].[exemptobject]
,[exemptdate] = [dbo].[historical].[exemptdate]
,[comments] = [dbo].[historical].[comments];
COMMIT
This seems to do what I want, but I see that the results are going to grow exponentially and I think it won't be sustainable for long.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM [dbo].[temp] temp
INNER JOIN [dbo].[historical] historical
ON (
[temp].[finding] = [sci].[finding] OR
[temp].[object] = [sci].[object] OR
) AND
(
[historical].[exemptfinding] = 1 OR
[historical].[exemptobject] = 1
)
COMMIT
I feel like I need to normalize the database, but I can't think of a way to separate things out and be able to:
See a count of each finding based on date the script was run
Be able to drill down into each day and see all the findings, objects and recommendations for each
Control the count shown for each finding by removing 'exempted' findings OR objects.
I feel like there's something obvious I'm missing or I'm thinking about this incorrectly. Any help would be greatly appreciated!
EDIT - The following seems to do what I want, but as soon as I add an additional WHERE condition to the final result, the query time goes from 7 seconds to 90 seconds, so I fear it will not scale.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptrecommendation] = [historical].[exemptrecommendation]
,[dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM (
SELECT *
FROM historical h
WHERE EXISTS (
SELECT id
,recommendation
FROM temp t
WHERE (
t.id = s.id OR
t.recommendation = s.recommendation
)
)
) historical
WHERE [dbo].[temp].[recommendation] = [historical].[recommendation] OR
[dbo].[temp].[id] = [historical].[id]
COMMIT

is there something faster than Enumerable.Except<TSource> Method?

I have a program that downloads data from server database to client database. server database keeps growing recently.
in that program, there is an option to select download all data OR download data for a specific time period (can select backward days from today). if the user selects all, I wrote the program to truncate client database table and insert all data using bulk copy. that part is ok.
but the problem is when user select a specific time period (each recode has created data time ) program has to compare two tables and divide recodes (server data) in two tables. one is, not exist data and the second one is not existing data. and what I'm going to do is,
not existing data directly insert into client DB (i'm using bulk insert) and Existing data inserting into a tempory table using bulkcopy and after update client's table using the above tempory table. My actual problem occurs when dividing server's table. this is how I did it
updateTable = (From c In dt_from_server.AsEnumerable()
Join o In Dt_from_client.AsEnumerable()
On c.Field(Of String)("BARCODE").Trim() Equals o.Field(Of String)("BARCODE").Trim()
And c.Field(Of String)("ITEM_CODE").Trim() Equals o.Field(Of String)("ITEM_CODE").Trim()
Select c).CopyToDataTable()
insertTable = dt_server.AsEnumerable()
.Except(updateTable.AsEnumerable(), DataRowComparer.Default)
.CopyToDataTable()
(normally there is over 1M recodes in the server table )
when there is over 1 Milion recodes, Update part taking acceptable time like 10 minutes (Yes it taking 5GB space from Ram - in this case, it's ok when considering performance )
but insert part seams taking days, just to assing the insertTable(datatable). this is the issue.
AsEnumerable().Except() part taking long time and I couldn't find a solution speedup this process. I'm not sure I explained this correctly. Could anyone can give me some advice for this?
Since you have commented that dt_from_server and dt_server are actually the same DataTable you don't need to compare all values of all DataRows with each other, which is what DataRowComparer.Default does. You can use Except without second parameter for the comparer, then only references are compared which is much faster.
You also don't need two CopyToDataTable which creates two additonal big DataTables in memory, process the rows one after the other.
Here is a different approach using Linq's left-outer join, which is more efficient:
Dim query = from rServ in dt_from_server.AsEnumerable()
group join rClient in Dt_from_client.AsEnumerable()
On New With{
Key .BarCode = rServ.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rServ.Field(Of String)("ITEM_CODE").Trim()
} Equals New With{
Key .BarCode = rClient.Field(Of String)("BARCODE").Trim(),
Key .ItemCode = rClient.Field(Of String)("ITEM_CODE").Trim()
} into Group
From client In Group.DefaultIfEmpty()
Select new With { .ServerRow = rServ, .InsertRow = client is Nothing }
Dim insertOrUpdateRows = query.ToLookup(Function(x) x.InsertRow, Function(x) x.ServerRow)
Dim insertRows = insertOrUpdateRows(true).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
Dim updateRows = insertOrUpdateRows(false).CopyToDataTable() 'CopyToDataTable redundant if you process rows immediately now'
But in general the most scalable and efficient approach would be to not load all into memory at once and then process all, but to use database paging(or a stored-procedure) to process only parts of it in memory, otherwise it's likely that you will encounter a OutOfMemoryException sooner or later.
C# as requested:
var query = from rServ in dt_from_server.AsEnumerable()
join rClient in Dt_from_client.AsEnumerable()
on new { BarCode = rServ.Field<string>("BARCODE").Trim(), ItemCode = rServ.Field<string>("ITEM_CODE").Trim() }
equals new { BarCode = rClient.Field<string>("BARCODE").Trim(), ItemCode = rClient.Field<string>("ITEM_CODE").Trim() }
into clientGroup
from client in clientGroup.DefaultIfEmpty()
select new { ServerRow = rServ, InsertRow = client == null };
var insertOrUpdateRows = query.ToLookup(x => x.InsertRow, x => x.ServerRow);
var insertRows = insertOrUpdateRows[true].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now
var updateRows = insertOrUpdateRows[false].CopyToDataTable(); // CopyToDataTable redundant if you process rows immediately now

Merge statement optimization

I have a two tables in SQL Server, in which one is the source for a MERGE operation into another.
The source table has 30Mil Records
The Target table has 180Mil Records. Both tables have 227 columns.
I do have SSIS, but I'm told in this case, a MERGE statement is the better option. Below is a shortened version of it:
;WITH MySource as (
SELECT * FROM [STAGE].[dbo].[STAGE_TABLE]
)
MERGE [EDW].[dbo].[TARGET_TABLE] AS MyTarget
USING MySource
ON MySource.[ID_FIELD] = MyTarget.[ID_FIELD]
AND MySource.[LoadDate] >= MyTarget.[LoadDate]
WHEN MATCHED THEN UPDATE SET
<<Target Column>> = MySource.<<Source Colums>> --227 columns
WHEN NOT MATCHED THEN INSERT
(
[ID_FIELD],
[LoadDate],
<<225 Other Columns>>
)
VALUES (
MySource.[ID_FIELD],
MySource.[LoadDate],
MySource.<<225 other columns>>
);
The only changes I made to the script above is truncating the list of columns to keep the code block here short.
My Problem is that I am getting hung on the execution. The profile screen shows a CXPACKET suspension with the error: cwaitpipenewrow, node=2.
How do I troubleshoot this? Thank you.
Seems like CXPACKET and suspended state means that some threads which have completed are logging that other thread's state which have not completed yet.
Please check here. The query need to update upto 1 Billion values in the table. hence it would be slow running queries.
https://dba.stackexchange.com/questions/96346/cxpacket-suspended-and-null-wait-type
https://www.sqlshack.com/troubleshooting-the-cxpacket-wait-type-in-sql-server/
Hope these articles might help you debug.

SQL Server - Multiple select queries hit performance

Recently I ran into an issue where we have multiple concurrent client requests causing performance issue in db. I tried the test scenario and as it turned out, when I run SELECT queries (same query) 6 to 7 times (gets worse with more), It degrades the performance and execution takes a lot of time. However I tried this one
SELECT TOP (100) COUNT(DISTINCT([Doc_Number])) AS "Expression"
FROM (
SELECT *
FROM "dbo"."Dummy_Table" "table_alias"
WHERE ((CAST("table_alias"."ID" AS NVARCHAR)) NOT IN
(
SELECT "PrimaryKey" AS ExceptionKey
FROM dbo.exceptions inner_exceptionStatus
LEFT JOIN dbo.Workflow inner_workflowStates ON
(inner_exceptionStatus."Status"= inner_workflowStates."UUID" AND
inner_exceptionStatus."UUID"= 'CA1662D6-73A2-4692-A765-E7E3EDB66062')
WHERE ("inner_workflowStates"."RemoveFromRecordSet" = 1 AND
"inner_workflowStates"."IsDeleted" = 0) AND
("inner_exceptionStatus"."IsArchived" IS NULL OR
"inner_exceptionStatus"."IsArchived" = 0)))) wrapperQuery
The query when runs alone takes around 1sec execution time. But If we runs it in parallel, for each query it takes up a wried amount of time of leads to timeout.
The only thing bothers me here is that SELECT query should be non-blocking and even with shared lock, then need to get along easily.
I am not sure if there is anything wrong in the query that adds up the situation.
Any help is deeply appreciated !!
Try this way
SELECT Count(DISTINCT( [Doc_Number] )) AS Expression
FROM dbo.Dummy_Table table_alias
WHERE NOT EXISTS (SELECT 1
FROM dbo.exceptions inner_exceptionStatus
INNER JOIN dbo.Workflow inner_workflowStates
ON ( inner_exceptionStatus.Status = inner_workflowStates.UUID
AND inner_exceptionStatus.UUID = 'CA1662D6-73A2-4692-A765-E7E3EDB66062' )
WHERE inner_workflowStates.RemoveFromRecordSet = 1
AND inner_workflowStates.IsDeleted = 0
AND ( inner_exceptionStatus.IsArchived IS NULL
OR inner_exceptionStatus.IsArchived = 0 )
AND table_alias.ID = PrimaryKey)
Made couple of changes.
Changed NOT IN to NOT EXISTS
Removed the convert in "table_alias"."ID" because it will avoid using any index present in "table_alias"."ID" column. If the conversion is really required then add it.
Removed Top (100) since there is no Group By it will return a single record as result.
Still if the query is running slow then you need to post the execution plan and make sure the statistics are up-to-date
You can simplyfy your query like this :
SELECT COUNT(DISTINCT(Doc_Number)) AS Expression
FROM dbo.Dummy_Table dmy
WHERE not exists
(
SELECT *
FROM dbo.exceptions ies
INNER JOIN dbo.Workflow iws ON ies.Status= iws.UUID AND ies.UUID= 'CA1662D6-73A2-4692-A765-E7E3EDB66062'
WHERE iws.RemoveFromRecordSet = 1 AND iws.IsDeleted = 0 AND (ies.IsArchived IS NULL OR ies.IsArchived = 0)
and dmy.ID=PrimaryKey
)
Like prdp say :
Changed NOT IN to NOT EXISTS
Removed the convert in "table_alias"."ID" because it will avoid using any index present in "table_alias"."ID" column. If the conversion is really required then add it.
Removed Top (100) since there is no Group By it will return a single record as result.
I add :
Remove you temporary table wrapperQuery
You can use INNER JOIN because into where you test RemoveFromRecordSet = 1 then you remove null values.
Remove not utils quotes ,brackets and parenthèses into where clause

Resources