Replaying outstanding snowpipe notifications/messages in Snowflake - snowflake-cloud-data-platform

When a pipe is re-created, there is a chance of missing some notifications. Is there any way to replay these missed notifications? Refreshing the pipe is dangerous (so not an option), as the load history is lost when the pipe is re-created (and hence could result in ingesting the same files twice & creating duplicate records)
Snowflake has documented a process on how to re-create pipes with automated data loading (link). Unfortunately, any new notifications coming in between step 1 (pause the pipe) and step 3 (re-create the pipe) can be missed. Even by automating the process with a procedure, we can shrink the window, but not eliminate it. I have confirmed this with multiple tests. Even without pausing the previous pipe, there's still a slim chance for this to happen.
However, Snowflake is aware of the notifications, as the notification queue is separate from the pipes (and shared for the entire account). But the notifications received at the "wrong" time are just never processed (which I guess makes sense if there's no active pipe to process them at the time).
I think we can see those notifications in the numOutstandingMessagesOnChannel property of the pipe status, but I can't find much more information about this, nor how to get those notifications processed. I think they might just become lost when the pipe is replaced. 😞
Note: This is related to another question I asked about preserving the load history when re-creating pipes in Snowflake (link).

Assuming there's no way to replay outstanding notifications, I've instead created a procedure to detect files that have failed to load automatically. A benefit of this approach is that it can also detect any file that has failed to load for any reason (not only missed notifications).
The procedure can be called like this:
CALL verify_pipe_load(
'my_db.my_schema.my_pipe', -- Pipe name
'my_db.my_schema.my_stage', -- Stage name
'my_db.my_schema.my_table', -- Table name
'/YYYY/MM/DD/HH/', -- File prefix
'YYYY-MM-DD', -- Start time for the loads
'ERROR' -- Mode
);
Here's how it works, at a high level:
First, it finds all the files in the stage that match the specified prefix (using the LIST command), minus a slight delay to account for latency.
Then, out of those files, it finds all of those that have no records in COPY_HISTORY.
Finally, it handles those missing file loads in one of three ways, depending on the mode:
The 'ERROR' mode will abort the procedure by throwing an exception. This is useful to automate the continuous monitoring of pipes and ensure no files are missed. Just hook it up to your automation tool of choice! We use DBT + DBT Cloud.
The 'INGEST' mode will automatically re-queue the files for ingestion by Snowpipe using the REFRESH command for those specific files only.
The 'RETURN' mode will simply return the list of files in the response.
Here is the code for the procedure:
-- Returns a list of files missing from the destination table (separated by new lines).
-- Returns NULL if there are no missing files.
CREATE OR REPLACE PROCEDURE verify_pipe_load(
-- The FQN of the pipe (used to auto ingest):
PIPE_FQN STRING,
-- Stage to get the files from (same as the pipe definition):
STAGE_NAME STRING,
-- Destination table FQN (same as the pipe definition):
TABLE_FQN STRING,
-- File prefix (to filter files):
-- This should be based on a timestamp (ex: /YYYY/MM/DD/HH/)
-- in order to restrict files to a specific time interval
PREFIX STRING,
-- The time to get the loaded files from (should match the prefix):
START_TIME STRING,
-- What to do with the missing files (if any):
-- 'RETURN': Return the list of missing files.
-- 'INGEST': Automatically ingest the missing files (and return the list).
-- 'ERROR': Make the procedure fail by throwing an exception.
MODE STRING
)
RETURNS STRING
LANGUAGE JAVASCRIPT
EXECUTE AS CALLER
AS
$$
MODE = MODE.toUpperCase();
if (!['RETURN', 'INGEST', 'ERROR'].includes(MODE)) {
throw `Exception: Invalid mode '${MODE}'. Must be one of 'RETURN', 'INGEST' or 'ERROR'`;
}
let tableDB = TABLE_FQN.split('.')[0];
let [pipeDB, pipeSchema, pipeName] = PIPE_FQN.split('.')
.map(name => name.startsWith('"') && name.endsWith('"')
? name.slice(1, -1)
: name.toUpperCase()
);
let listQueryId = snowflake.execute({sqlText: `
LIST #${STAGE_NAME}${PREFIX};
`}).getQueryId();
let missingFiles = snowflake.execute({sqlText: `
WITH staged_files AS (
SELECT
"name" AS name,
TO_TIMESTAMP_NTZ(
"last_modified",
'DY, DD MON YYYY HH24:MI:SS GMT'
) AS last_modified,
-- Add a minute per GB, to account for larger file size = longer ingest time
ROUND("size" / 1024 / 1024 / 1024) AS ingest_delay,
-- Estimate the time by which the ingest should be done (default 5 minutes)
DATEADD(minute, 5 + ingest_delay, last_modified) AS ingest_done_ts
FROM TABLE(RESULT_SCAN('${listQueryId}'))
-- Ignore files that may not be done being ingested yet
WHERE ingest_done_ts < CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP())::TIMESTAMP_NTZ
), loaded_files AS (
SELECT stage_location || file_name AS name
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), stage AS (
SELECT DISTINCT stage_location
FROM TABLE(
${tableDB}.information_schema.copy_history(
table_name => '${TABLE_FQN}',
start_time => '${START_TIME}'::TIMESTAMP_LTZ
)
)
WHERE pipe_catalog_name = '${pipeDB}'
AND pipe_schema_name = '${pipeSchema}'
AND pipe_name = '${pipeName}'
), missing_files AS (
SELECT REPLACE(name, stage_location) AS prefix
FROM staged_files
CROSS JOIN stage
WHERE name NOT IN (
SELECT name FROM loaded_files
)
)
SELECT LISTAGG(prefix, '\n') AS "missing_files"
FROM missing_files;
`});
if (!missingFiles.next()) return null;
missingFiles = missingFiles.getColumnValue('missing_files');
if (missingFiles.length == 0) return null;
if (MODE == 'ERROR') {
throw `Exception: Found missing files:\n'${missingFiles}'`;
}
if (MODE == 'INGEST') {
missingFiles
.split('\n')
.forEach(file => snowflake.execute({sqlText: `
ALTER PIPE ${PIPE_FQN} REFRESH prefix='${file}';
`}));
}
return missingFiles;
$$
;

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

How to view timestamp when pipe finished copying data from stage to table?

I've created a Pipe from an S3 stage and with a python script I'm generating the timestamps of when I am generating the data from a streaming service into file batches. I would also like to be able to add the timestamp when the files were actually copied into the table from the S3 stage. I've found some documentation regarding the PIPE_USAGE_HISTORY method but although I've already ran for the past days quite a few tests the below returns an empty table. What am I doing wrong?
select * from table(information_schema.pipe_usage_history(
date_range_start=>dateadd('day',-14,current_date()),
date_range_end=>current_date())
)
I found the answer. There is another query I should be using: copy_history
The above query would be rewritten as follows
select * from table(information_schema.copy_history(
table_name => '{replace with your schema.table}',
start_time => dateadd(days, -14, current_timestamp()),
end_time => current_timestamp())
)

How do I set the correct transaction level?

I am using Dapper on ADO.NET. So at present I am doing the following:
using (IDbConnection conn = new SqlConnection("MyConnectionString")))
{
conn.Open());
using (IDbTransaction transaction = conn.BeginTransaction())
{
// ...
However, there are various levels of transactions that can be set. I think this is the various settings.
My first question is how do I set the transaction level (where I am using Dapper)?
My second question is what is the correct level for each of the following cases? In each of these cases we have multiple instances of a web worker (Azure) service running that will be hitting the DB at the same time.
I need to run monthly charges on subscriptions. So in a transaction I need to read a record and if it's due for a charge create the invoice record and mark the record as processed. Any other read of that record for the same purpose needs to fail. But any other reads of that record that are just using it to verify that it is active need to succeed.
So what transaction do I use for the access that will be updating the processed column? And what transaction do I use for the other access that just needs to verify that the record is active?
In this case it's fine if a conflict causes the charge to not be run (we'll get it the next day). But it is critical that we not charge someone twice. And it is critical that the read to verify that the record is active succeed immediately while the other operation is in its transaction.
I need to update a record where I am setting just a couple of columns. One use case is I set a new password hash for a user record. It's fine if other access occurs during this except for deleting the record (I think that's the only problem use case). If another web service is also updating that's the user's problem for doing this in 2 places simultaneously.
But it's key that the record stay consistent. And this includes the use case of "set NumUses = NumUses + #ParamNum" so it needs to treat the read, calculation, write of the column value as an atomic action. And if I am setting 3 column values, they all get written together.
1) Assuming that Invoicing process is an SP with multiple statements your best bet is to create another "lock" table to store the fact that invoicing job is already running e.g.
CREATE TABLE InvoicingJob( JobStarted DATETIME, IsRunning BIT NOT NULL )
-- Table will only ever have one record
INSERT INTO InvoicingJob
SELECT NULL, 0
EXEC InvoicingProcess
ALTER PROCEDURE InvoicingProcess
AS
BEGIN
DECLARE #InvoicingJob TABLE( IsRunning BIT )
-- Try to aquire lock
UPDATE InvoicingJob WITH( TABLOCK )
SET JobStarted = GETDATE(), IsRunning = 1
OUTPUT INSERTED.IsRunning INTO #InvoicingJob( IsRunning )
WHERE IsRunning = 0
-- job has been running for more than a day i.e. likely crashed without releasing a lock
-- OR ( IsRunning = 1 AND JobStarted <= DATEADD( DAY, -1, GETDATE())
IF NOT EXISTS( SELECT * FROM #InvoicingJob )
BEGIN
PRINT 'Another Job is already running'
RETURN
END
ELSE
RAISERROR( 'Start Job', 0, 0 ) WITH NOWAIT
-- Do invoicing tasks
WAITFOR DELAY '00:01:00' -- to simulate execution time
-- Release lock
UPDATE InvoicingJob
SET IsRunning = 0
END
2) Read about how transactions work: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transactions-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-2017
You second question is quite broad.

Concurrent updates on a single staging table

I am developing a service application (VB.NET) which pulls information from a source and imports it to a SQL Server database
The process can involve one or more “batches” of information at a time (the number and size of batches in any given “run” is arbitrary based on a queue maintained elsewhere)
Each batch is assigned an identifier (BatchID) so that the set of records in the staging table which belong to that batch can be easily identified
The ETL process for each batch is sequential in nature; the raw data is bulk inserted to a staging table and then a series of stored procedures perform updates on a number of columns until the data is ready for import
These stored procedures are called in sequence by the service and are generally simple UPDATE commands
Each SP takes the BatchID as an input parameter and specifies this as the criteria for inclusion in each UPDATE, ĂĄ la :
UPDATE dbo.stgTable
SET FieldOne = (CASE
WHEN S.[FieldOne] IS NULL
THEN T1.FieldOne
ELSE
S.[FieldOne]
END
)
, FieldTwo = (CASE
WHEN S.[FieldTwo] IS NULL
THEN T2.FieldTwo
ELSE
S.[FieldTwo]
END
)
FROM dbo.stgTable AS S
LEFT JOIN dbo.someTable T1 ON S.[SomeField] = T1.[SomeField]
LEFT JOIN dbo.someOtherTable T2 ON S.[SomeOtherField] = T2.[SomeOtherField]
WHERE S.BatchID = #BatchID
Some of the SP’s also refer to functions (both scalar and table-valued) and all incorporate a TRY / CATCH structure so I can tell from the output parameters if a particular SP has failed
The final SP is a MERGE operation to move the enriched data from the staging table into the production table (again, specific to the provided BatchID)
I would like to thread this process in the service so that a large batch doesn’t hold up smaller batches in the same run
I figured there should be no issue with this as no thread could ever attempt to process records in the staging table that could be targeted by another thread (no race conditions)
However, I’ve noticed that, when I do thread the process, arbitrary steps on arbitrary batches seem to fail (but no error is recorded from the output of the SP)
The failures are inconsistent; e.g. sometimes batches 2, 3 & 5 will fail (on SP’s 3, 5 & 7 respectively), other times it will be different batches, each at different steps in the sequence
When I import the batches sequentially, they all import perfectly fine – always!
I can’t figure out if this is an issue on the service side (VB.NET) – e.g. is each thread opening an independent connection to the DB or could they be sharing the same one (I’ve set it up that each one should be independent…)
Or if the issue is on the SQL Server side – e.g. is it not feasible for concurrent SP calls to manipulate data on the same table, even though, as described above, no thread/batch will ever touch records belonging to another thread/batch
(On this point – I tried using CTE’s to create subsets of data from the staging table based on the BatchID and apply the UPDATE’s to those instead but the exact same behaviour occurred)
WITH CTE AS (
SELECT *
FROM dbo.stgTable
WHERE BatchID = #BatchID
)
UPDATE CTE...
Or maybe the problem is that multiple SP’s are calling the same function at the same time and that is why one or more of them are failing (I don’t see why that would be a problem though?)
Any suggestions would be very gratefully received – I’ve been playing around with this all week and I can’t for the life of me determine precisely what the problem might be!
Update to include sample service code
This is the code in the service class where the threading is initiated
For Each ItemInScope In ScopedItems
With ItemInScope
_batches(_batchCount) = New Batch(.Parameter1, .Parameter2, .ParameterX)
With _batches(_batchCount)
If .Initiate() Then
_doneEvents(_batchCount) = New ManualResetEvent(False)
Dim _batchWriter = New BatchWriter(_batches(_batchCount), _doneEvents(_batchCount))
ThreadPool.QueueUserWorkItem(AddressOf _batchWriter.ThreadPoolCallBack, _batchCount)
Else
_doneEvents(_batchCount) = New ManualResetEvent(True)
End If
End With
End With
_batchCount += 1
Next
WaitHandle.WaitAll(_doneEvents)
Here is the BatchWriter class
Public Class BatchWriter
Private _batch As Batch
Private _doneEvent As ManualResetEvent
Public Sub New(ByRef batch As Batch, ByVal doneEvent As ManualResetEvent)
_batch = batch
_doneEvent = doneEvent
End Sub
Public Sub ThreadPoolCallBack(ByVal threadContext As Object)
Dim threadIndex As Integer = CType(threadContext, Integer)
With _batch
If .PrepareBatch() Then
If .WriteTextOutput() Then
.ProcessBatch()
End If
End If
End With
_doneEvent.Set()
End Sub
End Class
The PrepareBatch and WriteTextOutput functions of the Batch class are entirely contained within the service application - it is only the ProcessBatch function where the service starts to interact with the database (via Entity Framework)
Here is that function
Public Sub ProcessScan()
' Confirm that a file is ready for import
If My.Computer.FileSystem.FileExists(_filePath) Then
Dim dbModel As New DatabaseModel
With dbModel
' Pass the batch to the staging table in the database
If .StageBatch(_batchID, _filePath) Then
' First update (results recorded for event log)
If .UpdateOne(_batchID) Then
_stepOneUpdates = .RetUpdates.Value
' Second update (results recorded for event log)
If .UpdateTwo(_batchID) Then
_stepTwoUpdates = .RetUpdates.Value
' Third update (results recorded for event log)
If .UpdateThree(_batchID) Then
_stepThreeUpdates = .RetUpdates.Value
....
End Sub

Multiple file upload blueimp “sequentialUploads = true” not working

I have to store multiple files in a Database. In order to do that I am using jQuery Multiple File Uplaod control written by blueimp. For the server language I use asp.net. It uses a handler – ashx which accepts every particular file and stores it in the DB.
In my DB I have a stored procedure which returns the document id of the currently uploaded file and then stores it into the DB.
This is a code fragment from it which shows that getting this id is a 3 step procedure:
SELECT TOP 1 #docIdOut = tblFile.docId --
,#creator=creator,#created=created FROM [tblFile] WHERE
friendlyname LIKE #friendlyname AND #categoryId = categoryId AND
#objectId = objectId AND #objectTypeId = objectTypeId ORDER BY
tblFile.version DESC
IF #docIdOut = 0 BEGIN --get the next docId SELECT TOP 1
#docIdOut = docId + 1 FROM [tblFile] ORDER BY docId DESC END
IF #docIdOut = 0 BEGIN --set to 1 SET #docIdOut = 1 END
If more than one calls to that stored procedure are executed then will be a problem due to inconsistency of the data, but If I add a transaction then the upload for some files will be cancelled.
https://dl.dropboxusercontent.com/u/13878369/2013-05-20_1731.png
Is there a way to call the stored procedure again with the same parameters when the execution is blocked by transaction?
Another option is to use the blueimp plugin synchronously with the option “sequentialUploads = true”
This option works for all browsers except firefox, and I read that it doesn’t work for Safari as well.
Any suggestions would be helpful. I tried and enabling the option for selecting only one file at a time, which is not the best, but will stop the DB problems (strored procedure) and will save the files correctly.
Thanks,
Best Regards,
Lyuubomir
set singleFileUploads: true, sequentialUploads: false in the main.js.
singleFileUploads: true means each file of a selection is uploaded using an individual request for XHR type uploads. Then you can get information of an individual file and call the store procedure with the information you have just got.

Resources