singleton pattern implementation in Snowflake? - snowflake-cloud-data-platform

singleton pattern implementation in Snowflake? - snowflake-cloud-data-platform

We need to implement some singleton pattern to ensure a stored procedure cannot be run several times simultaneously.
As I cannot see this functionality in place, I thought about implementing this via a "Lock" table.
We are in a "batch" environment so waiting a few seconds is no problem.
SHARED.LOCK(LOCK_NAME STRING NOT NULL PRIMARY KEY
,SESSION_ID STRING NOT NULL
,ACQUIRED_AT TIMESTAMP_NTZ
)
LOCK_NAME is forced to upper case and used as a Primary Key
SESSION_ID is the current session
ACQUIRED_AT is just useful information
I then create a stored proc to "acquire" the lock $LOCK_NAME that tries to update the lock record with its own session id as long as it is not "locked" already
UPDATE SHARED.LOCK
SET LOAD_ID = $LOAD_ID
,SESSION_ID = CURRENT_SESSION()
,ACQUIRED_AT = CURRENT_TIMESTAMP()
WHERE LOCK_NAME = $LOCK_NAME
AND SESSION_ID IS NULL;
To avoid Snowflake optimistic locking side effects, I would ensure that this stored procedure is not called as part of an explicit transaction.
I then check whether I successfully "acquired" this lock
SELECT 1
FROM SHARED.LOCK
WHERE LOCK_NAME = $LOCK_NAME
AND LOAD_ID = $SESSION_ID;
If I get a record, then I have the lock.
Otherwise, I could wait X seconds and try again later, up to a certain number of attempts.
Once I am done, I can release the lock with a simple Update statement
UPDATE SHARED.LOCK
SET SESSION_ID = NULL
,ACQUIRED_AT = NULL
WHERE LOCK_NAME = $LOCK_NAME
AND SESSION_ID = $SESSION_ID;
And of course we'll have to do something about locks not released within a certain amount of time or locked by a session that is not live anymore, etc...
I think this should work... but maybe there is a simpler way to implement a singleton in Snowflake?
Any better ideas?

Depending on requirements, if the stored procedure is going to be run on schedule TASK could be used, which has OVERLAP protection built-in:
CREATE OR REPLACE TASK my_task
WAREHOUSE = compute_wh
SCHEDULE = '1 minute'
ALLOW_OVERLAPPING_EXECUTION = FALSE
AS
CALL procedure_call();
CREATE TASK - ALLOW_OVERLAPPING_EXECUTION :
ALLOW_OVERLAPPING_EXECUTION = TRUE | FALSE
Specifies whether to allow multiple instances of the task tree to run concurrently
FALSE ensures only one instance of a particular tree of tasks is allowed to run at a time.
Demo:
CREATE TABLE log(id INT NOT NULL IDENTITY(1,1), d TIMESTAMP);
CREATE OR REPLACE procedure insert_log()
returns string
language javascript
execute as owner
as
$$
snowflake.execute ({sqlText: "INSERT INTO log (d) SELECT CURRENT_TIMESTAMP()"});
snowflake.execute ({sqlText: "CALL SYSTEM$WAIT(2, 'MINUTES')"});
return "Succeeded.";
$$
;
ALTER TASK my_task RESUME;
SELECT * FROM log;

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;

I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.

The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

How do I set the correct transaction level?

I am using Dapper on ADO.NET. So at present I am doing the following:
using (IDbConnection conn = new SqlConnection("MyConnectionString")))
{
conn.Open());
using (IDbTransaction transaction = conn.BeginTransaction())
{
// ...
However, there are various levels of transactions that can be set. I think this is the various settings.
My first question is how do I set the transaction level (where I am using Dapper)?
My second question is what is the correct level for each of the following cases? In each of these cases we have multiple instances of a web worker (Azure) service running that will be hitting the DB at the same time.
I need to run monthly charges on subscriptions. So in a transaction I need to read a record and if it's due for a charge create the invoice record and mark the record as processed. Any other read of that record for the same purpose needs to fail. But any other reads of that record that are just using it to verify that it is active need to succeed.
So what transaction do I use for the access that will be updating the processed column? And what transaction do I use for the other access that just needs to verify that the record is active?
In this case it's fine if a conflict causes the charge to not be run (we'll get it the next day). But it is critical that we not charge someone twice. And it is critical that the read to verify that the record is active succeed immediately while the other operation is in its transaction.
I need to update a record where I am setting just a couple of columns. One use case is I set a new password hash for a user record. It's fine if other access occurs during this except for deleting the record (I think that's the only problem use case). If another web service is also updating that's the user's problem for doing this in 2 places simultaneously.
But it's key that the record stay consistent. And this includes the use case of "set NumUses = NumUses + #ParamNum" so it needs to treat the read, calculation, write of the column value as an atomic action. And if I am setting 3 column values, they all get written together.

1) Assuming that Invoicing process is an SP with multiple statements your best bet is to create another "lock" table to store the fact that invoicing job is already running e.g.
CREATE TABLE InvoicingJob( JobStarted DATETIME, IsRunning BIT NOT NULL )
-- Table will only ever have one record
INSERT INTO InvoicingJob
SELECT NULL, 0
EXEC InvoicingProcess
ALTER PROCEDURE InvoicingProcess
AS
BEGIN
DECLARE #InvoicingJob TABLE( IsRunning BIT )
-- Try to aquire lock
UPDATE InvoicingJob WITH( TABLOCK )
SET JobStarted = GETDATE(), IsRunning = 1
OUTPUT INSERTED.IsRunning INTO #InvoicingJob( IsRunning )
WHERE IsRunning = 0
-- job has been running for more than a day i.e. likely crashed without releasing a lock
-- OR ( IsRunning = 1 AND JobStarted <= DATEADD( DAY, -1, GETDATE())
IF NOT EXISTS( SELECT * FROM #InvoicingJob )
BEGIN
PRINT 'Another Job is already running'
RETURN
END
ELSE
RAISERROR( 'Start Job', 0, 0 ) WITH NOWAIT
-- Do invoicing tasks
WAITFOR DELAY '00:01:00' -- to simulate execution time
-- Release lock
UPDATE InvoicingJob
SET IsRunning = 0
END
2) Read about how transactions work: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transactions-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-2017
You second question is quite broad.

SQL CLR Trigger - get source table

I am creating a DB synchronization engine using SQL CLR Triggers in Microsoft SQL Server 2012. These triggers do not call a stored procedure or function (and thereby have access to the INSERTED and DELETED pseudo-tables but do not have access to the ##procid).
Differences here, for reference.
This "sync engine" uses mapping tables to determine what the table and field maps are for this sync job. In order to determine the target table and fields (from my mapping table) I need to get the source table name from the trigger itself. I have come across many answers on Stack Overflow and other sites that say that this isn't possible. But, I've found one website that provides a clue:
Potential Solution:
using (SqlConnection lConnection = new SqlConnection(#"context connection=true")) {
SqlCommand cmd = new SqlCommand("SELECT object_name(resource_associated_entity_id) FROM sys.dm_tran_locks WHERE request_session_id = ##spid and resource_type = 'OBJECT'", lConnection);
cmd.CommandType = CommandType.Text;
var obj = cmd.ExecuteScalar();
}
This does in fact return the correct table name.
Question:
My question is, how reliable is this potential solution? Is the ##spid actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id? Will it stand up to multiple executions of the same and/or different triggers within the database?
From these sites, it seems the process Id is in fact limited to the open connection, which doesn't overlap: here, here, and here.
Will this be a safe method to get my source table?
Why?
As I've noticed similar questions, but all without a valid answer for my specific situation (except that one). Most of the comments on those sites ask "Why?", and in order to preempt that, here is why:
This synchronization engine operates on a single DB and can push changes to target tables, transforming the data with user-defined transformations, automatic source-to-target type casting and parsing and can even use the CSharpCodeProvider to execute methods also stored in those mapping tables for transforming data. It is already built, quite robust and has good performance metrics for what we are doing. I'm now trying to build it out to allow for 1:n table changes (including extension tables requiring the same Id as the 'master' table) and am trying to "genericise" the code. Previously each trigger had a "target table" definition hard coded in it and I was using my mapping tables to determine the source. Now I'd like to get the source table and use my mapping tables to determine all the target tables. This is used in a medium-load environment and pushes changes to a "Change Order Book" which a separate server process picks up to finish the CRUD operation.
Edit
As mentioned in the comments, the query listed above is quite "iffy". It will often (after a SQL Server restart, for example) return system objects like syscolpars or sysidxstats. But, it seems that in the dm_tran_locks table there's always an associated resource_type of 'RID' (Row ID) with the same object_name. My current query which works reliably so far is the following (will update if this changes or doesn't work under high load testing):
select t1.ObjectName FROM (
SELECT object_name(resource_associated_entity_id) as ObjectName
FROM sys.dm_tran_locks WHERE resource_type = 'OBJECT' and request_session_id = ##spid
) t1 inner join (
SELECT OBJECT_NAME(partitions.OBJECT_ID) as ObjectName
FROM sys.dm_tran_locks
INNER JOIN sys.partitions ON partitions.hobt_id = dm_tran_locks.resource_associated_entity_id
WHERE resource_type = 'RID'
) t2 on t1.ObjectName = t2.ObjectName
If this is always the case, I'll have to find that out during testing.

How reliable is this potential solution?
While I do not have time to set up a test case to show it not working, I find this approach (even taking into account the query in the Edit section) "iffy" (i.e. not guaranteed to always be reliable).
The main concerns are:
cascading (whether recursive or not) Trigger executions
User (i.e. Explicit / Implicit) transactions
Sub-processes (i.e. EXEC and sp_executesql)
These scenarios allow for multiple objects to be locked, all at the same time.
Is the ##SPID actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id?
and (from a comment on the question):
I think I can join my query up with the sys.partitions and get a dm_trans_lock that has a type of 'RID' with an object name that will match up to the one in my original query.
And here is why it shouldn't be entirely reliable: the Session ID (i.e. ##SPID) is constant for all of the requests on that Connection). So all sub-processes (i.e. EXEC calls, sp_executesql, Triggers, etc) will all be on the same ##SPID / session_id. So, between sub-processes and User Transactions, you can very easily get locks on multiple resources, all on the same Session ID.
The reason I say "resources" instead of "OBJECT" or even "RID" is that locks can occur on: rows, pages, keys, tables, schemas, stored procedures, the database itself, etc. More than one thing can be considered an "OBJECT", and it is possible that you will have page locks instead of row locks.
Will it stand up to multiple executions of the same and/or different triggers within the database?
As long as these executions occur in different Sessions, then they are a non-issue.
ALL THAT BEING SAID, I can see where simple testing would show that your current method is reliable. However, it should also be easy enough to add more detailed tests that include an explicit transaction that first does some DML on another table, or have a trigger on one table do some DML on one of these tables, etc.
Unfortunately, there is no built-in mechanism that provides the same functionality that ##PROCID does for T-SQL Triggers. I have come up with a scheme that should allow for getting the parent table for a SQLCLR Trigger (that takes into account these various issues), but haven't had a chance to test it out. It requires using a T-SQL trigger, set as the "first" trigger, to set info that can be discovered by the SQLCLR Trigger.
A simpler form can be constructed using CONTEXT_INFO, if you are not already using it for something else (and if you don't already have a "first" Trigger set). In this approach you would still create a T-SQL Trigger, and then set it as the "first" Trigger using sp_settriggerorder. In this Trigger you SET CONTEXT_INFO to the table name that is the parent of ##PROCID. You can then read CONTEXT_INFO() on a Context Connection in a SQLCLR Trigger. If there are multiple levels of Triggers then the value of CONTEXT INFO will get overwritten, so reading that value must be the first thing you do in each SQLCLR Trigger.

This is an old thread, but it is an FAQ and I think I have a better solution. Essentially it uses the schema of the inserted or deleted table to find the base table by doing a hash of the column names and comparing the hash with the hashes of tables with a CLR trigger on them.
Code snippet below - at some point I will probably put the whole solution on Git (it sends a message to Azure Service Bus when the trigger fires).
private const string colqry = "select top 1 * from inserted union all select top 1 * from deleted";
private const string hashqry = "WITH cols as ( "+
"select top 100000 c.object_id, column_id, c.[name] "+
"from sys.columns c "+
"JOIN sys.objects ot on (c.object_id= ot.parent_object_id and ot.type= 'TA') " +
"order by c.object_id, column_id ) "+
"SELECT s.[name] + '.' + o.[name] as 'TableName', CONVERT(NCHAR(32), HASHBYTES('MD5',STRING_AGG(CONVERT(NCHAR(32), HASHBYTES('MD5', cols.[name]), 2), '|')),2) as 'MD5Hash' " +
"FROM cols "+
"JOIN sys.objects o on (cols.object_id= o.object_id) "+
"JOIN sys.schemas s on (o.schema_id= s.schema_id) "+
"WHERE o.is_ms_shipped = 0 "+
"GROUP BY s.[name], o.[name]";
public static void trgSendSBMsg()
{
string table = "";
SqlCommand cmd;
SqlDataReader rdr;
SqlTriggerContext trigContxt = SqlContext.TriggerContext;
SqlPipe p = SqlContext.Pipe;
using (SqlConnection con = new SqlConnection("context connection=true"))
{
try
{
con.Open();
string tblhash = "";
using (cmd = new SqlCommand(colqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
if (rdr.Read())
{
MD5 hash = MD5.Create();
StringBuilder hashstr = new StringBuilder(250);
for (int i=0; i < rdr.FieldCount; i++)
{
if (i > 0) hashstr.Append("|");
hashstr.Append(GetMD5Hash(hash, rdr.GetName(i)));
}
tblhash = GetMD5Hash(hash, hashstr.ToString().ToUpper()).ToUpper();
}
rdr.Close();
}
}
using (cmd = new SqlCommand(hashqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
while (rdr.Read())
{
string hash = rdr.GetString(1).ToUpper();
if (hash == tblhash)
{
table = rdr.GetString(0);
break;
}
}
rdr.Close();
}
}
if (table.Length == 0)
{
p.Send("Error: Unable to find table that CLR trigger is on. Message not sent!");
return;
}
….
HTH

postgresql deadlock

Sometimes postgresql raise error deadlocks.
In trigger for table setted FOR UPDATE.
Table comment:
http://pastebin.com/L1a8dbn4
Log (INSERT sentences is cutted):
2012-01-26 17:21:06 MSK ERROR: deadlock detected
2012-01-26 17:21:06 MSK DETAIL: Process 2754 waits for ExclusiveLock on tuple (40224,15) of relation 735493 of database 734745; blocked by process 2053.
Process 2053 waits for ShareLock on transaction 25162240; blocked by process 2754.
Process 2754: INSERT INTO comment (user_id, content_id, reply_id, text) VALUES (1756235868, 935967, 11378142, 'text1') RETURNING comment.id;
Process 2053: INSERT INTO comment (user_id, content_id, reply_id, text) VALUES (4071267066, 935967, 11372945, 'text2') RETURNING comment.id;
2012-01-26 17:21:06 MSK HINT: See server log for query details.
2012-01-26 17:21:06 MSK CONTEXT: SQL statement "SELECT comments_count FROM content WHERE content.id = NEW.content_id FOR UPDATE"
PL/pgSQL function "increase_comment_counter" line 5 at SQL statement
2012-01-26 17:21:06 MSK STATEMENT: INSERT INTO comment (user_id, content_id, reply_id, text) VALUES (1756235868, 935967, 11378142, 'text1') RETURNING comment.id;
And trigger on table comment:
CREATE OR REPLACE FUNCTION increase_comment_counter() RETURNS TRIGGER AS $$
DECLARE
comments_count_var INTEGER;
BEGIN
SELECT INTO comments_count_var comments_count FROM content WHERE content.id = NEW.content_id FOR UPDATE;
UPDATE content SET comments_count = comments_count_var + 1, last_comment_dt = now() WHERE content.id = NEW.content_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER increase_comment_counter_trigger AFTER INSERT ON comment FOR EACH ROW EXECUTE PROCEDURE increase_comment_counter();
Why it can happens?
Thanks!

These are two comments being inserted with the same content_id. Merely inserting the comment will take out a SHARE lock on the content row, in order to stop another transaction deleting that row until the first transaction has completed.
However, the trigger then goes on to upgrade the lock to EXCLUSIVE, and this can be blocked by a concurrent transaction performing the same process. Consider the following sequence of events:
Txn 2754 Txn 2053
Insert Comment
Insert Comment
Lock Content#935967 SHARE
(performed by fkey)
Lock Content#935967 SHARE
(performed by fkey)
Trigger
Lock Content#935967 EXCLUSIVE
(blocks on 2053's share lock)
Trigger
Lock Content#935967 EXCLUSIVE
(blocks on 2754's share lock)
So- deadlock.
One solution is to immediately take an exclusive lock on the content row before inserting the comment. i.e.
SELECT 1 FROM content WHERE content.id = 935967 FOR UPDATE
INSERT INTO comment(.....)
Another solution is simply to avoid this "cached counts" pattern completely, except where you can prove it is necessary for performance. If so, consider keeping the cached count somewhere other than the content table-- e.g. a dedicated table for the counter. That will also cut down on the update traffic to the content table every time a comment gets added. Or maybe just re-select the count and use memcached in the application. There's no getting round the fact that wherever you store this cached count is going to be a choke point, it has to be updated safely.

Why does an UPDATE take much longer than a SELECT?

I have the following select statement that finishes almost instantly.
declare #weekending varchar(6)
set #weekending = 100103
select InvoicesCharges.orderaccnumber, Accountnumbersorders.accountnumber
from Accountnumbersorders, storeinformation, routeselecttable,InvoicesCharges, invoice
where InvoicesCharges.pubid = Accountnumbersorders.publication
and Accountnumbersorders.actype = 0
and Accountnumbersorders.valuezone = 'none'
and storeinformation.storeroutename = routeselecttable.istoreroutenumber
and storeinformation.storenumber = invoice.store_number
and InvoicesCharges.invoice_number = invoice.invoice_number
and convert(varchar(6),Invoice.bill_to,12) = #weekending
However, the equivalent update statement takes 1m40s
declare #weekending varchar(6)
set #weekending = 100103
update InvoicesCharges
set InvoicesCharges.orderaccnumber = Accountnumbersorders.accountnumber
from Accountnumbersorders, storeinformation, routeselecttable,InvoicesCharges, invoice
where InvoicesCharges.pubid = Accountnumbersorders.publication
and Accountnumbersorders.actype = 0
and dbo.Accountnumbersorders.valuezone = 'none'
and storeinformation.storeroutename = routeselecttable.istoreroutenumber
and storeinformation.storenumber = invoice.store_number
and InvoicesCharges.invoice_number = invoice.invoice_number
and convert(varchar(6),Invoice.bill_to,12) = #weekending
Even if I add:
and InvoicesCharges.orderaccnumber <> Accountnumbersorders.accountnumber
at the end of the update statement reducing the number of writes to zero, it takes the same amount of time.
Am I doing something wrong here? Why is there such a huge difference?

transaction log file writes
index updates
foreign key lookups
foreign key cascades
indexed views
computed columns
check constraints
locks
latches
lock escalation
snapshot isolation
DB mirroring
file growth
other processes reading/writing
page splits / unsuitable clustered index
forward pointer/row overflow events
poor indexes
statistics out of date
poor disk layout (eg one big RAID for everything)
Check constraints with UDFs that have table access
...
Although, the usual suspect is a trigger...
Also, your condition extra has no meaning: How does SQL Server know to ignore it? An update is still generated with most of the baggage... even the trigger will still fire. Locks must be held while rows are searched for the other conditions for example
Edited Sep 2011 and Feb 2012 with more options

The update has to lock and modify the data in the table, and also log the changes to the transaction log. The select does not have to do any of those things.

Because reading does not affect indices, triggers, and what have you?

In Slow servers or large database i usually use UPDATE DELAYED, that waits for a "break" to update the database itself.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight