SNOWFLAKE Schedule Task to run just once - snowflake-cloud-data-platform

I am looking for a way in Snowflake to schedule a task to run JUST ONCE.
Any suggestions? Some CRON expressions allow to specify year but I think that SF tasks do not allow that.

A task could run Snowflake Scripting block and the last step could be dropping or pausing it:
CREATE OR REPLACE TASK SchemaName.MyTask
WAREHOUSE = 'COMPUTE_WH'
SCHEDULE = 'USING CRON 0 0 1 1 * UTC'
AS
BEGIN
-- somecode
-- remove xor suspend
DROP TASK SchemaName.MyTask;
ALTER TASK SchemaName.MyTask SUSPEND;
END;

Related

Run parallel transaction while executing code in another session

Is there any possibility to run update at some specific point of time, but in different/parallel session? In the provided example I want some specific update to be run at the time when I run WAITFOR. Currently I have this WAITFOR block to have some time to switch to another SSMS (or other tool) window/tab and run update while it's waiting for 10 secs. Logically the only thing is needed to be done, is that transaction started at this point of time.
EXEC dbo.p_sync_from_accounts_ext_test #enable_snapshot_isolation = 1
, #run_update_flag = NULL
, #run_wait_for_10 = NULL
, #acc = #acc;
WAITFOR DELAY '00:00:10'; -- Execute update in parallel transaction
-- table update should be performed in that parallel transaction
EXEC dbo.p_finish_sync_attributes;
Yes, you can do it.
Method 1: loop-back linked server (linked server that points to your current server) that does not have DTC enabled. Call your SP from your linked server.
Methd 2: create an SQL Server Job and start the job programmatically.
Note that in the first case your update statement must be included in an SP. In the second case it is advisable but not necessary.

How to properly truncate a staging table in an ETL pipeline?

We have an ETL pipeline that runs for each CSV uploaded into an storage account (Azure). It runs some transformations on the CSV and writes the outputs to another location, also as CSV, and calls a stored procedure on the database (SQL Azure) which ingests (BULK INSERT) this resulting CSV into a staging table.
This pipeline can have concurrent executions as multiple resources can be uploading files to the storage. Hence, the staging table is getting data inserted pretty often.
Then, we have an scheduled SQL job (Elastic Job) that triggers an SP that moves the data from the staging table into the final table.
At this point, we would want to truncate/empty the staging table so that we do not re-insert them in the next execution of the job.
Problem is, we cannot be sure that between the load from the staging table to the final table and the truncate command, there has not been any new data written into the staging table that could be truncated without first being inserted in to the final table.
Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
If not, what's the best approach to handle this?
I would propose solution with two identical staging tables. Lets name them StageLoading and StageProcessing.
Load process would have following steps:
1. At the beginning both tables are empty.
2. We load some data into StageLoading table (I assume each load is a transaction).
3. When Elastic job starts it will do:
- ALTER TABLE SWITCH to move all data from StageLoading to StageProcessing. It will make StageLoading empty and ready for next loads. It is a metadata operation, so takes miliseconds and it is fully blocking, so will be done between loads.
- load the data from StageProcessing to final tables.
- truncate table StageProcessing.
4. Now we are ready for next Elastic job.
If we try to do SWITCH when StageProcessing is not empty, ALTER will fail and it will mean that last load process failed.
I like the sp_getapplock and use this method myself in few places for its flexibility and that you have full control over the locking logic and wait times.
The only problem that I see is that in your case concurrent processes are not all equal.
You have SP1 that moves data from the staging table into the main table. Your system never tries to run several instances of this SP.
Another SP2 that inserts data into the staging table can be run several times simultaneously and it is fine to do it.
It is easy to implement the locking that would prevent any concurrent run of any combination of SP1 or SP2. In other words, it is easy if the locking logic is the same for SP1 and SP2 and they are treated equal. But, then you can't have several instances of SP2 running simultaneously.
It is not obvious how to implement the locking that would prevent concurrent run of SP1 and SP2, while allowing several instances of SP2 to run simultaneously.
There is another approach that doesn't attempt to prevent concurrent run of SPs, but embraces and expects that simultaneous runs are possible.
One way to do it is to add an IDENTITY column to the staging table. Or an automatically populated datetime if you can guarantee that it is unique and never decreases, which can be tricky. Or rowversion column.
The logic inside SP2 that inserts data into the staging table doesn't change.
The logic inside SP1 that moves data from the staging table into the main table needs to use these identity values.
At first read the current maximum value of identity from the staging table and remember it in a variable, say, #MaxID. All subsequent SELECTs, UPDATEs and DELETEs from the staging table in that SP1 should include a filter WHERE ID <= #MaxID.
This would ensure that if there happen to be a new row added to the staging table while SP1 is running, that row would not be processed and would remain in the staging table until the next run of SP1.
The drawback of this approach is that you can't use TRUNCATE, you need to use DELETE with WHERE ID <= #MaxID.
If you are OK with several instances of SP2 waiting for each other (and SP1), then you can use sp_getapplock similar to the following. I have this code in my stored procedure. You should put this logic into both SP1 and SP2.
I'm not calling sp_releaseapplock explicitly here, because the lock owner is set to Transaction and engine will release the lock automatically when transaction ends.
You don't have to put retry logic in the stored procedure, it can be within external code that runs these stored procedures. In any case, your code should be ready to retry.
CREATE PROCEDURE SP2 -- or SP1
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
BEGIN TRY
-- Maximum number of retries
DECLARE #VarCount int = 10;
WHILE (#VarCount > 0)
BEGIN
SET #VarCount = #VarCount - 1;
DECLARE #VarLockResult int;
EXEC #VarLockResult = sp_getapplock
#Resource = 'StagingTable_app_lock',
-- this resource name should be the same in SP1 and SP2
#LockMode = 'Exclusive',
#LockOwner = 'Transaction',
#LockTimeout = 60000,
-- I'd set this timeout to be about twice the time
-- you expect SP to run normally
#DbPrincipal = 'public';
IF #VarLockResult >= 0
BEGIN
-- Acquired the lock
-- for SP2
-- INSERT INTO StagingTable ...
-- for SP1
-- SELECT FROM StagingTable ...
-- TRUNCATE StagingTable ...
-- don't retry any more
BREAK;
END ELSE BEGIN
-- wait for 5 seconds and retry
WAITFOR DELAY '00:00:05';
END;
END;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
-- log error
END CATCH;
END
This code guarantees that only one procedure is working with the staging table at any given moment. There is no concurrency. All other instances will wait.
Obviously, if you try to access the staging table not through these SP1 or SP2 (which try to acquire the lock first), then such access will not be blocked.
Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
It looks you are searching for a mechanism that is wider than a transaction level. SQL Server/Azure SQL DB has one and it is called application lock:
sp_getapplock
Places a lock on an application resource.
Locks placed on a resource are associated with either the current transaction or the current session. Locks associated with the current transaction are released when the transaction commits or rolls back.Locks associated with the session are released when the session is logged out. When the server shuts down for any reason, all locks are released.
Locks can be explicitly released with sp_releaseapplock. When an application calls sp_getapplock multiple times for the same lock resource, sp_releaseapplock must be called the same number of times to release the lock. When a lock is opened with the Transaction lock owner, that lock is released when the transaction is committed or rolled back.
It basically means that your ETL Tool should open single session to DB, acquire the lock and release when finished. Other sessions before trying to do anything should try to acquire the lock(they cannot because it already taken), wait until when it released and continue to work.
Assuming you have a single outbound job
Add an OutboundProcessing BIT DEFAULT 0 to the table
In the job, SET OutboundProcessing = 1 WHERE OutboundProcessing = 0 (claim the rows)
For the ETL, incorporate WHERE OutboundProcessing = 1 in the query that sources the data (transfer the rows)
After the ETL, DELETE FROM TABLE WHERE OutboundProcessing = 1 (remove the rows you transferred)
If the ETL fails, SET OutboundProcessing = 0 WHERE OutboundProcessing = 1
I always prefer to "ID" each file I receive. If you can do this, you can associate the records from a given file throughout your load process. You haven't called out a need for this, but jus sayin.
However, with each file having an identity (just a int/bigint identity value should do) you can then dynamically create as many load tables as you like from a "template" load table.
When a file arrives, create a new load table named with the ID of the file.
Process your data from load to final table.
drop the load table for the file being processed.
This is somewhat similar to the other solution about using 2 tables (load and stage) but even in that solution you are still limited to having 2 files "loaded" (your still only applying one file to the final table though?)
Last, it is not clear if your "Elastic Job" is detached from the actual "load" pipeline/processing or if it is included. Being a job, I assume it is not included, if a job, you can only run a single instance at time? So its not clear why it's important to load multiple files at once if you can only move one from load to final at a time. Why the rush to get files into load?

Locking for SQL Server concurrent accessing and modifying one record

I have a table saving a list of completed jobs. Each job is done and inserted into that table after completion. There are multi-users who can fetch and run the same jobs. But before running the job should be checked (against the completed jobs table I've just mentioned) to ensure that it's not been run by anyone.
In fact the job is inserted into that table right before running the job, if the job is failed it will be removed from that table later. I have a stored procedure to check if a job exists in the table but I'm not really sure about the situation when multi-users can accidentally run the same jobs.
Here is the basic logic (for each user's app)
check if job A has been existed in the completed jobs table:
if exists(select * from CompletedJobs where JobId = JobA_Id)
select 1
else select 0
if job A has been existed (actually being run or has been completed), the current user's action should stop here. Otherwise the current user can continue by first inserting job A into the completed jobs table:
insert into CompletedJobs(...) values(...)
then it can just continue actually run the job and if it's failed, the Job A will be deleted from the table.
So in multi-threading, I can use lock to ensure that there is no other user's action involved between checking-inserting (kind of marking completion), so it should work safely. But in SQL Server I'm not so sure how that could be done. For example what if there are 2 users passing the step 1 (and both have the same result of 0 - meaning job is free to run)?
I guess both will then continue running the same job and that should be avoided. Unless at the phase of inserting the job (at the beginning of step 2), somehow I take benefit of unique constraint or primary key constraint to make SQL Server throw exception so that only one job can be continued successfully. But I feel that it's a bit hacky and not a nice solution. Are there some better (and more standard) solutions to this issue.
I think the primary/unique key approach is a valid one. But there are other options, for example you can try to lock the completed job row and if it success then run the job and insert it into the completed jobs table. You can lock the row even it doesn't exist yet.
Here is the code:
DECLARE #job_id int = 1
SET LOCK_TIMEOUT 100
BEGIN TRANSACTION
BEGIN TRY
-- it will try to exclusively lock the row. If it success, the
-- lock will be held during the transaction.
-- If the row is locked, it will wait for 100 ms before failing
-- with error 1222
IF EXISTS (SELECT * FROM completed_jobs WITH (ROWLOCK, HOLDLOCK, XLOCK) WHERE job_id = #job_id)
BEGIN
SELECT 1
COMMIT
RETURN
END
SET LOCK_TIMEOUT -1
-- execute the job and insert it into completed_jobs table
SELECT 0;
END TRY
BEGIN CATCH
IF ##TRANCOUNT > 0 ROLLBACK
SET LOCK_TIMEOUT -1
-- 1222: Lock request time out period exceeded.
IF ERROR_NUMBER() = 1222 SELECT 2
ELSE THROW
END CATCH
The script returns:
SELECT 0 if it completes the job
SELECT 1 if the job is already completed
SELECT 2 if the job is running by other one.
Two connections can run this script concurrently as long as #job_id is different.
If two connections run this script at the same time with the same #job_id and the job is not completed yet, one of them completes the job and the other one sees it as a completed job (SELECT 1) or as a running job (SELECT 2).
If one connection A executes SELECT * FROM completed_jobs WHERE job_id = #job_id while other connection B is executing this script with the same #job_id, then connection A will be blocked until B completes the script. It is true only if A runs under READ COMMITTED, REPEATABLE READ and SERIALIZABLE isolation level. If A runs under READ UNCOMMITTED, READ COMMITTED SNAPSHOT or SNAPHOST, it won't be blocked, and it will see the job as uncompleted.

How can I measure the duration of a package execution?

I have an SSIS job that I would like to run from a procedure, and then, using the job start and job end date execute a select statement.
Getting the job start time is easy- just save the current time from just before you call the job.
How can I get the end tim? Can I use #endTime = GETDATE()? Does starting the job wait for it to end ?
Is it true in general about calls inside SQL procedures?
EDIT:
As people asked, I wanted to call an SSIS job using this code, which I found here:
declare #execution_id bigint
exec ssisdb.catalog.create_execution
#folder_name = 'mssqltips'
,#project_name = 'exec-ssis-stored-proc-ssis-sample'
,#package_name = 'Sample.dtsx'
,#execution_id = #execution_id output
exec ssisdb.catalog.start_execution #execution_id
SSIS already logs package execution durations and events, including step durations. You don't need to use GETDATE().
You can query the catalog.executions view of the SSISDB database to retrieve the execution status, start time and end time, eg:
select status, start_time, end_time,datediff(s,start_time,end_time) as duration
from catalog.executions
where execution_id = #execution_id
Or
select status, start_time, end_time, datediff(s,start_time,end_time) duration
from catalog.executions
where project_name = 'exec-ssis-stored-proc-ssis-sample'
and package_name = 'Sample.dtsx'
order by execution_id
for historical data
It depends on how to you run SSIS via SP. Is it a sql agent job or package execution (catalog)?
If you run it as package, it can be run synchronously or asynchronously.
If it is in asynchronous mode, SP just starts the SSIS package and doesn't wait.
IF it is in synchronous mode, it will wait.
The mode depends on SYNCHRONIZED parameter. This parameter should be set BEFORE execution starts, see the link below how to set it.
https://learn.microsoft.com/en-us/sql/integration-services/system-stored-procedures/catalog-set-execution-parameter-value-ssisdb-database
If you run SQL job from SP and that job executes SSIS package, then SP does not wait, it just activates the SQL Agent Job.

Status of Stored Procedure call from Agent job, when job is stopped

We have a clean-up job, which calls a stored procedure, which in turn deletes one day's worth of records for a log table. This job runs every five minutes and usually completes in less than 10 seconds. Sometimes, it take much longer, as long as 15 minutes. During such instances, the log table gets locked and subsequent transactions timeout, till the job completes.
In order to address this, we came up with this solution -
1) Remove the scheduling of the existing job
2) Create a new job, to call the original job
3) Schedule the new job to run every 5 minutes
4) See below code of the new job
DECLARE #JobToRun NVARCHAR(128) = N'OM_EDU_Purge logs'
EXEC msdb.dbo.sp_start_job #JobToRun
WAITFOR DELAY '00:00:20'
IF EXISTS(SELECT 1
FROM msdb.dbo.sysjobs J
JOIN msdb.dbo.sysjobactivity A
ON A.job_id=J.job_id
WHERE J.name=#JobToRun
AND A.run_requested_date IS NOT NULL
AND A.stop_execution_date IS NULL
)
BEGIN -- job is running or finishing (not idle)
EXEC msdb.dbo.sp_stop_job #job_name=#JobToRun
-- could log info, raise error, send email etc here
END
This seems to work fine and stops the job if it is still running after 20 seconds. However, since the job calls a stored procedure, here is my question:
When the job is stopped, will it also terminate the stored procedure that is executing?
I think you query gets stuck because the log table being updated or creating records concurrently with you delete statement. So you might try to lock the table while delete statement execution. update your procedure inside query like this exp: delete from logs with(tablock)
Here, a stored proc is just calling another nested stored proc. So no, stored proc won't be stopped. The control will return to the calling stored proc. You should have sufficient error-handling in the proc to take care of scenarios where the called sproc errors out.

Resources