Atomic DROP and SELECT ... INTO table

Atomic DROP and SELECT ... INTO table - sql-server

I would have thought that code like the following would be atomic: if DeleteMe exists before running this transaction, it should be dropped and recreated. Otherwise it should simply be created:
BEGIN TRANSACTION
IF OBJECT_ID('DeleteMe') IS NOT NULL
DROP TABLE DeleteMe
SELECT query.*
INTO DeleteMe
FROM (SELECT 1 AS Value) AS query
COMMIT TRANSACTION
However, it appears that executing this code multiple times concurrently can cause various combinations of the errors:
Cannot drop the table 'DeleteMe', because it does not exist or you do not have permission.
There is already an object named 'DeleteMe' in the database.
Here's a LINQPad Script to show what I mean.
var sql = #"
BEGIN TRANSACTION
IF OBJECT_ID('DeleteMe') IS NOT NULL
DROP TABLE DeleteMe
SELECT query.*
INTO DeleteMe
FROM (SELECT 1 AS Value) AS query
COMMIT TRANSACTION
";
await Task.WhenAll(Enumerable.Range(1, 50)
.Select(async i =>
{
using var connection = new SqlConnection(this.Connection.ConnectionString);
await connection.OpenAsync();
await connection.ExecuteAsync(sql);
}).Dump());
And an example of its output:
If I use SQL Server 2016's DROP TABLE IF EXISTS feature, that part at least appears to be atomic, but then another concurrent command can apparently still create the DeleteMe table between the time this one gets dropped and the time it gets created again.
Question: Is there any way to atomically drop, create, and populate a table, such that there's no time during which that table won't exist from the perspective of another concurrent connection?

Is there any way to atomically drop, create, and populate a table, such that there's no time during which that table won't exist from the perspective of another concurrent connection?
Sure. It's just like any transaction: you have to take an inconsistent lock on the very first statement. In your transaction two sessions can run IF OBJECT_ID('DeleteMe') IS NOT NULL at the same time. Then they both try to drop the object, and only one succeeds.
DROP TABLE IF EXISTS also performs the existence check before taking the exclusive schema lock on the object that would be necessary to drop it.
A simple and reliable way to get an exclusive lock is to use sp_getapplock.
eg
BEGIN TRANSACTION
exec sp_getapplock 'dropandcreate_DeleteMe', 'exclusive'
DROP TABLE IF EXISTS DeleteMe
SELECT query.*
INTO DeleteMe
FROM (SELECT 1 AS Value) AS query
COMMIT TRANSACTION

The biggest problem I see you encountering, is that by dropping the object you want to lock (you can lock an object, but not a 'name' of an object) you have nothing to lock.
Proposals that involve finding something else to lock only resolve half the issue; the process stops racing itself, but then any other process that references the DeleteMe table can still race with this process.
10x the process referenced in the question, using sp_getapplock, for example
Those 10 concurrent instances of the process no longer race each other
Then 1x another process that only uses SELECT * FROM DeleteMe but not sp_getapplock
That process CAN fail due to racing with the currently Active DROP/SELECT INTO process
That leads me to conclude that NOT dropping objects is better, so that the table in use remains in existence and CAN be locked...
BEGIN TRANSACTION
TRUNCATE TABLE DeleteMe
INSERT INTO DeleteMe SELECT 1 AS Value
COMMIT TRANSACTION
The TRUNCATE implicitly takes a table lock, and a secondary process that reads from this table never sees it as empty.

Related

How to properly truncate a staging table in an ETL pipeline?

We have an ETL pipeline that runs for each CSV uploaded into an storage account (Azure). It runs some transformations on the CSV and writes the outputs to another location, also as CSV, and calls a stored procedure on the database (SQL Azure) which ingests (BULK INSERT) this resulting CSV into a staging table.
This pipeline can have concurrent executions as multiple resources can be uploading files to the storage. Hence, the staging table is getting data inserted pretty often.
Then, we have an scheduled SQL job (Elastic Job) that triggers an SP that moves the data from the staging table into the final table.
At this point, we would want to truncate/empty the staging table so that we do not re-insert them in the next execution of the job.
Problem is, we cannot be sure that between the load from the staging table to the final table and the truncate command, there has not been any new data written into the staging table that could be truncated without first being inserted in to the final table.
Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
If not, what's the best approach to handle this?

I would propose solution with two identical staging tables. Lets name them StageLoading and StageProcessing.
Load process would have following steps:
1. At the beginning both tables are empty.
2. We load some data into StageLoading table (I assume each load is a transaction).
3. When Elastic job starts it will do:
- ALTER TABLE SWITCH to move all data from StageLoading to StageProcessing. It will make StageLoading empty and ready for next loads. It is a metadata operation, so takes miliseconds and it is fully blocking, so will be done between loads.
- load the data from StageProcessing to final tables.
- truncate table StageProcessing.
4. Now we are ready for next Elastic job.
If we try to do SWITCH when StageProcessing is not empty, ALTER will fail and it will mean that last load process failed.

I like the sp_getapplock and use this method myself in few places for its flexibility and that you have full control over the locking logic and wait times.
The only problem that I see is that in your case concurrent processes are not all equal.
You have SP1 that moves data from the staging table into the main table. Your system never tries to run several instances of this SP.
Another SP2 that inserts data into the staging table can be run several times simultaneously and it is fine to do it.
It is easy to implement the locking that would prevent any concurrent run of any combination of SP1 or SP2. In other words, it is easy if the locking logic is the same for SP1 and SP2 and they are treated equal. But, then you can't have several instances of SP2 running simultaneously.
It is not obvious how to implement the locking that would prevent concurrent run of SP1 and SP2, while allowing several instances of SP2 to run simultaneously.
There is another approach that doesn't attempt to prevent concurrent run of SPs, but embraces and expects that simultaneous runs are possible.
One way to do it is to add an IDENTITY column to the staging table. Or an automatically populated datetime if you can guarantee that it is unique and never decreases, which can be tricky. Or rowversion column.
The logic inside SP2 that inserts data into the staging table doesn't change.
The logic inside SP1 that moves data from the staging table into the main table needs to use these identity values.
At first read the current maximum value of identity from the staging table and remember it in a variable, say, #MaxID. All subsequent SELECTs, UPDATEs and DELETEs from the staging table in that SP1 should include a filter WHERE ID <= #MaxID.
This would ensure that if there happen to be a new row added to the staging table while SP1 is running, that row would not be processed and would remain in the staging table until the next run of SP1.
The drawback of this approach is that you can't use TRUNCATE, you need to use DELETE with WHERE ID <= #MaxID.
If you are OK with several instances of SP2 waiting for each other (and SP1), then you can use sp_getapplock similar to the following. I have this code in my stored procedure. You should put this logic into both SP1 and SP2.
I'm not calling sp_releaseapplock explicitly here, because the lock owner is set to Transaction and engine will release the lock automatically when transaction ends.
You don't have to put retry logic in the stored procedure, it can be within external code that runs these stored procedures. In any case, your code should be ready to retry.
CREATE PROCEDURE SP2 -- or SP1
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
BEGIN TRY
-- Maximum number of retries
DECLARE #VarCount int = 10;
WHILE (#VarCount > 0)
BEGIN
SET #VarCount = #VarCount - 1;
DECLARE #VarLockResult int;
EXEC #VarLockResult = sp_getapplock
#Resource = 'StagingTable_app_lock',
-- this resource name should be the same in SP1 and SP2
#LockMode = 'Exclusive',
#LockOwner = 'Transaction',
#LockTimeout = 60000,
-- I'd set this timeout to be about twice the time
-- you expect SP to run normally
#DbPrincipal = 'public';
IF #VarLockResult >= 0
BEGIN
-- Acquired the lock
-- for SP2
-- INSERT INTO StagingTable ...
-- for SP1
-- SELECT FROM StagingTable ...
-- TRUNCATE StagingTable ...
-- don't retry any more
BREAK;
END ELSE BEGIN
-- wait for 5 seconds and retry
WAITFOR DELAY '00:00:05';
END;
END;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
-- log error
END CATCH;
END
This code guarantees that only one procedure is working with the staging table at any given moment. There is no concurrency. All other instances will wait.
Obviously, if you try to access the staging table not through these SP1 or SP2 (which try to acquire the lock first), then such access will not be blocked.

Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
It looks you are searching for a mechanism that is wider than a transaction level. SQL Server/Azure SQL DB has one and it is called application lock:
sp_getapplock
Places a lock on an application resource.
Locks placed on a resource are associated with either the current transaction or the current session. Locks associated with the current transaction are released when the transaction commits or rolls back.Locks associated with the session are released when the session is logged out. When the server shuts down for any reason, all locks are released.
Locks can be explicitly released with sp_releaseapplock. When an application calls sp_getapplock multiple times for the same lock resource, sp_releaseapplock must be called the same number of times to release the lock. When a lock is opened with the Transaction lock owner, that lock is released when the transaction is committed or rolled back.
It basically means that your ETL Tool should open single session to DB, acquire the lock and release when finished. Other sessions before trying to do anything should try to acquire the lock(they cannot because it already taken), wait until when it released and continue to work.

Assuming you have a single outbound job
Add an OutboundProcessing BIT DEFAULT 0 to the table
In the job, SET OutboundProcessing = 1 WHERE OutboundProcessing = 0 (claim the rows)
For the ETL, incorporate WHERE OutboundProcessing = 1 in the query that sources the data (transfer the rows)
After the ETL, DELETE FROM TABLE WHERE OutboundProcessing = 1 (remove the rows you transferred)
If the ETL fails, SET OutboundProcessing = 0 WHERE OutboundProcessing = 1

I always prefer to "ID" each file I receive. If you can do this, you can associate the records from a given file throughout your load process. You haven't called out a need for this, but jus sayin.
However, with each file having an identity (just a int/bigint identity value should do) you can then dynamically create as many load tables as you like from a "template" load table.
When a file arrives, create a new load table named with the ID of the file.
Process your data from load to final table.
drop the load table for the file being processed.
This is somewhat similar to the other solution about using 2 tables (load and stage) but even in that solution you are still limited to having 2 files "loaded" (your still only applying one file to the final table though?)
Last, it is not clear if your "Elastic Job" is detached from the actual "load" pipeline/processing or if it is included. Being a job, I assume it is not included, if a job, you can only run a single instance at time? So its not clear why it's important to load multiple files at once if you can only move one from load to final at a time. Why the rush to get files into load?

Query from multiple threads on a database table

I have a database table with thousands of entries. I have multiple worker threads which pick up one row at a time, does some work (takes roughly one second each). While picking up the row, each thread updates a flag on the database row (like a timestamp) so that the other threads do not pick it up. But the problem is that I end up in a scenario where multiple threads are picking up the same row.
My general question is that what general design approach should I follow here to ensure that each thread picks up unique rows and does their task independently.
Note : Multiple threads are running in parallel to hasten the processing of the database rows. So I would like to have a as small as possible critical segment or exclusive lock.
Just to give some context, below is the stored proc which picks up the rows from the table after it has updated the flag on the row. Please note that the stored proc is not compilable as I have removed unnecessary portions from it. But generally that's the structure of it.
The problem happens when multiple threads execute the stored proc in parallel. The change made by the update statement (note that the update is done after taking up a lock) in one thread is not visible to the other thread unless the transaction is committed. And as there is a SELECT statement (which takes around 50ms) between the UPDATE and the TRANSACTION COMMIT, on 20% cases the UPDATE statement in a thread picks up a row which has already been processed.
I hope I am clear enough here.
USE ['mydatabase']
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[GetRequest]
AS
BEGIN
-- some variable declaration here
BEGIN TRANSACTION
-- check if there are blocking rows in the request table
-- FM: Remove records that don't qualify for operation.
-- delete operation on the table to remove rows we don't want to process
delete FROM request where somecondition = 1
-- Identify the requests to process
DECLARE #TmpTableVar table(TmpRequestId int NULL);
UPDATE TOP(1) request
WITH (ROWLOCK)
SET Lock = DateAdd(mi, 5, GETDATE())
OUTPUT INSERTED.ID INTO #TmpTableVar
FROM request tur
WHERE (Lock IS NULL OR GETDATE() > Lock) -- not locked or lock expired
AND GETDATE() > NextRetry -- next in the queue
IF(##RowCount = 0)
BEGIN
ROLLBACK TRANSACTION
RETURN
END
select #RequestID = TmpRequestId from #TmpTableVar
-- Get details about the request that has been just updated
SELECT somerows
FROM request
WHERE somecondition = 1
COMMIT TRANSACTION
END

The analog of a critical section in SQL Server is sp_getapplock, which is simple to use. Alternatively you can SELECT the row to update with (UPDLOCK,READPAST,ROWLOCK) table hints. Both of these require a multi-statement transaction to control the duration of the exclusive locking.

You need start a transaction isolation level on sql for isolation your line, but this can impact on your performance.
Look the sample:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
GO
BEGIN TRANSACTION
GO
SELECT ID, NAME, FLAG FROM SAMPLE_TABLE WHERE FLAG=0
GO
UPDATE SAMPLE_TABLE SET FLAG=1 WHERE ID=1
GO
COMMIT TRANSACTION
Finishing, not exist a better way for use isolation level. You need analyze the positive and negative point for each level isolation and test your system performance.
More information:
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql
http://www.besttechtools.com/articles/article/sql-server-isolation-levels-by-example
https://en.wikipedia.org/wiki/Isolation_(database_systems)

How is data locked during a transaction

I'm starting to work with SQL Server database and I'm having a hard time trying to understand Transaction Isolation Levels and how they lock data.
I'm trying to accomlish the following simple task:
Accept a pair of integers [ID, counter] in a SQL stored procedure
Determine whether ID exists in a certain table: SELCT COUNT(*) FROM MyTable WHERE Id = {idParam}
If the previous COUNT statement returns 0, insert this ID and counter:
INSERT INTO MyTable(Id, Counter) VALUES({idParam}, {counterParam})
If the COUNT statement returns 1, update the existing record: UPDATE MyTable SET Counter = Counter + {counterParam} WHERE Id = {idParam}
Now, I understand I have to wrap this whole stored procedure in a transaction, and according to this MS article the appropriate isolation level would be SERIALIZABLE (it says: No other transactions can modify data that has been read by the current transaction until the current transaction completes). Please correct me if I'm wrong here.
Suppose I called the procedure with ID=1, so the first query woluld be SELCT COUNT(*) FROM MyTable WHERE SomeId=1 (1st transaction began). Then, immediately after this query was executed, the procedure is called with ID=2 (2nd transaction began).
What I fail to understand is how much data would be locked during the execution of my stored procedure in this case:
If the 1st query of the 1st transaction returns 0 records, does this mean that 1st transaction locks nothing and other transactions are able to INSERT ID=1 before 1st transaction tries it?
Or does the 1st transaction lock the whole table making the 2nd transaction wait even though those 2 transactions can never try to read/update the same row?
Or does 1st transaction somehow forbid anyone else to read/write only records with ID=1 until it is comleted?

If your filter is on an index, that's what's going to get locked. So regardless of whether the row already exists or not, it's locked for the duration of the transaction. Take care, though - it's very easy to turn a row lock into something nastier, especially full table locks. And of course, it's easy to introduce deadlocks this way :)
However, I'd suggest a different approach. First, try to do an insert. If it works, you're done - if it doesn't, you know you can safely do an atomic update. Very fast, very cheap, very reliable :)

Does SQL Server wrap Select...Insert Queries into an implicit transaction?

When I perform a select/Insert query, does SQL Server automatically create an implicit transaction and thus treat it as one atomic operation?
Take the following query that inserts a value into a table if it isn't already there:
INSERT INTO Table1 (FieldA)
SELECT 'newvalue'
WHERE NOT EXISTS (Select * FROM Table1 where FieldA='newvalue')
Is there any possibility of 'newvalue' being inserted into the table by another user between the evaluation of the WHERE clause and the execution of the INSERT clause if I it isn't explicitly wrapped in a transaction?

You are confusing between transaction and locking. Transaction reverts your data back to the original state if there is any error. If not, it will move the data to the new state. You will never ever have your data in an intermittent state when the operations are transacted. On the other hand, locking is the one that allows or prevents multiple users from accessing the data simultaneously. To answer your question, select...insert is atomic and as long as no granular locks are explicitly requested, no other user will be able to insert while select..insert is in progress.

John, the answer to this depends on your current isolation level. If you're set to READ UNCOMMITTED you could be looking for trouble, but with a higher isolation level, you should not get additional records in the table between the select and insert. With a READ COMMITTED (the default), REPEATABLE READ, or SERIALIZABLE isolation level, you should be covered.

Using SSMS 2016, it can be verified that the Select/Insert statement requests a lock (and so most likely operates atomically):
Open a new query/connection for the following transaction and set a break-point on ROLLBACK TRANSACTION before starting the debugger:
BEGIN TRANSACTION
INSERT INTO Table1 (FieldA) VALUES ('newvalue');
ROLLBACK TRANSACTION --[break-point]
While at the above break-point, execute the following from a separate query window to show any locks (may take a few seconds to register any output):
SELECT * FROM sys.dm_tran_locks
WHERE resource_database_id = DB_ID()
AND resource_associated_entity_id = OBJECT_ID(N'dbo.Table1');
There should be a single lock associated to the BEGIN TRANSACTION/INSERT above (since by default runs in an ISOLATION LEVEL of READ COMMITTED)
OBJECT ** ********** * IX LOCK GRANT 1
From another instance of SSMS, open up a new query and run the following (while still stopped at the above break-point):
INSERT INTO Table1 (FieldA)
SELECT 'newvalue'
WHERE NOT EXISTS (Select * FROM Table1 where FieldA='newvalue')
This should hang with the string "(Executing)..." being displayed in the tab title of the query window (since ##LOCK_TIMEOUT is -1 by default).
Re-run the query from Step 2.
Another lock corresponding to the Select/Insert should now show:
OBJECT ** ********** 0 IX LOCK GRANT 1
OBJECT ** ********** 0 IX LOCK GRANT 1
ref: How to check which locks are held on a table

How is a T-SQL transaction not thread-safe?

The following (sanitized) code sometimes produces these errors:
Cannot drop the table 'database.dbo.Table', because it does not exist or you do not have permission.
There is already an object named 'Table' in the database.
begin transaction
if exists (select 1 from database.Sys.Tables where name ='Table')
begin drop table database.dbo.Table end
Select top 3000 *
into database.dbo.Table
from OtherTable
commit
select * from database.dbo.Table
The code can be run multiple times simultaneously. Anyone know why it breaks?

Can I ask why your doing this first? You should really consider using temporary tables or come up with another solution.
I'm not positive that DDL statments behave the sameway in transactions as DML statements and have seen a blog post with a weird behavior and creating stored procedures within a DDL.
Asside from that you might want to verify your transaction isolation level and set it to Serialized.
Edit
Based on a quick test, I ran the same sql in two different connections, and when I created the table but didn't commit the transaction, the second transaction blocked. So it looks like this should work. I would still caution against this type of design.

In what part of the code are you preventing multiple accesses to this resource?
begin transaction
if exists (select 1 from database.Sys.Tables where name ='Table')
begin drop table database.dbo.Table end
Select top 3000 *
into database.dbo.Table
from OtherTable
commit
Begin transaction isn't doing it. It's only setting up for a commit/rollback scenario on any rows added to tables.
The (if exists, drop) is a race condition, along with the re-creation of the table with (select..into). Mutiliple people dropping into that code all at once will most certainly cause all kinds of errors. Some creating tables that others have just destroyed, others dropping tables that don't exist anymore, and others dropping tables that some are busy inserting into. UGH!
Consider the temp table suggestions of others, or using an application lock to block others from entering this code at all if the critical resource is busy. Transactions on drop/create are not what you want.

If you are just using this table during this process I would suggest using a temp table or , depending on how much data , a ram table. I use ram tables frequently to avoid any transaction costs and save on disk activity.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight