Why Is My Azure SQL Database Table Permanently Locked? - sql-server

I have an isolated Azure SQL test database that has no active connections except my development machine through SSMS and a development web application instance. I am the only one using this database.
I am running some tests on a table of ~1M records where we need to do a large UPDATE to data in nearly all of the ~1M records.
DECLARE #BatchSize INT = 1000
WHILE #BatchSize > 0
BEGIN
UPDATE TOP (#BatchSize)
[MyTable]
SET
[Data] = [Data] + ' a change'
WHERE
[Data] IS NOT NULL
SET #BatchSize = ##ROWCOUNT
RAISERROR('Updated %d records', 0, 1, #BatchSize) WITH NOWAIT
END
This query works fine, and I can see my data being updated 1000 records at a time every few seconds.
Performing additional INSERT/UPDATE/DELETE commands on MyTable seem to be somewhat affected by this batch query running, but these operations do execute within a few seconds when ran. I assume this is because locks are being taken on MyTable and my other commands will execute in between the batch query's locks/looping iterations.
This behavior is all expected.
However, every so often while the batch query is running I notice that additional INSERT/UPDATE/DELETE commands on MyTable will no longer execute. They always time out/never finish. I assume some type of lock has occurred on MyTable, but it seems that the lock is never being released. Further, even if I cancel the long-running update batch query I can still no longer run any INSERT/UPDATE/DELETE commands on MyTable. Even after 10-15 minutes of the database sitting stale with nothing happening on it anymore I cannot execute write commands on MyTable. The only way I have found to "free up" the database from whatever is occurring is to scale it up and down to a new pricing tier. I assume that this pricing tier change is recycling/rebooting the instance or something.
I have reproduced this behavior multiple times during my testing today.
What is going on here?

Scaling up/down the tier rollback all open transactions and disconnect server logins.
About what you are seeing it seems is lock escalation. Try to serialize access to the database using sp_getapplock. You can also try lock hints.

Related

SELECT statement is not blocked by an existing exclusive table lock

For testing, I am trying to simulate a condition in which a query from our web application to our SQL Server backend would timeout. The web application is configured so this happens if the query runs longer than 30 seconds. I felt the easiest way to do this would be to take and hold an exclusive lock on the the table that the web application wants to query. As I understand it, an exclusive lock should prevent any additional locks (even the shared locks taken by a SELECT statement).
I used the following methodology:
CREATE A LONG-HELD LOCK
Open a first query window in SSMS and run
BEGIN TRAN;
SELECT * FROM MyTable WITH (TABLOCKX);
WAITFOR DELAY '00:02:00';
ROLLBACK;
(see https://stackoverflow.com/a/25274225/2824445 )
CONFIRM THE LOCK
I can EXEC sp_lock and see results with ObjId matching MyTable, Type of TAB, Mode of X
TRY TO GET BLOCKED BY THE LOCK
Open a second query window in SSMS and run SELECT * FROM MyTable
I would expect this to sit and wait, not returning any results until after the lock is released by the first query. Instead, the second query returns with full results immediately.
STUFF I TRIED
In the second query window, if I SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, then the second query waits until the first completes as expected. However, the point is to simulate a timeout in our web application, and I do not have any easy way to alter the transaction isolation level of the web application's connections away from the default of READ COMMITTED.
In the first window, I tried modifying the table's values inside the transaction. In this case, when the second query returns immediately, the values it shows are the unmodified values.
Figured it out. We had READ_COMMITTED_SNAPSHOT turned on, which is how the second query was able to return the previous, unmodified values in part 2 of "Stuff I tried". I was able to determine this with SELECT is_read_committed_snapshot_on FROM sys.databases WHERE name = 'MyDatabase'. Once it was turned off with ALTER DATABASE MyDatabase SET READ_COMMITTED_SNAPSHOT OFF, I began to see the expected behavior in which the second query would wait for the first to complete.

How to properly truncate a staging table in an ETL pipeline?

We have an ETL pipeline that runs for each CSV uploaded into an storage account (Azure). It runs some transformations on the CSV and writes the outputs to another location, also as CSV, and calls a stored procedure on the database (SQL Azure) which ingests (BULK INSERT) this resulting CSV into a staging table.
This pipeline can have concurrent executions as multiple resources can be uploading files to the storage. Hence, the staging table is getting data inserted pretty often.
Then, we have an scheduled SQL job (Elastic Job) that triggers an SP that moves the data from the staging table into the final table.
At this point, we would want to truncate/empty the staging table so that we do not re-insert them in the next execution of the job.
Problem is, we cannot be sure that between the load from the staging table to the final table and the truncate command, there has not been any new data written into the staging table that could be truncated without first being inserted in to the final table.
Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
If not, what's the best approach to handle this?
I would propose solution with two identical staging tables. Lets name them StageLoading and StageProcessing.
Load process would have following steps:
1. At the beginning both tables are empty.
2. We load some data into StageLoading table (I assume each load is a transaction).
3. When Elastic job starts it will do:
- ALTER TABLE SWITCH to move all data from StageLoading to StageProcessing. It will make StageLoading empty and ready for next loads. It is a metadata operation, so takes miliseconds and it is fully blocking, so will be done between loads.
- load the data from StageProcessing to final tables.
- truncate table StageProcessing.
4. Now we are ready for next Elastic job.
If we try to do SWITCH when StageProcessing is not empty, ALTER will fail and it will mean that last load process failed.
I like the sp_getapplock and use this method myself in few places for its flexibility and that you have full control over the locking logic and wait times.
The only problem that I see is that in your case concurrent processes are not all equal.
You have SP1 that moves data from the staging table into the main table. Your system never tries to run several instances of this SP.
Another SP2 that inserts data into the staging table can be run several times simultaneously and it is fine to do it.
It is easy to implement the locking that would prevent any concurrent run of any combination of SP1 or SP2. In other words, it is easy if the locking logic is the same for SP1 and SP2 and they are treated equal. But, then you can't have several instances of SP2 running simultaneously.
It is not obvious how to implement the locking that would prevent concurrent run of SP1 and SP2, while allowing several instances of SP2 to run simultaneously.
There is another approach that doesn't attempt to prevent concurrent run of SPs, but embraces and expects that simultaneous runs are possible.
One way to do it is to add an IDENTITY column to the staging table. Or an automatically populated datetime if you can guarantee that it is unique and never decreases, which can be tricky. Or rowversion column.
The logic inside SP2 that inserts data into the staging table doesn't change.
The logic inside SP1 that moves data from the staging table into the main table needs to use these identity values.
At first read the current maximum value of identity from the staging table and remember it in a variable, say, #MaxID. All subsequent SELECTs, UPDATEs and DELETEs from the staging table in that SP1 should include a filter WHERE ID <= #MaxID.
This would ensure that if there happen to be a new row added to the staging table while SP1 is running, that row would not be processed and would remain in the staging table until the next run of SP1.
The drawback of this approach is that you can't use TRUNCATE, you need to use DELETE with WHERE ID <= #MaxID.
If you are OK with several instances of SP2 waiting for each other (and SP1), then you can use sp_getapplock similar to the following. I have this code in my stored procedure. You should put this logic into both SP1 and SP2.
I'm not calling sp_releaseapplock explicitly here, because the lock owner is set to Transaction and engine will release the lock automatically when transaction ends.
You don't have to put retry logic in the stored procedure, it can be within external code that runs these stored procedures. In any case, your code should be ready to retry.
CREATE PROCEDURE SP2 -- or SP1
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;
BEGIN TRY
-- Maximum number of retries
DECLARE #VarCount int = 10;
WHILE (#VarCount > 0)
BEGIN
SET #VarCount = #VarCount - 1;
DECLARE #VarLockResult int;
EXEC #VarLockResult = sp_getapplock
#Resource = 'StagingTable_app_lock',
-- this resource name should be the same in SP1 and SP2
#LockMode = 'Exclusive',
#LockOwner = 'Transaction',
#LockTimeout = 60000,
-- I'd set this timeout to be about twice the time
-- you expect SP to run normally
#DbPrincipal = 'public';
IF #VarLockResult >= 0
BEGIN
-- Acquired the lock
-- for SP2
-- INSERT INTO StagingTable ...
-- for SP1
-- SELECT FROM StagingTable ...
-- TRUNCATE StagingTable ...
-- don't retry any more
BREAK;
END ELSE BEGIN
-- wait for 5 seconds and retry
WAITFOR DELAY '00:00:05';
END;
END;
COMMIT TRANSACTION;
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION;
-- log error
END CATCH;
END
This code guarantees that only one procedure is working with the staging table at any given moment. There is no concurrency. All other instances will wait.
Obviously, if you try to access the staging table not through these SP1 or SP2 (which try to acquire the lock first), then such access will not be blocked.
Is there a way to lock the staging table while we're copying the data into the final table so that the SP (called from the ETL pipeline) trying to write to it will just wait until the lock is release? Is this achievable by using transactions or maybe some manual lock commands?
It looks you are searching for a mechanism that is wider than a transaction level. SQL Server/Azure SQL DB has one and it is called application lock:
sp_getapplock
Places a lock on an application resource.
Locks placed on a resource are associated with either the current transaction or the current session. Locks associated with the current transaction are released when the transaction commits or rolls back.Locks associated with the session are released when the session is logged out. When the server shuts down for any reason, all locks are released.
Locks can be explicitly released with sp_releaseapplock. When an application calls sp_getapplock multiple times for the same lock resource, sp_releaseapplock must be called the same number of times to release the lock. When a lock is opened with the Transaction lock owner, that lock is released when the transaction is committed or rolled back.
It basically means that your ETL Tool should open single session to DB, acquire the lock and release when finished. Other sessions before trying to do anything should try to acquire the lock(they cannot because it already taken), wait until when it released and continue to work.
Assuming you have a single outbound job
Add an OutboundProcessing BIT DEFAULT 0 to the table
In the job, SET OutboundProcessing = 1 WHERE OutboundProcessing = 0 (claim the rows)
For the ETL, incorporate WHERE OutboundProcessing = 1 in the query that sources the data (transfer the rows)
After the ETL, DELETE FROM TABLE WHERE OutboundProcessing = 1 (remove the rows you transferred)
If the ETL fails, SET OutboundProcessing = 0 WHERE OutboundProcessing = 1
I always prefer to "ID" each file I receive. If you can do this, you can associate the records from a given file throughout your load process. You haven't called out a need for this, but jus sayin.
However, with each file having an identity (just a int/bigint identity value should do) you can then dynamically create as many load tables as you like from a "template" load table.
When a file arrives, create a new load table named with the ID of the file.
Process your data from load to final table.
drop the load table for the file being processed.
This is somewhat similar to the other solution about using 2 tables (load and stage) but even in that solution you are still limited to having 2 files "loaded" (your still only applying one file to the final table though?)
Last, it is not clear if your "Elastic Job" is detached from the actual "load" pipeline/processing or if it is included. Being a job, I assume it is not included, if a job, you can only run a single instance at time? So its not clear why it's important to load multiple files at once if you can only move one from load to final at a time. Why the rush to get files into load?

Corrupt Azure SQL stored procedure could only be fixed by drop recreate

Pardon me if this is a duplicate. The closest I could find was Random timeout running a stored proc - drop recreate fixes but I'm not certain the answers there about recompiling the stored procedure apply.
I have an Azure SQL database, latest version, that has a lot of traffic from an Azure web app front end. I have a nightly remote job that runs a batch to rebuild indexes on the Azure SQL database as that seems to help greatly with controlling database size and performance.
Normally, the rebuilding of indexes takes about 20 minutes. Last night it timed out after 2 hours. The error handler in that batch did not log any errors.
Soon after rebuilding indexes was started, one particular stored procedure starting timing out for every client calling it. Other stored procedures using the same tables were not having any issues. When I discovered the problem, I could alleviate all the timeouts and suspended processes by altering the stored procedure to immediately return. When I altered the stored procedure again to behave normally, the issues reappeared immediately. My understanding is that altering the stored procedure forced it to recompile but that didn't fix it.
Ultimately, I completely dropped and recreated the procedure with the original code and the issue was resolved.
This procedure and the schema it uses have been completely stable for many months. The procedure itself is quite simple:
CREATE Procedure [dbo].[uspActivityGet] (#databaseid uniqueidentifier) AS
begin
SET NOCOUNT ON;
--There may be writing activities to the table asynchronously, do not use nolock on tblActivity - the ActivityBlob might be null in a dirty read.
select top 100 a.Id, h.HandsetNumber, a.ActivityBlob, a.ActivityReceived
from dbo.tblDatabases d with(nolock) join dbo.tblHandsets h with(nolock) on d.DatabaseId = h.DatabaseId join dbo.tblActivity a on h.Id = a.HandsetId
where d.DatabaseId = #databaseid and a.ActivitySent is null
order by a.ActivityReceived
end
While the procedure would hang and time out with something like this:
exec dbo.uspActivityGet 'AF3EA01B-DB22-4A39-9E1C-D096D2DF1215'
Running the identical select in a query window would return promptly and successfully:
declare #databaseid uniqueidentifier; set #databaseid = 'AF3EA01B-DB22-4A39-9E1C-D096D2DF1215'
select top 100 a.Id, h.HandsetNumber, a.ActivityBlob, a.ActivityReceived
from dbo.tblDatabases d with(nolock) join dbo.tblHandsets h with(nolock) on d.DatabaseId = h.DatabaseId join dbo.tblActivity a on h.Id = a.HandsetId
where d.DatabaseId = #databaseid and a.ActivitySent is null
order by a.ActivityReceived
Any ideas how I can prevent this from happening in the future? Thank you.
Edit - Adding execution plan screenshot
Edit - Adding query used to view running processes. There were many, guessing aproximately 150, in the suspended state and they were all for the same stored procedure - uspActivityGet. Also, Data IO Percentage was maxed out the whole time when it normally runs 20 - 40% in peak demand times. I don't recall what the wait type was. Here is the query used to view that.
select * from sys.dm_Exec_requests r with(nolock) CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) order by r.total_elapsed_time desc
Edit - It happened again tonight. Here is the execution plan of the same procedure during the issue. After dropping and creating the procedure again, the execution plan returned to normal and the issue was resolved.
During the issue, sp_executesql with the identical query took about 5 minutes to execute and I believe that is representative of what was happening. There were about 50 instances of uspActivityGet suspended with wait type SLEEP_TASK or IO_QUEUE_LIMIT.
Perhaps the next question is why is index rebuilding or other nightly maintenance doing this to the execution plan?
The clues are in the query and the troublesome execution plan. See Poor Performance with Parallelism and Top
The normal execution plan seems quite efficient and shouldn't need recompiled as long as the relevant schema doesn't change. I also want to avoid parallelism in this query. I added the following two options to the query for assurance on both points and all is happy again.
OPTION (KEEPFIXED PLAN, MAXDOP 1)

SQL Server lock escalation

My application runs a nightly purge process to delete old records from the primary tables in my OLTP application. I was experiencing lock escalation during the purge process which was blocking concurrent inserts into the table, so I modified the purge procedure to loop through and delete records in blocks of 4900 which should be well below SQL Server's lock escalation threshold of 5000. While lock escalation was much reduced, SQL Server Profiler still reports occasional lock escalation on the following DELETE statement in the loop:
-- outer loop increments #BatchMinId and #BatchMaxId variables
BEGIN TRAN
-- limit is set at 4900
DELETE TOP (#limit) h
OUTPUT DELETED.ChildTable1Id,
DELETED.ChildTable2Id,
DELETED.ChildTable3Id,
DELETED.ChildTable4Id
INTO #ChildRecordsToDelete
FROM MainTable h WITH (ROWLOCK)
WHERE h.Id >= #BatchMinId AND h.Id <= #BatchMaxId AND h.Id < #MaxId AND
NOT EXISTS (SELECT 1 FROM OtherTable ot WHERE ot.Id = h.Id);
-- delete from ChildTables 1-4 (no additional references to MainTable)
COMMIT TRAN;
-- end loop
The "IntegerData2" column in SQL Server Profiler for the reported lock escalation events (which is supposed to be the escalated lock count) ranges from 10197 to 10222 which does not look close to any multiple of 4900 (my purge batch size) plus any multiple of 1250 (number of additional locks SQL Server may take before attempting escalation).
Given that I am explicitly limiting the DELETE statement to 4900 rows, how are more locks ever being taken, especially to the point that SQL Server is escalating to a table lock? I would like to understand this before I resort to disabling lock escalation altogether on this table.
I can't comment on your question since I don't have enough reputation on this web site, so I'm commenting here.
I had a similar issue with a cleanup task running at night. The delete statement was locked by the "GHOST CLEANUP".
Here have a look at this :
SQL Server Lock Timeout Exceeded Deleting Records in a Loop
Hope this help.
One weird solution that I found at the time was :
1) Insert the record to keep in another table with same structure. (Copy)
2) Truncate table to clean
3) Insert back data to keep from the copy into the now empty table.
4) Truncate copy table to release space.
This trick was faster to cleanup then the delete itself, because the deletion was done in a split second because of truncate. Somehow the cost of insertion was less expensive then deletion one.
But still, I would recommend to avoid this solution. You could also reduce the chunk between 100 to 500. This increase time the cleanup takes, but you are less likely to have the lock escalation.

Does anyone see a performance issue with my logon Trigger?

Does anyone see a performance issue with my logon Trigger?
I'm trying to reduce the overhead and prevent any performance issues before I push this trigger to my production SQL Server.
I currently have the logon trigger working on my Development sql server. I let it run over the past weekend and it put 50,000+ rows into my audit log table. I noticed that 95% of the records where for the logon 'NT AUTHORITY/SYSTEM'. So I decided to filter anything with 'NT AUTHORITY%' and just not insert those records. My thinking is if I do filter these 'NT AUTHORITY' records that the amount of resources I'll save on those inserts will make up for the cost of the IF statement check. I also have been watching Prefmon and don't see anything unusual while the trigger is enabled, but then again my Development server dosn't see the same amount of activity as production.
USE [MASTER]
GO
CREATE TRIGGER AuditServerAuthentication
ON ALL SERVER
WITH EXECUTE AS SELF
FOR LOGON
AS BEGIN
DECLARE #event XML, #Logon_Name VARCHAR(100)
SET #Event = EVENTDATA()
SET #Logon_Name = CAST(#event.query('/EVENT_INSTANCE/LoginName/text()') AS VARCHAR(100))
IF #Logon_Name NOT LIKE 'NT AUTHORITY%'
BEGIN
INSERT INTO Auditing.Audit.Authentication_Log
(Post_Time,Event_Type,Login_Name,Client_Host,Application_Name,Event_Data)
VALUES
(
CAST(CAST(#event.query('/EVENT_INSTANCE/PostTime/text()') AS VARCHAR(64)) AS DATETIME),
CAST(#event.query('/EVENT_INSTANCE/EventType/text()') AS VARCHAR(100)),
CAST(#event.query('/EVENT_INSTANCE/LoginName/text()') AS VARCHAR(100)),
CAST(#event.query('/EVENT_INSTANCE/ClientHost/text()') AS VARCHAR(100)),
APP_NAME(),
#Event
)
END
END
GO
I use a very similar trigger in my servers, and i haven't experience any performance issues. The production DB gets about 10 logins per second. This creates a huge amount of data over time which translates to bigger backups, etc.
For some servers i created a table with the users that logins shouldn't be logged, with this is also possible to refuse logins according to working hours
The difference with my trigger is that I've created a DB for auditing purposes, in which i created some stored procedures that i call in the trigger. The trigger looks like this
alter TRIGGER [tr_AU_LogonLog] ON ALL SERVER
WITH EXECUTE AS 'AUDITUSER'
FOR LOGON
AS
BEGIN
DECLARE
#data XML
, #rc INT
SET #data = EVENTDATA()
EXEC #rc = AuditDB.dbo.LogonLog #data
END ;
The production DB gets about 10 logins per second. This creates a huge amount of data over time which translates to bigger backups, etc.
EDIT: oh i forgot, its recommended if you create a specific user for the trigger, execute as self can be dangerous in some scenarios.
EDIT2: There's some useful information about the execute as statement here. oh and be careful while implementing triggers, you can accidentally lock yourself out, i recommend keeping a connection open just in case :)
It doesn't look like a costly IF statement at all to me (it's not like you are selecting anything from the database) and, as you say, would be far less costly than performing an INSERT that is isn't necessary 95% of the time. However, I should add I am just a database programmer and not a DBA, so am open to being corrected here.
I am, though, slightly curious as to why you are doing this? Doesn't SQL Server already have a built-in mechanism for Login Auditing that you can use?
There's nothing immediately obvious. It's effectively an IF that protects an INSERT. The only thing I would validate is how expensive the XML parsing is - I've not used it in SQL Server yet, so it's an unknown to me.
Admittedly, it would seem odd for Microsoft to supply an easy way to get metadata (EVENTDATA()) yet make it expensive to parse, yet stranger things have happened...

Resources