MSSQL Backup Immediately Suspended [BULKOP_BACKUP_DB] - sql-server

Our production database server has stopped running the backup maintance plans...
The server has plenty of space, upon googling and futher investigation it appears the backups are being halted they moment they are started.
Running "Select * from Sys.dm_Exec_requests where command = 'Backup Database'" indicates the backup is being suspended due to 'DATABASE: 5 [BULKOP_BACKUP_DB]' with wait type of 'LCK_M_U'
I have tried searching for an answer to this, but nothing seems to apply in this case.
As i stated the server has plenty of disk space, and has been restarted, yet the backups are still immediatly suspened.
I'm all out of ideas, and would apreciate any input into helping me fix this 'issue'
Update: It seems it is waiting for session 77 to finish which is another backup, but it is 'Killed/Rollback' with percentage complete of 76% (doesnt appear to be changing) and a wait time of 267328092, trying to kill this prcoess results in 'SPID 77: transaction rollback in progress. Estimated rollback completion: 76%. Estimated time remaining: 86124 seconds.'
Update: Upon installing updates, trying some fixes, it appeared fixed, however on its second attempt it also stopped processing the backup, this time with a wait type of ASYNC_IO_COMPLETION...
any ideas?

Related

SQL Server - difference between stopping and disabling Agent Jobs

Can someone explain - didn't find it in MS's docs - what is the difference in behaviour of EXEC dbo.sp_stop_job and EXEC dbo.sp_update_job #enabled = 0? I'm preparing for AWS RDS reboot and need to turn off any job/ssis/dms that points to/from my RDS instance.
Goal is to stop any activity happening around RDS without brutal/forced connection break.
If you stop a job, you stop the currently running agent job. If you disable a job that means that the job will not start at it's next scheduled time.
Take, for example, a job that starts every hour, on the hour, and takes 10 minutes to complete. At 13:05 you stop the job; the process currently running is terminated (likely triggering rollbacks for any open transactions). The job will then run again at 14:00.
For the same job, at 16:01 you disable it. I believe, (though it's not documented that I could see) the job will continue running to completion, however, at 17:00 it will not start, and nor will it at any later dates until it is enabled (again).

how to resolve ERROR: could not start WAL streaming: ERROR: replication slot "xxx" is active for PID 124563

PROBLEM!!
After setting up my Logical Replication and everything is running smoothly, i wanted to just dig into the logs just to confirm there was no error there. But when i tail -f postgresql.log, i found the following error keeps reoccurring ERROR: could not start WAL streaming: ERROR: replication slot "sub" is active for PID 124898
SOLUTION!!
This is the simple solution...i went into my postgresql.conf file and searched for wal_sender_timeout on the master and wal_receiver_timeout on the slave. The values i saw there 120s for both and i had to change both to 300s which is equivalent to 5mins. Then remember to reload both servers as you dont require a restart. Then wait for about 5 to 10 mins and the error is fixed.
We had an identical error message in our logs and tried this fix and unfortunately our case was much more diabolical. Putting the notes here just for the next poor soul but in our case, the publishing instance was an AWS managed RDS server and it managed (ha ha) to create such a WAL backlog that it was going into catchup state, processing the WAL and running out of memory (getting killed by the OS every time) before it caught up. The experience on the client side was exactly what you see here - timeouts and failed WAL streaming. The fix was kind of nasty - we had to drop the whole replication link and rebuild it (fortunately it was a test database so not harm done but it's a situation you want to avoid). It was obvious after looking on the publisher side and seeing the logs but from the subscription side more mysterious.

SQL Server: DB STARTUP blocking processes

I am trying to run
DBCC CHECKTABLE
(or CHECKDB, same result), but I keep getting this error:
Check statement aborted. Database contains deferred transactions.
I've made some researches and found that it's some process with SPID 5 and command DB STARTUP blocks everything. This process is running for a few days already but neither dbcc opentran nor dbcc inputbuffer(5) show anything.
Looks like it just sits there and does nothing.
I've checked the logs for that database and it seems that recovery process went fine (last records are about step 3 of 3 running and that over 500K transactions were rolled back so I assume it's done)
I've tried some advice from Google but none of them helped. Setting database to SINGLE_USER, EMERGENCY and even OFFLINE changed nothing - actually, all of them were blocked in one way or another. I can't restore it from earlier backup for some reasons, and there is no more good advice in Google.
Please help.

SQL Server scheduled stored procedure was working, but now runs very slowly

We have a stored procedure being called by a daily job schedule that used to work within a reasonable time frame but is now performing very badly and failing. Whatever the cause, it seems to also be bogging down our whole system. We tried rebooting the computer, but the problem persisted.
The stored procedure acts as an ETL to import data from one database to another and make some updates. When called from the job, it used to run within an hour, but then about 7 days ago it started taking 10-15 hours to run. Then the last 3 days it has failed altogether. Today I let it run for 10 hours and then cancelled it.
The error message for the failed runs found it was failing because the log file is out of space. So I tried to shrink the log file by using the code below. It worked, but it didn't reduce the file size at all. Since the code didn't work, I tried shrinking using SSMS, but that failed due to the error:
Lock request time out period exceeded
I ran sp_who2 and, without knowing for sure (I'm a developer not a DBA), found the following which seemed relevant:
SPID: 63, Status: Suspended, Command: Delete, CPU Time: 1142382, DiskIO: 1254258
I thought that could be the issue so I tried to end that transaction using Kill 63. However, it appears that didn't work because if I run sp_who2 it now reads
SPID: 63, Status: Suspended, Command: Killed/Rollback, CPU Time: 1142803, DiskIO: 1261601
Any help to resolve the issue would be appreciated! Specifically:
Any ideas what could be causing the bad performance all of a sudden?
How can I shrink the log file? Could that be causing the bad performance?
Here's the code I tried:
USE MyDatabase;
GO
ALTER DATABASE MyDatabase
SET RECOVERY SIMPLE;
GO
DBCC SHRINKFILE (MyDatabase_log, 1);
GO
ALTER DATABASE MyDatabase
SET RECOVERY FULL;
GO
This ended up being a hard disk problem. Initially the hardware team insisted it was not, but after further testing and reviews it in fact was a bad disk.

Process/SPID is being blocked by itself, how to clear/kill without restarting Sql Server

We have a process that was running for 4 hours. Because it was running so long, it was causing other issues in the database, so it was decided to kill the process.
Now, the process is in a suspended state. It also states that it's being blocked by itself after querying sp_who2.
In activity monitor, here's the waitresource information:
objectlock lockPartition=0 objid=xxx subresource=FULL dbid=2 id=lockyyyy mode=X associatedObjectid=xxx
You'll notice that the objid and associatedObjectId are the same value.
Querying the sys.objects table shows NO results for that object id.
Is Sql Server waiting for a lock on an object that doesn't exist anymore? How can I get rid of this process without restarting Sql Server? (our DBA's are not responding to help requests).
Keep in mind, this is a test environment, but it is stopping all development/testing because we are unable to deploy any changes to our database, because one of those changes is affecting one of the objects that the process was accessing.
Edit: more info from activity monitor:
Command = 'KILLED/ROLLBACK'
TASK STATE = 'SUSPENDED'
I have experienced this may times. When you kill a large INSERT/UDPATE/DELETE statement, it can take hours to recover (if it ever does recover) from this state.
run kill <spid> with statusonly.
It will give you a percentage and estimated wait time of the ROLLBACK process.
Sometimes it says 0% or 100% and 0 estimated time. If you are patient, it may recover eventually. If you restart the server, the rollback process will be completed offline, and the database will show IN RECOVERY state and usually will be faster than waiting the server to recover itself.
Be aware that users won't be able to use the database until the recovery process ends, but if the SPID in KILLED/ROLLBACK state is locking other process, it might be an option to restart.
Well, this seems to be lock due to parallel processing inside the tempdb.
You can try kill [processid] if you have the rights to?
Another way is to get more detailed process information with this:
SELECT * FROM sys.sysprocesses WHERE spid = YOURSPID
As the Process runs in DB:2 try this:
SELECT * FROM tempdb.sys.all_objects WHERE object_id = OBJECTID
As I've seen, you have edited your question. If the Spid is in KILLED/ROLLBACK you have to wait until your transaction is rolled back. After that the process will be killed and removed. You can't do anything else, as the transaction security must be given.

Resources