Oracle database "log file sync" wait event - database

We are having an issue on our production database where the blocked sessions suddenly spike for a few minutes and everything stops working. Looking at the blocked session history, I can see the log writer process (LGWR) is causing most of the blocking and the ADDM report shows the following.
Waits on event "log file sync" while performing COMMIT and ROLLBACK operations
were consuming significant database time.
Is there a way to find out what sessions caused high number of commits during a time period, which caused LGWR to block other sessions while writing to the redo log files?
Note that we already moved the redo log files to a separate disk and restarted the database which has improved things, but we're struggling to find what SQL or PL/SQL code causes excessive commits from time to time which results in LGWR going haywire.
Thanks.

Check out v$active_session_history for the LGWR process to see what it got stuck on. If its things like "log file parallel write" then it might well be an I/O issue. However, it could be that LGWR is in itself the "victim" here - it might be waiting for another background process for some reason.

Related

Many KILLED/ROLLBACK tasks running on SQL Server with 0% progress after 2 days

I created a stored procedure which sends an email and accidentally called the stored procedure within itself creating an endless loop. Within a few seconds of executing the stored procedure I realized what I had done and fixed the loop, but it had already created 517 processes. I killed all the SPID's but they are stuck in a KILLED/ROLLBACK state.
This code shows me the processes:
select session_id,handle.percent_complete, *
from sys.dm_exec_requests handle
outer apply sys.fn_get_sql(handle.sql_handle) spname
where cast(handle.start_time as date) = '2022-01-10'
spname.text is showing 'xp_sysmail_format_query' for all the SPID's. It's been two days, and all 517 processes have been stuck in this rollback state with 0% progress. We are still able to use all our business applications and execute queries, with the exception of EXEC msdb.dbo.sp_send_dbmail which, when starting even a test email, gets stuck executing and has to be cancelled. This is not good because any auto generated email warnings will not be sent, and all other sql email functions are blocked. I'm not sure what other jobs are being blocked at this time.
This is a huge problem and I cannot find a solution. I've read every post I can find about this. I've tried everything I can think of except restarting the SQL server. Some posts state that restarting the SQL server can fix this and some state not to restart it or that the tasks will just resume in the killed/rollback state when restarted. I tried killing the spids again with statusonly but that just informs me that they are in a rollback state with 0% complete.
Should I restart the server and will this fix anything? Is there another solution other than restoring the DB to a backup that is more than 2 days old and losing all the work the entire business has done in the last couple days?
Any assistance will be greatly appreciated.
As robust as it is, sometimes (fortunately rarely) SQL Server leaves us no choice when killing a process to adopt the IT mantra of turning it off an on again when the rollback does not complete in a timely fashion.
This can be more prevalent when a transaction enlists external methods or functions, email is notorious for this inparticular.
As unwelcome as it is, it's often the least-expensive in terms of time and should be considered an option soon in the diagnosis process when the low-hanging fruit options have been exhausted.

SQL Server ROLLBACK transaction took forever. Why?

We have a huge DML script, that opens up a transaction and performs a lot of changes and only then it commits.
So recently, I had triggered this scripts (through an app), and as it was taking quite an amount of time, I had killed the session, which triggered a ROLLBACK.
So the problem is that this ROLLBACK took forever and moreover it was hogging a lot of CPU (100% utilization), and as I was monitoring this session (using exec DMVs), I saw a lot of waits that are IO related (IO_COMPLETION, PAGE_IO_LATCH etc).
So my question is:
1. WHy does a rollback take some much amount of time? Is it because it needs to write every revert change to the LOG file? And the IO waits I saw could be related to IO operation against this LOG file?
2. Are there any online resources that I can find, that explains how ROLLBACK mechanism works?
Thank You
Based on another article on the DBA side of SO, ROLLBACKs are slower for at least two reasons: the original SQL is capable of being multithreaded, where the rollback is single-threaded, and two, a commit confirms work that is already complete, where the rollback not only must identify the log action to reverse, but then target the impacted row.
https://dba.stackexchange.com/questions/5233/is-rollback-a-fast-operation
This is what I have found out about why a ROLLBACK operation in SQL Server could be time-consuming and as to why it could produce a lot of IO.
Background Knowledge (Open Tran/Log mechanism):
When a lot of changes to the DB are being written as part of an open transaction, these changes modify the data pages in memory (dirty pages) and log records (into a structure called LOG BLOCKS) generated are initially written to the buffer pool (In Memory). These dirty pages are flushed to the disk either by a recurring Checkpoint operation or a lazy-write process. In accordance with the write-ahead logging mechanism of the SQL Server, before the dirty pages are flushed the LOG RECORDS describing these changes needs to be flushed to the disk as well.
Keeping this background knowledge in mind, now when a transaction is rolled back, this is almost like a recovery operation, where all the changes that are written to the disk, have to be undone. So, the heavy IO we were experiencing might have happened because of this, as there were lots of data changes that had to be undone.
Information Source: https://app.pluralsight.com/library/courses/sqlserver-logging/table-of-contents
This course has a very deep and detailed explanation of how logging recovery works in SQL Server.

Log suspend reason unknown

I am not a DBA but a programmer. Recently we have been getting LOG SUSPEND issue daily on our production. I am unable to catch the scenario as it is not reproducible on my local system.
A file when uploaded on production fails with log suspend while same file uploaded on local seems to work fine. Also, when the same file is uploaded again after some time it seems to work fine in production too.
Really confused as why this is happening.
Log Suspend indicates that the transaction log is filling up, and may not be properly sized for the transaction rate you are supporting. Have the DBA/System Administrator add additional Log device space to the database that is having issues. If possible, you may also want to break up any large transactions as well to lower the possibility
As for a cause, it's very dependent on how the system is setup. First check the database settings.
sp_helpdb will print out the list of databases on the server, as well as any options that may be set for each database.
If you don't see trunc log on chkpt, then the database is setup for maximum recoverability, the log space will only free up after a backup is run, or after the transaction log is dumped. This allows for up to the second recovery in the event of a failure, at the expense of using more log space.
If you DO see trunc log on chkpt, then the database will automatically truncate the log after a checkpoint occurs in the database. Checkpoints are issued by the database itself as part of routine processing, but the command can also be issued manually. If this option is set, and the database still goes into log suspend, then you may have a transaction that did not properly close (whether by committing or rolling back). You can check the master..syslogshold table to find long running transactions.
A third possibility is that if the system is using SAP/Sybase Replication Server, there is actually a secondary truncation point used as part of the replication processes. The system will not truncate the transaction log until after a transaction has been read by the RepAgent process, so this too can cause a system to go into log suspend.

Restarting due to SPID stuck on RUNNING, KILLED/ROLLEDBACK status

I'm a new "accidental" DBA and I'm currently trying to resolve a lockup caused by a trigger I created on a production database supporting a front end application.
I created a trigger, and then I decided I'd be best off creating a job to do the work instead, so tried to delete the trigger in object explorer. The delete failed with the message:
An exception occurred while executing a Transact-SQL statement or batch.
Lock request time out period exceeded.
I then tried to manually drop it and it failed at 0%, 0s left to go. I checked for the longest running transaction and then tried to kill the process in activity monitor. Since then the process has been stuck on "Task State:RUNNING and Command:KILLED/ROLLBACK". After some googling it sounds like I have two options.
Option 1: Restart DTC on the SQL server.... didn't work, still stuck.
Option 2: Restart the SQL service. Uh-oh.
This is the first time I've ever had to do anything like this and I'm pretty nervous being the only SQL guy in the office. Please can anyone let me know what the potential implications of restarting the service are, in terms of data loss and impact to front end users? Am I better off waiting to restart after business hours?
Thanks, and apologies if I've asked this question badly, first time for everything.
Cheers
Wait. It's rolling back and has to finish the rollback. Don't restart SQL, that will just result in the rollback continuing after the restart, possibly with the database offline.
If this is a production system and you do bounce the database, all users of your user interface will get weird and wonderful errors. Unless your application can handle it, your users will have a bad experience and then you will start getting phone calls from the boss....
As a side note, check for locking\blocking processes. The message in the question "Lock request time out period exceeded. " seems to suggest there is locking/blocking happening.

If I create a TransactionScope, is there a chance of it blocking the database if I stop it whilst debugging?

I'm just debugging this error which I have suddenly seem to have gotten when I'm writing a row to a table.
Timeout expired. The timeout period
elapsed prior to completion of the
operation or the server is not
responding. The statement has been
terminated.
Does it have something to do with transaction scopes? I was wondering if I create a transaction scope and I debug through and I literally stop the application in the middle of debugging before it's reached the end of the trasnaction scope - is there a chance it'll block the database?
If so, how do I unblock it?
NOTE - This might help, right now, I'm having trouble inserting a row in tables but have accessing and updating existing rows
UPDATE - Well I reset the SQL Server service and it seems to have done the trick. Still, I'm curious to heard how it could've locked in the first place - I do not want some part of my code to be doing that and happening in production.
If opening a DB transaction would lock the entire database, that would be massively disappointing. Generally speaking, SQL Server locks on a per-row basis and then escalates these locks as neccessary (I'm simplifying matters here significantly).
Each transaction has a timeout it's given to complete. If this time elapses and you nether commit nor roll back, you'll get that "Timeout expired" exception.
As far as "unblocking" goes, you don't usually have to worry. Everything's unlocked as soon as you close the connection.

Resources