I am doing a sql server profiler on deadlocks because users are getting query time outs. In the profiler the eventClass column shows Lock:Escalation and Lock:Cancel. How do I find out what would cause a query to be canceled? Basically the same queries are being run by bunch of users and things zoom right through, but off and on throughout the day users are timing out. I am run sqldiag also; however, unfortunately I am not DBA and muddling my way through to discover the problem. Any suggestions?
thanks community
nick
Query timeouts and deadlocks are pretty much mutually exclusive.
A deadlock situation will be discovered very quickly by deadlock monitor background thread and dealt with in a manner that one of the deadlocked processes (usually the one with lower cost of rolling back) will be chosen as a deadlock victim and it's work up to the point rolled back.
A query timeout could happen with livelocks, with high number of cuncurrent processes trying to access the same resource and thus blocking one another. When the time elapsed exceeds the timeout value (set by the client) the query will be canceled (and this is the reason you're seeing the Lock:Canceled events in the trace).
It is very important that client handles this condition, because all the resorces taken inside a transaction which timed out will remain taken as long as the connection is alive or the transaction is not rolled back.
To diagnose blocking situations, you can do several things.
If you happen to be monitoring at the time when a process is blocked, run the following query to find out the head of the blocking chain so you can investigate further:
select r.session_id, r.host_name, r.program_name,
r.login_name, r.nt_domain, r.nt_user_name,
r.total_elapsed_time/1000 as total_elapsed_time_sec, getdate() as vrijeme,
(select text from sys.dm_exec_sql_text(c.most_recent_sql_handle)) as sql_text
from sys.dm_exec_connections c
inner join sys.dm_exec_sessions r on r.session_id = c.session_id
where r.is_user_process = 1
and exists (
select *
from sys.dm_os_waiting_tasks r2
where r2.blocking_session_id = r.session_id
)
and not exists (
select *
from sys.dm_os_waiting_tasks r3
where r3.session_id = r.session_id
)
and r.total_elapsed_time/1000 > 10
This query has a 10 seconds treshold.
Firthermore you can use the Profiler to capture the blocking process event and then analyze it later on. Check this link for detailed explanation:
https://www.simple-talk.com/sql/sql-tools/how-to-identify-blocking-problems-with-sql-profiler/
There will usually be a handful of queries responsible for large majority of blocking. Identify those and try to optimize them (rewrite, indexing..). Besides, you can set up read committed snapshot isolation level for the database to avoid readers waiting on writers.
Related
I'm running into an interesting issue in production, which I cannot replicate in our QA/Staging environments.
I have a query that is doing dirty reads on a fairly large table (around 6 million rows, but we only keep the last 90 days of data in it, older records are warehoused in a different database). This table has lots of writes to it, as it logs page views, but only occasionally is data read from the table.
Recently I noticed that when one specific query is running, SQL Server 2019 starts generating a ton of WRITELOG waits and appears to hold up any other requests that are trying to write to the database.
Now the query itself has nolock hints on all the tables, because it's okay if dirty data is returned. We use the nolock hints because the writes to the table are extremely frequently and queries to this table can be slow because there are a lot of page scans required.
The query itself looks something like this:
select
clt.ViewDate, clt.UserId, clt.RemoteAddress, clt.LibraryId, clt.Parameters
, u.Fullname
, cl.Id as VideoId, cl.Title
-- we need a compound key for each row, so we can count the unique rows
, case
when clt.ViewDate is null then null
else row_number() over (order by clt.ViewDate, clt.UserId, clt.LibraryId, clt.Parameters)
end as compoundKey
from
ContentLibrary as cl (nolock)
left join
(
ContentLibraryTracking as clt (nolock)
inner join
[User] as u (nolock)
on
clt.UserId = u.UserId
)
on
clt.ViewDate between #startDate and #endDate
and
clt.Parameters like #filter
where
1 = 1
and
cl.ContentType = #contentType
order by
clt.ViewDate
The problem table appears to be the ContentLibraryTracking. This is the table that has millions of rows and has lots of inserts and we warehouse rows nightly, so there can be a lot of page fragmentation. We do defrag the indices and stats weekly on the table.
When this query is running, sp_BlitzWho will report the query has entered into a CXCONSUMER. I will then see SQL Server 2019 starting to queue processes with a WRITELOG wait. This processed remain in this state until the query has finished running.
Since our application has some kind of write transaction with every page view, this means this query is holding up execution for entire application, which is obviously bad.
While I know have page scans is bad for a query plan, the query requires searching patterns in a varchar column, which is why the page scans happen. Since the reads are very infrequently, the table is optimized for writes since those are extremely frequent. And while the query could perform better, considering the work it's doing even when it's slow it runs within 15 seconds or so.
One thing I do see from the sp_BlitzWho results is the query is using parallelism and it also states the Transaction Isolation Level is Read Committed (which I would unexpected Read Uncommitted since all the tables have a nolock hint).
What would cause a query with dirty reads to be forcing the database to queue up WRITELOG events?
I could see this happening if the query was altering data and causing it's own transaction log entries, but that should not be happening with the query. That's the whole reason we are using the nolock hint on the tables.
Also, our database, log files and tempdb are all on their own logical storage devices, so reads from the database should not be causing a IO problems writing to the transaction log files.
A couple of notes on the environment:
We are running Microsoft SQL Server 2019 (RTM-CU8-GDR) (KB4583459) - 15.0.4083.2 (X64))
The database is running in a VM
We backup transaction logs every 5 minutes (could this be the issue?)
Memory and CPU usage appear fine with the query runs
SQL Monitor 11 only really shows spikes in the log flushes and waits (which would match the behavior). Page splits, buffer cache & page are all normal. I do see the "disk read bytes/sec" go up on the logic drive that has the database on it, but the writes on all drives (including the transaction logs) look okay.
Any thoughts would be greatly appreciated as I'm really scratching my head over this issue.
Right after I posted my question I started looking at the sp_BlitzWho results in more detail. I noticed the parallelism was using all the CPUs. So I changed the MAXDOP to half the CPU/cores and this appears to have resolved the issue. I'm going to keep monitoring the situation, but looks like an instance where the MAXDOP was not set correctly.
It make sense that if a query is eating up all the available cores, that other threads would be waiting. I was just thrown off by the WRITELOG waits.
Environment is SQL Server 2014, 64gb RAM, 6 processors. 2TB disk, with almost 400gb free space.
I have a procedure that is called by job. It creates temp table then joins several dimension tables to that table and inserts into fact table. It worked cleanly until monday running between 2 and 10 minutes. On monday it lasted nearly 5 hours without doing anything. idles process was at 98%, no reads no writes, state is suspended. There are no locks, no blocking sessions, literally nothing that I can pin down as culprit.
As soon as it's called it immediately goes to suspended state and I cannot find out why. It's supposed to be waiting for something, but I can't find what it's waiting for. It's blocking entire process and no data is being loaded.
I would really appreciate help.
A process goes into suspended mode because its waiting for a system resource become available. What specifically that resource is in your case, I'm not sure. If you re-run it and it continues to happen, I'd run a profiler trace on the procedure and see what it's doing at the moment it becomes suspended.
#XObi Mark,
In short, you'll need to look at the wait types and the query plan. Here's a query to capture the details of the query plan:
SELECT dm_ws.wait_duration_ms,
dm_ws.wait_type,
dm_es.status,
dm_t.TEXT,
dm_qp.query_plan,
dm_ws.session_ID,
dm_es.cpu_time,
dm_es.memory_usage,
dm_es.logical_reads,
dm_es.total_elapsed_time,
dm_es.program_name,
DB_NAME(dm_r.database_id) DatabaseName,
-- Optional columns
dm_ws.blocking_session_id,
dm_r.wait_resource,
dm_es.login_name,
dm_r.command,
dm_r.last_wait_type
FROM sys.dm_os_waiting_tasks dm_ws
INNER JOIN sys.dm_exec_requests dm_r ON dm_ws.session_id = dm_r.session_id
INNER JOIN sys.dm_exec_sessions dm_es ON dm_es.session_id = dm_r.session_id
CROSS APPLY sys.dm_exec_sql_text (dm_r.sql_handle) dm_t
CROSS APPLY sys.dm_exec_query_plan (dm_r.plan_handle) dm_qp
WHERE dm_es.is_user_process = 1
To analyze wait types, follow the advice on this link from Marcello Miorelli and steoleary.
How to find out why the status of a spid is suspended? What resources the spid is waiting for?
Tnx all for your input, I tried the above query and checked the helpful link you provided. It did give me a lot of info.
I finally traced the cause. It appears that procedure in question had a very inefficient way of getting dates from date table. It used
SELECT max(date) FROM d_date WHERE date_id = #myDateFrom
In the table date_id is integer, while date is, well, date.
This max was paralyzing the query. I realize that normally it's used to insure only a single row is returned from table, but in this case even without max only one row should be retrieved from d_date table. Removing the max from query returned execution time to roughly previous values.
Thank you all for your effort.
Obi Mark
I am in between the development of a report and the main data set of the report is being populated using a stored procedure. One particular update statement is as given below
UPDATE T1
SET
T1.Status=T2.Status,
T1.ErrorMessage=T2.ErrorMessage,
T1.IssueStatus=T2.IssueStatus
FROM
#tmptblOtherSongs T1
INNER JOIN #tmpOtherSongStatus T2 ON
T1.SongCode= T2.SongCode AND
T1.SocietyCode= T2.SocietyCode AND
T1.TableName=T2.TableName
WHERE
T1.SessionID='TRYFBGHk' AND
T2.Status IS NOT NULL
The sp goes into a suspended state when it reaches this particular update query. I tried running the query alone, but the result was the same.It runs fine for small amount of data, as in thousands but the issue occurs for more amount of data.
Is there any way in which I can prevent this. I have no clue why this is happening.
When the process is in suspended state, it is waiting for something. Quite often it's either waiting for I/O (reading or writing data, can be also tempdb related), or waiting for blocking to end.
If there is no blocking, it's most likely I/O operations. You should look at execution plan if there's something that doesn't look right and / or set statistics io on and check what tables are causing most of the I/O.
I/O can most often be improved by adding indexes, and in your case I would start looking at T1.SessionID and after that T1 & T2.SongCode and SocietyCode + maybe TableName depending on how many rows there are in the table per single value.
I am using Entity Framework, and I am inserting records into our database which include a blob field. The blob field can be up to 5 MB of data.
When inserting a record into this table, does it lock the whole table?
So if you are querying any data from the table, will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load, will it take longer to cause a deadlock?
Is there a way to monitor and see what is locked at any particular time?
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
This is taking into account that most of my code is just a bunch of select statements, not heaps of long running transactions or anything like that.
Holy cow, you've got a lot of questions in here, heh. Here's a few answers:
When inserting a record into this table, does it lock the whole table?
Not by default, but if you use the TABLOCK hint or if you're doing certain kinds of bulk load operations, then yes.
So if you are querying any data from the table will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
This one gets a little trickier. If someone's trying to select data from a page in the table that you've got locked, then yes, you'll block 'em. You can work around that with things like the NOLOCK hint on a select statement or by using Read Committed Snapshot Isolation. For a starting point on how isolation levels work, check out Kendra Little's isolation levels poster.
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load will it take longer to cause a deadlock?
Deadlocks aren't based on time - they're based on dependencies. Say we've got this situation:
Query A is holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query B
Query B is also holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query A
Neither query can move forward (think Mexican standoff) so SQL Server calls it a draw, shoots somebody's query in the back, releases his locks, and lets the other query keep going. SQL Server picks the victim based on which one will be less expensive to roll back. If you want to get fancy, you can use SET DEADLOCK_PRIORITY LOW on particular queries to paint targets on their back, and SQL Server will shoot them first.
Is there a way to monitor and see what is locked at any particular time?
Absolutely - there's Dynamic Management Views (DMVs) you can query like sys.dm_tran_locks, but the easiest way is to use Adam Machanic's free sp_WhoIsActive stored proc. It's a really slick replacement for sp_who that you can call like this:
sp_WhoIsActive #get_locks = 1
For each running query, you'll get a little XML that describes all of the locks it holds. There's also a Blocking column, so you can see who's blocking who. To interpret the locks being held, you'll want to check the Books Online descriptions of lock types.
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
Believe it or not, a single query can actually deadlock itself, and yes, queries can deadlock on just one table. To learn even more about deadlocks, check out The Difficulty with Deadlocks by Jeremiah Peschka.
If you have direct control over the SQL, you can force row level locking using:
INSERT INTO WITH (ROWLOCK) MyTable(Id, BigColumn)
VALUES(...)
These two answers might be helpful:
Is it possible to force row level locking in SQL Server?
Locking a table with a select in Entity Framework
To view current held locks in Management Studio, look under the server, then under Management/Activity Monitor. It has a section for locks by object, so you should be able to see whether the inserts are really causing a problem.
Deadlock errors generally return quite quickly. Deadlock states do not occur as a result of a timeout error occurring while waiting for a lock. Deadlock is detected by SQL Server by looking for cycles in the lock requests.
The best answer I can come up with is: It depends.
The best way to check is to find your connection SPID and use sp_lock SPID to check if the lock mode is X on the TAB type. You can also verify the table name with SELECT OBJECT_NAME(objid). I also like to use the below query to check for locking.
SELECT RESOURCE_TYPE,RESOURCE_SUBTYPE,DB_NAME(RESOURCE_DATABASE_ID) AS 'DATABASE',resource_database_id DBID,
RESOURCE_DESCRIPTION,RESOURCE_ASSOCIATED_ENTITY_ID,REQUEST_MODE,REQUEST_SESSION_ID,
CASE WHEN RESOURCE_TYPE = 'OBJECT' THEN OBJECT_NAME(RESOURCE_ASSOCIATED_ENTITY_ID,RESOURCE_DATABASE_ID) ELSE '' END OBJETO
FROM SYS.DM_TRAN_LOCKS (NOLOCK)
WHERE REQUEST_SESSION_ID = --SPID here
In SQL Server 2008 (and later) you can disable the lock escalation on the table and enforce a WITH (ROWLOCK) in your insert clause effectively forcing a rowlock. This can't be done prior to SQL Server 2008 (you can write WITH ROWLOCK, but SQL Server can choose to ignore it).
I'm speaking generals here, and I don't have much experience with BLOBs as I usually advise developers to avoid them, especially if larger than 1 MB.
I'm using SQL Server 2008 on Windows Server 2008 R2, all sp'd up.
I'm getting occasional issues with SQL Server hanging with the CPU usage on 100% on our live server. It seems all the wait time on SQL Sever when this happens is given to SOS_SCHEDULER_YIELD.
Here is the Stored Proc that causes the hang. I've added the "WITH (NOLOCK)" in an attempt to fix what seems to be a locking issue.
ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;
SELECT
c.ForeignId , ct.ContentSource as ContentSource
, sum(ch.HitCount * hw.Weight) as Popularity
, (sum(ch.HitCount * hw.Weight) * 100) / #Total as Percent
, #Total as TotalHits
from
ContentHit ch WITH (NOLOCK)
join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where
ch.CreatedDate between #Then and #Now
group by
c.ForeignId , ct.ContentSource
order by
sum(ch.HitCount * hw.HitWeightMultiplier) desc
END
The stored proc reads from the table "ContentHit", which is a table that tracks when content on the site is clicked (it gets hit quite frequently - anything from 4 to 20 hits a minute). So its pretty clear that this table is the source of the problem. There is a stored proc that is called to add hit tracks to the ContentHit table, its pretty trivial, it just builds up a string from the params passed in, which involves a few selects from some lookup tables, followed by the main insert:
BEGIN TRAN
insert into [ContentHit]
(ContentId, HitCount, HitWeightId, ContentHitComment)
values
(#ContentId, isnull(#HitCount,1), isnull(#HitWeightId,1), #ContentHitComment)
COMMIT TRAN
The ContentHit table has a clustered index on its ID column, and I've added another index on CreatedDate since that is used in the select.
When I profile the issue, I see the Stored proc executes for exactly 30 seconds, then the SQL timeout exception occurs. If it makes a difference the web application using it is ASP.NET, and I'm using Subsonic (3) to execute these stored procs.
Can someone please advise how best I can solve this problem? I don't care about reading dirty data...
EDIT:
The MostPopularRead stored proc is called very infrequently - its called on the home page of the site, but the results are cached for a day. The pattern of events that I am seeing is when I clear the cache, multiple requests come in for the home site, and they all hit the stored proc because it hasn't yet been cached. SQL Server then maxes out, and can only be resolved by restarting the sql server process. When I do this, usually the proc will execute OK (in about 200 ms) and put the data back in the cache.
EDIT 2:
I've checked the execution plan, and the query looks quite sound. As I said earlier when it does run it only takes around 200ms to execute. I've added MAXDOP 1 to the select statement to force it to use only one CPU core, but I still see the issue. When I look at the wait times I see that XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER are taking up a massive amount of wait time.
EDIT 3:
I previously thought that this was related to Subsonic, our ORM, but having switched to ADO.NET, the erros is still live.
The issue is likely concurrency, not locking. SOS_SCHEDULER_YIELD occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.
How often is [MostPopularRead] SP called and how long does it take to execute?
The aggregation in your query might be rather CPU-intensive, especially if there are lots of data and/or ineffective indexes. So, you might end up with high CPU pressure - basically, a demand for CPU time is too high.
I'd consider the following:
Check what other queries are executing while CPU is 100% busy? Look at sys.dm_os_waiting_tasks, sys.dm_os_tasks, sys.dm_exec_requests.
Look at the query plan of [MostPopularRead], try to optimize the query. Quite often an ineffective query is the root cause of a performance problem, and query optimization is much more straightforward than other performance improvement techniques.
If the query plan is parallel and the query is often called by multiple clients simultaneously, forcing a single-thread plan with MAXDOP=1 hint might help (abundant use of parallel plans is usually indicated by SOS_SCHEDULER_YIELD and CXPACKET waits).
Also, have a look at this paper: Performance tuning with wait statistics. It gives a pretty good summary of different wait types and their impact on performance.
P.S. It is easier to use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before a query instead of adding (nolock) to each table.
Remove the NOLOCK hint.
Open a query in SSMS, run SET STATISTICSIO ON and run the query in the procedure. Let it finish and post here the IO stats messages. Then post the table definitions and all indexes defined on them. Then somebody will be able to reply with the proper indexes you need.
As with all SQL performance problem, the text of the query is largely irrelevant without complete schema definition.
A guesstimate covering index would be:
create index ContentHitCreatedDate
on ContentHit (CreatedDate)
include (HitCount, ContentId, HitWeightId);
Update
XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER: you can safely ignore all these waits. They show up because they represent threads parked and waiting to dispatch XEvents, Service Broker or internal SQL thread pool work items. As they spend most of their time parked and waiting, they get accounted for unrealistic wait times. Ignore them.
If you believe ContentHit to be the source of your problem, you could add a Covering Index
CREATE INDEX IX_CONTENTHIT_CONTENTID_HITWEIGHTID_HITCOUNT
ON dbo.ContentHit (ContentID, HitWeightID, HitCount)
Take a look at the Query Plan if you want to be certain about the bottleneck in your query.
By default settings sql server uses all the core/cpu for all queries (max DoP setting> advanced property, DoP= Degree of Parallelism), which can lead to 100% CPU even if only one core is actually waiting for some I/O.
If you search the net or this site you will find resource explaining it better than me (like monitoring your I/o despite you see a CPU-bound problem).
On one server we couldn't change the application with a bad query that locked down all resources (CPU) but by setting DoP to the half of the number of core we managed to avoid that the server get "stopped". The effect on the queries being less parallel was negligible in our case.
--
Dom
Thanks to all who posted, I got some great SQL Server perf tuning tips.
In the end we ran out time to resolve this mystery - we found a more effecient way to collect this information and cache it in the database, so this solved the problem for us.