Environment is SQL Server 2014, 64gb RAM, 6 processors. 2TB disk, with almost 400gb free space.
I have a procedure that is called by job. It creates temp table then joins several dimension tables to that table and inserts into fact table. It worked cleanly until monday running between 2 and 10 minutes. On monday it lasted nearly 5 hours without doing anything. idles process was at 98%, no reads no writes, state is suspended. There are no locks, no blocking sessions, literally nothing that I can pin down as culprit.
As soon as it's called it immediately goes to suspended state and I cannot find out why. It's supposed to be waiting for something, but I can't find what it's waiting for. It's blocking entire process and no data is being loaded.
I would really appreciate help.
A process goes into suspended mode because its waiting for a system resource become available. What specifically that resource is in your case, I'm not sure. If you re-run it and it continues to happen, I'd run a profiler trace on the procedure and see what it's doing at the moment it becomes suspended.
#XObi Mark,
In short, you'll need to look at the wait types and the query plan. Here's a query to capture the details of the query plan:
SELECT dm_ws.wait_duration_ms,
dm_ws.wait_type,
dm_es.status,
dm_t.TEXT,
dm_qp.query_plan,
dm_ws.session_ID,
dm_es.cpu_time,
dm_es.memory_usage,
dm_es.logical_reads,
dm_es.total_elapsed_time,
dm_es.program_name,
DB_NAME(dm_r.database_id) DatabaseName,
-- Optional columns
dm_ws.blocking_session_id,
dm_r.wait_resource,
dm_es.login_name,
dm_r.command,
dm_r.last_wait_type
FROM sys.dm_os_waiting_tasks dm_ws
INNER JOIN sys.dm_exec_requests dm_r ON dm_ws.session_id = dm_r.session_id
INNER JOIN sys.dm_exec_sessions dm_es ON dm_es.session_id = dm_r.session_id
CROSS APPLY sys.dm_exec_sql_text (dm_r.sql_handle) dm_t
CROSS APPLY sys.dm_exec_query_plan (dm_r.plan_handle) dm_qp
WHERE dm_es.is_user_process = 1
To analyze wait types, follow the advice on this link from Marcello Miorelli and steoleary.
How to find out why the status of a spid is suspended? What resources the spid is waiting for?
Tnx all for your input, I tried the above query and checked the helpful link you provided. It did give me a lot of info.
I finally traced the cause. It appears that procedure in question had a very inefficient way of getting dates from date table. It used
SELECT max(date) FROM d_date WHERE date_id = #myDateFrom
In the table date_id is integer, while date is, well, date.
This max was paralyzing the query. I realize that normally it's used to insure only a single row is returned from table, but in this case even without max only one row should be retrieved from d_date table. Removing the max from query returned execution time to roughly previous values.
Thank you all for your effort.
Obi Mark
Related
I'm running into an interesting issue in production, which I cannot replicate in our QA/Staging environments.
I have a query that is doing dirty reads on a fairly large table (around 6 million rows, but we only keep the last 90 days of data in it, older records are warehoused in a different database). This table has lots of writes to it, as it logs page views, but only occasionally is data read from the table.
Recently I noticed that when one specific query is running, SQL Server 2019 starts generating a ton of WRITELOG waits and appears to hold up any other requests that are trying to write to the database.
Now the query itself has nolock hints on all the tables, because it's okay if dirty data is returned. We use the nolock hints because the writes to the table are extremely frequently and queries to this table can be slow because there are a lot of page scans required.
The query itself looks something like this:
select
clt.ViewDate, clt.UserId, clt.RemoteAddress, clt.LibraryId, clt.Parameters
, u.Fullname
, cl.Id as VideoId, cl.Title
-- we need a compound key for each row, so we can count the unique rows
, case
when clt.ViewDate is null then null
else row_number() over (order by clt.ViewDate, clt.UserId, clt.LibraryId, clt.Parameters)
end as compoundKey
from
ContentLibrary as cl (nolock)
left join
(
ContentLibraryTracking as clt (nolock)
inner join
[User] as u (nolock)
on
clt.UserId = u.UserId
)
on
clt.ViewDate between #startDate and #endDate
and
clt.Parameters like #filter
where
1 = 1
and
cl.ContentType = #contentType
order by
clt.ViewDate
The problem table appears to be the ContentLibraryTracking. This is the table that has millions of rows and has lots of inserts and we warehouse rows nightly, so there can be a lot of page fragmentation. We do defrag the indices and stats weekly on the table.
When this query is running, sp_BlitzWho will report the query has entered into a CXCONSUMER. I will then see SQL Server 2019 starting to queue processes with a WRITELOG wait. This processed remain in this state until the query has finished running.
Since our application has some kind of write transaction with every page view, this means this query is holding up execution for entire application, which is obviously bad.
While I know have page scans is bad for a query plan, the query requires searching patterns in a varchar column, which is why the page scans happen. Since the reads are very infrequently, the table is optimized for writes since those are extremely frequent. And while the query could perform better, considering the work it's doing even when it's slow it runs within 15 seconds or so.
One thing I do see from the sp_BlitzWho results is the query is using parallelism and it also states the Transaction Isolation Level is Read Committed (which I would unexpected Read Uncommitted since all the tables have a nolock hint).
What would cause a query with dirty reads to be forcing the database to queue up WRITELOG events?
I could see this happening if the query was altering data and causing it's own transaction log entries, but that should not be happening with the query. That's the whole reason we are using the nolock hint on the tables.
Also, our database, log files and tempdb are all on their own logical storage devices, so reads from the database should not be causing a IO problems writing to the transaction log files.
A couple of notes on the environment:
We are running Microsoft SQL Server 2019 (RTM-CU8-GDR) (KB4583459) - 15.0.4083.2 (X64))
The database is running in a VM
We backup transaction logs every 5 minutes (could this be the issue?)
Memory and CPU usage appear fine with the query runs
SQL Monitor 11 only really shows spikes in the log flushes and waits (which would match the behavior). Page splits, buffer cache & page are all normal. I do see the "disk read bytes/sec" go up on the logic drive that has the database on it, but the writes on all drives (including the transaction logs) look okay.
Any thoughts would be greatly appreciated as I'm really scratching my head over this issue.
Right after I posted my question I started looking at the sp_BlitzWho results in more detail. I noticed the parallelism was using all the CPUs. So I changed the MAXDOP to half the CPU/cores and this appears to have resolved the issue. I'm going to keep monitoring the situation, but looks like an instance where the MAXDOP was not set correctly.
It make sense that if a query is eating up all the available cores, that other threads would be waiting. I was just thrown off by the WRITELOG waits.
I am in between the development of a report and the main data set of the report is being populated using a stored procedure. One particular update statement is as given below
UPDATE T1
SET
T1.Status=T2.Status,
T1.ErrorMessage=T2.ErrorMessage,
T1.IssueStatus=T2.IssueStatus
FROM
#tmptblOtherSongs T1
INNER JOIN #tmpOtherSongStatus T2 ON
T1.SongCode= T2.SongCode AND
T1.SocietyCode= T2.SocietyCode AND
T1.TableName=T2.TableName
WHERE
T1.SessionID='TRYFBGHk' AND
T2.Status IS NOT NULL
The sp goes into a suspended state when it reaches this particular update query. I tried running the query alone, but the result was the same.It runs fine for small amount of data, as in thousands but the issue occurs for more amount of data.
Is there any way in which I can prevent this. I have no clue why this is happening.
When the process is in suspended state, it is waiting for something. Quite often it's either waiting for I/O (reading or writing data, can be also tempdb related), or waiting for blocking to end.
If there is no blocking, it's most likely I/O operations. You should look at execution plan if there's something that doesn't look right and / or set statistics io on and check what tables are causing most of the I/O.
I/O can most often be improved by adding indexes, and in your case I would start looking at T1.SessionID and after that T1 & T2.SongCode and SocietyCode + maybe TableName depending on how many rows there are in the table per single value.
I am doing a sql server profiler on deadlocks because users are getting query time outs. In the profiler the eventClass column shows Lock:Escalation and Lock:Cancel. How do I find out what would cause a query to be canceled? Basically the same queries are being run by bunch of users and things zoom right through, but off and on throughout the day users are timing out. I am run sqldiag also; however, unfortunately I am not DBA and muddling my way through to discover the problem. Any suggestions?
thanks community
nick
Query timeouts and deadlocks are pretty much mutually exclusive.
A deadlock situation will be discovered very quickly by deadlock monitor background thread and dealt with in a manner that one of the deadlocked processes (usually the one with lower cost of rolling back) will be chosen as a deadlock victim and it's work up to the point rolled back.
A query timeout could happen with livelocks, with high number of cuncurrent processes trying to access the same resource and thus blocking one another. When the time elapsed exceeds the timeout value (set by the client) the query will be canceled (and this is the reason you're seeing the Lock:Canceled events in the trace).
It is very important that client handles this condition, because all the resorces taken inside a transaction which timed out will remain taken as long as the connection is alive or the transaction is not rolled back.
To diagnose blocking situations, you can do several things.
If you happen to be monitoring at the time when a process is blocked, run the following query to find out the head of the blocking chain so you can investigate further:
select r.session_id, r.host_name, r.program_name,
r.login_name, r.nt_domain, r.nt_user_name,
r.total_elapsed_time/1000 as total_elapsed_time_sec, getdate() as vrijeme,
(select text from sys.dm_exec_sql_text(c.most_recent_sql_handle)) as sql_text
from sys.dm_exec_connections c
inner join sys.dm_exec_sessions r on r.session_id = c.session_id
where r.is_user_process = 1
and exists (
select *
from sys.dm_os_waiting_tasks r2
where r2.blocking_session_id = r.session_id
)
and not exists (
select *
from sys.dm_os_waiting_tasks r3
where r3.session_id = r.session_id
)
and r.total_elapsed_time/1000 > 10
This query has a 10 seconds treshold.
Firthermore you can use the Profiler to capture the blocking process event and then analyze it later on. Check this link for detailed explanation:
https://www.simple-talk.com/sql/sql-tools/how-to-identify-blocking-problems-with-sql-profiler/
There will usually be a handful of queries responsible for large majority of blocking. Identify those and try to optimize them (rewrite, indexing..). Besides, you can set up read committed snapshot isolation level for the database to avoid readers waiting on writers.
I have an NHibernate Query (which is populating an EXTJS grid)
It's firing 2 queries off to the database, one to get the record count (for paging purposes) and the other to get the top N rows to fill the grid with.
From code, I'm consistently getting an exception on the Select count(*) statement.
NHibernate.Exceptions.GenericADOException:
Failed to execute multi criteria[SQL:
SELECT count(*) as y0_ FROM RecordView this_ inner join ProcessesView
process1_ on this_.ProcessId=process1_.Id inner join BusinessModelsView
businessmo3_ on process1_.BusinessModelId=businessmo3_.Id inner join BatchesView
batch2_ on this_.BatchId=batch2_.Id WHERE this_.ProcessId = ?;
] ---> System.Data.SqlClient.SqlException: Timeout expired.
The timeout period elapsed prior to completion of the operation or the server
is not responding.
However if I take that exact query and drop it into an SSMS window, and run it, it executes in a <1 second.
Is NHibernate doing anything "funny" under the hood here. Are there execution plan/cache issues. I'm at a complete loss as to why this is occurring.
Whenever I encountered this error, the reason was locking (never performance). There was two sessions opened (accidently). Both started transaction and one of them locked the table.
The problem could be some not disposed session, or "unintended" singleton... holding opened session.
This answer is not as straigth forward as I wish, but I am sure about the direction. Because I experienced the same (and was guilty)
BTW: as Oskar Berggren found out from you, 30 secods timeout would be related to the <property name="command_timeout">30</property>. I am sure, if you will provide 60, 120 ... it will be not enough because of lock
Your two queries are not handled in the same way by SQL SERVER
your NH query has been compiled on its first execution, based on table statistics and on the first value of the parameter. The generated query plan will then be used for all subsequent calls, witout considering the parameter value
your SQL query (where, I guess, you replace the ? with an actual value) gets a different compilation for each value, based on statistics, and on the value.
Your first NH compilation might have produced a query plan, effective for the first value, but not in the general case.
First, I would suggest that :
you count on a projection (say on the main table id), as it is slightly more effective than count(*), allowing the DB to work only on indexes when possible
you check that you don't miss any index necessary to your query
you check that all your table statistics are up to date
If this does not improve execution time, this post offers some options (recompile might be the good one) :
Query executed from Nhibernate is slow, but from ADO.NET is fast
I'm using SQL Server 2008 on Windows Server 2008 R2, all sp'd up.
I'm getting occasional issues with SQL Server hanging with the CPU usage on 100% on our live server. It seems all the wait time on SQL Sever when this happens is given to SOS_SCHEDULER_YIELD.
Here is the Stored Proc that causes the hang. I've added the "WITH (NOLOCK)" in an attempt to fix what seems to be a locking issue.
ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;
SELECT
c.ForeignId , ct.ContentSource as ContentSource
, sum(ch.HitCount * hw.Weight) as Popularity
, (sum(ch.HitCount * hw.Weight) * 100) / #Total as Percent
, #Total as TotalHits
from
ContentHit ch WITH (NOLOCK)
join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where
ch.CreatedDate between #Then and #Now
group by
c.ForeignId , ct.ContentSource
order by
sum(ch.HitCount * hw.HitWeightMultiplier) desc
END
The stored proc reads from the table "ContentHit", which is a table that tracks when content on the site is clicked (it gets hit quite frequently - anything from 4 to 20 hits a minute). So its pretty clear that this table is the source of the problem. There is a stored proc that is called to add hit tracks to the ContentHit table, its pretty trivial, it just builds up a string from the params passed in, which involves a few selects from some lookup tables, followed by the main insert:
BEGIN TRAN
insert into [ContentHit]
(ContentId, HitCount, HitWeightId, ContentHitComment)
values
(#ContentId, isnull(#HitCount,1), isnull(#HitWeightId,1), #ContentHitComment)
COMMIT TRAN
The ContentHit table has a clustered index on its ID column, and I've added another index on CreatedDate since that is used in the select.
When I profile the issue, I see the Stored proc executes for exactly 30 seconds, then the SQL timeout exception occurs. If it makes a difference the web application using it is ASP.NET, and I'm using Subsonic (3) to execute these stored procs.
Can someone please advise how best I can solve this problem? I don't care about reading dirty data...
EDIT:
The MostPopularRead stored proc is called very infrequently - its called on the home page of the site, but the results are cached for a day. The pattern of events that I am seeing is when I clear the cache, multiple requests come in for the home site, and they all hit the stored proc because it hasn't yet been cached. SQL Server then maxes out, and can only be resolved by restarting the sql server process. When I do this, usually the proc will execute OK (in about 200 ms) and put the data back in the cache.
EDIT 2:
I've checked the execution plan, and the query looks quite sound. As I said earlier when it does run it only takes around 200ms to execute. I've added MAXDOP 1 to the select statement to force it to use only one CPU core, but I still see the issue. When I look at the wait times I see that XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER are taking up a massive amount of wait time.
EDIT 3:
I previously thought that this was related to Subsonic, our ORM, but having switched to ADO.NET, the erros is still live.
The issue is likely concurrency, not locking. SOS_SCHEDULER_YIELD occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.
How often is [MostPopularRead] SP called and how long does it take to execute?
The aggregation in your query might be rather CPU-intensive, especially if there are lots of data and/or ineffective indexes. So, you might end up with high CPU pressure - basically, a demand for CPU time is too high.
I'd consider the following:
Check what other queries are executing while CPU is 100% busy? Look at sys.dm_os_waiting_tasks, sys.dm_os_tasks, sys.dm_exec_requests.
Look at the query plan of [MostPopularRead], try to optimize the query. Quite often an ineffective query is the root cause of a performance problem, and query optimization is much more straightforward than other performance improvement techniques.
If the query plan is parallel and the query is often called by multiple clients simultaneously, forcing a single-thread plan with MAXDOP=1 hint might help (abundant use of parallel plans is usually indicated by SOS_SCHEDULER_YIELD and CXPACKET waits).
Also, have a look at this paper: Performance tuning with wait statistics. It gives a pretty good summary of different wait types and their impact on performance.
P.S. It is easier to use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before a query instead of adding (nolock) to each table.
Remove the NOLOCK hint.
Open a query in SSMS, run SET STATISTICSIO ON and run the query in the procedure. Let it finish and post here the IO stats messages. Then post the table definitions and all indexes defined on them. Then somebody will be able to reply with the proper indexes you need.
As with all SQL performance problem, the text of the query is largely irrelevant without complete schema definition.
A guesstimate covering index would be:
create index ContentHitCreatedDate
on ContentHit (CreatedDate)
include (HitCount, ContentId, HitWeightId);
Update
XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER: you can safely ignore all these waits. They show up because they represent threads parked and waiting to dispatch XEvents, Service Broker or internal SQL thread pool work items. As they spend most of their time parked and waiting, they get accounted for unrealistic wait times. Ignore them.
If you believe ContentHit to be the source of your problem, you could add a Covering Index
CREATE INDEX IX_CONTENTHIT_CONTENTID_HITWEIGHTID_HITCOUNT
ON dbo.ContentHit (ContentID, HitWeightID, HitCount)
Take a look at the Query Plan if you want to be certain about the bottleneck in your query.
By default settings sql server uses all the core/cpu for all queries (max DoP setting> advanced property, DoP= Degree of Parallelism), which can lead to 100% CPU even if only one core is actually waiting for some I/O.
If you search the net or this site you will find resource explaining it better than me (like monitoring your I/o despite you see a CPU-bound problem).
On one server we couldn't change the application with a bad query that locked down all resources (CPU) but by setting DoP to the half of the number of core we managed to avoid that the server get "stopped". The effect on the queries being less parallel was negligible in our case.
--
Dom
Thanks to all who posted, I got some great SQL Server perf tuning tips.
In the end we ran out time to resolve this mystery - we found a more effecient way to collect this information and cache it in the database, so this solved the problem for us.