I'm using SQL Server 2008 on Windows Server 2008 R2, all sp'd up.
I'm getting occasional issues with SQL Server hanging with the CPU usage on 100% on our live server. It seems all the wait time on SQL Sever when this happens is given to SOS_SCHEDULER_YIELD.
Here is the Stored Proc that causes the hang. I've added the "WITH (NOLOCK)" in an attempt to fix what seems to be a locking issue.
ALTER PROCEDURE [dbo].[MostPopularRead]
AS
BEGIN
SET NOCOUNT ON;
SELECT
c.ForeignId , ct.ContentSource as ContentSource
, sum(ch.HitCount * hw.Weight) as Popularity
, (sum(ch.HitCount * hw.Weight) * 100) / #Total as Percent
, #Total as TotalHits
from
ContentHit ch WITH (NOLOCK)
join [Content] c WITH (NOLOCK) on ch.ContentId = c.ContentId
join HitWeight hw WITH (NOLOCK) on ch.HitWeightId = hw.HitWeightId
join ContentType ct WITH (NOLOCK) on c.ContentTypeId = ct.ContentTypeId
where
ch.CreatedDate between #Then and #Now
group by
c.ForeignId , ct.ContentSource
order by
sum(ch.HitCount * hw.HitWeightMultiplier) desc
END
The stored proc reads from the table "ContentHit", which is a table that tracks when content on the site is clicked (it gets hit quite frequently - anything from 4 to 20 hits a minute). So its pretty clear that this table is the source of the problem. There is a stored proc that is called to add hit tracks to the ContentHit table, its pretty trivial, it just builds up a string from the params passed in, which involves a few selects from some lookup tables, followed by the main insert:
BEGIN TRAN
insert into [ContentHit]
(ContentId, HitCount, HitWeightId, ContentHitComment)
values
(#ContentId, isnull(#HitCount,1), isnull(#HitWeightId,1), #ContentHitComment)
COMMIT TRAN
The ContentHit table has a clustered index on its ID column, and I've added another index on CreatedDate since that is used in the select.
When I profile the issue, I see the Stored proc executes for exactly 30 seconds, then the SQL timeout exception occurs. If it makes a difference the web application using it is ASP.NET, and I'm using Subsonic (3) to execute these stored procs.
Can someone please advise how best I can solve this problem? I don't care about reading dirty data...
EDIT:
The MostPopularRead stored proc is called very infrequently - its called on the home page of the site, but the results are cached for a day. The pattern of events that I am seeing is when I clear the cache, multiple requests come in for the home site, and they all hit the stored proc because it hasn't yet been cached. SQL Server then maxes out, and can only be resolved by restarting the sql server process. When I do this, usually the proc will execute OK (in about 200 ms) and put the data back in the cache.
EDIT 2:
I've checked the execution plan, and the query looks quite sound. As I said earlier when it does run it only takes around 200ms to execute. I've added MAXDOP 1 to the select statement to force it to use only one CPU core, but I still see the issue. When I look at the wait times I see that XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER are taking up a massive amount of wait time.
EDIT 3:
I previously thought that this was related to Subsonic, our ORM, but having switched to ADO.NET, the erros is still live.
The issue is likely concurrency, not locking. SOS_SCHEDULER_YIELD occurs when a task voluntarily yields the scheduler for other tasks to execute. During this wait the task is waiting for its quantum to be renewed.
How often is [MostPopularRead] SP called and how long does it take to execute?
The aggregation in your query might be rather CPU-intensive, especially if there are lots of data and/or ineffective indexes. So, you might end up with high CPU pressure - basically, a demand for CPU time is too high.
I'd consider the following:
Check what other queries are executing while CPU is 100% busy? Look at sys.dm_os_waiting_tasks, sys.dm_os_tasks, sys.dm_exec_requests.
Look at the query plan of [MostPopularRead], try to optimize the query. Quite often an ineffective query is the root cause of a performance problem, and query optimization is much more straightforward than other performance improvement techniques.
If the query plan is parallel and the query is often called by multiple clients simultaneously, forcing a single-thread plan with MAXDOP=1 hint might help (abundant use of parallel plans is usually indicated by SOS_SCHEDULER_YIELD and CXPACKET waits).
Also, have a look at this paper: Performance tuning with wait statistics. It gives a pretty good summary of different wait types and their impact on performance.
P.S. It is easier to use SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED before a query instead of adding (nolock) to each table.
Remove the NOLOCK hint.
Open a query in SSMS, run SET STATISTICSIO ON and run the query in the procedure. Let it finish and post here the IO stats messages. Then post the table definitions and all indexes defined on them. Then somebody will be able to reply with the proper indexes you need.
As with all SQL performance problem, the text of the query is largely irrelevant without complete schema definition.
A guesstimate covering index would be:
create index ContentHitCreatedDate
on ContentHit (CreatedDate)
include (HitCount, ContentId, HitWeightId);
Update
XE_DISPATCHER_WAIT, ONDEMAND_TASK_QUEUE, BROKER_TRANSMITTER, KSOURCE_WAKEUP and BROKER_EVENTHANDLER: you can safely ignore all these waits. They show up because they represent threads parked and waiting to dispatch XEvents, Service Broker or internal SQL thread pool work items. As they spend most of their time parked and waiting, they get accounted for unrealistic wait times. Ignore them.
If you believe ContentHit to be the source of your problem, you could add a Covering Index
CREATE INDEX IX_CONTENTHIT_CONTENTID_HITWEIGHTID_HITCOUNT
ON dbo.ContentHit (ContentID, HitWeightID, HitCount)
Take a look at the Query Plan if you want to be certain about the bottleneck in your query.
By default settings sql server uses all the core/cpu for all queries (max DoP setting> advanced property, DoP= Degree of Parallelism), which can lead to 100% CPU even if only one core is actually waiting for some I/O.
If you search the net or this site you will find resource explaining it better than me (like monitoring your I/o despite you see a CPU-bound problem).
On one server we couldn't change the application with a bad query that locked down all resources (CPU) but by setting DoP to the half of the number of core we managed to avoid that the server get "stopped". The effect on the queries being less parallel was negligible in our case.
--
Dom
Thanks to all who posted, I got some great SQL Server perf tuning tips.
In the end we ran out time to resolve this mystery - we found a more effecient way to collect this information and cache it in the database, so this solved the problem for us.
Related
I'm running into an interesting issue in production, which I cannot replicate in our QA/Staging environments.
I have a query that is doing dirty reads on a fairly large table (around 6 million rows, but we only keep the last 90 days of data in it, older records are warehoused in a different database). This table has lots of writes to it, as it logs page views, but only occasionally is data read from the table.
Recently I noticed that when one specific query is running, SQL Server 2019 starts generating a ton of WRITELOG waits and appears to hold up any other requests that are trying to write to the database.
Now the query itself has nolock hints on all the tables, because it's okay if dirty data is returned. We use the nolock hints because the writes to the table are extremely frequently and queries to this table can be slow because there are a lot of page scans required.
The query itself looks something like this:
select
clt.ViewDate, clt.UserId, clt.RemoteAddress, clt.LibraryId, clt.Parameters
, u.Fullname
, cl.Id as VideoId, cl.Title
-- we need a compound key for each row, so we can count the unique rows
, case
when clt.ViewDate is null then null
else row_number() over (order by clt.ViewDate, clt.UserId, clt.LibraryId, clt.Parameters)
end as compoundKey
from
ContentLibrary as cl (nolock)
left join
(
ContentLibraryTracking as clt (nolock)
inner join
[User] as u (nolock)
on
clt.UserId = u.UserId
)
on
clt.ViewDate between #startDate and #endDate
and
clt.Parameters like #filter
where
1 = 1
and
cl.ContentType = #contentType
order by
clt.ViewDate
The problem table appears to be the ContentLibraryTracking. This is the table that has millions of rows and has lots of inserts and we warehouse rows nightly, so there can be a lot of page fragmentation. We do defrag the indices and stats weekly on the table.
When this query is running, sp_BlitzWho will report the query has entered into a CXCONSUMER. I will then see SQL Server 2019 starting to queue processes with a WRITELOG wait. This processed remain in this state until the query has finished running.
Since our application has some kind of write transaction with every page view, this means this query is holding up execution for entire application, which is obviously bad.
While I know have page scans is bad for a query plan, the query requires searching patterns in a varchar column, which is why the page scans happen. Since the reads are very infrequently, the table is optimized for writes since those are extremely frequent. And while the query could perform better, considering the work it's doing even when it's slow it runs within 15 seconds or so.
One thing I do see from the sp_BlitzWho results is the query is using parallelism and it also states the Transaction Isolation Level is Read Committed (which I would unexpected Read Uncommitted since all the tables have a nolock hint).
What would cause a query with dirty reads to be forcing the database to queue up WRITELOG events?
I could see this happening if the query was altering data and causing it's own transaction log entries, but that should not be happening with the query. That's the whole reason we are using the nolock hint on the tables.
Also, our database, log files and tempdb are all on their own logical storage devices, so reads from the database should not be causing a IO problems writing to the transaction log files.
A couple of notes on the environment:
We are running Microsoft SQL Server 2019 (RTM-CU8-GDR) (KB4583459) - 15.0.4083.2 (X64))
The database is running in a VM
We backup transaction logs every 5 minutes (could this be the issue?)
Memory and CPU usage appear fine with the query runs
SQL Monitor 11 only really shows spikes in the log flushes and waits (which would match the behavior). Page splits, buffer cache & page are all normal. I do see the "disk read bytes/sec" go up on the logic drive that has the database on it, but the writes on all drives (including the transaction logs) look okay.
Any thoughts would be greatly appreciated as I'm really scratching my head over this issue.
Right after I posted my question I started looking at the sp_BlitzWho results in more detail. I noticed the parallelism was using all the CPUs. So I changed the MAXDOP to half the CPU/cores and this appears to have resolved the issue. I'm going to keep monitoring the situation, but looks like an instance where the MAXDOP was not set correctly.
It make sense that if a query is eating up all the available cores, that other threads would be waiting. I was just thrown off by the WRITELOG waits.
I have 1.2 million rows in Azure data table. The following command:
DELETE FROM _PPL_DETAIL WHERE RunId <> 229
is painfully slow.
There is an index on RunId.
I am deleting most of the data.
229 is a small number of records.
It has been running for an hour now
Should it take this long?
I am pretty sure it will finish.
Is there anything I can do to make operations like this faster?
The database does have a PK, although it is a dummy PK (not used). I already saw that as an optimization need to help this problem, but it still takes way too long (SQL Server treats a table without a PK differently -- much less efficient). It is still taking 1+ hour.
How about trying something like below
BEGIN TRAN
SELECT * INTO #T FROM _PPL_DETAIL WHERE RunId = 229
TRUNCATE TABLE _PPL_DETAIL
INSERT INTO _PPL_DETAIL
SELECT * FROM #T
COMMIT TRAN
Without knowing what database tier is using the database where that statment runs it is not easy to help you. However, let us tell you how the system works so that you can make this determination with a bit more investigation by yourself.
Currently the log commit rate is limited by the tier the database has. Deletes are fundamentally limited on the ability to write out log records (and replicate them to multiple machines in case your main machine dies). When you select records, you don't have to go over the network to N machines and you may not even need to go to the local disk if the records are preserved in memory, so selects are generally expected to be faster than inserts/updates/deletes because of the need to harden log for you. You can read about the specific limits for different reservation sizes are here: DTU Limits and vCore Limits.
One common problem is to do individual operations in a loop (like a cursor or driven from the client). This implies that each statement has a single row updated and thus has to harden each log record serially because the app has to wait for the statement to return before submitting the next statement. You are not hitting that since you are running a big delete as a single statement. That could be slow for other reasons such as:
Locking - if you have other users doing operations on the table, it could block the progress of the delete statement. You can potentially see this by looking at sys.dm_exec_requests to see if your statement is blocking on other locks.
Query Plan choice. If you have to scan a lot of rows to delete a small fraction, you could be blocked on the IO to find them. Looking at the query plan shape will help here, as will set statistics time on (We suggest you change the query to do TOP 100 or similar to get a sense of whether you are doing lots of logical read IOs vs. actual logical writes). This could imply that your on-disk layout is suboptimal for this problem. The general solutions would be to either pick a better indexing strategy or to use partitioning to help you quickly drop groups of rows instead of having to delete all the rows explicitly.
An additional strategy to have better performance with deletes is to perform batching.
As I know SQL Server had a change and the default DOP is 1 on their servers, so if you run the query with OPTION(MAXDOP 0) could help.
Try this:
DELETE FROM _PPL_DETAIL
WHERE RunId <> 229
OPTION (MAXDOP 0);
I am using Entity Framework, and I am inserting records into our database which include a blob field. The blob field can be up to 5 MB of data.
When inserting a record into this table, does it lock the whole table?
So if you are querying any data from the table, will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load, will it take longer to cause a deadlock?
Is there a way to monitor and see what is locked at any particular time?
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
This is taking into account that most of my code is just a bunch of select statements, not heaps of long running transactions or anything like that.
Holy cow, you've got a lot of questions in here, heh. Here's a few answers:
When inserting a record into this table, does it lock the whole table?
Not by default, but if you use the TABLOCK hint or if you're doing certain kinds of bulk load operations, then yes.
So if you are querying any data from the table will it block until the insert is done (I realise there are ways around this, but I am talking by default)?
This one gets a little trickier. If someone's trying to select data from a page in the table that you've got locked, then yes, you'll block 'em. You can work around that with things like the NOLOCK hint on a select statement or by using Read Committed Snapshot Isolation. For a starting point on how isolation levels work, check out Kendra Little's isolation levels poster.
How long will it take before it causes a deadlock? Will that time depend on how much load is on the server, e.g. if there is not much load will it take longer to cause a deadlock?
Deadlocks aren't based on time - they're based on dependencies. Say we've got this situation:
Query A is holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query B
Query B is also holding a bunch of locks, and to finish his query, he needs stuff that's locked by Query A
Neither query can move forward (think Mexican standoff) so SQL Server calls it a draw, shoots somebody's query in the back, releases his locks, and lets the other query keep going. SQL Server picks the victim based on which one will be less expensive to roll back. If you want to get fancy, you can use SET DEADLOCK_PRIORITY LOW on particular queries to paint targets on their back, and SQL Server will shoot them first.
Is there a way to monitor and see what is locked at any particular time?
Absolutely - there's Dynamic Management Views (DMVs) you can query like sys.dm_tran_locks, but the easiest way is to use Adam Machanic's free sp_WhoIsActive stored proc. It's a really slick replacement for sp_who that you can call like this:
sp_WhoIsActive #get_locks = 1
For each running query, you'll get a little XML that describes all of the locks it holds. There's also a Blocking column, so you can see who's blocking who. To interpret the locks being held, you'll want to check the Books Online descriptions of lock types.
If each thread is doing queries on single tables, is there then a case where blocking can occur? So isn't it the case that a deadlock can only occur if you have a query which has a join and is acting on multiple tables?
Believe it or not, a single query can actually deadlock itself, and yes, queries can deadlock on just one table. To learn even more about deadlocks, check out The Difficulty with Deadlocks by Jeremiah Peschka.
If you have direct control over the SQL, you can force row level locking using:
INSERT INTO WITH (ROWLOCK) MyTable(Id, BigColumn)
VALUES(...)
These two answers might be helpful:
Is it possible to force row level locking in SQL Server?
Locking a table with a select in Entity Framework
To view current held locks in Management Studio, look under the server, then under Management/Activity Monitor. It has a section for locks by object, so you should be able to see whether the inserts are really causing a problem.
Deadlock errors generally return quite quickly. Deadlock states do not occur as a result of a timeout error occurring while waiting for a lock. Deadlock is detected by SQL Server by looking for cycles in the lock requests.
The best answer I can come up with is: It depends.
The best way to check is to find your connection SPID and use sp_lock SPID to check if the lock mode is X on the TAB type. You can also verify the table name with SELECT OBJECT_NAME(objid). I also like to use the below query to check for locking.
SELECT RESOURCE_TYPE,RESOURCE_SUBTYPE,DB_NAME(RESOURCE_DATABASE_ID) AS 'DATABASE',resource_database_id DBID,
RESOURCE_DESCRIPTION,RESOURCE_ASSOCIATED_ENTITY_ID,REQUEST_MODE,REQUEST_SESSION_ID,
CASE WHEN RESOURCE_TYPE = 'OBJECT' THEN OBJECT_NAME(RESOURCE_ASSOCIATED_ENTITY_ID,RESOURCE_DATABASE_ID) ELSE '' END OBJETO
FROM SYS.DM_TRAN_LOCKS (NOLOCK)
WHERE REQUEST_SESSION_ID = --SPID here
In SQL Server 2008 (and later) you can disable the lock escalation on the table and enforce a WITH (ROWLOCK) in your insert clause effectively forcing a rowlock. This can't be done prior to SQL Server 2008 (you can write WITH ROWLOCK, but SQL Server can choose to ignore it).
I'm speaking generals here, and I don't have much experience with BLOBs as I usually advise developers to avoid them, especially if larger than 1 MB.
Using SQL Server Management Studio.
How can I test the performance of a large select (say 600k rows) without the results window impacting my test? All things being equal it doesn't really matter, since the two queries will both be outputting to the same place. But I'd like to speed up my testing cycles and I'm thinking that the output settings of SQL Server Management Studio are getting in my way. Output to text is what I'm using currently, but I'm hoping for a better alternative.
I think this is impacting my numbers because the database is on my local box.
Edit: Had a question about doing WHERE 1=0 here (thinking that the join would happen but no output), but I tested it and it didn't work -- not a valid indicator of query performance.
You could do SET ROWCOUNT 1 before your query. I'm not sure it's exactly what you want but it will avoid having to wait for lots of data to be returned and therefore give you accurate calculation costs.
However, if you add Client Statistics to your query, one of the numbers is Wait time on server replies which will give you the server calculation time not including the time it takes to transfer the data over the network.
You can SET STATISTICS TIME ON to get a measurement of the time on server. And you can use the Query/Include Client Statistics (Shift+Alt+S) on SSMS to get detail information about the client time usage. Note that SQL queries don't run and then return the result to the client when finished, but instead they run as they return results and even suspend execution if the communication channel is full.
The only context under which a query completely ignores sending the result packets back to the client is activation. But then the time to return the output to the client should be also considered when you measure your performance. Are you sure your own client will be any faster than SSMS?
SET ROWCOUNT 1 will stop processing after the first row is returned which means unless the plan happens to have a blocking operator the results will be useless.
Taking a trivial example
SELECT * FROM TableX
The cost of this query in practice will heavily depend on the number of rows in TableX.
Using SET ROWCOUNT 1 won't show any of that. Irrespective of whether TableX has 1 row or 1 billion rows it will stop executing after the first row is returned.
I often assign the SELECT results to variables to be able to look at things like logical reads without being slowed down by SSMS displaying the results.
SET STATISTICS IO ON
DECLARE #name nvarchar(35),
#type nchar(3)
SELECT #name = name,
#type = type
FROM master..spt_values
There is a related Connect Item request Provide "Discard results at server" option in SSMS and/or TSQL
The best thing you can do is to check the Query Execution Plan (press Ctrl+L) for the actual query. That will give you the best guesstimate for performance available.
I'd think that the where clause of WHERE 1=0 is definitely happening on the SQL Server side, and not Management Studio. No results would be returned.
Is you DB engine on the same machine that you're running the Mgmt Studio on?
You could :
Output to Text or
Output to File.
Close the Query Results pane.
That'd just move the cycles spent on drawing the grid in Mgmt Studio. Perhaps the Resuls to Text would be more performant on the whole. Hiding the pane would save the cycles on Mgmt Studio on having to draw the data. It's still being returned to the Mgmt Studio, so it really isn't saving a lot of cycles.
How can you test performance of your query if you don't output the results? Speeding up the testing is pointless if the testing doesn't tell you anything about how the query is going to perform. Do you really want to find out this dog of a query takes ten minutes to return data after you push it to prod?
And of course its going to take some time to return 600,000 records. It will in your user interface as well, it will probably take longer than in your query window because the info has to go across the network.
There is a lot of more correct answers of answers but I assume real question here is the one I just asked myself when I stumbled upon this question:
I have a query A and a query B on the same test data. Which is faster? And I want to check quick and dirty. For me the answer is - temp tables (overhead of creating temp table here is easy to ignore). This is to be done on perf/testing/dev server only!
Query A:
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS (to clear statistics
SELECT * INTO #temp1 FROM ...
Query B
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
SELECT * INTO #temp2 FROM ...
I have been working on a stored procedure performance problem for over a week now and is related to my other post on Stackoverflow here. Let me give you some background information.
We have a nightly process which runs and is started by a stored procedure which calls many many many other stored procedures. Lots of the called stored procedures call others, etc. I have looked at some of the called procs and there is all sorts of frightnening complicated stuff in there such as XML string processing, unnecessary over-use of cursors, NOLOCK hints over-used, rare use of set-based processing, etc - the list goes on, it's quite horrendous.
This nightly process in our production environment takes on average 1:15 to run. It sometimes takes 2 hours to run which is unacceptable. I have created a test environment on identical hardware to production and run the proc. It took 45 minutes the first time I ran it. If I restore the database to the exact same point and run it again, it takes longer: indeed, if I repeat this action several times (restoring and re-running), the proc takes progressively longer until it plateaus at around 2 hours. This really puzzles me because I restore the database to the exact same point every time. There are no other user databases on the server.
I thought of two lines of investigation to pursue:
Query plans and parameter spoofing
Tempdb
As a test, I restarted SQL Server to clear out both the cache and tempdb and re-ran the proc with the same database restore. The proc took 45 minutes. I repeated this several times to ensure that it was repeatable - again it took 45 minutes each time. I then embarked on several tests to try and isolate the puzzling increase in run times when SQL Server does not get restarted:
Run the initial stored procedure WITH RECOMPILE
Before running the procedure, executre DBCC FREEPROCCACHE to clear out the procedure cache
Before running the procedure, execute CHECKPOINT followed by DBCC DROPCLEANBUFFERS to ensure that the cache was empty and clean
Executed the following script to ensure all stored procedures were marked for recompilation:
DECLARE #proc_schema SYSNAME
DECLARE #proc_name SYSNAME
DECLARE prcCsr CURSOR local
FOR SELECT specific_schema,
specific_name
FROM INFORMATION_SCHEMA.routines
WHERE routine_type = 'PROCEDURE'
OPEN prcCsr
FETCH NEXT FROM prcCsr INTO #proc_schema, #proc_name
DECLARE #stmt NVARCHAR(MAX)
WHILE ##FETCH_STATUS = 0
BEGIN
SET #stmt = N'exec sp_recompile ''[' + #proc_schema + '].['
+ #proc_name + ']'''
-- PRINT #stmt -- DEBUG
EXEC ( #stmt
)
FETCH NEXT FROM prcCsr INTO #proc_schema, #proc_name
END
In all the above tests, the procedure takes longer and longer to run with the same database restore. I am really at a loss now as to what to try. Looking into the code at this point is an option but realistically its going to take 3-6 months to get that optimised as there is lots of room for improvement there. What I am really interested in getting to the bottom of, is why does the proc execution time get longer each time when a database restore has been performed even when the procedure and buffer caches have been cleaned?
I did also investigate tempdb, and try and clear out old tables in there as described in my other stackoverflow post, but I am unable to manually clear out temp tables that were created from table variables manually and they don't seem to want to disappear on their own (even after leaving them for 24 hours).
Any insight or suggestions for further testing would be greatly appreciated. I am running SQL Server 2005 SP3 64-bit Enterprise edition on a Windows 2003 R2 Ent. edition cluster.
Regards,
Mark.
One thing that could cause this is if the process is leaking XML documents. That would cause SQL Server to use more memory, and parts of that might be written to a page file on disk, causing the process to slow down.
Code that creates an XML document looks like:
EXEC sp_xml_preparedocument #idoc OUTPUT, #strXML
It leaks if there is no corresponding:
EXEC sp_xml_removedocument #idoc
XML documents are COM objects stored outside the configured SQL Server memory. Even if you set SQL Server to use max 5 GB, leaking XML documents grows memory usage beyond that.
Reviewing all posts to-date and your related question, it certainly sounds like your strongest lead is the mystery behind those tempdb objects. Some leading questions:
After a fresh start, after the process is run how many objects are in tempdb? Is it the same number after every fresh start?
Do the numbers grow after “successive” runs? Do they grow at the same rate?
Can you determine if they occupy space?
For that matter, your tempdb files grow with each successive run of your process?
I followed the links, but didn’t find any reference discussion the actual problem. You might want to raise the issue on the Microsoft SQL Technet forums here -- they can be pretty good with the abstract stuff. (If all else fails, you can open a case with MS technical support. It might take days, but odds are very good that they will figure things out. And if it is an MS bug, they refund your money!)
You've said that rewriting the code is not an option. However, if temp table abuse is a factor, identifying and refactoring those parts of the code first might help a lot. To find which those may be, run SQL Profiler while your process executes. This kind of work is, alas, subjective and highly iterative (meaning you hardly ever get just the right set of counters on the first pass). Some thoughts:
Start with tracking SP:Started, to track which stored proedures are being called.
SQL Profiler can be used to group data; it’s awkward and I’m not sure how to describe it in mere text, but configured properly you’ll get a Profiler display showing the number of times each procedures was. Ideally, this would show the most frequenly called procs, and you can analyze them for temp table abuse and refactor as necessary.
If nothing jumps out there, you can trace SP:StmtStarting and do the same thing for individual statements. The problem here is that in a 2+/- hour spaghetti-code run, you might run out of disk space, and analyzing 100s of MB of trace data can be a nightmare. (Hint: load it in a table, build indexes, then carefully delete out the cruft.) Again, the goal would be to identify overly used/abused temp table code to be refactored.
Mark-
So it might take 3-6 months to totally re-write this procedure, but that doesn't mean you can't do some relatively quick performance optimization.
Some of the routines I have to support run 30hrs+, I would be ecstatic to get them to run in 2hrs!! The kind of optimization that you do on these routines is a little different than your normal OLTP database:
Capture a trace of the entire process, making sure to capture SP:StmtCompleted and SQL:StmtCompleted events. Make sure to put a filter on Duration (>10ms or something) to eliminate all the quick, unimportant statements.
Pull this trace into a table, and do some filtering/sorting/grouping, focusing on Duration and Reads. You will likely end up with one of two situations:
(A) A handful of individual queries/statements are responsible for the bulk of the time of the procedure (good news)
(B) A whole lot of similar statements each take a short amount of time, but together they add up to a long time.
In scenario (A), just focus your attention on these queries. Optimize them using indexes, or using other standard techniques. I highly recommend Dan Tow's book "SQL Tuning" for a powerful technique to optimize queries, especially messy ones with complicated joins.
In scenario (B), step back a bit and look at the set of statements as a whole. Are they all similar in some way? Can you add an index on a key, common table that will improve them all? Can you eliminate a loop that executes 10,000 dynamic queries, and instead do a single set-based query?
Still two other possibilities, I suppose:
(C) 15,000 totally different dynamic SQL statements, each requiring its own painstaking optimization. In this case, try to focus on server-level optimizations, such as I/O based improvements that will benefit them all.
(D) Something else weird going on with TempDB or something mis-configured on the server. Not much else I can say here, other than find the problem, and fix it!
Hope this helps.
Can you try the following scenario on the test server:
Make two copies of the database on the server: [A] and [B]. [A] is the database in question, [B] is the copy.
Restart server
Run your process
Drop the database [A]
Rename [B] to [A]
Run your process
This would be like a hot database swap. If the second run takes longer, something on the server level is happening (tempdb, memory, I/O, etc). If the second run takes about the same time, then the problem is on the database level (locks, index fragmentation, etc).
Good luck!
Run the following script at start of test and then after each iteration:
select sum(single_pages_kb) as sum_bp_kb
, sum(multi_pages_kb) as sum_va_kb
, type
from sys.dm_os_memory_clerks
group by type
having sum(single_pages_kb+multi_pages_kb) > 16
order by sum(single_pages_kb+multi_pages_kb) desc
select sum(total_pages), type_desc
from tempdb.sys.allocation_units
group by type_desc;
select * from sys.dm_os_performance_counters
where counter_name in (
'Log Truncations'
,'Log Growths'
,'Log Shrinks'
,'Data File(s) Size (KB)'
,'Log File(s) Size (KB)'
,'Active Temp Tables');
If the results are not self-evident, you can post them somewhere and place a link here, I can look into them and see if something strikes as odd.
What does the overall process do, what is the purpose of the operation being performed?
I would assume that executing the process results in data modification within the database. Is this the case?
If this is the case, then each time you run the process, the data begin considered is different and so different execution plan production is a possibility and so too are differing execution times.
Assuming that modification to the database data is occuring then you should also investigate:
Updating relevant database statistics
between each process run.
Reviewing the level of index
fragmentation between each process
run and determine if defragmentation could prove benificial.
Apparently you want to try anything except what you really have to do which is fix the process. Start by getting rid of the cursors. If it takes two hours right now, without the cursors I'll bet you can get it down to less than ten minutes.
I would log information into a log_table and the time it took to run each steps...that will help you narrow down the issue and also help you progressively improve the process by tackling it one at time (from improving procs that take the longest).
Best way is to simply insert at the beginning and the end of each proc.
Cursors are not peformance boosters, others address that. (not your decision)
Look into the temp tables use/management. Are they global temp tables or session/local temp tables? The fact that they are hanging around looks interesting. The tempdb is locked when temp tables are created which might be part of the issue.
Local temp tables (#mytable syntax) should go away when the session goes out of scope, but you SHOULD have dropped these (release early) to free up resources.
Use of local temp tables in transaction then cancel without COMMIT/ROLLBACK can increase locking in tempdb causing performance issues.
Speaking of transactions - this will cause locks on syscolumns, sysindexes etc. if temp tables are created in transactions - thus other exeuctions are blocked from using the same query.
Use of temp tables created by calling procedures in the called procedures points to logic need - rethink and try to use relational structures instead.
IF you need temp tables (to eliminate cursors :) then avoid SELECT INTO - to avoid system objects locks.
Use of global temp tables (##myglobaltable syntax) should be avoided as multiple sessions accessing can be and issue (the table hangs around until all sessions clear), and for me at least, makes no additive logical value proposition (look into the use of a permanent table instead). Question if global, are there blocking procedures?
Are there a lot of sparse temp tables (grow with large data, but have smaller data sets in them?)
Microsoft SQL Server Book Online,
“Consider using table variables instead of temporary tables. Temporary tables are useful in cases when indexes need to be created explicitly on them, or when the table values need to be visible across multiple stored procedures or functions. In general, table variables contribute to more efficient query processing.”
Of course if the temp table needs indexes, tabel variables are not an option.
I don't have the answer but some ideas of what I would do to isolate issues like this.
First, I would take snapshots of sys.dm_os_wait_stats before and after each execution. You subtract the 2 snapshots (get a deltas) and see if any particular WAIT is prominent or gets worse with each run. An easy way to calculate deltas is to copy the sys.dm_os_wait_stats values into Excel worksheets and use VLOOKUP() to subtract corresponding values. I've used this investigation technique hundreds of times. You don't know what aspect SQL Server is hung up on?! Let SQL Server "tell" you via sys.dm_os_wait_stats !
The other thing I might try is to adjust the behavior of the loop to understand if the subsequent slower executions exhibit constant throughput for all records from beginning to end or does it only slow down for particular sproc(s) in INFORMATION_SCHEMA.routines ... 2 techniques for exploring this is:
1) Add a "top N" clause the SQL SELECT such as "top 100" or "top 1000" (create an artificial limit) to see if you get subsequent slowdowns for all record count scenarios ... or ... do you only get the slowdowns when the cursor resultset is large enough to include the offending sproc.
2) Instead of adding "top N", you can add more print statements (instrumentation) to calculate the throughput as it is processing.
Of course, you can do combination of both.
Maybe these diagnostics will get you closer to the root cause.
Edited to add: Btw, SQL2008 has a new performance monitor that makes it easy to "eyeball" the numbers of sys.dm_os_wait_stats. However for SQL2005, you'll have to manually calculate the deltas via Excel or a script.
These are long shots:
Quickly look through all of the
stored procedures for things that are
unusual and SQL Server should not
really be doing, for example sending
email or writing files, etc. SQL trying to send email to a non-exist email server could cause delays.
The other thing to keep in mind is
that as you restore the database
before each test possibly your disk
is getting more fragmented (not
really sure about this though). So
that may explain why run times get longer each time until they plateau.
Firstly, thanks to everyone for some really great help. I much appreciate your time and expertise in helping me to solve this very strange issue. I have an update.
I started a server-side trace to try and isolate the stored procs that were running slower between iterations. What I found surprised me. 96 stored procedures are involved in the process. Most of these stored procedures ran slower the second time around - about 50 of them. The rest were very quick to run and didn't influence the overall time at all, and in fact some of these ran a little quicker (as would be expected).
I failed over the database instance to another node in my cluster and ran the tests there with the exact same results - so I can rule out any OS differences between cluster nodes - when building the clusters I was very conscious to build them identically.
1100 temp tables get created during the process and persist after it has finished - these are all table variables and I found a way to remove them. Running sp_recompile on every proc and function in the database caused all the temp tables to get cleared up. However this did not improve the run times at all. The only thing that helps the run times is a restart of the SQL Server service. Unfortunately I am out of time now to investigate this further - I have other work to do, but would like to persist with it. Perhaps I will come back to it later if I get a spare few hours. In the meantime however, I have to admit defeat with no solution and no bounty to give.
Thanks again everyone.