For months we have been plagued with an issue where a database which serves two web servers has its CPU shoot up to 100% and stay there, for hours if we let it. All 6 processors. This happens every few days at different times of the day. The CPU usage is due to the sqlserver.exe.
This is not a general SQL Server performance issue ("how do I make my queries more efficient"). When there is an incident, CPU goes from its typical 20% up to 100% and stays there until a server reboot.
We are on SQL Server 2016 SP2 cumulative update 6.
We've added some logging and see that during the latest CPU incident, the number of spinlocks on OPT_IDX_STATS shot up to 5775813 spins per collision. Not sure if that's the cause or a symptom?
Before CPU 100% incident
name collisions spins spins_per_collision sleep_time backoffs
---- ---------- ----- ------------------- ---------- --------
OPT_IDX_STATS 787 200250 254.4473 0 5
LOCK_HASH 2137398 630970500 295.205 1410 52938
1 minute later
name collisions spins spins_per_collision sleep_time backoffs
---- ---------- ----- ------------------- ---------- --------
OPT_IDX_STATS 12 69309750 5775813 7 27
LOCK_HASH 17292 49187101 2844.5 47 555
We see around 40 queries running when an incident hints. They are typically instances of the same two LINQ queries. No query ever has an elapsedMS of longer than 20,000ms, so it's not a long running query that's crushing the CPU. They are expensive queries, but it seems to be a symptom of the problem not a cause - we see those queries piling up because the DB is running so slow because CPU is so high. Those same queries (along with others) are being executed all the time, including after the DB server is rebooted, and they don't cause a problem after a reboot.
The server has 36 GB of memory and we don't see usage going higher than 22%.
Some other interesting information, killing the currently running queries lets the CPU drop, but only briefly (shoots up again as the web servers send more queries). Pausing the DB to let the queries finish lets the CPU drop for as long as it's paused, but then it shoots up when the DB is resumed. Rebooting the database server always fixes the issue. Before and after the database reboot the webservers should be sending the same types of queries, which points to a problem with SQL Server - otherwise why would a reboot fix the problem?
Update: I wrote a PowerShell script that clears the plan cache if the CPU is > 95% for 45 seconds and that seems to have worked around the problem. Still don't know what the issue is though.
Copying comments to an answer as requested:
What is the memory configuration for the SQL Server? Do you have it set to correctly limit the amount of memory SQL Server will try to claim for itself? I've seen people leave it at the default, and then get into pathological situations where SQL Server claims more memory than is available, causing it and the OS to swap, cratering performance. This is always the first thing to check. There are guides out there for the best value for this particular setting for your memory, OS, and configuration. A good rule of thumb for 80% of normal configurations is take installed memory, subtract 4GB, and use that value for SQL Server.
The next thing to check is your plan caches and the like. If you have hard-coded SQL queries (not parameterized) that vary with requests, you could have a horribly polluted plan cache. Try turning the "Optimize for ad-hoc queries" option on under Advanced options. Try clearing all caches and see if that affects performance (something short of a reboot).
You can look at using Resource Governor, I've had to do it in a similar situation where I HAD to share the database with some resource hogs:
https://learn.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor?view=sql-server-2017
It's still relevant in SQL 2016 but I didn't easily find the link.
Related
SQL Server 2008 Enterprise SP4 0.0.6547.0 x64
Running on Windows 2012R2 patched current.
A VM running on Cisco UCM blades and 6.0 Update 3 plus patches.
A Nimble CS700 SAN for the storage.
This is a large OLTP server with 12 vCPU. Normal CPU usage hovers around 6-11%
What happens is that, without warning, the IO Stall times will go through the roof (2000-1000ms) and most queries will stop returning results. Adam Machanic's sp_whoisactive will show dozens of active queries. CPU is at 90+%.
SAN shows almost zero activity and all other VMs on the same SAN are operating optimally.
We see massive blocking as the stalled processes hold blocks, with some timing out and sleeping with blocks hanging on the SPID. Killing the SPIDs in question provides temporary relief, but seconds later we are right back where we started.
The only thing that provides relief is a reboot of the server.
Management is rightly demanding an actual root cause. When this happened last summer, with visibility to the CEO level, we engaged Microsoft support, who were dumbfounded and offered no actual root cause.
What I can't do is upgrade the SQL server. The machine hosts a packaged application and the package publisher refuses to support their software if we implement any newer SQL Server version. I desperately want to go to 2014/2016/2017, and would feel that it would solve this problem and others.
In any event, I searched the bug reports and did not see anything that matched.
Has anyone run into this issue? If so did you suss out a root cause? I have a gut feel that there is a bug in either SQL 2008, Windows 2012R2 or how they interact. But I don't want to write that into the RCA without having some corroboration.
Would appreciate any pointers.
Here is my approach
1.) Try eliminate storage issues.We once had a storage issue(SAN) and root cause seemed to be some HBA.You can further check if your storage is performing with in acceptable limits
You should start with below counters and see if they are less than 15ms
Avg. Disk sec/Read - is the average time, in seconds, of a read of data from the disk.
Avg. Disk sec/Write - is the average time, in seconds, of a write of data to the disk.
There is more info here :https://www.mssqltips.com/sqlservertip/2460/perfmon-counters-to-identify-sql-server-disk-bottlenecks/
2.) Once you have eliminated storage issues, you can further check if SQLSERVER is the only causing IO spikes or if there are any other applications causing IO.You can use resource monitor to find this
3.) If you have reached here, SQLSERVER may be culprit..Go with below steps and try following same sequence and see if problem persists after each step.
Remember HIGH IO can be caused due to
Stale stats and missing indexes:You might not be updating stats regularly or some type of queries might need more frequent index rebuilds/stats update
gather queries causing HIGH IO and try tuning them,you can observe number of reads done and try adding indexes to minimize number of reads
Further Check memory pressure,some times high memory usage can cause Buffer pool flush and there by queries will go to disk..You can look for a counter called PLE and see what is good for your environment
Further research pointed to VMWare. The machine was allocated 304GB of RAM, 264GB of which was assigned to SQL Server. However the underlying host was overcommitted on RAM by a large amount. We suspect thrashing as page life drops, and as other VMs also need real RAM.
Thanks
John.
We are running Dynamics GP 2010 on 2 load balanced citrix servers. For the past 3 weeks we have had severe performance hits when users are running Fixed Assets reporting.
The database is large in size, but when I run the reports locally on the SQL server, they run great. The SQL server seems to be performing adequately even when users are seeing slow performance.
Any ideas?
Just because your DB seems un-stressed, it does not mean that it is fine. It could contain other bottlenecks. Typically, if a DB server is not maxing-out its CPUs occasionally, it means there is a much bigger problem.
Standard process for troubleshooting performance problems on a data driven app go like this:
Tune DB indexes. The Tuning Wizard in SSMS is a great starting point. If you haven't tried this yet, it is a great starting point.
Check resource utilization: CPU, RAM. If your CPU is maxed-out, then consider adding/upgrading CPU or optimize code or split your tiers. If your RAM is maxed-out, then consider adding RAM or split your tiers.
Check HDD usage: if your queue length goes above 1 very often (more than once per 10 seconds), upgrade disk bandwidth or scale-out your disk (RAID, multiple MDF/LDFs, DB partitioning).
Check network bandwidth
Check for problems on your app (Dynamics) server
Shared report dictionaries are the bane of reporting in GP. they do tend to slow things down. also, modifying reports becomes impossible as somebody has it open all the time.
use local report dictionaries and have a system to keep them synced with a "master" reports.dic
I'm running some stored procedures in SQL Server 2012 under Windows Server 2012 in a dedicated server with 32 GB of RAM and 8 CPU cores. The CPU usage is always below 10% and the RAM usage is at 80% because SQL Server has 20 GB (of 32 GB) assigned.
There are some stored procedures that are taking 4 hours some days and other days, with almost the same data, are taking 7 or 8 hours.
I'm using the least restrictive isolation level so I think this should not be a locking problem. The database size is around 100 GB and the biggest table has around 5 million records.
The processes have bulk inserts, updates and deletes (in some cases I can use truncate to avoid generating logs and save some time). I'm making some full-text-search queries in one table.
I have full control of the server so I can change any configuration parameter.
I have a few questions:
Is it possible to improve the performance of the queries using
parallelism?
Why is the CPU usage so low?
What are the best practises for configuring SQL Server?
What are the best free tools for auditing the server? I tried one
from Microsoft called SQL Server 2012 BPA but the report is always
empty with no warnings.
EDIT:
I checked the log and I found this:
03/18/2015 11:09:25,spid26s,Unknown,SQL Server has encountered 82 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.HLSQLSERVER\MSSQL\DATA\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000BF8. The offset of the latest long I/O is: 0x00000001fe4000
Bump up max memory to 24 gb.
Move tempdb off the c drive and consider mult tempdb files, with auto grow at least 128 Mbps or 256 Mbps.
Install performance dashboard and run performance dashboard report to see what queries are running and check waits.
If you are using auto grow on user data log and log files of 10%, change that to something similar to tempdb growth above.
Using performance dashboard check for obvious missing indexes that predict 95% or higher improvement impact.
Disregard all the nay Sayers who say not to do what I'm suggesting. If you do these 5 things and you're still having trouble post some of the results from performance dashboard, which by the way is free.
One more thing that may be helpful, download and install the sp_whoisactive stored proc, run it and see what processes are running. Research the queries that you find after running sp_whoisactive.
query taking hours but using low CPU
You say that as if CPU would matter for most db operations. HINT: They do not.
Databases need IO. RAM sin some cases helps mitigate this, but at the end it runs down to IO.
And you know what I see in your question? CPU, Memory (somehow assuming 32gb is impressive) but NO WORD ON DISC LAYOUT.
And that is what matters. Discs, distribution of files to spread the load.
If you look into performance counters then you will see latency being super high on discs - because whatever "pathetic" (in sql server terms) disc layout you have there, it simply is not up to the task.
Time to start buying. SSD are a LOT cheaper than discs. You may say "Oh, how are they cheaper". Well, you do not buy GB - you buy IO. And last time I checked SSD did not cost 100 times the price of discs - but they have 100 times or more the IO. and we talk always of random IO.
Then isolate Tempdb on separate SSD - tempdb either does no a lot or a TON and you want to see this.
Then isolate the log file.
Make multiple data files, for database and tempdb (particularly tempdb - as many as you have cores).
And yes, this will cost money. But at the end - you need IO and like most developers you got CPU. Bad for a database.
I am using SQL Server 2008 R2.
The process is actually like this:
First, about 2 million records are pulled from a remote server,
then a join is done locally,
the final result is thousands of records.
The time cost varies from less one 1 min to 30 mins.
And after I experienced the 30 mins delay, it seems the following time costs are all only around 3 mins.
It is the same data, same SP.
What could cause this drastic difference?
Update
I delet the SP, re-start the SQL server service, and re-creat the SP. The execution took only 50 seconds!
What's wrong?
The behaviour you describe seems extreme - but (if you exclude the client), there are 3 logical places to look.
The first is the query execution on the database server. It's worth using the Query Analyzer tool to see if it's using any indices - by far the most common reason for variable performance of database queries is that the query is not using (the right) indices, and that therefore the impact of the query cache plays a big part. SQL Server will cache a lot of data, and the first run of your proc populates that cache; the second run is faster because it hits the cache. After a while, the cache goes stale, and running the proc slows down again.
The second possibility is that the database server is wobbly - it may just not be powerful enough to do all the work it's supposed to do. In that case, one moment you get lucky, have all the server resources to yourself; the next, someone else is running a query and yours slows down. That would make all queries slow, not just this one - so it doesn't sound likely.
Third possibility is networking weirdness - as Phil says, "thousands of records" is nothing too scary, but if they're big, and your network is saturated with pictures of kittens, it might have an impact. Again, that would manifest in general network slowness, and is unlikely to explain a delay of 30 minutes...
Fourth, is anything going on at the same time?
Fifth, does your SP use dynamically generated SQL statements? This would cause the SP not to become pre-compiled. If possible seperate such statements into child SPs.
Our primary database server is an 8 core box with 8GB of RAM. The CPU is a Xeon E7330 # 2.4GHz. It runs Windows Server 2003 R2 (x64 edition) and SQL Server 2005
I wanted to do some testing so I set up SQL Server 2005 on another brand-new server which is an 8 core box with 4 GB of RAM. It has a Xeon X5460 # 3.16GHz and runs Windows Server 2003 R2 Standard. I Installed SQL Server 2005 out of the box and restored a backup of the primary database on to it, and did an UPDATE STATISTICS on all the tables.
The process I was testing executes the same stored proc many times. I was astounded to find from the profiler that this proc which executes with duration=0 or 1 on the primary server, was consistently executing with durations in excess of 130. This essentially makes the secondary server useless for testing, because it's just too slow.
No other apps run on either of these two boxes, just SQL server. And unlike the primary database server, the test server only had me accessing it.
I can't believe the difference in spec between these two machines explains this colossal difference in performance. Can anybody suggest any settings I may need to change?
Updates in answers to questions:
Second server is 32 bit Windows
I'm inquiring now about the disk arrays and how comparable they are
On the primary server, the data and logs are on the same drive (!) and it works fine
Looking in task manager on the test server, the CPU is running at like 10%, only one core even showing activity
Task manager on the test server (4GB RAM) shows "PF Usage 2.01GB" with SQL Server running. On the primary server (8GB RAM) it shows "PF Usage 6.67GB". How would I make SQL Server on the test box use more of the RAM? Maybe that would make a difference
Another update:
The primary server has a RAID-5 with 15,000 RPM drives. The test box has a RAID-5 with 10,000 RPM drives.
32 bit OS means 2 GB Virtual Address Space for your processes. Standard edition OS mean no AWE extensions either. So your test machine will be severely RAM deprived compared with the production one. Your buffer pool will suffer from premature eviction of the pages, your execution plans will not have the option to choose hash-joins for a lot of queries and so on and so forth. I doubt this explains the entire difference, I'm sure there must be something more at play. You say only 10 CPU usage during the query, is your MAXDOP setting 1 by any chance on the test server? Have you compared the output of sp_configure on the two machines? (make sure you enable 'advanced options' too).
Can you run the same problem query on the two machines, from a SSMS query window, with SET STATISTICS IO ON and SET STATISTICS TIME ON? Run it 2-3 times on each and write down the results. Does it show the same number of logical reads but vastly different number of physical reads? This would point to the RAM being insufficient to cache the needed pages. IS the number of logical reads very different? It probably means you get a bad execution plan on test.
Is the query write intensive by any chance? If so did you pre-grow the test database or is your execution blocked by log growth and database growth events?
There are plenty of places to look at to narrow down the issue, like SQL performance counters, sys.dm_os_wait_stats, check the sys.dm_exec_requests wait_type and wait_resource.
was the data in the memory cache yet? or was it all read from disk
You either have a different plan being generated or some hardware differences. For hardware you can check the disk seconds/[read,write] (edit to clarify - you do this in perfmon) and see if you have some massive differences from caching (e.g. high perf raid controller).
For the plan difference just check out the execution plans.
Also do set statistics io on and see if you are getting physical reads instead of logical reads. Maybe the mem difference is keeping your dataset from fitting in memory in secondary but not primary machine.
Although you may not be able to use AWE on your 32-bit server, you can provide SQL Server with a little more memory by adding the /3GB switch to the boot.ini file. Check out Books Online, it should give you more information.