SQL Server 2008 sudden high IO Stall and queries dead in water - sql-server

SQL Server 2008 Enterprise SP4 0.0.6547.0 x64
Running on Windows 2012R2 patched current.
A VM running on Cisco UCM blades and 6.0 Update 3 plus patches.
A Nimble CS700 SAN for the storage.
This is a large OLTP server with 12 vCPU. Normal CPU usage hovers around 6-11%
What happens is that, without warning, the IO Stall times will go through the roof (2000-1000ms) and most queries will stop returning results. Adam Machanic's sp_whoisactive will show dozens of active queries. CPU is at 90+%.
SAN shows almost zero activity and all other VMs on the same SAN are operating optimally.
We see massive blocking as the stalled processes hold blocks, with some timing out and sleeping with blocks hanging on the SPID. Killing the SPIDs in question provides temporary relief, but seconds later we are right back where we started.
The only thing that provides relief is a reboot of the server.
Management is rightly demanding an actual root cause. When this happened last summer, with visibility to the CEO level, we engaged Microsoft support, who were dumbfounded and offered no actual root cause.
What I can't do is upgrade the SQL server. The machine hosts a packaged application and the package publisher refuses to support their software if we implement any newer SQL Server version. I desperately want to go to 2014/2016/2017, and would feel that it would solve this problem and others.
In any event, I searched the bug reports and did not see anything that matched.
Has anyone run into this issue? If so did you suss out a root cause? I have a gut feel that there is a bug in either SQL 2008, Windows 2012R2 or how they interact. But I don't want to write that into the RCA without having some corroboration.
Would appreciate any pointers.

Here is my approach
1.) Try eliminate storage issues.We once had a storage issue(SAN) and root cause seemed to be some HBA.You can further check if your storage is performing with in acceptable limits
You should start with below counters and see if they are less than 15ms
Avg. Disk sec/Read - is the average time, in seconds, of a read of data from the disk.
Avg. Disk sec/Write - is the average time, in seconds, of a write of data to the disk.
There is more info here :https://www.mssqltips.com/sqlservertip/2460/perfmon-counters-to-identify-sql-server-disk-bottlenecks/
2.) Once you have eliminated storage issues, you can further check if SQLSERVER is the only causing IO spikes or if there are any other applications causing IO.You can use resource monitor to find this
3.) If you have reached here, SQLSERVER may be culprit..Go with below steps and try following same sequence and see if problem persists after each step.
Remember HIGH IO can be caused due to
Stale stats and missing indexes:You might not be updating stats regularly or some type of queries might need more frequent index rebuilds/stats update
gather queries causing HIGH IO and try tuning them,you can observe number of reads done and try adding indexes to minimize number of reads
Further Check memory pressure,some times high memory usage can cause Buffer pool flush and there by queries will go to disk..You can look for a counter called PLE and see what is good for your environment

Further research pointed to VMWare. The machine was allocated 304GB of RAM, 264GB of which was assigned to SQL Server. However the underlying host was overcommitted on RAM by a large amount. We suspect thrashing as page life drops, and as other VMs also need real RAM.
Thanks
John.

Related

SQL Server CPU Permanently stuck at 100%

For months we have been plagued with an issue where a database which serves two web servers has its CPU shoot up to 100% and stay there, for hours if we let it. All 6 processors. This happens every few days at different times of the day. The CPU usage is due to the sqlserver.exe.
This is not a general SQL Server performance issue ("how do I make my queries more efficient"). When there is an incident, CPU goes from its typical 20% up to 100% and stays there until a server reboot.
We are on SQL Server 2016 SP2 cumulative update 6.
We've added some logging and see that during the latest CPU incident, the number of spinlocks on OPT_IDX_STATS shot up to 5775813 spins per collision. Not sure if that's the cause or a symptom?
Before CPU 100% incident
name collisions spins spins_per_collision sleep_time backoffs
---- ---------- ----- ------------------- ---------- --------
OPT_IDX_STATS 787 200250 254.4473 0 5
LOCK_HASH 2137398 630970500 295.205 1410 52938
1 minute later
name collisions spins spins_per_collision sleep_time backoffs
---- ---------- ----- ------------------- ---------- --------
OPT_IDX_STATS 12 69309750 5775813 7 27
LOCK_HASH 17292 49187101 2844.5 47 555
We see around 40 queries running when an incident hints. They are typically instances of the same two LINQ queries. No query ever has an elapsedMS of longer than 20,000ms, so it's not a long running query that's crushing the CPU. They are expensive queries, but it seems to be a symptom of the problem not a cause - we see those queries piling up because the DB is running so slow because CPU is so high. Those same queries (along with others) are being executed all the time, including after the DB server is rebooted, and they don't cause a problem after a reboot.
The server has 36 GB of memory and we don't see usage going higher than 22%.
Some other interesting information, killing the currently running queries lets the CPU drop, but only briefly (shoots up again as the web servers send more queries). Pausing the DB to let the queries finish lets the CPU drop for as long as it's paused, but then it shoots up when the DB is resumed. Rebooting the database server always fixes the issue. Before and after the database reboot the webservers should be sending the same types of queries, which points to a problem with SQL Server - otherwise why would a reboot fix the problem?
Update: I wrote a PowerShell script that clears the plan cache if the CPU is > 95% for 45 seconds and that seems to have worked around the problem. Still don't know what the issue is though.
Copying comments to an answer as requested:
What is the memory configuration for the SQL Server? Do you have it set to correctly limit the amount of memory SQL Server will try to claim for itself? I've seen people leave it at the default, and then get into pathological situations where SQL Server claims more memory than is available, causing it and the OS to swap, cratering performance. This is always the first thing to check. There are guides out there for the best value for this particular setting for your memory, OS, and configuration. A good rule of thumb for 80% of normal configurations is take installed memory, subtract 4GB, and use that value for SQL Server.
The next thing to check is your plan caches and the like. If you have hard-coded SQL queries (not parameterized) that vary with requests, you could have a horribly polluted plan cache. Try turning the "Optimize for ad-hoc queries" option on under Advanced options. Try clearing all caches and see if that affects performance (something short of a reboot).
You can look at using Resource Governor, I've had to do it in a similar situation where I HAD to share the database with some resource hogs:
https://learn.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor?view=sql-server-2017
It's still relevant in SQL 2016 but I didn't easily find the link.

Dynamics GP 2010 Awful Report Performance

We are running Dynamics GP 2010 on 2 load balanced citrix servers. For the past 3 weeks we have had severe performance hits when users are running Fixed Assets reporting.
The database is large in size, but when I run the reports locally on the SQL server, they run great. The SQL server seems to be performing adequately even when users are seeing slow performance.
Any ideas?
Just because your DB seems un-stressed, it does not mean that it is fine. It could contain other bottlenecks. Typically, if a DB server is not maxing-out its CPUs occasionally, it means there is a much bigger problem.
Standard process for troubleshooting performance problems on a data driven app go like this:
Tune DB indexes. The Tuning Wizard in SSMS is a great starting point. If you haven't tried this yet, it is a great starting point.
Check resource utilization: CPU, RAM. If your CPU is maxed-out, then consider adding/upgrading CPU or optimize code or split your tiers. If your RAM is maxed-out, then consider adding RAM or split your tiers.
Check HDD usage: if your queue length goes above 1 very often (more than once per 10 seconds), upgrade disk bandwidth or scale-out your disk (RAID, multiple MDF/LDFs, DB partitioning).
Check network bandwidth
Check for problems on your app (Dynamics) server
Shared report dictionaries are the bane of reporting in GP. they do tend to slow things down. also, modifying reports becomes impossible as somebody has it open all the time.
use local report dictionaries and have a system to keep them synced with a "master" reports.dic

SQL Server long running query taking hours but using low CPU

I'm running some stored procedures in SQL Server 2012 under Windows Server 2012 in a dedicated server with 32 GB of RAM and 8 CPU cores. The CPU usage is always below 10% and the RAM usage is at 80% because SQL Server has 20 GB (of 32 GB) assigned.
There are some stored procedures that are taking 4 hours some days and other days, with almost the same data, are taking 7 or 8 hours.
I'm using the least restrictive isolation level so I think this should not be a locking problem. The database size is around 100 GB and the biggest table has around 5 million records.
The processes have bulk inserts, updates and deletes (in some cases I can use truncate to avoid generating logs and save some time). I'm making some full-text-search queries in one table.
I have full control of the server so I can change any configuration parameter.
I have a few questions:
Is it possible to improve the performance of the queries using
parallelism?
Why is the CPU usage so low?
What are the best practises for configuring SQL Server?
What are the best free tools for auditing the server? I tried one
from Microsoft called SQL Server 2012 BPA but the report is always
empty with no warnings.
EDIT:
I checked the log and I found this:
03/18/2015 11:09:25,spid26s,Unknown,SQL Server has encountered 82 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [C:\Program Files\Microsoft SQL Server\MSSQL11.HLSQLSERVER\MSSQL\DATA\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000BF8. The offset of the latest long I/O is: 0x00000001fe4000
Bump up max memory to 24 gb.
Move tempdb off the c drive and consider mult tempdb files, with auto grow at least 128 Mbps or 256 Mbps.
Install performance dashboard and run performance dashboard report to see what queries are running and check waits.
If you are using auto grow on user data log and log files of 10%, change that to something similar to tempdb growth above.
Using performance dashboard check for obvious missing indexes that predict 95% or higher improvement impact.
Disregard all the nay Sayers who say not to do what I'm suggesting. If you do these 5 things and you're still having trouble post some of the results from performance dashboard, which by the way is free.
One more thing that may be helpful, download and install the sp_whoisactive stored proc, run it and see what processes are running. Research the queries that you find after running sp_whoisactive.
query taking hours but using low CPU
You say that as if CPU would matter for most db operations. HINT: They do not.
Databases need IO. RAM sin some cases helps mitigate this, but at the end it runs down to IO.
And you know what I see in your question? CPU, Memory (somehow assuming 32gb is impressive) but NO WORD ON DISC LAYOUT.
And that is what matters. Discs, distribution of files to spread the load.
If you look into performance counters then you will see latency being super high on discs - because whatever "pathetic" (in sql server terms) disc layout you have there, it simply is not up to the task.
Time to start buying. SSD are a LOT cheaper than discs. You may say "Oh, how are they cheaper". Well, you do not buy GB - you buy IO. And last time I checked SSD did not cost 100 times the price of discs - but they have 100 times or more the IO. and we talk always of random IO.
Then isolate Tempdb on separate SSD - tempdb either does no a lot or a TON and you want to see this.
Then isolate the log file.
Make multiple data files, for database and tempdb (particularly tempdb - as many as you have cores).
And yes, this will cost money. But at the end - you need IO and like most developers you got CPU. Bad for a database.

Debugging SQL Server Slowness: Same Database, Different Servers

For a while now we've been having anecdotal slowness on our newly-minted (VMWare-based) SQL Server 2005 database servers. Recently the problem has come to a head and I've started looking for the root cause of the issue.
Here's the weird part: on the stored procedure that I'm using as a performance test case, I get a 30x difference in the execution speed depending on which DB server I run it on. This is using the same database (mdf) and log (ldf) files, detached, copied, and reattached from the slow server to the fast one. This doesn't appear to be a (virtualized) hardware issue: he slow server has 4x the CPU capacity and 2x the memory as the fast one.
As best as I can tell, the problem lies in the environment/configuration of the servers (either operating system or SQL Server installation). However, I've checked a bunch of variables (SQL Server config options, running services, disk fragmentation) and found nothing that has made a difference in testing.
What things should I be looking at? What tools can I use to investigate why this is happening?
Blindly checking variables and settings won't get you very far. You need to approach this methodically.
Are the two procedures executed the same way? Namely, is the plan different? A quick check is to SET STATISTICS IO ON and run the two cases. Is the number of logical-reads the same? Is the number of physical-reads the same? Is the number of writes the same? Differences in logical-reads or writes would indicate a different plan. Differences in physical-reads (while logical-reads is similar) indicate cache and memory problems. If the plans are different, you need to further investigate what is different in the actual execution plan. Does one plan uses a different degree of parallelism? Does one use different join types? Different access paths?
If the plans are similar yet the execution is still different, and you cannot blame the IO subsystem, then you need to check contention. Use SET STATISTICS TIME ON and compare the elapsed time and worker time in the two cases. Similar worker time but different elapsed time indicate that there is more waiting in one case. Use the wait_type and wait_resource info in sys.dm_exec_requests to identify the cause of contention.
The methodology of investigation is discussed in more detail in the Waits and Queues whitepaper.
Run SQL Server Profiler to gather information about running processes within SQL Server. This is probably the best start. This will give you a good idea of the things that are consuming a lot of resources.
If you still have issues after Indexing / Rebuilding Indexes, or rewriting queries, then the next step would be to run PerfMon.

When can I host IIS and SQL Server on the same machine?

I've read that it's unwise to install SQL Server and IIS on the same machine, but I haven't seen any evidence for that. Has anybody tried this, and if so, what were the results? At what point is it necessary to separate them? Is any tuning necessary? I'm concerned specifically with IIS7 and SQL Server 2008.
If somebody can provide numbers showing when it makes more sense to go to two machines, that would be most helpful.
It is unwise to run SQL Server with any other product, including another instance of SQL Server. The reason for this recommendation is the nature of of how SQL Server uses the OS resources. SQL Server runs on a user mode memory management and processor scheduling infrastructure called SQLOS. SQL Server is designed to run at peak performance and assumes that is the only server on the OS. As such the SQL OS reserves all RAM on the machine for SQL process and creates a scheduler for each CPU core and allocates tasks for all schedulers to run, utilizing all CPU it can get, when it needs it. Because SQL reserves all memory, other processes that need memory will cause SQL to see memory pressure, and the response to memory pressure will evict pages from buffer pool and compiled plans from the plan cache. And since SQL is the only server that actually leverages the memory notification API (there are rumors that the next Exchange will too), SQL is the only process that actually shrinks to give room to other processes (like leaky buggy ASP pools). This behavior is also explained in BOL: Dynamic Memory Management.
A similar pattern happens with CPU scheduling where other processes steal CPU time from the SQL schedulers. On high end systems and on Opteron machines things get worse because SQL uses NUMA locality to full advantage, but no other processes are usually not aware of NUMA and, as much as the OS can try to preserve locality of allocations, they end up allocating all over the physical RAM and reduce the overall throughput of the system as the CPUs are idling on waiting for cross-numa boundary page access. There are other things to consider too like TLB and L2 miss increase due to other processes taking up CPU cycles.
So to sum up, you can run other servers with SQL Server, but is not recommended. If you must, then make sure you isolate the two server to your best ability. Use CPU affinity masks for both SQL and IIS/ASP to isolate the two on separate cores, configure SQL to reserve less RAM so that it leaves free memory for IIS/ASP, configure your app pools to recycle aggressively to prevent application pool growth.
Yes, it is possible and many do it.
It tends to be a question of security and/or performance.
Security is questioned as your attack surface is increased on a box that has both. Perhaps not an issue for you.
Performance is questioned as now your server is serving web and DB requests. Again, perhaps not an issue in your case.
Test vs. Production....
Many may feel fine in test environments but not production....
Again, your team's call. I like my test and production environments being as similar as possible if possible but that's my preference.
It's possible, yes.
A good idea for a production environment, no.
The problem that you're going to run in to is that a SQL Server database under substantial load is, more than likely, going to be doing heavy disk I/O and have a large memory footprint. That combination is going to tie up the machine, and you're going to see a performance hit in IIS as it tries to serve up the pages.
It's unwise in certain contexts... totally wise in others.
If your machine is underutilized and won't experience heavy loads, then there is an advantage to installing the database on the same machine, because you simply won't have to transfer anything across the network.
On the other hand, if one or both of IIS or the database will be under heavy load, they will likely start to interfere, and the performance gain of dedicated hardware for each will probably outstrip the loss of having to go over the network.
Don't forget the maintenance issue...you can't reboot/patch one without nuking the other. If they are on two boxes, you could give your users a better experience, than no response from the webserver if you are maintaining the SQL box.
Not highest on the list, but should be noted.
You certainly can. You will run into performance issues if, for example, you have large user base or if there are a lot of heavy query's being run against the DB. I have worked on several sites, usually hosted at 1and1, that run IIS and SQL Server (Express!) on the same box with thousands of users (hundreds concurrent) and millions of records in poorly designed tables, accessed via poorly written stored procedures and the user experience was certainly tolerable. It all comes down to how hard you plan on hitting the server.

Resources