SQL Server Latches and their indication of performance issues

SQL Server Latches and their indication of performance issues - sql-server

I am trying to understand a potential performance issue with our database (SQL 2008) and in particular one performance counter, SQLServer:Latches\Total Latch Wait Time Total Latch Wait Time (ms). We are seeing a slow down in DB response times and the only correlating spike that I can match it with is a spike in Total Latch Wait Time and Latch Waits/sec. I am not seeing any particular bottleneck in disk IO, CPU usage or memory.
The common explanation of a SQLServer latch is that it is a lightweight lock, but I am trying to get a more detailed understanding of what a latch is, how it differs from a lock and what the high amount of them that I am seeing may be an indicator for.

This maybe a really basic error to professional DBA... but this is what I found with our high latch problem, and this thread ranks very high in search results. I thought I'd share our bit that it may help someone else.
on newer dual / multi processor server using NUMA memory architecture, the max degree of parallelism should be set to the actual core number per processor. in our example we had dual xenon with 4 cores each, and with hyper threading it appears as 16 logical processors to SQL.
Locking this value from the default 0 to 4 cut the high latch on some queries down immediately.
Our latch ran 1000ms+ up to 30,000ms on some occasions.

I recommend you looke into sys.dm_os_latch_stats and see what type of latches have increased contention and wait types, compared to previous base-line.
If you see a spike in the BUFFER type latches it means it is driven by updates conflicting to modify the same page. Other latch types have also short explanation in the MSDN and can guide you toward the problem root cause. For those marked 'internal use only', you're going to have to open a support case with MS, as a detailed explanation of what they mean is on the verge of NDA.
You should also look into sys.dm_os_wait_stats. If you see an increase of PAGELATCH_*, then it is the same problem as the BUFFER type latch above, contention in trying to modify same page, aka. as an update hot-spot. If you see an increase PAGEIOLATCH_*then your problem is the I/O susbsytem, it takes too long to load the pages in memory when they are needed.

Reference taken from this blog:
Using sys.dm_db_index_operational_stats:
SELECT
OBJECT_NAME(object_id)
,page_latch_wait_count
,page_latch_wait_in_ms
,tree_page_latch_wait_count
,tree_page_latch_wait_in_ms
,Page_io_latch_wait_count
,Page_io_latch_wait_in_ms
FROM sys.dm_db_index_operational_stats (DB_ID(), NULL, NULL, NULL)
Using sys.dm_os_latch_stats:
SELECT * FROM sys.dm_os_latch_stats
WHERE latch_class = 'buffer'

sp_configure 'max degree of parallelism', 8
go
reconfigure
go

Related

What does CPU utilisation in databases actually mean?

There are two kinds of queries that I ran,
1.A purposely introduced query to perform sorting(order by) in about 10 columns.This uses CPU since sorting is a CPU intensive operation.
The scenario involved running the query which took 30 seconds and ran about 100 of those using simultaneous connections on 100 different tables.CPU usage on a 32 core machine was about 85% on all 32 cores and all 100 queries ran in parallel.
2.Inserting a million rows on a table.
I don't understand why this would consume CPU, since this is purely disk I/O.But I inserted 1 million rows on a single table using 100 simultaneous connections/threads and no indexes where there on those tables,now insert is not the fastest way to load data, but the point here is it is consuming CPU time about 32% on about 10 cores.This is way lesser than the above but still I am just curios.
I could be wrong because of Wal archiving was on and query log was on - does this contribute to CPU.I am assuming no since those are also disk IO.
There was no other process/application running/installed on this machine other than postgres.

Many different things:
CPU time for query planning and the logic in the executor for query execution
Transforming text representations of tuples into their on-disk format. Parsing dates, and so on.
Log output
Processing the transaction logs
Writing to shared_buffers when inserting pages to write, scanning shard_buffers for pages to write out
Interprocess communication for lock management
Scanning through in-memory cached copies of indexes when checking uniqueness, inserting new keys in an index, etc
....
If you really want to know the juicy details, fire up perf with stack traces enabled to see where CPU time is spent.

If your table had a primary key, then it has an implicit index.
It may also be true that if the table had a primary key, then it would be stored as a b-tree and not a simple flat table; I'm not clear on this point since my postgres-fu has weakened over the years, but many DBMSes use the primary key as a default clustering key for a b-tree and just store everything in the b-tree. Managing that b-tree requires plenty of CPU.
Additionally, if you're inserting from 100 threads and connections, then postgres has to perform locking in order to keep internal data structures consistent. Fighting for locks can consume a ton of CPU, and is especially difficult to do efficiently on machines with many CPUs - acquiring a single mutex requires the cooperation of every CPU in the system ala cache coherency protocol.
You may want to experiment with different numbers of threads, while measuring overall runtime and cpu usage - you may find that with, say, 8 threads, the total CPU utilized is 1/10th of your current usage, but still gets the job done within 110-150% of the original time. This would be a sure sign that lock contention is killing your CPU usage.

MaxDOP = 8, does that mean 8 threads on all parallel operators?

I've been reading a bit on MaxDOP and have run into a question that I cant seem to find an answer for. If MaxDOP is set to a value, lets say 8, does that mean that SQL Server will always spin-up 8 threads on the parallel activities in the query, or could it decide to use less threads for a particular operator?
It boils down to: Is too many threads a performance concern if the workload is small (OLTP) and MaxDOP has been set too high?
A hint to the correct DMW would be nice. I got lost in DMW land, again.

The short answer is: SQL Server will dynamically decide to use a parallel execution of the query, but will not exceed the maximum degree of parallelity (MAXDOP) that you have indicated.
The following article has some more detailed information: How It Works: Maximizing Max Degree Of Parallelism (MAXDOP). I'll just cite a part of it here:
There are several stages to determining the degree of parallelism (MAXDOP) a query can utilize.
Stage 1 – Compile
During complication SQL Server considers the hints, sp_configure and resource workgroup settings to see if a parallel plan should even be considered. Only if the query operations allow parallel execution:
If hint is present and > 1 then build a parallel plan
else if no hint or hint (MAXDOP = 0)
if sp_configure setting is 1 but workload group > 1 then build a parallel plan
else if sp_configure setting is 0 or > 1 then build parallel plan
Stage 2 – Query Execution
When the query begins execution the runtime, degree of parallelism is determined. This involves many factors, already outlined in SQL Server Books Online: http://technet.microsoft.com/en-US/library/ms178065(v=SQL.105).aspx
Before SQL Server looks at the idle workers and other factors it determines the target for the degree of parallelism.
[... see details in article ...]
If still 0 after the detailed calculations it is set to 64 (default max for SQL Server as documented in Books Online.) [...] SQL Server hard codes the 64 CPU target when the runtime target of MAXDOP is still 0 (default.)
The MAXDOP target is now adjusted for:
Actual CPU count (affinity settings from sp_configure and the resource pool).
Certain query types (index build for example) look at the partitions
Other query type limitations that may exist
Now SQL Server takes a look at the available workers (free workers for query execution.) You can loosely calculate the free worker count on a scheduler using (Free workers = Current_workers_count – current_tasks_count) from sys.dm_os_schedulers.
Once the target is calculated the actual is determined by looking at the available resources to support a parallel execution. This involves determining the node(s) and CPUs with available workers.
[...]
The worker location information is then used to target an appropriate set of CPUs to assign the parallel task to.
Using XEvents you can monitor the MAXDOP decision logic. For example:
XeSqlPkg::calculate_dop_begin
XeSqlPkg::calculate_dop
You can monitor the number of parallel workers by querying: sys.dm_os_tasks

It is only used to limit the max number of threads allowed per request:
https://msdn.microsoft.com/en-us/library/ms189094.aspx
So if SQL thinks using one thread is fastest it will just use one.
Generally on an OLTP system you will keep this on the low side. On large warehouse DB's you may want to keep a higher number.
unless you are seeing specific problems I wouldn't change it unless you are confident of the outcome.

SQL Server can also decide to use less threads, you can see them from the actual plan with the number of rows handled by each thread. The maximum of threads is also for each of the parallel sections, and one query can have more than one section.
In addition to MAXDOP there is setting "cost threshold for parallelism" which decides if parallel plan is even considered for a query.

SQL Server 2008 Activity Monitor Resource Wait Category: Does Latch include CPU or just disk IO?

In SQL Server 2008 Activity Monitor, I see Wait Time on Wait Category "Latch" (not Buffer Latch) spike above 10,000ms/sec at times. Average Waiter Count is under 10, but this is by far the highest area of waits in a very busy system. Disk IO is almost zero and page life expectancy is over 80,000, so I know it's not slowed down by disk hardware and assume it's not even touching SAN cache. Does this mean SQL Server is waiting on CPU (i.e. resolving a bajillion locks) or waiting to transfer data from the local server's cache memory for processing?
Background: System is a 48-core running SQL Server 2008 Enterprise w/ 64GB of RAM. Queries are under 100ms in response time - for now - but I'm trying to understand the bottlenecks before they get to 100x that level.
Class Count Sum Time Max Time
ACCESS_METHODS_DATASET_PARENT 649629086 3683117221 45600
BUFFER 20280535 23445826 8860
NESTING_TRANSACTION_READONLY 22309954 102483312 187
NESTING_TRANSACTION_FULL 7447169 123234478 265

Some latches are IO, some are CPU, some are other resource. It really depends on which particular latch type you're seeing this. sys.dm_os_latch_stats will show which latches are hot in your deployment.
I wouldn't worry about the last three items. The two nesting_transaction ones look very healthy (low average, low max). Buffer is also OK, more or less, although the the 8s max time is a bit high.
The AM_DS_PARENT latch is related to parallel queries/parallel scans. Its average is OK, but the max of 45s is rather high. W/o going into too much detail I can tell that long wait time on this latch type indicate that your IO subsystem can encounter spikes (and the max 8s BUFFER latch waits corroborate this).

Terrible SQL reads performance (culprit update stats?)

I'm running on SQL Server 2008 R2 and am trying to fine-tune performance. I did everything I could from:
Code review of SQL code
Create or remove indexes as I think appropriate
Auto create stats ON
Auto update stats ON
Auto update stats async ON
I have a 24/7 system that constantly stores data. Sometimes we do reads and that's where the issue is. Sometimes the reads take a couple of seconds or less (which would be expected and acceptable to us). Other times, the reads take several seconds that could amount to a minute before the stored procedure completes and we render data on the UI.
If we do the read again, it would be faster. The SQL profiler would trace the particular stored procedure or query that took several seconds. We would zoom into that stored procedure, and do everything we can do to optimize it if we can.
I also traced the auto stats event and the recompile event. It's hard to tell if a stat is being updated causing the read to take a long time, or if a recompile caused it. Sometimes, I see that the profiler traced a recompile of the read query that took several unacceptable minutes, other times it doesn't trace a recompile.
I tried to prevent the query optimizer from blocking the read until it recompiles or updates stats by using option use plan XML, etc. But I ran into compile errors complaining that the query plan XML isn't valid; that could be true because the query is quiet involved: select + joins that involve a local table var. I sort of hacked the XML and maybe that's why it deemed it invalid. So I gave up on using plan hint.
We tried periodic (every 15 minutes) manual running update stats in order to keep stats up-to-date as much as we can, but that hurt performance. updatestats blocks writes, and I'm sure even reads; updatestats seemed to maintain a bunch of statistics and on average it was taking around 80-90 seconds. A read that waits that long is unacceptable.
So the idea is to let the reads happen and prevent a situation when a recompile/update stat blocks it, correct? Does it make sense to disable auto statistics altogether? Or perhaps disable auto create statistics after deleting all the auto created stats?
This goes against Microsoft recommendations perhaps, since they enable auto create statistics and auto update statistics by default, and performance may suffer, but any ideas/hints you can give would be appreciated.

From what you are explaining, it looks like the below (all or some) might be happening.
You are doing physical reads. The quick way you avoid this is by increasing the amount of RAM you throw at the box. You haven't mentioned the hardware specs of your server. Please add details.
If you trace the SQL calls then you can easily figure out why the RECOMPILE happened. Look at the EventSubClass to figure out the reason and work towards resolving that.
ref: http://msdn.microsoft.com/en-us/library/ms187105.aspx
You mentioned table variables. These are notorious for causing performance issues when NOT using at the right place. If you use table variables in a JOIN, parallel plan is out of the question and no stats also. I am NOT sure how and where you are using but try replacing them with temp tables. And starting from SQL Server 2005, you will get only STMT recompilation at best and NOT the complete SP recompile as it happened in 2000.
You mentioned Update Stats ASYNC option and this won't block the query.
What are the TOP WAIT STATS on this server? Have you identified the expensive procedures based on CPU, Logical reads & execution count?
Have you looked the Page Life Expectancy, amount of IO using virtual file stats DMV?
Updating Stats every 15 minutes is NOT a good plan. How often is data inserted into the system? What is the sample rate you are using? What is your index maintenance strategy?
Have you looked at the missing indexes DMV?
There are a bunch of good queries to identify problems in more granular fashion using the below queries.
ref: http://dl.dropbox.com/u/13748067/SQL%20Server%202008%20Diagnostic%20Information%20Queries%20%28April%202011%29.sql
There are so many other things to look at but the above is a good starting point.

OK, here is my IMHO catch on this:
DBCC INDEXDEFRAG is worth trying and is an ONLINE function hence can be used on a live system
You could be reaching the maximum capacity of your architectural design. You can scale up which can always help but more likely you have to change the architecture to achieve better scalability sacrificing simplicity
A common trick is partitioning. You are writing to a table whose index distribution looks nothing like it was a few hours ago - hence degrading performance. This is a massive write, such a table could be divided to daily write and the rest of the data with nightly batches of moving stuff across.
More and more, people are being converting to CQRS. You might be the next. This solves the problem by separating reads from writes (a very simplistic explanation).

What to monitor on SQL Server

I have been asked to monitor SQL Server (2005 & 2008) and am wondering what are good metrics to look at? I can access WMI counters but am slightly lost as to how much depth is going to be useful.
Currently I have on my list:
user connections
logins per second
latch waits per second
total latch wait time
dead locks per second
errors per second
Log and data file sizes
I am looking to be able to monitor values that will indicate a degradation of performance on the machine or a potential serious issue. To this end I am also wondering at what values some of these things would be considered normal vs problematic?
As I reckon it would probably be a really good question to have answered for the general community I thought I'd court some of you DBA experts out there (I am certainly not one of them!)
Apologies if a rather open ended question.
Ry

I would also monitor page life expectancy and your buffer cache hit ratio, see Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters for details

Late answer but can be of interest to other readers
One of my colleagues had the similar problem, and used this thread to help get him started.
He also ran into a blog post describing common causes of performance issues and an instruction on what metrics should be monitored, beside ones already mentioned here. These other metrics are:
• %Disk Time:
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the %Disk Time reaching 100%.
• %Disk Read Time and the %Disk Write Time:
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the operations read from or written to disk, respectively. They are actually the Average Disk Read Queue Length and Average Disk Write Queue Length values presented in percentages.
• %Idle Time:
Measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
• %Free Space:
Measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
If you would like to read the whole post, you may find it here:
http://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-measures/

Use SQL Profiler to identify your Top 10 (or more) queries. Create a baseline performance for these queries. Review current average execution times vs. your baseline, and alert if significantly above your baseline. You can also use this list to identify queries for possible optimization.
This attacks the problem at a higher level than just reviewing detailed stats, although those stats can also be useful. I have found this approach to work on any DBMS, including MySQL and Oracle. If your top query times start to go up, you can bet you are starting to run into performance issues, which you can then start to drill into in more detail.

Budget permitting, it's worth looking at some 3rd party tools to help. We use Idera's SQL Diagnostic Manager to monitor server health and Confio's Ignite to keep an eye on query performance. Both products have served us well in our shop.

Percent CPU utilization and Average disk queue lengths are also pretty standard. CPUs consistently over 80% indicates you may need more or better CPUs (and servers to house them); Consistently over 2 on any disk queue indicates you have a disk I/O bottleneck on that drive.

You Should monitor the total pages allocated to a particular process. You can get that information from querying the sys databases.
sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
The following post explains really well how you can use it to monitor you server when nothing works
http://tsqltips.blogspot.com/2012/06/monitor-current-sql-server-processes.html

Besides the performance metrics suggested above, I strongly recommend monitoring available memory, Batch Requests/sec, SQL Compilations/sec, and SQL Recompilations/sec. All are available in the sys.dm_os_performance_counters view and in Windows Performance Monitor.
As for
ideally I'd like to organise monitored items into 3 categories, say 'FYI', 'Warning' & 'Critical'
There are many third party monitoring tools that enable you to create alerts of different severity level, so once you determine what to monitor and what are recommended values for your environment, you can set low, medium, and high alerts.
Check Brent Ozar's article on not so useful metrics here.