What to monitor on SQL Server

What to monitor on SQL Server - sql-server

I have been asked to monitor SQL Server (2005 & 2008) and am wondering what are good metrics to look at? I can access WMI counters but am slightly lost as to how much depth is going to be useful.
Currently I have on my list:
user connections
logins per second
latch waits per second
total latch wait time
dead locks per second
errors per second
Log and data file sizes
I am looking to be able to monitor values that will indicate a degradation of performance on the machine or a potential serious issue. To this end I am also wondering at what values some of these things would be considered normal vs problematic?
As I reckon it would probably be a really good question to have answered for the general community I thought I'd court some of you DBA experts out there (I am certainly not one of them!)
Apologies if a rather open ended question.
Ry

I would also monitor page life expectancy and your buffer cache hit ratio, see Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters for details

Late answer but can be of interest to other readers
One of my colleagues had the similar problem, and used this thread to help get him started.
He also ran into a blog post describing common causes of performance issues and an instruction on what metrics should be monitored, beside ones already mentioned here. These other metrics are:
• %Disk Time:
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the %Disk Time reaching 100%.
• %Disk Read Time and the %Disk Write Time:
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the operations read from or written to disk, respectively. They are actually the Average Disk Read Queue Length and Average Disk Write Queue Length values presented in percentages.
• %Idle Time:
Measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
• %Free Space:
Measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
If you would like to read the whole post, you may find it here:
http://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-measures/

Use SQL Profiler to identify your Top 10 (or more) queries. Create a baseline performance for these queries. Review current average execution times vs. your baseline, and alert if significantly above your baseline. You can also use this list to identify queries for possible optimization.
This attacks the problem at a higher level than just reviewing detailed stats, although those stats can also be useful. I have found this approach to work on any DBMS, including MySQL and Oracle. If your top query times start to go up, you can bet you are starting to run into performance issues, which you can then start to drill into in more detail.

Budget permitting, it's worth looking at some 3rd party tools to help. We use Idera's SQL Diagnostic Manager to monitor server health and Confio's Ignite to keep an eye on query performance. Both products have served us well in our shop.

Percent CPU utilization and Average disk queue lengths are also pretty standard. CPUs consistently over 80% indicates you may need more or better CPUs (and servers to house them); Consistently over 2 on any disk queue indicates you have a disk I/O bottleneck on that drive.

You Should monitor the total pages allocated to a particular process. You can get that information from querying the sys databases.
sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
The following post explains really well how you can use it to monitor you server when nothing works
http://tsqltips.blogspot.com/2012/06/monitor-current-sql-server-processes.html

Besides the performance metrics suggested above, I strongly recommend monitoring available memory, Batch Requests/sec, SQL Compilations/sec, and SQL Recompilations/sec. All are available in the sys.dm_os_performance_counters view and in Windows Performance Monitor.
As for
ideally I'd like to organise monitored items into 3 categories, say 'FYI', 'Warning' & 'Critical'
There are many third party monitoring tools that enable you to create alerts of different severity level, so once you determine what to monitor and what are recommended values for your environment, you can set low, medium, and high alerts.
Check Brent Ozar's article on not so useful metrics here.

Related

What statistical metrics are related to overall SQL Server database performance?

In SQL Server, I would like to know what related statistical metrics such as Oracle's 'SQL Service Response Time' or 'Response Time Per Txn' can evaluate the overall database performance.
Please tell me the name of the statistical metrics and how to collect it using sql .

SQL Server does not accumulate statistics about transactions, but stats of execution are available for free in all editions for queries, procedures, triggers and UDF in DMV like :
SELECT * FROM sys.dm_exec_query_stats;
SELECT * FROM sys.dm_exec_procedure_stats;
SELECT * FROM sys.dm_exec_trigger_stats;
SELECT * FROM sys.dm_exec_function_stats;
The metrics to consider are the followings :
execution_count,
total_worker_time
total_elapsed_time
...
As an example, to have a mean exec time, you must divide the total time by the execution_count

You're looking for Windows Performance counters, there are a range of them, see example:
https://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/
These can be read by code.
this is a big topic, but if this is what you need, please describe what problem you want to address as it dictates which part of windows is interesting to that end.
Generally i look for:
batch requests per second
lock wait time
deadlocks
cache hit ratio
target/ actual memory relation
available memory
context switches per second
CPU utilization
what we need to act on is the values changing away from normal picture.

Is there a way to clear Postgres stats xact_commit and xact_rollback only?

I am working with a postgres database that is being monitored by icinga2, and one of our monitors is looking at the commit ratio of a database:
select
round(100.*sd.xact_commit/(sd.xact_commit+sd.xact_rollback), 2) AS dcommitratio,
d.datname,
r.rolname AS rolname
FROM pg_stat_database sd
JOIN pg_database d ON (d.oid=sd.datid)
JOIN pg_roles r ON (r.oid=d.datdba)
WHERE sd.xact_commit+sd.xact_rollback<>0;
The problem is that an application recently had a bug (now fixed!) that increased the count of rollbacks considerably, so that the commit ratio is now only 78%, and it is triggering alarms every day.
I could run pg_stats_clear(), but is there a way to clear out these two counters only? I don't want to clear out any other necessary stats inadvertently, like any being used by the autovaccuum or the query optimizer. Or, is pg_stats_clear() considered safe to run?

Unfortunately it is all-or-nothing with resetting PostgreSQL statistics.
But I'd say that your monitoring system is monitoring the wrong thing anyway. Rather than monitoring the absolute values of xact_commit and xact_rollback, you should monitor the changes in the values since the last check.
Otherwise you will not detect a potential problem in a timely fashion: if there have been many months of normal operation, it will take a long time of misbehavior to change the ratio perceptibly.

Why my SnappyData cluster faced with slow query about once a day

When my cluster runs for a certain time(maybe a day, maybe two days), some queries may become very slow, about 2~10min to finish, when this happens, I need to restart the whole cluster, and the query become normal, but after some time, very slow queries happen again

The query response time depends on multiple factor including
1. Table size, if table size grows with time then response time will also increase
2. If it is open source version then time spent in GC pauses, which in turn will depend on number of objects/grabage present in the JVM Heap
3. Number of Concurrent queries being run
4. Amount of data overflown to disk
You will need to describe in detail your usage pattern of snappydata. Only then It would be possible to characterise the issue.
Some of the questions that should be answered are like
1. What is cluster size?
2. What are the table sizes?
3. Are writes happening continuously on the tables or only queries are getting executed?
You can engage us at slack channel to provide us informations related to your clusters.

Terrible SQL reads performance (culprit update stats?)

I'm running on SQL Server 2008 R2 and am trying to fine-tune performance. I did everything I could from:
Code review of SQL code
Create or remove indexes as I think appropriate
Auto create stats ON
Auto update stats ON
Auto update stats async ON
I have a 24/7 system that constantly stores data. Sometimes we do reads and that's where the issue is. Sometimes the reads take a couple of seconds or less (which would be expected and acceptable to us). Other times, the reads take several seconds that could amount to a minute before the stored procedure completes and we render data on the UI.
If we do the read again, it would be faster. The SQL profiler would trace the particular stored procedure or query that took several seconds. We would zoom into that stored procedure, and do everything we can do to optimize it if we can.
I also traced the auto stats event and the recompile event. It's hard to tell if a stat is being updated causing the read to take a long time, or if a recompile caused it. Sometimes, I see that the profiler traced a recompile of the read query that took several unacceptable minutes, other times it doesn't trace a recompile.
I tried to prevent the query optimizer from blocking the read until it recompiles or updates stats by using option use plan XML, etc. But I ran into compile errors complaining that the query plan XML isn't valid; that could be true because the query is quiet involved: select + joins that involve a local table var. I sort of hacked the XML and maybe that's why it deemed it invalid. So I gave up on using plan hint.
We tried periodic (every 15 minutes) manual running update stats in order to keep stats up-to-date as much as we can, but that hurt performance. updatestats blocks writes, and I'm sure even reads; updatestats seemed to maintain a bunch of statistics and on average it was taking around 80-90 seconds. A read that waits that long is unacceptable.
So the idea is to let the reads happen and prevent a situation when a recompile/update stat blocks it, correct? Does it make sense to disable auto statistics altogether? Or perhaps disable auto create statistics after deleting all the auto created stats?
This goes against Microsoft recommendations perhaps, since they enable auto create statistics and auto update statistics by default, and performance may suffer, but any ideas/hints you can give would be appreciated.

From what you are explaining, it looks like the below (all or some) might be happening.
You are doing physical reads. The quick way you avoid this is by increasing the amount of RAM you throw at the box. You haven't mentioned the hardware specs of your server. Please add details.
If you trace the SQL calls then you can easily figure out why the RECOMPILE happened. Look at the EventSubClass to figure out the reason and work towards resolving that.
ref: http://msdn.microsoft.com/en-us/library/ms187105.aspx
You mentioned table variables. These are notorious for causing performance issues when NOT using at the right place. If you use table variables in a JOIN, parallel plan is out of the question and no stats also. I am NOT sure how and where you are using but try replacing them with temp tables. And starting from SQL Server 2005, you will get only STMT recompilation at best and NOT the complete SP recompile as it happened in 2000.
You mentioned Update Stats ASYNC option and this won't block the query.
What are the TOP WAIT STATS on this server? Have you identified the expensive procedures based on CPU, Logical reads & execution count?
Have you looked the Page Life Expectancy, amount of IO using virtual file stats DMV?
Updating Stats every 15 minutes is NOT a good plan. How often is data inserted into the system? What is the sample rate you are using? What is your index maintenance strategy?
Have you looked at the missing indexes DMV?
There are a bunch of good queries to identify problems in more granular fashion using the below queries.
ref: http://dl.dropbox.com/u/13748067/SQL%20Server%202008%20Diagnostic%20Information%20Queries%20%28April%202011%29.sql
There are so many other things to look at but the above is a good starting point.

OK, here is my IMHO catch on this:
DBCC INDEXDEFRAG is worth trying and is an ONLINE function hence can be used on a live system
You could be reaching the maximum capacity of your architectural design. You can scale up which can always help but more likely you have to change the architecture to achieve better scalability sacrificing simplicity
A common trick is partitioning. You are writing to a table whose index distribution looks nothing like it was a few hours ago - hence degrading performance. This is a massive write, such a table could be divided to daily write and the rest of the data with nightly batches of moving stuff across.
More and more, people are being converting to CQRS. You might be the next. This solves the problem by separating reads from writes (a very simplistic explanation).

SQL Server Latches and their indication of performance issues

I am trying to understand a potential performance issue with our database (SQL 2008) and in particular one performance counter, SQLServer:Latches\Total Latch Wait Time Total Latch Wait Time (ms). We are seeing a slow down in DB response times and the only correlating spike that I can match it with is a spike in Total Latch Wait Time and Latch Waits/sec. I am not seeing any particular bottleneck in disk IO, CPU usage or memory.
The common explanation of a SQLServer latch is that it is a lightweight lock, but I am trying to get a more detailed understanding of what a latch is, how it differs from a lock and what the high amount of them that I am seeing may be an indicator for.

This maybe a really basic error to professional DBA... but this is what I found with our high latch problem, and this thread ranks very high in search results. I thought I'd share our bit that it may help someone else.
on newer dual / multi processor server using NUMA memory architecture, the max degree of parallelism should be set to the actual core number per processor. in our example we had dual xenon with 4 cores each, and with hyper threading it appears as 16 logical processors to SQL.
Locking this value from the default 0 to 4 cut the high latch on some queries down immediately.
Our latch ran 1000ms+ up to 30,000ms on some occasions.

I recommend you looke into sys.dm_os_latch_stats and see what type of latches have increased contention and wait types, compared to previous base-line.
If you see a spike in the BUFFER type latches it means it is driven by updates conflicting to modify the same page. Other latch types have also short explanation in the MSDN and can guide you toward the problem root cause. For those marked 'internal use only', you're going to have to open a support case with MS, as a detailed explanation of what they mean is on the verge of NDA.
You should also look into sys.dm_os_wait_stats. If you see an increase of PAGELATCH_*, then it is the same problem as the BUFFER type latch above, contention in trying to modify same page, aka. as an update hot-spot. If you see an increase PAGEIOLATCH_*then your problem is the I/O susbsytem, it takes too long to load the pages in memory when they are needed.

Reference taken from this blog:
Using sys.dm_db_index_operational_stats:
SELECT
OBJECT_NAME(object_id)
,page_latch_wait_count
,page_latch_wait_in_ms
,tree_page_latch_wait_count
,tree_page_latch_wait_in_ms
,Page_io_latch_wait_count
,Page_io_latch_wait_in_ms
FROM sys.dm_db_index_operational_stats (DB_ID(), NULL, NULL, NULL)
Using sys.dm_os_latch_stats:
SELECT * FROM sys.dm_os_latch_stats
WHERE latch_class = 'buffer'

sp_configure 'max degree of parallelism', 8
go
reconfigure
go

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight