SQL Server 2019 instance seems to randomly dump its query plan cache

SQL Server 2019 instance seems to randomly dump its query plan cache - sql-server

The instance in question has maximum server memory set to 6GB, but only seems to be using half a GB. I checked the query plan cache by using the query on this page:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-cached-plans-transact-sql?view=sql-server-ver16
SELECT usecounts, cacheobjtype, objtype, text
FROM sys.dm_exec_cached_plans
CROSS APPLY sys.dm_exec_sql_text(plan_handle)
WHERE usecounts > 1
ORDER BY usecounts DESC;
GO
After running that, I only see about 3 plans. When I run the application that uses this database, sometimes there will be 300-400 plans, but about 30 seconds later the same query will only show about 3 plans in the cache.
I've run SQL profiler and can't find anything running a DBCC FREEPROCCACHE
There are 3 other instances on this server that are consuming their allocated memory just fine. One in particular is allowed to eat 2GB and has consumed the entire amount with over 500 plans consistently in its cache.
Other than a scheduled task running DBCC FREEPROCCACHE every 30-60 seconds, is there anything that would cause SQL Server 2019 to behave in this way?

Multiple facets of SQL Server will 'compete' for buffer cache, including:
Data
Plans
Clerks (i.e., other caches)
Memory Grants
etc
The amount of space that Plans can consume is dictated by thresholds defined here:
https://learn.microsoft.com/en-us/previous-versions/tn-archive/cc293624(v=technet.10)
https://www.sqlskills.com/blogs/erin/sql-server-plan-cache-limits/
And, once plans start to exceed those thresholds, the SQLOS will beging to 'eagerly cleanup/clip/evict' less frequently used plans.
Likewise, if OTHER clerks (caches for things like schemas, objects, and permissions-caches against those objects - i.e., TOKENPERMS) exceed certain, internal, cache thresholds they TOO can cause the SQLOS to start scavenging ALL caches - including cache plans.
For example:
https://learn.microsoft.com/en-us/archive/blogs/psssql/query-performance-issues-associated-with-a-large-sized-security-cache
Likewise, Memory Grants can/will use buffer cache during query processing. For example, if you're querying a huge table and the engine expects to get back (or hang-on-to for further processing) roughly 1KB of for each of 10 million rows, you're going to need potentially 9GB of buffer space for said query to process. (Or, there are mechanics LIKE this in play with memory grants - the example I've cited is WAY too simplistic - to the point of not being even close to accurate).
The point being, however, that these grants can/will be given RAM directly from the overall buffer cache and can/will cause INTERNAL memory pressure against the plan-cache (and all other caches for that matter).
In short, memory grants can be a huge problem with SOME workloads.
Otherwise, external factors (other apps - especially memory-hungry apps) can/will cause the OS to tell SQL Server to 'cough up' memory it has been using. (You can prevent this by granting the Lock_Pages_In_Memory User Right to the SQL Server service account - just be sure you know what you're doing here.)
In your case, with 4x distinct instances running, I'd assume you're likely running into 'external' memory pressure against the instance in question.
That said, you can query sys.dm_os_ring_buffers to get insight into whether or not memory pressure is happening - as per posts like the following:
https://learn.microsoft.com/en-us/archive/blogs/psssql/how-it-works-what-are-the-ring_buffer_resource_monitor-telling-me
https://learn.microsoft.com/en-us/archive/blogs/mvpawardprogram/using-sys-dm_os_ring_buffers-to-diagnose-memory-issues-in-sql-server
https://www.sqlskills.com/blogs/jonathan/identifying-external-memory-pressure-with-dm_os_ring_buffers-and-ring_buffer_resource_monitor/
Along those lines, I use the following query/diagnostic to check for memory pressure:
WITH core AS (
SELECT
EventTime,
record.value('(/Record/ResourceMonitor/Notification)[1]', 'varchar(max)') as [Type],
record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') as [IndicatorsProcess],
record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') as [IndicatorsSystem],
record.value('(/Record/ResourceMonitor/IndicatorsPool)[1]', 'int') as [IndicatorsPool],
record.value('(/Record/MemoryNode/#id)[1]', 'int') as [MemoryNode],
record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint') AS [Avail Phys Mem, Kb],
record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint') AS [Avail VAS, Kb],
record
FROM (
SELECT
DATEADD (ss, (-1 * ((cpu_ticks / CONVERT (float, ( cpu_ticks / ms_ticks ))) - [timestamp])/1000), GETDATE()) AS EventTime,
CONVERT (xml, record) AS record
FROM sys.dm_os_ring_buffers
CROSS JOIN sys.dm_os_sys_info
WHERE ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR') AS tab
)
SELECT
EventTime,
[Type],
IndicatorsProcess,
IndicatorsSystem,
IndicatorsPool,
MemoryNode,
CAST([Avail Phys Mem, Kb] / (1024.0 * 1024.0) AS decimal(20,2)) [Avail Phys Mem (GB)],
CAST([Avail VAS, Kb] / (1024.0 * 1024.0) AS decimal(20,2)) [Avail VAS (GB)]
,record
FROM
core
WHERE
[Type] = N'RESOURCE_MEMPHYSICAL_LOW'
ORDER BY
EventTime DESC;
As in, if you run that against effectively ANY SQL Server instance, you REALLY don't want to see ANY results from this query. Or, if you do, they should be at times when you're running REALLY heavy workloads (ugly data-loading/population jobs or other huge processing operations) that you're already aware are issues/problems from a performance perspective.
Otherwise, the occasional entry/hiccup (i.e., set of results) isn't necessarily a reason to worry about major problems, but if you're routinely seeing entries/rows/results from the above with regular workloads, you'll want to investigate things like all of the details listed above (cache and clerk sizes/thresholds, trap for any large memory grants, check plan-cache sizing based on overall RAM, etc.) AND/OR start looking into cache clock hands to see exactly where memory is being scavenged:
https://learn.microsoft.com/en-us/archive/blogs/slavao/q-and-a-clock-hands-what-are-they-for

Related

What statistical metrics are related to overall SQL Server database performance?

In SQL Server, I would like to know what related statistical metrics such as Oracle's 'SQL Service Response Time' or 'Response Time Per Txn' can evaluate the overall database performance.
Please tell me the name of the statistical metrics and how to collect it using sql .

SQL Server does not accumulate statistics about transactions, but stats of execution are available for free in all editions for queries, procedures, triggers and UDF in DMV like :
SELECT * FROM sys.dm_exec_query_stats;
SELECT * FROM sys.dm_exec_procedure_stats;
SELECT * FROM sys.dm_exec_trigger_stats;
SELECT * FROM sys.dm_exec_function_stats;
The metrics to consider are the followings :
execution_count,
total_worker_time
total_elapsed_time
...
As an example, to have a mean exec time, you must divide the total time by the execution_count

You're looking for Windows Performance counters, there are a range of them, see example:
https://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/
These can be read by code.
this is a big topic, but if this is what you need, please describe what problem you want to address as it dictates which part of windows is interesting to that end.
Generally i look for:
batch requests per second
lock wait time
deadlocks
cache hit ratio
target/ actual memory relation
available memory
context switches per second
CPU utilization
what we need to act on is the values changing away from normal picture.

Query compilation and provisioning times

What does it mean there is a longer time for COMPILATION_TIME, QUEUED_PROVISIONING_TIME or both more than usual?
I have a query runs every couple of minutes and it usually takes less than 200 milliseconds for compilation and 0 for provisioning. There are 2 instances in the last couple of days the values are more than 4000 for compilation and more than 100000 for provisioning.
Is that mean warehouse was being resumed and there was a hiccup?

COMPILATION_TIME:
The SQL is parsed and simplified, and the tables meta data is loaded. Thus a compile for select a,b,c from table_name will be fractally faster than select * from table_name because the meta data is not needed from every partition to know the final shape.
Super fragmented tables, can give poor compile performance as there is more meta data to load. Fragmentation comes from many small writes/deletes/updates.
Doing very large INSERT statements can give horrible compile performance. We did a lift-and-shift and did all data loading via INSERT, just avoid..
PRIOVISIONING_TIME is the amount of time to setup the hardware, this occurs for two main reasons ,you are turning on 3X, 4X, 5X, 6X servers and it can take minutes just to allocate those volume of servers.
Or there is failure, sometime around releases there can be a little instability, where a query fails on the "new" release, and query is rolled back to older instances, which you would see in the profile as 1, 1001. But sometimes there has been problems in the provisioning infrastructure (I not seen it for a few years, but am not monitoring for it presently).
But I would think you will mostly see this on a on going basis for the first reason.

The compilation process involves query parsing, semantic checks, query rewrite components, reading object metadata, table pruning, evaluating certain heuristics such as filter push-downs, plan generations based upon the cost-based optimization, etc., which totally accounts for the COMPILATION_TIME.
QUEUED_PROVISIONING_TIME refers to Time (in milliseconds) spent in the warehouse queue, waiting for the warehouse compute resources to provision, due to warehouse creation, resume, or resize.
https://docs.snowflake.com/en/sql-reference/functions/query_history.html
To understand the reason behind the query taking long time recently in detail, the query ID needs to be analysed. You can raise a support case to Snowflake support with the problematic query ID to have the details checked.

SQL Server: Calculating Page Life Expectancy

I want to calculate the page life expectancy of my SQL Server.
If I query the PLE with the follwowing query I get the value 46.000:
SELECT [object_name],
[counter_name],
[cntr_value] FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'
I think this value isn't the final value because of the high amount. Do I have to calculate these value with a specifiy formula?
Thanks

Although some counters reported by sys.dm_os_performance_counters are cumulative, PLE reflects the current value so no calculation is necessary.
As to whether the value of 46 seconds is a cause for concern depends much on the workload and storage system. This value would be concern on a high-volume OLTP system with local spinning disk media due to the multi-millisecond latency incurred for each physical IO and IOPS of roughly 200 per spindle. Conversely, the same workload with high-performance local SSD may be fine because the storage capable of well over 100K IOPS.

What to monitor on SQL Server

I have been asked to monitor SQL Server (2005 & 2008) and am wondering what are good metrics to look at? I can access WMI counters but am slightly lost as to how much depth is going to be useful.
Currently I have on my list:
user connections
logins per second
latch waits per second
total latch wait time
dead locks per second
errors per second
Log and data file sizes
I am looking to be able to monitor values that will indicate a degradation of performance on the machine or a potential serious issue. To this end I am also wondering at what values some of these things would be considered normal vs problematic?
As I reckon it would probably be a really good question to have answered for the general community I thought I'd court some of you DBA experts out there (I am certainly not one of them!)
Apologies if a rather open ended question.
Ry

I would also monitor page life expectancy and your buffer cache hit ratio, see Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters for details

Late answer but can be of interest to other readers
One of my colleagues had the similar problem, and used this thread to help get him started.
He also ran into a blog post describing common causes of performance issues and an instruction on what metrics should be monitored, beside ones already mentioned here. These other metrics are:
• %Disk Time:
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the %Disk Time reaching 100%.
• %Disk Read Time and the %Disk Write Time:
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the operations read from or written to disk, respectively. They are actually the Average Disk Read Queue Length and Average Disk Write Queue Length values presented in percentages.
• %Idle Time:
Measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
• %Free Space:
Measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
If you would like to read the whole post, you may find it here:
http://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-measures/

Use SQL Profiler to identify your Top 10 (or more) queries. Create a baseline performance for these queries. Review current average execution times vs. your baseline, and alert if significantly above your baseline. You can also use this list to identify queries for possible optimization.
This attacks the problem at a higher level than just reviewing detailed stats, although those stats can also be useful. I have found this approach to work on any DBMS, including MySQL and Oracle. If your top query times start to go up, you can bet you are starting to run into performance issues, which you can then start to drill into in more detail.

Budget permitting, it's worth looking at some 3rd party tools to help. We use Idera's SQL Diagnostic Manager to monitor server health and Confio's Ignite to keep an eye on query performance. Both products have served us well in our shop.

Percent CPU utilization and Average disk queue lengths are also pretty standard. CPUs consistently over 80% indicates you may need more or better CPUs (and servers to house them); Consistently over 2 on any disk queue indicates you have a disk I/O bottleneck on that drive.

You Should monitor the total pages allocated to a particular process. You can get that information from querying the sys databases.
sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
The following post explains really well how you can use it to monitor you server when nothing works
http://tsqltips.blogspot.com/2012/06/monitor-current-sql-server-processes.html

Besides the performance metrics suggested above, I strongly recommend monitoring available memory, Batch Requests/sec, SQL Compilations/sec, and SQL Recompilations/sec. All are available in the sys.dm_os_performance_counters view and in Windows Performance Monitor.
As for
ideally I'd like to organise monitored items into 3 categories, say 'FYI', 'Warning' & 'Critical'
There are many third party monitoring tools that enable you to create alerts of different severity level, so once you determine what to monitor and what are recommended values for your environment, you can set low, medium, and high alerts.
Check Brent Ozar's article on not so useful metrics here.

How do I find out what is hammering my SQL Server?

My SQL Server CPU has been at around 90% for the most part of today.
I am not in a position to be able to restart it due to it being in constant use.
Is it possible to find out what within SQL is causing such a CPU overload?
I have run SQL Profiler but so much is going on it's difficult to tell if anything in particular is causing it.
I have run sp_who2 but am not sure what everything means exactly and if it is possible to identify possible problems in here.
To pre-empt any "it's probably just being used a lot" responses, this has only kicked in today from perfectly normal activitly levels.
I'm after any way of finding what is causing CPU grief within SQL.

This query uses DMV's to identify the most costly queries by CPU
SELECT TOP 20
qs.sql_handle,
qs.execution_count,
qs.total_worker_time AS Total_CPU,
total_CPU_inSeconds = --Converted from microseconds
qs.total_worker_time/1000000,
average_CPU_inSeconds = --Converted from microseconds
(qs.total_worker_time/1000000) / qs.execution_count,
qs.total_elapsed_time,
total_elapsed_time_inSeconds = --Converted from microseconds
qs.total_elapsed_time/1000000,
st.text,
qp.query_plan
FROM
sys.dm_exec_query_stats AS qs
CROSS APPLY
sys.dm_exec_sql_text(qs.sql_handle) AS st
CROSS APPLY
sys.dm_exec_query_plan (qs.plan_handle) AS qp
ORDER BY
qs.total_worker_time DESC
For a complete explanation see: How to identify the most costly SQL Server queries by CPU

I assume due diligence here that you confirmed the CPU is actually consumed by SQL process (perfmon Process category counters would confirm this). Normally for such cases you take a sample of the relevant performance counters and you compare them with a baseline that you established in normal load operating conditions. Once you resolve this problem I recommend you do establish such a baseline for future comparisons.
You can find exactly where is SQL spending every single CPU cycle. But knowing where to look takes a lot of know how and experience. Is is SQL 2005/2008 or 2000 ?
Fortunately for 2005 and newer there are a couple of off the shelf solutions. You already got a couple good pointer here with John Samson's answer. I'd like to add a recommendation to download and install the SQL Server Performance Dashboard Reports. Some of those reports include top queries by time or by I/O, most used data files and so on and you can quickly get a feel where the problem is. The output is both numerical and graphical so it is more usefull for a beginner.
I would also recommend using Adam's Who is Active script, although that is a bit more advanced.
And last but not least I recommend you download and read the MS SQL Customer Advisory Team white paper on performance analysis: SQL 2005 Waits and Queues.
My recommendation is also to look at I/O. If you added a load to the server that trashes the buffer pool (ie. it needs so much data that it evicts the cached data pages from memory) the result would be a significant increase in CPU (sounds surprising, but is true). The culprit is usually a new query that scans a big table end-to-end.

You can find some useful query here:
Investigating the Cause of SQL Server High CPU
For me this helped a lot:
SELECT s.session_id,
r.status,
r.blocking_session_id 'Blk by',
r.wait_type,
wait_resource,
r.wait_time / (1000 * 60) 'Wait M',
r.cpu_time,
r.logical_reads,
r.reads,
r.writes,
r.total_elapsed_time / (1000 * 60) 'Elaps M',
Substring(st.TEXT,(r.statement_start_offset / 2) + 1,
((CASE r.statement_end_offset
WHEN -1
THEN Datalength(st.TEXT)
ELSE r.statement_end_offset
END - r.statement_start_offset) / 2) + 1) AS statement_text,
Coalesce(Quotename(Db_name(st.dbid)) + N'.' + Quotename(Object_schema_name(st.objectid, st.dbid)) + N'.' +
Quotename(Object_name(st.objectid, st.dbid)), '') AS command_text,
r.command,
s.login_name,
s.host_name,
s.program_name,
s.last_request_end_time,
s.login_time,
r.open_transaction_count
FROM sys.dm_exec_sessions AS s
JOIN sys.dm_exec_requests AS r
ON r.session_id = s.session_id
CROSS APPLY sys.Dm_exec_sql_text(r.sql_handle) AS st
WHERE r.session_id != ##SPID
ORDER BY r.cpu_time desc
In the fields of status, wait_type and cpu_time you can find the most CPU consuming task that is running right now.

Run either of these a few second apart. You'll detect the high CPU connection.
Or: stored CPU in a local variable, WAITFOR DELAY, compare stored and current CPU values
select * from master..sysprocesses
where status = 'runnable' --comment this out
order by CPU
desc
select * from master..sysprocesses
order by CPU
desc
May not be the most elegant but it'd effective and quick.

You can run the SQL Profiler, and filter by CPU or Duration so that you're excluding all the "small stuff". Then it should be a lot easier to determine if you have a problem like a specific stored proc that is running much longer than it should (could be a missing index or something).
Two caveats:
If the problem is massive amounts of tiny transactions, then the filter I describe above would exclude them, and you'd miss this.
Also, if the problem is a single, massive job (like an 8-hour analysis job or a poorly designed select that has to cross-join a billion rows) then you might not see this in the profiler until it is completely done, depending on what events you're profiling (sp:completed vs sp:statementcompleted).
But normally I start with the Activity Monitor or sp_who2.

For a GUI approach I would take a look at Activity Monitor under Management and sort by CPU.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight