SQL Server: Calculating Page Life Expectancy - sql-server

I want to calculate the page life expectancy of my SQL Server.
If I query the PLE with the follwowing query I get the value 46.000:
SELECT [object_name],
[counter_name],
[cntr_value] FROM sys.dm_os_performance_counters
WHERE [object_name] LIKE '%Manager%'
AND [counter_name] = 'Page life expectancy'
I think this value isn't the final value because of the high amount. Do I have to calculate these value with a specifiy formula?
Thanks

Although some counters reported by sys.dm_os_performance_counters are cumulative, PLE reflects the current value so no calculation is necessary.
As to whether the value of 46 seconds is a cause for concern depends much on the workload and storage system. This value would be concern on a high-volume OLTP system with local spinning disk media due to the multi-millisecond latency incurred for each physical IO and IOPS of roughly 200 per spindle. Conversely, the same workload with high-performance local SSD may be fine because the storage capable of well over 100K IOPS.

Related

SQL Server 2019 instance seems to randomly dump its query plan cache

The instance in question has maximum server memory set to 6GB, but only seems to be using half a GB. I checked the query plan cache by using the query on this page:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-cached-plans-transact-sql?view=sql-server-ver16
SELECT usecounts, cacheobjtype, objtype, text
FROM sys.dm_exec_cached_plans
CROSS APPLY sys.dm_exec_sql_text(plan_handle)
WHERE usecounts > 1
ORDER BY usecounts DESC;
GO
After running that, I only see about 3 plans. When I run the application that uses this database, sometimes there will be 300-400 plans, but about 30 seconds later the same query will only show about 3 plans in the cache.
I've run SQL profiler and can't find anything running a DBCC FREEPROCCACHE
There are 3 other instances on this server that are consuming their allocated memory just fine. One in particular is allowed to eat 2GB and has consumed the entire amount with over 500 plans consistently in its cache.
Other than a scheduled task running DBCC FREEPROCCACHE every 30-60 seconds, is there anything that would cause SQL Server 2019 to behave in this way?
Multiple facets of SQL Server will 'compete' for buffer cache, including:
Data
Plans
Clerks (i.e., other caches)
Memory Grants
etc
The amount of space that Plans can consume is dictated by thresholds defined here:
https://learn.microsoft.com/en-us/previous-versions/tn-archive/cc293624(v=technet.10)
https://www.sqlskills.com/blogs/erin/sql-server-plan-cache-limits/
And, once plans start to exceed those thresholds, the SQLOS will beging to 'eagerly cleanup/clip/evict' less frequently used plans.
Likewise, if OTHER clerks (caches for things like schemas, objects, and permissions-caches against those objects - i.e., TOKENPERMS) exceed certain, internal, cache thresholds they TOO can cause the SQLOS to start scavenging ALL caches - including cache plans.
For example:
https://learn.microsoft.com/en-us/archive/blogs/psssql/query-performance-issues-associated-with-a-large-sized-security-cache
Likewise, Memory Grants can/will use buffer cache during query processing. For example, if you're querying a huge table and the engine expects to get back (or hang-on-to for further processing) roughly 1KB of for each of 10 million rows, you're going to need potentially 9GB of buffer space for said query to process. (Or, there are mechanics LIKE this in play with memory grants - the example I've cited is WAY too simplistic - to the point of not being even close to accurate).
The point being, however, that these grants can/will be given RAM directly from the overall buffer cache and can/will cause INTERNAL memory pressure against the plan-cache (and all other caches for that matter).
In short, memory grants can be a huge problem with SOME workloads.
Otherwise, external factors (other apps - especially memory-hungry apps) can/will cause the OS to tell SQL Server to 'cough up' memory it has been using. (You can prevent this by granting the Lock_Pages_In_Memory User Right to the SQL Server service account - just be sure you know what you're doing here.)
In your case, with 4x distinct instances running, I'd assume you're likely running into 'external' memory pressure against the instance in question.
That said, you can query sys.dm_os_ring_buffers to get insight into whether or not memory pressure is happening - as per posts like the following:
https://learn.microsoft.com/en-us/archive/blogs/psssql/how-it-works-what-are-the-ring_buffer_resource_monitor-telling-me
https://learn.microsoft.com/en-us/archive/blogs/mvpawardprogram/using-sys-dm_os_ring_buffers-to-diagnose-memory-issues-in-sql-server
https://www.sqlskills.com/blogs/jonathan/identifying-external-memory-pressure-with-dm_os_ring_buffers-and-ring_buffer_resource_monitor/
Along those lines, I use the following query/diagnostic to check for memory pressure:
WITH core AS (
SELECT
EventTime,
record.value('(/Record/ResourceMonitor/Notification)[1]', 'varchar(max)') as [Type],
record.value('(/Record/ResourceMonitor/IndicatorsProcess)[1]', 'int') as [IndicatorsProcess],
record.value('(/Record/ResourceMonitor/IndicatorsSystem)[1]', 'int') as [IndicatorsSystem],
record.value('(/Record/ResourceMonitor/IndicatorsPool)[1]', 'int') as [IndicatorsPool],
record.value('(/Record/MemoryNode/#id)[1]', 'int') as [MemoryNode],
record.value('(/Record/MemoryRecord/AvailablePhysicalMemory)[1]', 'bigint') AS [Avail Phys Mem, Kb],
record.value('(/Record/MemoryRecord/AvailableVirtualAddressSpace)[1]', 'bigint') AS [Avail VAS, Kb],
record
FROM (
SELECT
DATEADD (ss, (-1 * ((cpu_ticks / CONVERT (float, ( cpu_ticks / ms_ticks ))) - [timestamp])/1000), GETDATE()) AS EventTime,
CONVERT (xml, record) AS record
FROM sys.dm_os_ring_buffers
CROSS JOIN sys.dm_os_sys_info
WHERE ring_buffer_type = 'RING_BUFFER_RESOURCE_MONITOR') AS tab
)
SELECT
EventTime,
[Type],
IndicatorsProcess,
IndicatorsSystem,
IndicatorsPool,
MemoryNode,
CAST([Avail Phys Mem, Kb] / (1024.0 * 1024.0) AS decimal(20,2)) [Avail Phys Mem (GB)],
CAST([Avail VAS, Kb] / (1024.0 * 1024.0) AS decimal(20,2)) [Avail VAS (GB)]
,record
FROM
core
WHERE
[Type] = N'RESOURCE_MEMPHYSICAL_LOW'
ORDER BY
EventTime DESC;
As in, if you run that against effectively ANY SQL Server instance, you REALLY don't want to see ANY results from this query. Or, if you do, they should be at times when you're running REALLY heavy workloads (ugly data-loading/population jobs or other huge processing operations) that you're already aware are issues/problems from a performance perspective.
Otherwise, the occasional entry/hiccup (i.e., set of results) isn't necessarily a reason to worry about major problems, but if you're routinely seeing entries/rows/results from the above with regular workloads, you'll want to investigate things like all of the details listed above (cache and clerk sizes/thresholds, trap for any large memory grants, check plan-cache sizing based on overall RAM, etc.) AND/OR start looking into cache clock hands to see exactly where memory is being scavenged:
https://learn.microsoft.com/en-us/archive/blogs/slavao/q-and-a-clock-hands-what-are-they-for

How could database have worse benchmark results on faster disk?

I'm benchmarking comparable (2vCPU, 2G RAM) server (Ubuntu 18.04) from DigitalOcean (DO) and AWS EC2 (t3a.small).
The disk benchmark (fio) goes inline with the results of https://dzone.com/articles/iops-benchmarking-disk-io-aws-vs-digitalocean
In summary:
DO --
READ: bw=218MiB/s (229MB/s), 218MiB/s-218MiB/s (229MB/s-229MB/s), io=3070MiB (3219MB), run=14060-14060msec
WRITE: bw=72.0MiB/s (76.5MB/s), 72.0MiB/s-72.0MiB/s (76.5MB/s-76.5MB/s), io=1026MiB (1076MB), run=14060-14060msec
EC2 --
READ: bw=9015KiB/s (9232kB/s), 9015KiB/s-9015KiB/s (9232kB/s-9232kB/s), io=3070MiB (3219MB), run=348703-348703msec
WRITE: bw=3013KiB/s (3085kB/s), 3013KiB/s-3013KiB/s (3085kB/s-3085kB/s), io=1026MiB (1076MB), run=348703-348703msec
which shows DO disk more than 10 times faster than the EBS of EC2
However, sysbench following https://severalnines.com/database-blog/how-benchmark-postgresql-performance-using-sysbench is showing DO slower than EC2 (using Postgres 11 default configuration, read-write test on oltp_legacy/oltp.lua )
DO --
transactions: 14704 (243.87 per sec.)
Latency (ms):
min: 9.06
avg: 261.77
max: 2114.04
95th percentile: 383.33
EC2 --
transactions: 20298 (336.91 per sec.)
Latency (ms):
min: 5.85
avg: 189.47
max: 961.27
95th percentile: 215.44
What could be the explanation?
Sequential read/write throughput matters for large sequential scans, stuff like data warehousing, loading a large backup, etc.
Your benchmark is OLTP which does lots of small quick queries. For this sequential throughput is irrelevant.
For reads (SELECTs) the most important factor is having enough RAM to keep your working set in cache and not do any actual IO. Failing that, it is read random access time.
For writes (UPDATE,INSERT) then the fsync latency, which is the time required to commit data to stable storage, is the most important factor since the database will only finish a COMMIT when data has been written.
Most likely the EC2 has better random access and fsync performance. Maybe it uses SSDs or battery-backed cache.
Sequential bandwidth and latency / iops are independent parameters.
Some workloads (like DBs) depend on latency for lots of small IOs. Or throughput for lots of small IO operations, iops (IOs per second).
In addition to IOPS vs throughput which others mentioned. I also wanted to point out that they are both pretty similar numbers. 240 tps vs 330 tps. you could add or subtract almost that much by just doing things like vacuum, analyze, or let it sit there for a while.
there could be other factors too. CPU speed could be different, there could be one performance for short burst vs throttling a heavy user, there could be presence or absence of huge_pages, different cache timings, memory speeds, or different nvme drivers. the point is 240 is not as much less than 330 as you might think.
Update: something else to point out is that OLTP read/write transactions arent necessary bottlenecked by disk performance. if you have sync off, then it really isnt.
I dont know exactly what the sysbench legacy OLTP read write test is doing, but I suspect its more like a bank xaction touching multiple records, using indexes, ... its probably not some sort of raw max insertion rate, or MAX CRUD operation rate benchmark.
I get 1000 tps on my desktop in the write heavy benchmark against pg13, but i can insert something like 50k records per second, each being ~ 100 bytes records from just a single process python client during bulk loads. and nearly 100k w/ sync off.

Oracle Database CPU, Memory utilization history

I need to determine the workload of our database instance each week, AWR report provides many details but its very hard to break down the data.
I need a query that produces a data set that represents the snap-id with the following value:
CPU utilization
Memory Utilization
Read/Write operations
Using this set I will be able to create a histogram shows the CPU, memory, and read/write utilization during the week by each hour.
You can try querying the DBA_HIST_SYSMETRIC_SUMMARY view to get the CPU utilization Memory Utilization Read/Write operations at the SNAP_ID level.
Sample query provided below:
select *
from DBA_HIST_SYSMETRIC_SUMMARY
where snap_id=<snap_id>
and metric_name in ('Host CPU Utilization (%)','I/O Megabytes per Second','I/O Requests per Second','Total PGA Allocated');

Low cost way to host a large table yet keep the performance scalable?

I have a growing table storing time series data, 500M entries now, and 200K new records every day. The total size is around 15GB for now.
My clients are querying the table via a PHP script mostly, and the size of the result set is around 10K records (not very large).
select * from T where timestamp > X and timestamp < Y and additionFilters
And I want this operation cheap.
Currently my table is hosting in Postgres 7, on a single 16G memory Box, and I would love to see some good suggestion for me to host this in low cost and also allow me to scale up for performance if needed.
The table serves:
1. Query: 90%
2. Insert: 9.9%
2. Update: 0.1% <-- very rare.
PostgreSQL 9.2 supports partitioning and partial indexes. If there are a few hot partitions, and you can put those partitions or their indexes on a solid state disk, you should be able to run rings around your current configuration.
There may or may not be a low cost, scalable option. It depends on what low cost and scalable mean to you.

What to monitor on SQL Server

I have been asked to monitor SQL Server (2005 & 2008) and am wondering what are good metrics to look at? I can access WMI counters but am slightly lost as to how much depth is going to be useful.
Currently I have on my list:
user connections
logins per second
latch waits per second
total latch wait time
dead locks per second
errors per second
Log and data file sizes
I am looking to be able to monitor values that will indicate a degradation of performance on the machine or a potential serious issue. To this end I am also wondering at what values some of these things would be considered normal vs problematic?
As I reckon it would probably be a really good question to have answered for the general community I thought I'd court some of you DBA experts out there (I am certainly not one of them!)
Apologies if a rather open ended question.
Ry
I would also monitor page life expectancy and your buffer cache hit ratio, see Use sys.dm_os_performance_counters to get your Buffer cache hit ratio and Page life expectancy counters for details
Late answer but can be of interest to other readers
One of my colleagues had the similar problem, and used this thread to help get him started.
He also ran into a blog post describing common causes of performance issues and an instruction on what metrics should be monitored, beside ones already mentioned here. These other metrics are:
• %Disk Time:
This counter indicates a disk problem, but must be observed in conjunction with the Current Disk Queue Length counter to be truly informative. Recall also that the disk could be a bottleneck prior to the %Disk Time reaching 100%.
• %Disk Read Time and the %Disk Write Time:
The %Disk Read Time and %Disk Write Time metrics are similar to %Disk Time, just showing the operations read from or written to disk, respectively. They are actually the Average Disk Read Queue Length and Average Disk Write Queue Length values presented in percentages.
• %Idle Time:
Measures the percentage of time the disk was idle during the sample interval. If this counter falls below 20 percent, the disk system is saturated. You may consider replacing the current disk system with a faster disk system.
• %Free Space:
Measures the percentage of free space on the selected logical disk drive. Take note if this falls below 15 percent, as you risk running out of free space for the OS to store critical files. One obvious solution here is to add more disk space.
If you would like to read the whole post, you may find it here:
http://www.sqlshack.com/sql-server-disk-performance-metrics-part-2-important-disk-performance-measures/
Use SQL Profiler to identify your Top 10 (or more) queries. Create a baseline performance for these queries. Review current average execution times vs. your baseline, and alert if significantly above your baseline. You can also use this list to identify queries for possible optimization.
This attacks the problem at a higher level than just reviewing detailed stats, although those stats can also be useful. I have found this approach to work on any DBMS, including MySQL and Oracle. If your top query times start to go up, you can bet you are starting to run into performance issues, which you can then start to drill into in more detail.
Budget permitting, it's worth looking at some 3rd party tools to help. We use Idera's SQL Diagnostic Manager to monitor server health and Confio's Ignite to keep an eye on query performance. Both products have served us well in our shop.
Percent CPU utilization and Average disk queue lengths are also pretty standard. CPUs consistently over 80% indicates you may need more or better CPUs (and servers to house them); Consistently over 2 on any disk queue indicates you have a disk I/O bottleneck on that drive.
You Should monitor the total pages allocated to a particular process. You can get that information from querying the sys databases.
sys.dm_exec_sessions s
LEFT JOIN sys.dm_exec_connections c
ON s.session_id = c.session_id
LEFT JOIN sys.dm_db_task_space_usage tsu
ON tsu.session_id = s.session_id
LEFT JOIN sys.dm_os_tasks t
ON t.session_id = tsu.session_id
AND t.request_id = tsu.request_id
LEFT JOIN sys.dm_exec_requests r
ON r.session_id = tsu.session_id
AND r.request_id = tsu.request_id
OUTER APPLY sys.dm_exec_sql_text(r.sql_handle) TSQL
The following post explains really well how you can use it to monitor you server when nothing works
http://tsqltips.blogspot.com/2012/06/monitor-current-sql-server-processes.html
Besides the performance metrics suggested above, I strongly recommend monitoring available memory, Batch Requests/sec, SQL Compilations/sec, and SQL Recompilations/sec. All are available in the sys.dm_os_performance_counters view and in Windows Performance Monitor.
As for
ideally I'd like to organise monitored items into 3 categories, say 'FYI', 'Warning' & 'Critical'
There are many third party monitoring tools that enable you to create alerts of different severity level, so once you determine what to monitor and what are recommended values for your environment, you can set low, medium, and high alerts.
Check Brent Ozar's article on not so useful metrics here.

Resources