High Paging file % Usage while memory is not full - sql-server

I was handed a server hosting SQL Server and I was asked to fined the causes of its bad performance problems.
While monitoring PerfMon I found that:
Paging file: % Usage = 25% average for 3 days.
Memory: Pages/sec > 1 average for 3 days.
What I know that if % Usage is > 2% then there is too much paging because of memory pressure and lack in memory space. However, if when I opened Resource Monitor, Memory tab, I found:
-26 GB in use (out of 32 GB total RAM)
-2 GB standby
-and 4 GB Memory free !!!!!!
If there is 4 GB free memory why the paging?! and most importantly why it (paging %) is too high?!!
Someone please explain this situation and how paging file % usage can be lowered to normal.
Note that SQL Server Max. memory is set to 15GB

Page file usage on its own isn't a major red flag. The OS will tend to use a page file even when there's plenty of RAM available, because it allows it to dump the relevant parts of memory from RAM when needed - don't think of the page file usage as memory moved from RAM to HDD - it's just a copy. All the accesses will still use RAM, the OS is simply preparing for a contingency - if it didn't have the memory pre-written to the page file, the memory requests would have to wait for "old" memory to be dumped before freeing the RAM for other uses.
Also, it seems you're a bit confused about how paging works. All user-space memory is always paged, this has nothing to do with the page file itself - it simply means you're using virtual memory. The metric you're looking for is Hard faults per second (EDIT: uh, I misread which one you're reading - Pages/sec is how many hard faults there are; still, the rest still applies), which tells you how often the OS had to actually read data from the page file. Even then, 1 per second is extremely low. You will rarely see anything until that number goes above fifty per sec or so, and much higher for SSDs (on my particular system, I can get thousands of hard faults with no noticeable memory lag - this varies a lot based on the actual HDD and your chipset and drivers).
Finally, there's way too many ways SQL Server performance can suffer. If you don't have a real DBA (or at least someone with plenty of DB experience), you're in trouble. Most of your lines of inquiry will lead you to dead-ends - something as complex and optimized as a DB engine is always hard to diagnose properly. Identify signs - is there a high CPU usage? Is there a high RAM usage? Are there queries with high I/O usage? Are there specific queries that are giving you trouble, or does the whole DB suffer? Are your indices and tables properly maintained? Those are just the very basics. Once you have some extra information like this, try DBA.StackExchange.com - SO isn't really the right place to ask for DBA advice :)

Just some shots in the dark really, might be a little random but I could hardly spot something straight away:
might there be processes that select uselessly large data sets or run too frequent operations? (e.g. the awful app developers' practice to use SELECT * everywhere or get all data and then filter it on application level or run DB queries in loops instead of getting record sets once, etc.)
is indexing proper? (e.g. are leaf elements employed where possible to reduce the key lookup operations, are heavy queries backed up with proper indices to avoid table & index scans etc.)
how is data population managed? (e.g. is it possible that there are too many page splits due to improper clustered indices or parallel inserts, are there some index rebuilds taking place etc.)

Related

Is 2ms a realistic performance goal for small writes to LMDB on a Raspberry Pi 3?

I'm about to begin work on a new project on a Raspberry Pi 3. It will be controlled with a complex GUI so interactive performance is important.
I decided to use LMDB for persistence, as - outside performance - its special traits make my system a lot simpler and for me there's no downside.
The application will be written in Rust, using the lmdb crate.
The critical path will be a single-threaded part where I will get the current timestamp and write a total of 16 or 20 bytes (not sure yet, is 16 bytes really better?) to the database under a key I already have computed.
For doing so (beginning with timestamp and ending after the write transaction has been commited) I have a performance budget of 2 milliseconds.
As far as I heard writes are LMDB's worst criteria, and those are very small random writes, so this is probably the worst possible application. This thought made me ask this question.
Further information:
as this is a GUI application this path will be called at most 100 times per second, and also never more than 1000 times per hour (this is a rewrite from scratch, and those are 10 times the numbers measured in the current system).
I do not understand much about how LMDB works but AFAIU it is using memory mapped files. I was hoping this means the OS internals will write back the pages aggregated.
Is 2ms a realistic goal for such an application? What would I need to consider to keep myself inside this window? Do I need to cache those writes manually?
It depends on how much data you have in you LMDB.
2ms is definitely achievable for small DB. I have got over 100K writes per second with SSDs and small DB size (<1GB). But if the DB is larger than your memory, you probably should not use LMDB if you are worried about write performance.
To wrap up, if your DB is smaller than memory, you can go ahead and use LMDB, 2ms is definitely enough for a write operation. If your DB is larger than memory, then you better use LSM-based kv-stores like LevelDB or RocksDB.

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

madvise managing memory mapped files

I have a program that processes a large dataset consisting of a large number (300+) of sizable memory (40MB+) mapped files. All the files are needed together though they are accessed in a sequential way. at the moment I am memory mapping the files and then using madvise with MADV_SEQUENTIAL since I don't want the thing to be any more of a memory hog than it needs to be (without any madvise the consumption becomes a problem). The problem it that the program runs much slower (like 50x slower) than the diskio of the system would indicate it should, and becomes worse faster than linearly. as the number of files are involved are increased. Processing 100 files is more than 10x faster than 300 files despite being only 3x the data. I suspect that the memory mapped files are generating a page fault every time a 4kb page is crossed, net result disk seek time is greater than disk transfer time.
Can anyone think of a better way than using madvise with MADV_WILLNEED and MADV_DONTNEED every so often, and if this is the best way, any ideas as to how far to look ahead?

Will writing million times to a file, spoil my harddisk?

I have a IO intensive simulation program, that logs the simulation trace / data to a file at every iterations. As the simulation runs for more than millions of iterations or so and logs the data to a file in the disk (overwrite the file each time), I am curious to know if that would spoil the harddisk as most of storage disk has a upper limit to write/erase cycles ( eg. flash disk allow up to 100,000 write/erase cycles). Will splitting the file in to multiple files be a better option?
You need to recognize that a million write calls to a single file may only write to each block of the disk once, which doesn't cause any harm to magnetic disks or SSD devices. If you overwrite the first block of the file one million times, you run a greater risk of wearing things out, but there are lots of mitigating factors. First, if it is a single run of a program, the o/s is likely to keep the disk image in memory without writing to disk at all in the interim — unless, perhaps, you're using a journalled file system. If it is a journalled file system, then the actual writing will be spread over lots of different blocks.
If you manage to write to the same block on a magnetic spinning hard disk a million times, you are still not at serious risk of wearing the disk out.
A Google search on 'hard disk write cycles' shows a lot of informative articles (more particularly, perhaps, about SSD), and the related searches may also help you out.
On an SSD, there is a limited amount of writes (or erase cycles to be more accurate) to any particular block. It's probably more than 100K to 1 million to any given block, and SSD's use "wear loading" to avoid unnecessary "writes" to the same block every time. SSD's can only write zeros, so when you "reset" a bit to one, you have to erase the whole block. [One could put an inverter on the cell to make it the other way around, but you get one or t'other, so it doesn't help much].
Real hard disks are more of a mechanical device, so there isn't so much of a with how many times you write to the same place, it's more the head movements.
I wouldn't worry too much about it. Writing one file should be fine, it has little consequence whether you have one file or many.

Reading discrete set of pages from a disk at one go

The problem is as follows:-
I have a certain file on disk which has a huge size (say a terabyte), now I want to read say N pages (discrete and not contiguous with a huge spread) from this file on disk with minimum number of disk reads (or say i want to minimize the time taken for reading these N pages from disk by minimizing the rotational and seek delays in the disk). Ideal would it be if, I start reading from a page and a completion of all reads occur before a rotation on disk ends. The difference in the positions of the pages is huge, so I cannot simply issue a read command starting from the first page to the last page, covering all N pages. That would take up immense amount of memory to store. (Extra - I was going though some material and bumped into "list prefetching" mechanism in a database. I read through it, and found out that such an implementation could solve my problem.)
Can somebody please help me solving this problem in the C language? Thanks in advance!
You are going to need something like Page replacement algorithm... with prefetch... You didn't tell us how will you operate with pages, how long you will need them in memory etc. But I suppose you will have to solve the situation when the memory is full and you need to release some of the pages from the memory. Look at the algorithms mentioned (LRU, MRU etc.). It is what OS's use for swapping.
You could also consider to use OS's memory mapped files - they have page replacement algorithms already implemented, but don't now about prefetch. (well depends on OS, I suppose linux will be much more advanced than windows in this topic). You can save a lot of work this way but it might not be perfetcly optimized for your case.
Regarding disk access optimizations... try to read some theory how OSes do it... Look at disk scheduling algorithms like SCAN or C-SCAN, eg. at this link.

Resources