malloc fragmentation in lots of 64MB arenas - c

I am struggling with profiling what looks like internal malloc memory fragmentation in a database server application. To rule out a leak, all malloc, realloc and free calls are wrapped with our own accounting that prepends our own header to bookkeep the memory balance, plus the code is valgrinded using quite a big test suite. Moreover, most of the time we use our custom allocator, directly mmaping pools of memory and doing our own administration.
glibs malloc is used only for some small stuff that doesn't fit in the scheme of our own allocator.
Running a test for a few days that just keeps allocating and freeing a lot of memory in the server (lots of short connections coming and going, lots of DDL operations modifying global catalogs), results in the "RES" memory creeping up and staying up, way above our internal accounting.
After these few days of testing, we count a total of about 400TB of memory being malloced/freed, with the balance reported by our accounting varying around a few hundred megabytes to 2-3 GB most of the time (with spikes up to 15GB). The "RES" memory of the process however never goes down below 8.3-8.4GB.
Parsing /proc/$PID/maps, practically all of it is in "rw-p" mappings of exactly 64MB (or "rw-p" plus a "---p" reserved "tail") - in a captured snapshot 143 such arenas account almost exactly for such 8.3-8.4 GB of resident memory.
Googling around tells that malloc allocated memory in such 64MB arenas, and that such multiple arenas can cause excessive "VIRT" memory:
https://infobright.com/blog/malloc_arena_max/
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
However in my case most of the areas are full and actually count to RES not to VIRT (only 9 out of the 143 areas with a "---p" tail of more than 1 MB).
In this case it is just a few GB of memory, but in actual production systems we've seen the disrepancy grow to numbers like 40-50 GB (on a 512 GB RAM server).
Is there a way that I could get more insight into this fragmentation? malloc_info output seems to be somewhat corrupted, reporting some odd numbers like:
<unsorted from="321" to="847883883078550" total="140585643867701" count="847883883078876"/>
- such exact line (exact same "to", "total", and "count") repeats in every heap.
I'm going to test the behaviour of different allocators (jemalloc, tcmalloc) in a similar fashion.

Related

High Paging file % Usage while memory is not full

I was handed a server hosting SQL Server and I was asked to fined the causes of its bad performance problems.
While monitoring PerfMon I found that:
Paging file: % Usage = 25% average for 3 days.
Memory: Pages/sec > 1 average for 3 days.
What I know that if % Usage is > 2% then there is too much paging because of memory pressure and lack in memory space. However, if when I opened Resource Monitor, Memory tab, I found:
-26 GB in use (out of 32 GB total RAM)
-2 GB standby
-and 4 GB Memory free !!!!!!
If there is 4 GB free memory why the paging?! and most importantly why it (paging %) is too high?!!
Someone please explain this situation and how paging file % usage can be lowered to normal.
Note that SQL Server Max. memory is set to 15GB
Page file usage on its own isn't a major red flag. The OS will tend to use a page file even when there's plenty of RAM available, because it allows it to dump the relevant parts of memory from RAM when needed - don't think of the page file usage as memory moved from RAM to HDD - it's just a copy. All the accesses will still use RAM, the OS is simply preparing for a contingency - if it didn't have the memory pre-written to the page file, the memory requests would have to wait for "old" memory to be dumped before freeing the RAM for other uses.
Also, it seems you're a bit confused about how paging works. All user-space memory is always paged, this has nothing to do with the page file itself - it simply means you're using virtual memory. The metric you're looking for is Hard faults per second (EDIT: uh, I misread which one you're reading - Pages/sec is how many hard faults there are; still, the rest still applies), which tells you how often the OS had to actually read data from the page file. Even then, 1 per second is extremely low. You will rarely see anything until that number goes above fifty per sec or so, and much higher for SSDs (on my particular system, I can get thousands of hard faults with no noticeable memory lag - this varies a lot based on the actual HDD and your chipset and drivers).
Finally, there's way too many ways SQL Server performance can suffer. If you don't have a real DBA (or at least someone with plenty of DB experience), you're in trouble. Most of your lines of inquiry will lead you to dead-ends - something as complex and optimized as a DB engine is always hard to diagnose properly. Identify signs - is there a high CPU usage? Is there a high RAM usage? Are there queries with high I/O usage? Are there specific queries that are giving you trouble, or does the whole DB suffer? Are your indices and tables properly maintained? Those are just the very basics. Once you have some extra information like this, try DBA.StackExchange.com - SO isn't really the right place to ask for DBA advice :)
Just some shots in the dark really, might be a little random but I could hardly spot something straight away:
might there be processes that select uselessly large data sets or run too frequent operations? (e.g. the awful app developers' practice to use SELECT * everywhere or get all data and then filter it on application level or run DB queries in loops instead of getting record sets once, etc.)
is indexing proper? (e.g. are leaf elements employed where possible to reduce the key lookup operations, are heavy queries backed up with proper indices to avoid table & index scans etc.)
how is data population managed? (e.g. is it possible that there are too many page splits due to improper clustered indices or parallel inserts, are there some index rebuilds taking place etc.)

Uploading Large(8GB) File Issue using Weka

I am trying to upload a 8GB file to weka for usage of Apriori Algorithm. The server configuration is as follows :-
Its 8 processor server with 4 cores in each physical address space = 40bits and virtual address space =48 bits. Its a 64 bits processor.
Physical Memory =26GB and SWAP =27GB
JVM = 64bit. We have allocated 32GB for JVM Heap using XmX option. Our concern is that the loading of such a huge file is taking a very long time(around 8 hours) and java is utilizing 107% CPU and 91% memory and it has not shown Out of memory exception and weka is showing reading from file.
Please help me how do I handle huge file and what exactly is happening here?
Reagards,
Aniket
I can't speak to Weka, I don't know your data set, or how many elements are in it. The number of elements matter as in a 64b JVM, the pointers are huge, and they add up.
But do NOT create a JVM larger than physical RAM. Swap is simply not an option for Java. A swapping JVM is a dead JVM. Swap is for idle processes rarely used.
Also note that the Xmx value and the physical heap size are not the same, physical size will always be larger than the Xmx size.
You should pre-allocate your JVM heap (Xms == Xmx) and try out various values until MOST of your physical RAM is consumed. This will limit full GCs and memory fragmentation. It also helps (a little) to do this on a fresh system if you're allocating such a large portion of the total memory space.
But whatever you do, do not let Java swap. Swapping and Garbage Collectors do not mix.

Why is memory fragmentation an issue on a 64-bit machine?

In a 32-bit machine each process gets a 4GB virtual space. In this case one can worry that we might face trouble due to fragmentation. But in the case of a 64-bit machine we theoretically have a huge addressable virtual memory, so why is memory fragmentation still an issue (if it is) in a 64-bit machine?
Each virtual address that you try to access is mapped by the operating system to physical memory. Physical memory is allocated in pages (e.g. 4K in size). If you manage to allocate a byte at offset 1000000*n and do it for n from 1 to 1000000 (I think you could do that with mmap), then the OS will have to back that with a million pages of physical memory, which is something like 4G. That physical memory will not be available for anything else. If you had allocated the bytes contiguously, you'd only need about 1M of physical memory (256 pages) for your million bytes.
You can get in a similar bad situation if you allocate 4G for legitimate reasons, and then deallocate parts of it, keeping a bit of every page allocated. The OS will not be able to actually reuse the freed memory for anything else because there are no physical pages that are fully free. So that's a fragmentation problem.
In theory, you could imagine that virtual addresses 1000000 and 2000000 would map to the same page of physical memory, avoiding the fragmentation. But in practice, and for good reasons, the virtual memory mapping is done on a page by page basis. You can read more about it here: http://en.wikipedia.org/wiki/Page_table.
Because all that memory is "wasted" consider an application where you have a lot of internal fragmentation. That process requires more pages in memory because the working set is now scattered in memory and that means its memory footprint is much higher. If this application is contending for physical slots in RAM (machines still really only have about 4 - 8 GB of RAM for a typical home setup) then it causes more page swapping. Generally you want to reduce your applications memory footprint to avoid memory pressure and contention with other applications.
There are cases though where it doesn't really matter, it won't kill you to use an extra megabyte here or there but it all adds up in larger applications. It depends on the situation as to whether or not it is important to have as little fragmentation as possible depending on what you're coding or what the aim of your project is.

How much data can be malloced at a time? what is the limit in modern OS such as Linux?

How much data can be malloced and how is the limit determined? I am writing an algorithm in C that basically utilizes repeatedly some data stored in arrays. My idea is to have this saved in dynamically allocated arrays but I am not sure if it's possible to have such amounts malloced.
I use 200 arrays of size 2046 holding complex data of size 8 byte each. I use these throughout the process so I do not wish to calculate it over and over.
What are your thoughts about feasibility of such an approach?
Thanks
Mir
How much memory malloc() can allocate depends on:
How much memory your program can address directly
How much physical memory is available
How much swap space is available
On a modern, flat-memory-model 32-bit system, your program can address 4 gigabytes, but some of the address space (usually 2 gigabytes, sometimes 1 gigabyte) is reserved for the kernel. So, as a rule of thumb, you should be able to allocate almost two gigabytes at once, assuming you have the physical memory and swap space to back it up.
On a 64-bit system, running a 64-bit OS and a 64-bit program, your addressable memory is essentially unlimited.
200 arrays of 2048 bytes each is only 400k, which should fit in cache (even on a smartphone).
A 32bit OS has a limit of 4Gb, typically some (upto half on win32) are reserved for the operating system - mapping the address space of graphcis card memory etc.
Linux supports 64Gb of address space (using Intel's 36bit PAE) on 32bit versions.
EDIT: although each process is limited to 4Gb
The main problem with allocating large amounts of memory is if you need it to be locked in RAM - then you obviously need a lot of RAM. Or if you need it all to be contiguous - it's much easier to get 4 * 1Gb chunks of memory than a single 4Gb chunk with nothing else in the way.
A common approach is to allocate all the memory you need at the start of the program so you can be sure that if the app isn't going to be possible it will fail instantly rather than when it's done 90% of the work.
Don't run other memory intensive apps at the same time.
There are also a bunch of flags you can use to suggest to the kernel that this app should get priority in memory or keep memory locked in ram - sorry it's too long since i did HPC on linux and i'm probably out of date with modern kernels.
I think that on most mordern (64bit) systems you can allocate 4GB at a time with a malloc( size_t ) call if that much memory is available. How big is each of those 'complex data' entries? if they are of the size 256 bytes, then you'll only need to allocate 100MB.
256bytes × 200 arrays × 2048 entries = 104857600bytes
104857600 bytes / 1024 / 1024 = 100MB.
So for 4096bytes each that's still only 1600MB or ≃ 1.6GB so it is feasible on most systems today, my four year old laptop got 3GB internal memory. Sometimes I does image manipulation with GIMP and it takes up over 2GB of memory.
With some implementations of malloc(), the regions are not actually backed by memory until they really get used so you can in theory carry on forever (though in practice of course the list of allocated regions assigned to your process in the kernel takes up space, so you might find you can only call malloc() a few million times even if it never actually gives you any memory). It's called "optimistic allocation" and is the strategy used by Linux (which is why it then has the OOM killer, for when it was over-optimistic).

overhead for an empty heap arena

My tools are Linux, gcc and pthreads. When my program calls new/delete from several threads, and when there is contention for the heap, 'arena's are created (see the following link for reference http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html). My program runs 24x7, and arenas are still occasionally being created after 2 weeks. I think there may eventually be as many arenas as threads. ps(1) shows alarming memory consumption, but I suspect that only a small portion of it is actually mapped.
What is the 'overhead' for an empty arena? (How much more memory per arena is used than if all allocation was confined to the traditional heap? )
Is there any way to force the creation in advance of n arenas? Is there any way to force the destruction of empty arenas?
struct malloc_state (aka mstate, aka arena descriptor) have size
glibc-2.2
(256+18)*4 bytes =~ 1 KB for 32 bit mode and ~2 KB for 64 bit mode.
glibc-2.3
(256+256/32+11+NFASTBINS)*4 =~ 1.1-1.2 KB in 32bit and 2.4-2.5 KB for 64bit
See glibc-x.x.x/malloc/malloc.c file, struct malloc_state
Destruction of arenas... I don't know yet, but there is such text (briefly - it says NO to the possibility of destruction/trimming memory ) from analysis http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf from 2000 (*a bit outdated). Please name your glibc version.
Ptmalloc maintains a linked list of subheaps. To re-
duce lock contention, ptmalloc searchs for the first
unlocked subheap and grabs memory from it to fulfill
a malloc() request. If ptmalloc doesn’t find an
unlocked heap, it creates a new one. This is a simple
way to grow the number of subheaps as appropriate
without adding complicated schemes for hashing on
thread or processor ID, or maintaining workload sta-
tistics. However, there is no facility to shrink the sub-
heap list and nothing stops the heap list from growing
without bound.
from malloc.c (glibc 2.3.5) line 1546
/*
-------------------- Internal data structures --------------------
All internal state is held in an instance of malloc_state defined
below.
...
Beware of lots of tricks that minimize the total bookkeeping space
requirements. **The result is a little over 1K bytes** (for 4byte
pointers and size_t.)
*/
The same result as I got for 32-bit mode. The result is a little over 1K bytes
Consider using of TCmalloc form google-perftools. It just better suited for threaded and long-living applications. And it is very FAST.
Take a look on http://goog-perftools.sourceforge.net/doc/tcmalloc.html especially on graphics (higher is better). Tcmalloc is twice better than ptmalloc.
In our application the main cost of multiple arenas has been "dark" memory. Memory allocated by the OS, which we don't have any references to.
The pattern you can see is
Thread X goes goes to alloc, hits a collision, creates a new arena.
Thread X makes some large allocations.
Thread X makes some small allocation(s).
Thread X stops allocating.
Large allocations are freed. But the whole arena at the high water mark of the last currently active allocation is still using up VMEM, and other threads won't use this arena unless they hit contention in the main arena.
Basically it's a contributor to "memory fragmentation", since there are multiple places memory can be available, but needing to grow an arena is not a reason to look in other arenas. At least I think that's the cause, the point is your application can end up with a bigger VM footprint than you think it should have. This mostly hits you if you have limited swap, since as you say most of this ends up paged out.
Our (memory hungry) application can have 10s of percent of memory "wasted" in this way, and it can really bite in some situations.
I'm not sure why you would want to create empty arenas. If allocations and frees are in the same thread as each other, then I think over time you will tend to all of them being in the same thread-specific arena with no contention. You may have some small blips while you get there, so maybe that's a reason.

Resources