.NET - different DebugDiag and perfmon GC Heap size results - heap-memory

I am using DebugDiag 1.2 and perfmon.exe to monitor memory usage for a .NET application.
DebugDiag shows the GC Heap Size as 35.51 MB, while for the same instant perfmon shows the #Bytes in all heaps as 4.5 MB.
Why are both values different ? Dont they represent the same thing ?

"Bytes in all heaps" represents memory used by all the .NET objects which are currently in use by the application where as GC Heap size is actually the memory committed in the .NET heap. The reason why GC heap size will be greater is because it includes the memory used by the objects that are marked as "FREE". .NET does not immediately return back all the memory that is marked as FREE back to the OS and that is freed on the next garbage collection so that memory is still marked as commit and is still in .net heap but not really in use by anything in the application.
To get more understanding of this please go through http://blogs.msdn.com/b/tess/archive/2005/11/25/496973.aspx and search for the word Free in that blog

Related

Handle memory properly with a pool of structs

I have a program with three pools of structs. For each of them I use a list a of used structs and another one for the unused structs. During the execution the program consumes structs, and returns them back to the pool on demand. Also, there is a garbage collector to clean the "zombie" structs and return them to the pool.
At the beginning of the execution, the virtual memory, as expected, shows around 10GB* of memory allocated, and as the program uses the pool, the RSS memory increases.
Although the used nodes are back in the pool, marked as unused nodes, the RSS memory do not decreases. I expect this, because the OS doesn't know about what I'm doing with the memory, is not able to notice if I'm doing a real use of them or managing a pool.
What I would like to do is to force the unused memory to go back to virtual memory whenever I want, for example, when the RSS memory increases above X GB.
Is there any way to mark, given the memory pointer, a memory area to put it in virtual memory? I know this is the Operating System responsability but maybe there is a way to force it.
Maybe I shouldn't care about this, what do you think?
Thanks in advance.
Note 1: This program is used in High Performance Computing, that's why it's using this amount of memory.
I provide a picture of the pool usage vs the memory usage, for a few files. As you can see, the sudden drops in the pool usage are due to the garbage collector, what I would like to see, is this drop reflected in the memory usage.
You can do this as long as you are allocating your memory via mmap and not via malloc. You want to use the madvise function with the POSIX_MADV_DONTNEED argument.
Just remember to run madvise with POSIX_MADV_WILLNEED before using them again to ensure there is actually memory behind them.
This does not actually guarantee the pages will be swapped out but gives the kernel a strong hint to do so when it has time.
Git 2.19 (Q3 2018) offers an example of memory pool of struct, using mmap, not malloc.
For a large tree, the index needs to hold many cache entries allocated on heap.
These cache entries are now allocated out of a dedicated memory pool to amortize malloc(3) overhead.
See commit 8616a2d, commit 8e72d67, commit 0e58301, commit 158dfef, commit 8fb8e3f, commit a849735, commit 825ed4d, commit 768d796 (02 Jul 2018) by Jameson Miller (jamill).
(Merged by Junio C Hamano -- gitster -- in commit ae533c4, 02 Aug 2018)
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is dominated in malloc() calls.
This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory pools.
Design:
The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from.
When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be.
When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for.
In the case of a Split Index, the following rules are followed.
1st, some terminology is defined:
Terminology:
'the_index': represents the logical view of the index
'split_index': represents the "base" cache entries.
Read from the split index file.
'the_index' can reference a single split_index, as well as cache_entries from the split_index. the_index will be discarded before the split_index is.
This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the split_index's memory pool.
This allows us to follow the pattern that the_index can reference cache_entries from the split_index, and that the cache_entries will not be freed while they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with.
Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently.
Several strategies were contemplated around how to handle this.
Chosen approach:
An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not.
This is currently an int field, as there are no more available bits in the existing
ce_flags bit field.
If / when more bits are needed, this new field can be turned into a proper bit field.
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this state.

malloc in Release vs Debug (VC 2012)

I wrote a quick and dirty program to leak memory by calling malloc repeatedly. I noticed when I ran my program in Debug configuration (in VS 2012) my program correctly consumes gigabytes of memory and keeps going until the page file is full (the Windows Task Manger reports high Working Set size). However when I run the program in Release mode the Working Set size of my program remains tiny but the Commit size keeps on growing. There is also markedly less disk thrashing or page faulting.
The MSDN documentation states that when in Debug mode, malloc is mapped to _malloc_dbg, but the documentation also states that both allocate memory on the heap, only _malloc_dbg allocates extra memory for debugging information - there is no mention of different heap behaviour (i.e. why it doesn't show up in Private Working Set in Release mode).
Pray tell, what's going on?
When a virtual memory page is committed, no physical memory is allocated until the page is accessed.
The debug malloc fills the newly-allocated memory with a known pattern, whereas the release malloc doesn't initialize it.
The initialization results in more pages of physical RAM (and more thrashing) required in debug than in release.
If you were to actually touch every page of the allocated memory, I'd expect most of the difference between the two builds to disappear.

Why the memory usage doesn't decrease after I freeing the data I alloced?

I created a linked list with 1,000,000 items which took 16M memory. Then I removed and freed half of them. I thought that the memory usage would decrease, but it didn't.
Why was that?
I'm checking the Memory usage via Activity Monitor On Mac OS X 10.8.2.
In case you want to check my code, here it is.
Generally speaking, free doesn't release memory back to the OS. It's still allocated to the process, so the OS reports it as allocated. From the POV of your program, it's available to satisfy an new allocations you make.
Be aware that since you freed every other node, your memory is almost certainly now very fragmented. This free memory is in small chunks with allocated memory in between them, so can only be used to satisfy small allocations. If you make a larger allocation, the process will go to the OS for more memory.
Since the process gets memory from the OS one page at a time, even if it wanted to it can't release such fragmented memory back to the OS. You're using part of every page.

outOfMemoryException while reading excel data

I am trying to read data from an excel file(xlsx format) which is of size 100MB. While reading the excel data I am facing outOfMemoryException. Tried by increasing the JVM heap size to 1024MB but still no use and I cant increase the size more than that. Also tried by running garbage collection too but no use. Can any one help me on this to resolve my issue.
Thanks
Pavan Kumar O V S.
By default a JVM places an upper limit on the amount of memory available to the current process in order to prevent runaway processes gobbling system resources and making the machine grind to a halt. When reading or writing large spreadsheets, the JVM may require more memory than has been allocated to the JVM by default - this normally manifests itself as a java.lang.OutOfMemoryError.
For command line processes, you can allocate more memory to the JVM using the -Xms and -Xmx options eg. to allocate an initial heap allocation of 10 MB, with 100 MB as the upper bound you can use:
java -Xms10m -Xmx100m -classpath jxl.jar spreadsheet.xls
You can refer to http://www.andykhan.com/jexcelapi/tutorial.html#introduction for further details

overhead for an empty heap arena

My tools are Linux, gcc and pthreads. When my program calls new/delete from several threads, and when there is contention for the heap, 'arena's are created (see the following link for reference http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html). My program runs 24x7, and arenas are still occasionally being created after 2 weeks. I think there may eventually be as many arenas as threads. ps(1) shows alarming memory consumption, but I suspect that only a small portion of it is actually mapped.
What is the 'overhead' for an empty arena? (How much more memory per arena is used than if all allocation was confined to the traditional heap? )
Is there any way to force the creation in advance of n arenas? Is there any way to force the destruction of empty arenas?
struct malloc_state (aka mstate, aka arena descriptor) have size
glibc-2.2
(256+18)*4 bytes =~ 1 KB for 32 bit mode and ~2 KB for 64 bit mode.
glibc-2.3
(256+256/32+11+NFASTBINS)*4 =~ 1.1-1.2 KB in 32bit and 2.4-2.5 KB for 64bit
See glibc-x.x.x/malloc/malloc.c file, struct malloc_state
Destruction of arenas... I don't know yet, but there is such text (briefly - it says NO to the possibility of destruction/trimming memory ) from analysis http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf from 2000 (*a bit outdated). Please name your glibc version.
Ptmalloc maintains a linked list of subheaps. To re-
duce lock contention, ptmalloc searchs for the first
unlocked subheap and grabs memory from it to fulfill
a malloc() request. If ptmalloc doesn’t find an
unlocked heap, it creates a new one. This is a simple
way to grow the number of subheaps as appropriate
without adding complicated schemes for hashing on
thread or processor ID, or maintaining workload sta-
tistics. However, there is no facility to shrink the sub-
heap list and nothing stops the heap list from growing
without bound.
from malloc.c (glibc 2.3.5) line 1546
/*
-------------------- Internal data structures --------------------
All internal state is held in an instance of malloc_state defined
below.
...
Beware of lots of tricks that minimize the total bookkeeping space
requirements. **The result is a little over 1K bytes** (for 4byte
pointers and size_t.)
*/
The same result as I got for 32-bit mode. The result is a little over 1K bytes
Consider using of TCmalloc form google-perftools. It just better suited for threaded and long-living applications. And it is very FAST.
Take a look on http://goog-perftools.sourceforge.net/doc/tcmalloc.html especially on graphics (higher is better). Tcmalloc is twice better than ptmalloc.
In our application the main cost of multiple arenas has been "dark" memory. Memory allocated by the OS, which we don't have any references to.
The pattern you can see is
Thread X goes goes to alloc, hits a collision, creates a new arena.
Thread X makes some large allocations.
Thread X makes some small allocation(s).
Thread X stops allocating.
Large allocations are freed. But the whole arena at the high water mark of the last currently active allocation is still using up VMEM, and other threads won't use this arena unless they hit contention in the main arena.
Basically it's a contributor to "memory fragmentation", since there are multiple places memory can be available, but needing to grow an arena is not a reason to look in other arenas. At least I think that's the cause, the point is your application can end up with a bigger VM footprint than you think it should have. This mostly hits you if you have limited swap, since as you say most of this ends up paged out.
Our (memory hungry) application can have 10s of percent of memory "wasted" in this way, and it can really bite in some situations.
I'm not sure why you would want to create empty arenas. If allocations and frees are in the same thread as each other, then I think over time you will tend to all of them being in the same thread-specific arena with no contention. You may have some small blips while you get there, so maybe that's a reason.

Resources