!heap –s displays “Lock contention” - heap-memory

I’m analyzing a dump for native memory leak, and then I observed the “Lock contention”
in the !heap –s output. I can’t remember seeing this before.
What does this mean?

This is lock contention for the heap manager. High lock contention is generally caused by a high number of concurrent allocation requests. If the lock contention is high it is recommended to create individual heaps to reduce the overall lock contention. The function HeapCreate can be used to create new heaps (http://msdn.microsoft.com/en-us/library/aa366599%28v=vs.85%29.aspx).

Related

Does MADV_REMOVE result in a TLB shootdown?

I have a tmpfs with a number of large file-backed shared memory mappings. I'd like to be able to punch out any unused pages in the process that writes the files without causing major detriment to other concurrent processes. I understand that MADV_FREE and MADV_DONTNEED can/will result in a TLB shootdown, but I cannot find anything describing any potential ill effects of MADV_REMOVE.

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?
Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?
It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Using many mutex locks

I have a large tree structure on which several threads are working at the same time. Ideally, I would like to have an individual mutex lock for each cell.
I looked at the definition of pthread_mutex_t in bits/pthreadtypes.h and it is fairly short, so the memory usage should not be an issue in my case.
However, is there any performance penalty when using many (let's say a few thousand) different pthread_mutex_ts for only 8 threads?
If you are locking and unlocking very frequently, there can be a penalty, since obtaining and releasing locks does take some time, and can take a fair amount of time if the locks are contended.
When using many locks in a structure like this, you will have to be very specific about what each lock actually locks, and make sure you are careful of AB-BA deadlocks. For example, if you are changing the tree's structure during a locking operation, you will need to lock all the nodes that will be changed, in a consistent order, and make sure that threads working on descendants do not become confused.
If you have a very large number of locks, spread out across memory, caching issues could cause performance problems, depending on the architecture, as locking operations will generally invalidate at least some part of the cache.
Your best bet is probably to implement a simple locking structure, then profile it, then refine it to improve performance, if necessary. I'm not sure what you're doing with the tree, but a good place to start might be a single reader-writer lock for the whole tree, if you expect to read much more than you update.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
-- Donald Knuth
Your locking/access patterns need to be stated in order to properly evaluate this. If each thread would only hold one or a few locks at a time and the probability that any two or more threads would want the same lock at the same time is low (either a random access patter or 8 runners on different positions on a circular track running at roughly the same speed or other more complicated things) then you will mostly avoid the worst case where a thread has to sleep to get a lock (or in some cases have to get the OS involved to decide who wins) because you have so few threads and so many locks.
If each thread might want hundreds or thousands of locks at any one time then things will start to change.
I won't touch deadlock avoidance because I don't know anything about the container that you are using, but you need to be aware of the need to avoid them.

Is lock free multithreaded programming making anything easier?

I only read a little bit about this topic, but it seems that the only benefit is to get around contention problems but it will not have any important effect on the deadlock problem as the code which is lock free is so small and fundamental (fifos, lifos, hash) that there was never a deadlock problem.
So it's all about performance - is this right?
Lock-free programming is (as far as I can see) always about performance, otherwise using a lock is in most cases much simpler, and therefore preferable.
Note however that with lock-free programming you can end up trading deadlock for live-lock, which is a lot harder to diagnose since no tools that I know of are designed to diagnose it (although I could be wrong there).
I'd say, only go down the path of lock-free if you have to; that is, you have a scenario where you have a heavily contended lock that is hurting your performance. (If it ain't broke, don't fix it).
Couple of issues.
We will soon be facing desktop systems with 64, 128 and 256 cores. Parallism in this domain is unlike our current experience of 2, 4, 8 cores; the algorithms which run successfully on such small systems will run slower on highly parallel systems due to contention.
In this sense, lock-free is important since it is contributes strongly to solving scalability.
There are also some very specific areas where lock-free is extremely convenient, such as the Windows kernel, where there are modes of execution where sleeps of any kind (such as waits) are forbidden, which obviously is very limiting with regard to data structures, but where lock-free provides a good solution.
Also, lock-free data structures often do not have failure modes; they cannot actually fail, where lock-based data structures can of course fail to obtain their locks. Not having to worry about failures simplifies code.
I've written a library of lock free data structures which I'll be releasing soon. I think if a developer can get hold of a well-proven API, then he can just use it - doesn't matter if it's lock-free or not, he doesn't need to worry about the complexity in the underlying implementation - and that's the way to go.
It's also about scalability. In order to get performance gains these days, you'll have to parallelise the problems you're working on so you can scale them across multiple cores - the more, the merrier.
The traditional way of doing this is by locking data structures that require parallel access but the more threads you can run truly parallel, the bigger an bottleneck this becomes.
So yes, it is about performance...
For preemptive threading, threads suspended while holding a lock can block threads that would otherwise be making forward progress. Lock-free doesn't have that problem since by Herlihy's definition, some other thread can always make forward progress.
For non-preemptive threading, it doesn't matter that much since even spin lock based solutions are lock-free by Herlihy's definition.
This is about performances - but also about the ability to take multi-thread loads:
locks grant an exclusive access to a portion of code: while a thread has a lock, other threads are spinning (looping while trying to acquire the lock) or blocked, sleeping until the lock is released (which usually happens if spinning lasts too long);
atomic operations grant an exclusive access to a resource (usually a word-sized variable or a pointer) by using uninterruptible intrinsic CPU instructions.
As locks BLOCK other threads' execution, a program is slowed-down.
As atomic operations execute serially (one after another), there is no blocking*.
(*) as long as the number of concurrent CPUs trying to access the same resource do not create a bottleneck - but we don't have enough CPU Cores yet to see this as a problem.
I have worked on the matter to write a wait-free (lock-free without wait states) Key-Value store for the server I am working on.
Libraries like Tokyo Cabinet (even TC-FIXED, a simple array) rely on locks to preserve the integrity of a database:
"while a writing thread is operating the database, other reading threads and writing threads are blocked" (Tokyo Cabinet documentation)
The results of a test without concurrency (a one-thread test):
SQLite time: 56.4 ms (a B-tree)
TC time: 10.7 ms (a hash table)
TC-FIXED time: 1.3 ms (an array)
G-WAN KV time: 0.4 ms (something new which works, but I am not sure a name is needed)
With concurrency (several threads writing and reading in the same DB), only the G-WAN KV survived the same test because (by contrast with the others) it never ever blocks.
So, yes, this KV store makes it easier for developpers to use it since they do not have to care about threading issues. Making it work this way was not trivial however.
I believe I saw an article that mathematically proved that any algorithm can be written in a wait free manner (which basically means that you can be assured of each thread always making progress towards its goal). This means that it can be applied to any large scale application (after all, a program is just an algorithm with many, many parameters) and because wait free ensures that neither dead/live-lock occurs within it (as long as it doesn't have bugs which preclude it from being truly wait free), it does simplify that side of the program. On the other hand, a mathematical proof is a far cry from actually implementing the code itself (AFAIK, there isn't even a fully lock-free linked list that can run on PCs, I've seen ones that cover most parts, but they usually either can't handle some common functions, or some functions require the structure to be locked).
On a side note, I've also found another proof that showed any lock-free algorithm can actually be considered wait-free due to the laws of probability and various other factors.
Scalability is a really important issue in efficient multi/manicore programming. The greatest limiting factor is actually the code section that should be executed in serial (see Amdahl's Law). However, contentions on locks are also very problematic.
Lock-free algorithm addresses the scalability problem which legacy lock has. So, I could say lock-free is mostly for performance, not decreasing the possibility of deadlock.
However, keep in mind, with current x86 architecture, writing general lock-free algorithm is impossible. This is because we can't atomically exchange arbitrary size of data in current x86 (and also true for other architectures except for Sun's ROCK). So, current lock-free data structures are quite limited and very specialized for specific uses.
I think current lock-free data structures would not be used anymore in a decade. I strongly expect hardware-assisted general lock-free mechanism (yes, that is transactional memory, TM) will be implemented within a decade. If any kind of TM is implemented, though it can't perfectly solve the problems of locks, many problems (including priority inversion and deadlock) will be eliminated. However, implementing TM in hardware is still very challenging, and in x86, only a draft just has been proposed.
It's still too long: 2 sentences summary.
Lock-free data structure is not panacea for lock-based multithreading programming (even TM is not. If you seriously need scalability and have troubles on lock contention, then consider lock-free data structure.

overhead for an empty heap arena

My tools are Linux, gcc and pthreads. When my program calls new/delete from several threads, and when there is contention for the heap, 'arena's are created (see the following link for reference http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html). My program runs 24x7, and arenas are still occasionally being created after 2 weeks. I think there may eventually be as many arenas as threads. ps(1) shows alarming memory consumption, but I suspect that only a small portion of it is actually mapped.
What is the 'overhead' for an empty arena? (How much more memory per arena is used than if all allocation was confined to the traditional heap? )
Is there any way to force the creation in advance of n arenas? Is there any way to force the destruction of empty arenas?
struct malloc_state (aka mstate, aka arena descriptor) have size
glibc-2.2
(256+18)*4 bytes =~ 1 KB for 32 bit mode and ~2 KB for 64 bit mode.
glibc-2.3
(256+256/32+11+NFASTBINS)*4 =~ 1.1-1.2 KB in 32bit and 2.4-2.5 KB for 64bit
See glibc-x.x.x/malloc/malloc.c file, struct malloc_state
Destruction of arenas... I don't know yet, but there is such text (briefly - it says NO to the possibility of destruction/trimming memory ) from analysis http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf from 2000 (*a bit outdated). Please name your glibc version.
Ptmalloc maintains a linked list of subheaps. To re-
duce lock contention, ptmalloc searchs for the first
unlocked subheap and grabs memory from it to fulfill
a malloc() request. If ptmalloc doesn’t find an
unlocked heap, it creates a new one. This is a simple
way to grow the number of subheaps as appropriate
without adding complicated schemes for hashing on
thread or processor ID, or maintaining workload sta-
tistics. However, there is no facility to shrink the sub-
heap list and nothing stops the heap list from growing
without bound.
from malloc.c (glibc 2.3.5) line 1546
/*
-------------------- Internal data structures --------------------
All internal state is held in an instance of malloc_state defined
below.
...
Beware of lots of tricks that minimize the total bookkeeping space
requirements. **The result is a little over 1K bytes** (for 4byte
pointers and size_t.)
*/
The same result as I got for 32-bit mode. The result is a little over 1K bytes
Consider using of TCmalloc form google-perftools. It just better suited for threaded and long-living applications. And it is very FAST.
Take a look on http://goog-perftools.sourceforge.net/doc/tcmalloc.html especially on graphics (higher is better). Tcmalloc is twice better than ptmalloc.
In our application the main cost of multiple arenas has been "dark" memory. Memory allocated by the OS, which we don't have any references to.
The pattern you can see is
Thread X goes goes to alloc, hits a collision, creates a new arena.
Thread X makes some large allocations.
Thread X makes some small allocation(s).
Thread X stops allocating.
Large allocations are freed. But the whole arena at the high water mark of the last currently active allocation is still using up VMEM, and other threads won't use this arena unless they hit contention in the main arena.
Basically it's a contributor to "memory fragmentation", since there are multiple places memory can be available, but needing to grow an arena is not a reason to look in other arenas. At least I think that's the cause, the point is your application can end up with a bigger VM footprint than you think it should have. This mostly hits you if you have limited swap, since as you say most of this ends up paged out.
Our (memory hungry) application can have 10s of percent of memory "wasted" in this way, and it can really bite in some situations.
I'm not sure why you would want to create empty arenas. If allocations and frees are in the same thread as each other, then I think over time you will tend to all of them being in the same thread-specific arena with no contention. You may have some small blips while you get there, so maybe that's a reason.

Resources