I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:
enqueue_nonblock(data) {
if (NULL != buffer[head]) {
return EWOULDBLOCK;
}
buffer[head] = data;
head = NEXT(head);
return 0;
}
dequeue_nonblock(data) {
data = buffer[tail];
if (NULL == data) {
return EWOULDBLOCK;
}
buffer[tail] = NULL;
tail = NEXT(tail);
return 0;
}
Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?
And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.
If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.
The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.
It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.
I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.
Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.
Yes.
There exist a number of lock-free multiple-reader multiple-writer queues.
I have implemented one, by Michael and Scott, from their 1996 paper.
I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.
You should look at FastFlow library
I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.
I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...
AH-ha!
I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.
I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.
Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.
I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.
To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.
If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator.
There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow
Before starting with Lock-free stuff, look at:Circular lock-free buffer
Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.
To use malloc you would need to pre allocate all the nodes and manage them with another queue...
Note you can make some form of expandable array which would need to lock if it was expanded.
Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.
This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2
Related
As I have been writing some multi-threaded code for fun, I came up with the following situation:
a thread claims a single resource unit from a memory pool, it processes it and sends a pointer to this data to another thread for further operation using a circular buffer (1R / 1W case).
The latter must inform the former thread whenever it is done with the data he received, so that the memory can be recycled.
I wonder whether it is better - performance-wise - to implement this "Freelist" as another circular buffer - holding the addresses of free resources - or choose the lock-free stack way (implementing DCAS on x86-64).
Generally speaking, what could be the pros and the cons of the two different approaches ?
Just in case, there is a difference between lock-free and wait-free. The former means there is no locking but the thread can still busy-spin not making any progress. The latter means that the thread always makes progress with no locking or busy-spinning.
With one reader and one writer lock-free and wait-free FIFO circular buffer is trivial to implement.
I hear that LIFO stack can also be made wait-free, but not so sure about FIFO list. And it sound like you need a queue here rather then a stack.
The main difference is the circular buffer will be bounded, while the stack will not.
It's hard to make a performance judgement on things like this without testing. On the one hand, the circular buffer is backed by a contiguous array. If the reader and writer indices remain "near" each other, you'll have each thread constantly invalidating a shared cache line.
On the other hand, with a stack you can have contention for the top-of-stack pointer, resulting in threads sometimes spinning in the CAS loop.
My guess would be that the best choice is workload-dependent.
I have a shared memory pool from which many different threads may request an allocation. Requesting an allocation from this will occur a LOT in every thread, however the amount of threads is likely to be small, often with only 1 thread running. I am not sure which of the following ways to handle this are better.
Ultimately I may need to implement both and see which produces more favorable results... I also fear that even thinking of #2 may be premature optimization at this point as I don't actually have the code that uses this shared resource written yet. But the problem is so darn interesting that it continues to distract me from the other work.
1) Create a mutex and have a thread attempt to lock it before obtaining the allocation, then unlocking it.
2) Have each thread register a request slot, when it needs an allocation it puts the request in the slot, then blocks(while (result == NULL) { usleep() }) waiting for the request slot to have a result. A single thread continuously iterates request slots making the allocations and assigning them to the result in the request slot.
Number 1 is the simple solution, but a single thread could potentially hog the lock if the timing is right. The second is more complex, but ensures fairness among threads when pulling from the resource. However it still blocks the requesting threads, and if there are many threads the iteration could burn cycles without doing any actual allocations until it finds a request to fulfill.
NOTE: C on Linux using pthreads
Solution 2 is bogus. It's an ugly hack and it does not ensure memory synchronization.
I would say go with solution 1, but I'm a little bit skeptical of the fact that you mentioned "memory pool" to begin with. Are you just trying to allocate memory, or is there some other resource you're managing (e.g. slots in some special kind of memory, memory-mapped file, textures in video memory, etc.)?
If you are just allocating memory, then you're completely right to be worried about premature optimization. The whole problem is premature optimization, and the system malloc will do as well as or better than your memory pool will do. (Or if your code will be running on one of the few systems with a pathologically broken malloc like some video game consoles, just drop in a replacement only on those known-broken systems.)
If you really do have a special resource you need to manage, start with solution 1 and see how it works. If you have problems, you might find you can improve it with a condition variable where the resource manager notifies you when a slot can be allocated, but I really doubt this will be necessary.
I'd like to implement a lockless single-producer, single-consumer circular queue between two pthreads; in C, on ARM Linux.
The queue will hold bytes, the producer will memcpy() things in, and the consumer will write() them out to file.
Is it naive to think I can store head and tail offsets in ints and everything will just work?
I am wondering about such things as compiler optimisations meaning my head/tail writes sit in registers and are not visible to the other thread, or needing a memory barrier somewhere.
The memory consistency model of pthreads doesn't offer you any assistance for building lockless algorithms. You're on your own - you will have to use whatever atomic instructions and memory barriers are provided and required by your architecture. You'll also have to consult your compiler documentation to determine how to request a compiler barrier.
You are almost certainly better off using a normal queue implementation protected by a mutex and condition variable - if the queue simply stores pointers to the buffers that are being written out to file (rather than the data itself), then lock contention shouldn't be a problem, as the lock will only have to be held while a pointer is added or removed from the queue.
I'm starting development of a series of image processing algorithms, some of them with intensive use of queues. Do you guys know a good benchmark for those data structures?
To narrow the scope, I'm using C mostly, but I can use C++, stl and any library.
I've got a few hits on data structure libraries, such as GLib and C-Generic-Library, and of course the containers of STL. Also, if any of you developed/know a faster queue than those, please advise :)
Also, the queue will have lots of enqueues and dequeues operations, so it better have a smart way to manage memory.
For a single threaded application you can often get around having to use any type of queue at all simply by processing the next item as it comes in, but there are many applications where this isn't the case (queuing up data for output, for instance).
Without the need to lock the queue (no other threads to worry about) a simple circular buffer is going to be hard to beat for performance. If for some reason the queue needs to grow after creation this is a little bit more difficult, but you shouldn't have a hard time finding a circular buffer queue implementation (or building your own). If either inserting or extracting are done in a signal handler (or interrupt service routine) then you may actually need to protect the read and/or write position indexes, but if you know your target well you may be able to determine that this is not the case (when in doubt protect, though). Protection would be by either temporarily blocking the signals or interrupts that could put things in your queue. (You would really need to block this if you were to need to resize the queue)
If whatever you are putting in the queue has to be dynamically allocated anyway then you might want to just tack on a pointer and turn the thing into a list node. A singly linked list where the list master holds a pointer to the head and the last node is sufficient. Extract from the head and insert at the tail. Here protecting the inserts and extractions from race conditions is pretty much independent and you only need to worry about things when the lenght of the list is very low. If you truly do have a single threaded application then you don't have to worry about it at all.
I don't have any actual benchmarks and can't make any suggestions about any library implementations, but both methods are O(1) for both insert and extract. The first is more cache (and memory pager) friendly unless your queue size is much larger than it needs to be. The second method is less cache friendly since each member of the queue can be in a different area of RAM.
Hope this helps you evaluate or create your own queue.
As far as I know each thread gets a distinct stack when the thread is created by the operating system. I wonder if each thread has a heap distinct to itself also?
No. All threads share a common heap.
Each thread has a private stack, which it can quickly add and remove items from. This makes stack based memory fast, but if you use too much stack memory, as occurs in infinite recursion, you will get a stack overflow.
Since all threads share the same heap, access to the allocator/deallocator must be synchronized. There are various methods and libraries for avoiding allocator contention.
Some languages allow you to create private pools of memory, or individual heaps, which you can assign to a single thread.
By default, C has only a single heap.
That said, some allocators that are thread aware will partition the heap so that each thread has it's own area to allocate from. The idea is that this should make the heap scale better.
One example of such a heap is Hoard.
Depends on the OS. The standard c runtime on windows and unices uses a shared heap across threads. This means locking every malloc/free.
On Symbian, for example, each thread comes with its own heap, although threads can share pointers to data allocated in any heap. Symbian's design is better in my opinion since it not only eliminates the need for locking during alloc/free, but also encourages clean specification of data ownership among threads. Also in that case when a thread dies, it takes all the objects it allocated along with it - i.e. it cannot leak objects that it has allocated, which is an important property to have in mobile devices with constrained memory.
Erlang also follows a similar design where a "process" acts as a unit of garbage collection. All data is communicated between processes by copying, except for binary blobs which are reference counted (I think).
Each thread has its own stack and call stack.
Each thread shares the same heap.
It depends on what exactly you mean when saying "heap".
All threads share the address space, so heap-allocated objects are accessible from all threads. Technically, stacks are shared as well in this sense, i.e. nothing prevents you from accessing other thread's stack (though it would almost never make any sense to do so).
On the other hand, there are heap structures used to allocate memory. That is where all the bookkeeping for heap memory allocation is done. These structures are sophisticatedly organized to minimize contention between the threads - so some threads might share a heap structure (an arena), and some might use distinct arenas.
See the following thread for an excellent explanation of the details: How does malloc work in a multithreaded environment?
Typically, threads share the heap and other resources, however there are thread-like constructions that don't. Among these thread-like constructions are Erlang's lightweight processes, and UNIX's full-on processes (created with a call to fork()). You might also be working on multi-machine concurrency, in which case your inter-thread communication options are considerably more limited.
Generally speaking, all threads use the same address space and therefore usually have just one heap.
However, it can be a bit more complicated. You might be looking for Thread Local Storage (TLS), but it stores single values only.
Windows-Specific:
TLS-space can be allocated using TlsAlloc and freed using TlsFree (Overview here). Again, it's not a heap, just DWORDs.
Strangely, Windows support multiple Heaps per process. One can store the Heap's handle in TLS. Then you would have something like a "Thread-Local Heap". However, just the handle is not known to the other threads, they still can access its memory using pointers as it's still the same address space.
EDIT: Some memory allocators (specifically jemalloc on FreeBSD) use TLS to assign "arenas" to threads. This is done to optimize allocation for multiple cores by reducing synchronization overhead.
On FreeRTOS Operating system, tasks(threads) share the same heap but each one of them has its own stack. This comes in very handy when dealing with low power low RAM architectures,because the same pool of memory can be accessed/shared by several threads, but this comes with a small catch , the developer needs to keep in mind that a mechanism for synchronizing malloc and free is needed, that is why it is necessary to use some type of process synchronization/lock when allocating or freeing memory on the heap, for example a semaphore or a mutex.