Lockless circular queue with pthreads. Anything to watch out for? - c

I'd like to implement a lockless single-producer, single-consumer circular queue between two pthreads; in C, on ARM Linux.
The queue will hold bytes, the producer will memcpy() things in, and the consumer will write() them out to file.
Is it naive to think I can store head and tail offsets in ints and everything will just work?
I am wondering about such things as compiler optimisations meaning my head/tail writes sit in registers and are not visible to the other thread, or needing a memory barrier somewhere.

The memory consistency model of pthreads doesn't offer you any assistance for building lockless algorithms. You're on your own - you will have to use whatever atomic instructions and memory barriers are provided and required by your architecture. You'll also have to consult your compiler documentation to determine how to request a compiler barrier.
You are almost certainly better off using a normal queue implementation protected by a mutex and condition variable - if the queue simply stores pointers to the buffers that are being written out to file (rather than the data itself), then lock contention shouldn't be a problem, as the lock will only have to be held while a pointer is added or removed from the queue.

Related

Is memcpy() a sleeping function?

I would like to copy the content of a an array without using a for loop. The copy is made when owning a spinlock.
Is there any chance that memcpy() can sleep?
Things that might happen with memcpy (or with really any memory access in general):
If part of the source or destination is inaccessible (invalid) memory, memcpy could crash your process, which might leave a shared spinlock in a bad state.
If part of the source memory needs to be paged in, memcpy can block while the kernel grabs the memory for you.
If part of the source or destination is memory-mapped to I/O, memcpy might block while the kernel performs that I/O. (In extreme cases, like memory-mapped network files, memcpy might block indefinitely).
The kernel is also free to swap your process out at any point during the copy, which means the copy could take arbitrarily long to actually complete.
However, memcpy does not do anything that a regular memory access wouldn't do. So, using it with a spinlock should be safe (as safe as accessing the memory normally would be, anyway).
I detect some inconsitency in your question. I'll explain myself.
A spinlock or a busy lock in general, maintains the process (or thread) that is waiting for the lock to be acquired without releasing the cpu to another process (or thread) This means a very fast unlocking and reschedule mechanism when the lock is freed, but a very expensive model for long wait times...
Once said this.... if you are using a spinlock, the reason must be that the loop the process or thread is using to check when the lock is freed should not execute more than three or four times, or the cpu will be wasted just checking once after another time if the lock has been freed.
This completely discourages doing blocking operations like the one you ask for (a memory copy normally is strange that has to deal with a non-present resource ---memory page---, but when it does, your spinlock will go into a loop of millions of checks)
spinlocks where designed to protect very small chuncks of memory, where access could signify at most two or three accesses to memory. In that case, a spinlock is going to solve the problem, as putting the thread to wait and rescheduling it will be milion times faster with the spinlock than with the wait/awake process. But this is in clear antagony to the use of memcpy(3) function, as it is a general copy function that allows for large memory copies in one shot. This means the time the resource is locked for one thread, can signify millions of checks of another thread (in a different core, as this is another reason to use a spinlock, when you have a different core that is going to wait two or three accesses to the lock to see it unlocked)
In my opinion, the only use a spinlock can have is to protect a semaphore's counter, or to protect the access to a cond variable or a mutex, but never to be used as a general memory copy or large resource protection. In those cases, it is better to use a normal, sleeping lock. If you plan to use memcpy(3) the only thing I can assume is that you use the lock to protect large amounts of memory while they are copied into.... that's better handler with a sempahore or a mutex.
In modern kernels, the awakening of a process is so efficient that makes user mode spinlocks almost unusable at all.
As a conclussion, my guess is that you don't have to consider the use of memcpy() to protect a shared memory region... but to consider to use a spinlock itself to do the protection. In most cases it will be a lost of resources, and will make your system heavier and slower.

Thread(and irq) safe dynamic memory handler in C

I'm looking for hints in using dynamic memory handler safe in multi-threaded system. Details of the issue:
written in C will run on cortex-M3 processor, with RTOS (CooCox OS),
TLSF memory allocator will be used (other allocators might be used if I will find them better suited and they will be free and open-source),
Solution I'm looking for is using memory allocator safe from OS tasks and interrupts.
So far thought of 2 possible approaches, both have few yet unknown for me details:
disable and enable interrupts when calling allocator functions. Problem - if I'm not mistaking I can't play with interrupts disable and enable in normal mode, only in privileged mode (so if I'm not mistaken, that is only in interrupts), I need to do that from runtime also - to prevent interupts and task switching during memory handler operations.
call allocator from SWI. This one is still very unclear for me. 1st - is SWI same as FIQ (if so is it true that FIQ code needs to be written in asm - since allocator is written in C). Then still have few doubts about calling FIQ from IRQ (that scenarion would happen - tho not often), but most likely this part will not cause issues.
So any ideas on possible solutions for this situation?
Regarding your suggestions 1 and 2:
On Cortex-M3 you can enable and disable interrupts at any time in privileged level code through the CMSIS intrinsics __disable_irq/_enable_irq functions. privileged level is not restricted to handler mode; thread mode code can run at privileged level too (and in many small RTOS that is the default).
SWI and FIQ are concepts from legacy ARM architectures. They do not exist in Cortex-M3.
You would not ideally want to perform memory allocation in an interrupt handler - even if the allocator is deterministic, it may still take significant amount of time; I can think of few reasons you would want to do that.
The best approach is to modify the tlsf code to use an RTOS mutex for each of the calls with external linkage. Other libraries I have used have stubs already in the library that normally do nothing, but which you can override with your own implementation to map it to any RTOS.
Now you cannot of course use a mutex in an ISR, but as I said you should probably not allocate memory there either. If you really must perform allocation in an interrupt handler, then enable/disable interrupts is your only option, but you are then confounding all the real-time deterministic behaviour that an RTOS provides. A better solution to that is to have your ISR do not more than issue an event-flag or semaphore to a thread context handler. This allows you to use all RTOS services and scheduling, and the context switch time from ISR to a high priority thread will be insignificant compared to the memory allocation time.
Another possibility would be to not use this allocator at all, but instead use a fixed-block allocator using RTOS queues. You pre-allocate blocks of memory (statically or dynamically), post pointers to the start of each block onto a queue, then to allocate you simply receive a pointer from the queue, and to free you post back to the queue. If memory is exhausted (queue is empty), you can baulk or block on the queue (do not block in an ISR though). You can create multiple queues for different sized blocks, and use the one appropriate to your needs (ensuring you post back to the same queue of course!)

Circular buffer Vs. Lock free stack to implement a Free List

As I have been writing some multi-threaded code for fun, I came up with the following situation:
a thread claims a single resource unit from a memory pool, it processes it and sends a pointer to this data to another thread for further operation using a circular buffer (1R / 1W case).
The latter must inform the former thread whenever it is done with the data he received, so that the memory can be recycled.
I wonder whether it is better - performance-wise - to implement this "Freelist" as another circular buffer - holding the addresses of free resources - or choose the lock-free stack way (implementing DCAS on x86-64).
Generally speaking, what could be the pros and the cons of the two different approaches ?
Just in case, there is a difference between lock-free and wait-free. The former means there is no locking but the thread can still busy-spin not making any progress. The latter means that the thread always makes progress with no locking or busy-spinning.
With one reader and one writer lock-free and wait-free FIFO circular buffer is trivial to implement.
I hear that LIFO stack can also be made wait-free, but not so sure about FIFO list. And it sound like you need a queue here rather then a stack.
The main difference is the circular buffer will be bounded, while the stack will not.
It's hard to make a performance judgement on things like this without testing. On the one hand, the circular buffer is backed by a contiguous array. If the reader and writer indices remain "near" each other, you'll have each thread constantly invalidating a shared cache line.
On the other hand, with a stack you can have contention for the top-of-stack pointer, resulting in threads sometimes spinning in the CAS loop.
My guess would be that the best choice is workload-dependent.

Queues implementation benchmark

I'm starting development of a series of image processing algorithms, some of them with intensive use of queues. Do you guys know a good benchmark for those data structures?
To narrow the scope, I'm using C mostly, but I can use C++, stl and any library.
I've got a few hits on data structure libraries, such as GLib and C-Generic-Library, and of course the containers of STL. Also, if any of you developed/know a faster queue than those, please advise :)
Also, the queue will have lots of enqueues and dequeues operations, so it better have a smart way to manage memory.
For a single threaded application you can often get around having to use any type of queue at all simply by processing the next item as it comes in, but there are many applications where this isn't the case (queuing up data for output, for instance).
Without the need to lock the queue (no other threads to worry about) a simple circular buffer is going to be hard to beat for performance. If for some reason the queue needs to grow after creation this is a little bit more difficult, but you shouldn't have a hard time finding a circular buffer queue implementation (or building your own). If either inserting or extracting are done in a signal handler (or interrupt service routine) then you may actually need to protect the read and/or write position indexes, but if you know your target well you may be able to determine that this is not the case (when in doubt protect, though). Protection would be by either temporarily blocking the signals or interrupts that could put things in your queue. (You would really need to block this if you were to need to resize the queue)
If whatever you are putting in the queue has to be dynamically allocated anyway then you might want to just tack on a pointer and turn the thing into a list node. A singly linked list where the list master holds a pointer to the head and the last node is sufficient. Extract from the head and insert at the tail. Here protecting the inserts and extractions from race conditions is pretty much independent and you only need to worry about things when the lenght of the list is very low. If you truly do have a single threaded application then you don't have to worry about it at all.
I don't have any actual benchmarks and can't make any suggestions about any library implementations, but both methods are O(1) for both insert and extract. The first is more cache (and memory pager) friendly unless your queue size is much larger than it needs to be. The second method is less cache friendly since each member of the queue can be in a different area of RAM.
Hope this helps you evaluate or create your own queue.

Any single-consumer single-producer lock free queue implementation in C?

I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:
enqueue_nonblock(data) {
if (NULL != buffer[head]) {
return EWOULDBLOCK;
}
buffer[head] = data;
head = NEXT(head);
return 0;
}
dequeue_nonblock(data) {
data = buffer[tail];
if (NULL == data) {
return EWOULDBLOCK;
}
buffer[tail] = NULL;
tail = NEXT(tail);
return 0;
}
Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?
And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.
If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.
The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.
It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.
I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.
Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.
Yes.
There exist a number of lock-free multiple-reader multiple-writer queues.
I have implemented one, by Michael and Scott, from their 1996 paper.
I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.
You should look at FastFlow library
I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.
I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...
AH-ha!
I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.
I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.
Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.
I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.
To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.
If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator.
There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow
Before starting with Lock-free stuff, look at:Circular lock-free buffer
Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.
To use malloc you would need to pre allocate all the nodes and manage them with another queue...
Note you can make some form of expandable array which would need to lock if it was expanded.
Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.
This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:
http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2

Resources