Mitigate effects of polling ring buffers - c

I have a single producer multiple consumer program with threads for each role. I am thinking of implementing a circular buffer for tcp on each of the consumers and allow the producer to keep pointers to the circular buffers' memory then handing out pointer space to the tcp to offload data into.
My problem, how to have consumer threads know when data is in?
I am thinking of busy wait checking the pointer location for something other than a 0; I don't mind being a cpu hog.
I should mention each thread is cpuset and soft RT by SCHED_FIFO, and of course c implemented.

In my experience, the problem with multiple consumer datastructures is to properly handle concurrency while avoiding issues with false sharing or excessivly wasting CPU cycles.
So if your problems allow it, I would use pipe to create a pipe to each consumer and putting items into these pipes in a round robin fashion. The consumers can then use epoll to watch the file handles. This avoids having to implement and optimize a concurrent datastructure and you won't burn CPU cycles needlessly. The cost is that you have to go through syscalls.
If you want to do everything yourself with polling to avoid syscalls, you can build a circular buffer but you have to make sure that only one process reads an item at the same time and only after the item has been written. Usually this is done with 4 pointers and proper mutexes.
This article about Xen's I/O ringbuffers might be of interest.

Related

POSIX shared memory - method for automatic client notification

I am investigating POSIX shared memory for IPC in place of a POSIX message queue. I plan to make a shared memory area large enough to hold 50 messages of 750 bytes each. The messages will be sent at random intervals from several cores (servers) to one core (client) that receives the messages and takes action based on the message content.
I have three questions about POSIX shared memory:
(1) is there a method for automatic client notification when new data are available, like the methods available with POSIX pipes and message queues?
(2) What problems would arise using shared memory without a lock where the data are write-once, read-once?
(3) I have read that shared memory is the fastest IPC method because it has the highest bandwith and data become available in both server and client cores immediately. However, with message queues and pipes the server cores can send the messages and continue with their work without waiting for a lock. Does the need for a lock slow the performance of shared memory over message queues and pipes in the type of scenario described above?
(1) There is no automatic mechanism to notify threads/processes that data was written to a memory location. You'd have to use some other mechanism for notifications.
(2) You have a multiple-producer/single-consumer (MPSC) setup. Implementing a lockless MPSC queue is not trivial. You would have to pay careful attention to doing atomic compare-and-swap (CAS) operations in right order with correct memory ordering and you should know how to avoid false cache line sharing. See https://en.cppreference.com/w/c/atomic for the atomic operations support in C11 and read up about memory barriers. Another good read is the paper on Disruptor at http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf.
(3) Your data size (50*750) is small. Chances are that it all fits in cache and you'll have no bandwidth issues accessing it. Lock vs. pipe vs. message queue: none of these is free at times of contention and when the queue is full or empty.
One benefit of lockless queues is that they can work entirely in user-space. This is a huge benefit when extremely low latency is desired.

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

Strategies to mitigate polling effects in ring buffers

I am using a canonical ring buffer implementation in a 1Reader thread/1Writer thread setting.
Since the reader loops when the buffer is empty [the writer loops when the buffer is full] and continously polls the control variables, I call pthread_yield (that in my case is only a wrapper to sched_yield) to give priority to other threads in the system. I am not using any mutex because it is not needed for proper functioning.
Is there a better way to mitigate the polling effects (a.k.a. CPU burning) ? I was thinking of the pthread's condition variables - since I mostly block the thread when there is no data [no space] - but I am afraid of the overhead could introduce.
Thanks
Use condition variables, the overhead is much lower than busy waiting, and using mutexes correctly ensures that your data is actually there when you expect it to be (since they enforce ordering).
In addition, if you really don't need the mutex for the general case, lock contention should be low to non-existent.

How to optimize two threads running heavy loops

I have more of a conceptual question.
Assume one program running two threads.
Both threads are running loops all the time.
One thread is responsible for streaming data and the other thread is responsible for receiving the file that the first thread has to stream.
So the file transfer thread is loops to receive the data which it writes to a file and the streaming thread reads that data from that file as it needs it and streams it.
The problem I see here is how to avoid starvation when the file transfer is taking too much CPU cycles for it's own and thus making the streaming thread lag?
How would I be able to share the CPU effectively between those two threads knowing that the streamer streams data far slower than the file transfer receives it.
I thank you for your advice.
Quite often this kind of problems are solved by using somekind of flow control:
Block the sender when the receiver is busy.
This cause also problems: If your program must be able to fast forward (seek forward),
then this is not good idea.
In your case, you could block the file transfer thread when there is more than 2MB unstreamed data in the file. And resume it when there is less than 1MB unstreamed data.
See if pthread_setschedparam() helps you balance out the threads' usage of the CPU
From man page of pthread_setschedparam, you can change the thread priorities.
pthread_setschedparam(pthread_t thread, int policy,
const struct sched_param *param);
struct sched_param {
int sched_priority; /* Scheduling priority */
};
As can be seen, only one scheduling parameter is supported. For details of
the permitted ranges for scheduling priorities in each scheduling policy, see
sched_setscheduler(2).
Also,
the file transfer is taking too much CPU cycles for it's own
If you read this SO post, it seems to suggest that changing thread priorities may not help. Because the reason the file transfer thread is consuming more CPU cycles is that it needs it. But in your case, you are OK if the file transfering is slowed down as the streamer thread cannot compete anyways! Hence I suggested you to change priorities and deprive file transfer thread of some cycles even if needs it

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

Resources