Overhead of Spin Loop in terms of cache coherence - c

Say a thread in one core is spinning on a variable which will be updated by a thread running on another core. My question is what is the overhead at cache level. Will the waiting thread cache the variable and therefore does not cause any traffic on the bus until the writing thread writes to that variable?
How can this overhead be reduced. Does x86 pause instruction help?

I believe all modern x86 CPUs use the MESI protocol. So the spinning "reader" thread will likely have a cached copy of the data in either "exclusive" or "shared" mode, generating no memory bus traffic while you spin.
It is only when the other core writes to the location that it will have to perform cross-core communication.
[update]
A "spinlock" like this is only a good idea if you will not be spinning for very long. If it may be a while before the variable gets updated, use a mutex + condition variable instead, which will put your thread to sleep so that it adds no overhead while it waits.
(Incidentally, I suspect a lot of people -- including me -- are wondering "what are you actually trying to do?")

If you spin lock for short intervals you are usually fine. However there is a timer interrupt on Linux (and I assume similar on other OSes) so if you spin lock for 10 ms or close to it you will see a cache disturbance.
I have heard its possible to modify the Linux kernel to prevent all interrupts on specific cores and this disturbance goes away, but I don't know what is involved in doing this.

In the case of two threads the overhead may be ignored, anyway it could be a good idea make a simple benchmark. For instance, if you implement spinlocks, how much time the thread spends into the spin.
This effect on the cache it's called cache line bouncing.

I tested this extensively in this post. The overhead in general is incurred by the bus-locking component of the spinlock, usually the instruction "xchg reg,mem" or some variant of it. Since that particular overhead cannot be avoided you have the options of economizing on the frequency with which you invoke the spinlock and performing the absolute minimum amount of work necessary - once the lock is in place - before releasing it.

Related

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?
Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?
It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Linux Interrupt vs. Polling

I am developing a system with a DSP and an ARM. On the ARM there is a linux OS. I have a DSP sending data to the ARM (Linux) - In the Linux there is a kernel module which read the data received from the DSP. The kernel module is waking up to read the data, using an hardware interrupt between the DSP to the ARM.
I want to write a user space app, that will read the data from the kernel space (The kernel module) each time there's a new data arrived from the DSP.
The question is:
What is better approach to do that, a software interrupt from the kernel to the user-space or polling from the user-space (reading a known memory address with the kernel) every 10ms..?
Knowing that:
The data from the DSP to the kernel must arrive in very short time - 100us.
The data from the kernel to the user-space can take 10ms to 30ms.
The data that is being read is regarded small - around 100 bytes.
I would create a device and have the userland program block on read. No need to wait 10ms in between, this is handled efficiently by blocking.
Polling in a sense of using poll (yes, I understood that's not what you meant) would work fine, but there is no reason to call two functions (first poll and then read) when one function can do it anyway. No need to do it every 10ms, you can immediately call poll again after having processed what you got from your last read.
Polling in a sense of checking a known memory location every 10ms is not advisable. Not only is this an ugly hack and more complicated than you think (you will have to map the page containing that memory location to user space), and a form of busy waiting which needlessly consumes CPU, it also has an average latency of 5ms and a worst case latency of 10ms, which is entirely unnecessary. Average and worst case latency of read is approximately zero (well, not quite, but nearly so... it's as fast as waking a blocked task goes).
Interrupts (i.e. signals) are very efficient but make the program a lot more complicated/contorted compared to simply reading and blocking (have to write a signal handler, may not use certain functions in handlers, must communicate to main app, etc.). While technically a good solution, I would advise against them because a program needs not be more complicated than necessary.
Polling has no advantage over waiting. The process still has to be scheduled and switched to and all that and then it does useless poll part of the time.
Linux runs scheduler when returning from interrupts, so when you wake up the waiting task in the in-kernel interrupt handler and it has high priority set (you should give it real-time priority, obviously) the task will be scheduled immediately. You won't beat that with polling.
The standard interface of (character) device files is reasonably fast, so just implement blocking read, poll (which is a blocking system call, not polling anything really) and possibly asynchronous read (uses real-time signal), but I suspect performance of dedicated thread waiting in read system call will be better than AIO. And it's easier to write too. You should find enough examples in kernel sources.
You don't seem to mention any hard time constraints, so you could really go with either approach. However, as Martin James said, polling introduces some overhead to the application, which you probably don't want.
Personally, I'd go with an interrupt or event flag triggered by the kernel. While you may not have have hard timing constraints, I assume you're wanting something that's more deterministic, rather than not. A kernel interrupt will get you closer to that.

Mutex vs busy wait for tcp io

I do not care about being a cpu hog as I have one thread assigned to each core and the system threads blocked off to their own set. My understanding is that mutex is of use when other tasks are to run, in this case that is not important so I am considering having a consumer thread loop on an address in memory waiting for its value to be non zero - as in the single producer thread that is looping recv()ing with TCP_NONBLOCK set just deposited information and it is now non zero.
Is my implantation a smart one given my circumstances or should I be using a mutex or custom interrupt even though no other tasks will run.
In addition to points by #ugoren and comments by others:
Even if you have a valid use-case for busy-waiting and burning a core, which are admittedly rare, you need to:
Protect the data shared between threads. This is where locks come into play - you need mutual exclusion when accessing any complex shared data structure. People tend to look into lock-free algorithms here, but these are way-way not obvious and error-prone and are still considered deep black magic. Don't even try these until you have a solid understanding of concurrency.
Notify threads about changed state. This is where you'd use conditional variables or monitors. There are other methods too, eventfd(2) on Linux, for example.
Here are some links for you to show that it's much harder then you seem to think:
Memory Ordering
Out-of-order execution
ABA problem
Cache coherence
Busy-wait can give you a lower latency and somewhat better performance in some cases.
Letting other threads use the CPU is the obvious reason not to do it, but there are others:
You consume more power. An idle CPU goes into a low power state, reducing consumption very significantly. Power consumption is a major issue in data centers, and any serious application must bit waste power.
If your code runs in a virtual machine (and everything is being virtualized these days), your machine competes for CPU with others. Consuming 100% CPU leaves less for the others, and may cause the hypervisor to give your machine less CPU when it's really needed.
You should always stick to mainstream methods, unless there's a good reason not to. In this case, the mainstream is to use select or poll (or epoll). This lets you do other stuff while waiting, if you want, and doesn't waste CPU time. Is the performance difference large enough to justify busy wait?

Implementing critical section

What way is better and faster to create a critical section?
With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:
#include <sched.h>
void critical_code(){
static volatile bool lock = false;
//Enter critical section
while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
sched_yield();
}
//...
//Leave critical section
lock = false;
}
Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.
With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.
And if by chance all that went over your head, don't even think about writing your own locks..
Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.
Which is better for you depends on your application. Only you can answer that question.
You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.
As R. already indicates, putting sched_yield into the loop, is really a bad idea.
If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.
As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (#33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.
Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.
A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.

Resources