Modern System Architecture? - c

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?

Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?

It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Related

Mutex vs busy wait for tcp io

I do not care about being a cpu hog as I have one thread assigned to each core and the system threads blocked off to their own set. My understanding is that mutex is of use when other tasks are to run, in this case that is not important so I am considering having a consumer thread loop on an address in memory waiting for its value to be non zero - as in the single producer thread that is looping recv()ing with TCP_NONBLOCK set just deposited information and it is now non zero.
Is my implantation a smart one given my circumstances or should I be using a mutex or custom interrupt even though no other tasks will run.
In addition to points by #ugoren and comments by others:
Even if you have a valid use-case for busy-waiting and burning a core, which are admittedly rare, you need to:
Protect the data shared between threads. This is where locks come into play - you need mutual exclusion when accessing any complex shared data structure. People tend to look into lock-free algorithms here, but these are way-way not obvious and error-prone and are still considered deep black magic. Don't even try these until you have a solid understanding of concurrency.
Notify threads about changed state. This is where you'd use conditional variables or monitors. There are other methods too, eventfd(2) on Linux, for example.
Here are some links for you to show that it's much harder then you seem to think:
Memory Ordering
Out-of-order execution
ABA problem
Cache coherence
Busy-wait can give you a lower latency and somewhat better performance in some cases.
Letting other threads use the CPU is the obvious reason not to do it, but there are others:
You consume more power. An idle CPU goes into a low power state, reducing consumption very significantly. Power consumption is a major issue in data centers, and any serious application must bit waste power.
If your code runs in a virtual machine (and everything is being virtualized these days), your machine competes for CPU with others. Consuming 100% CPU leaves less for the others, and may cause the hypervisor to give your machine less CPU when it's really needed.
You should always stick to mainstream methods, unless there's a good reason not to. In this case, the mainstream is to use select or poll (or epoll). This lets you do other stuff while waiting, if you want, and doesn't waste CPU time. Is the performance difference large enough to justify busy wait?

Implementing critical section

What way is better and faster to create a critical section?
With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:
#include <sched.h>
void critical_code(){
static volatile bool lock = false;
//Enter critical section
while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
sched_yield();
}
//...
//Leave critical section
lock = false;
}
Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.
With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.
And if by chance all that went over your head, don't even think about writing your own locks..
Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.
Which is better for you depends on your application. Only you can answer that question.
You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.
As R. already indicates, putting sched_yield into the loop, is really a bad idea.
If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.
As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (#33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.
Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.
A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.

How can I evaluate performances of a lockless queue?

I have implemented a lockless queue using the hazard pointer methodology explained in http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf using GCC CAS instructions for the implementation and pthread local storage for thread local structures.
I'm now trying to evaluate the performance of the code I have written, in particular I'm trying to do a comparison between this implementation and the one that uses locks (pthread mutexes) to protect the queue.
I'm asking this question here because I tried comparing it with the "locked" queue and I found that this has better performances with respect to the lockless implementation. The only test I tried is creating 4 thread on a 4-core x86_64 machine doing 10.000.000 random operations on the queue and it it significantly faster than the lockless version.
I want to know if you can suggest me an approach to follow, i.e. what kind of operation I have to test on the queue and what kind of tool I can use to see where my lockless code is wasting its time.
I also want to understand if it is possible that the performance are worse for the lockless queue just because 4 threads are not enough to see a major improvement...
Thanks
First point: lock-free programming doesn't necessarily improve speed. Lock-free programming (when done correctly) guarantees forward progress. When you use locks, it's possible for one thread to crash (e.g., go into an infinite loop) while holding a mutex. When/if that happens, no other thread waiting on that mutex can make any more progress. If that mutex is central to normal operation, you may easily have to restart the entire process before any more work can be done at all. With lock-free programming, no such circumstance can arise. Other threads can make forward progress, regardless of what happens in any one thread1.
That said, yes, one of the things you hope for is often better performance -- but to see it, you'll probably need more than four threads. Somewhere in the range of dozens to hundreds of threads would give your lock-free code a much better chance of showing improved performance over a lock-based queue. To really do a lot of good, however, you not only need more threads, but more cores as well -- at least based on what I've seen so far, with four cores and well-written code, there's unlikely to be enough contention over a lock for lock-free programming to show much (if any) performance benefit.
Bottom line: More threads (at least a couple dozen) will improve the chances of the lock-free queue showing a performance benefit, but with only four cores, it won't be terribly surprising if the lock-based queue still keeps up. If you add enough threads and cores, it becomes almost inevitable that the lock-free version will win. The exact number of threads and cores necessary is hard to predict, but you should be thinking in terms of dozens at a minimum.
1 At least with respect to something like a mutex. Something like a fork-bomb that just ate all the system resources might be able to deprive the other threads of enough resources to get anything done -- but some care with things like quotas can usually prevent that as well.
The question is really to what workloads you are optimizing for. If congestion is rare, lock structures on modern OS are probably not too bad. They mainly use CAS instructions under the hood as long as they are on the fast path. Since these are quite optimized out it will be difficult to beat them with your own code.
Our own implementation can only win substantially for the congested part. Just random operations on the queue (you are not too precise in your question) will probably not do this if the average queue length is much longer than the number of threads that hack on it in parallel. So you must ensure that the queue is short, perhaps by introducing a bias about the random operation that is chosen if the queue is too long or too short. Then I would also charge the system with at least twice as much threads than there are cores. This would ensure that wait times (for memory) don't play in favor of the lock version.
The best way in my opinion is to identify hotspots in your application with locks
by profiling the code.Introduce the lockless mechanism and measure the same again.
As mentioned already by other posters, there may not be a significant improvement
at lower scale (number of threads, application scale, number of cores) but you might
see throughput improvements as you scale up the system.This is because deadlock
situations have been eliminated and threads are always making forward progress.
Another way of looking at an advantage with lockless schemes are that to some
extent one decouples system state from application performance because there
is no kernel/scheduler involvement and much of the code is userland except
for CAS which is a hw instruction.
With locks that are heavily contended, threads block and are scheduled once
locks are obtained which basically means they are placed at the end of the run
queue (for a specific prio level).Inadvertently this links the application to system
state and response time for the app now depends on the run queue length.
Just my 2 cents.

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

Is lock free multithreaded programming making anything easier?

I only read a little bit about this topic, but it seems that the only benefit is to get around contention problems but it will not have any important effect on the deadlock problem as the code which is lock free is so small and fundamental (fifos, lifos, hash) that there was never a deadlock problem.
So it's all about performance - is this right?
Lock-free programming is (as far as I can see) always about performance, otherwise using a lock is in most cases much simpler, and therefore preferable.
Note however that with lock-free programming you can end up trading deadlock for live-lock, which is a lot harder to diagnose since no tools that I know of are designed to diagnose it (although I could be wrong there).
I'd say, only go down the path of lock-free if you have to; that is, you have a scenario where you have a heavily contended lock that is hurting your performance. (If it ain't broke, don't fix it).
Couple of issues.
We will soon be facing desktop systems with 64, 128 and 256 cores. Parallism in this domain is unlike our current experience of 2, 4, 8 cores; the algorithms which run successfully on such small systems will run slower on highly parallel systems due to contention.
In this sense, lock-free is important since it is contributes strongly to solving scalability.
There are also some very specific areas where lock-free is extremely convenient, such as the Windows kernel, where there are modes of execution where sleeps of any kind (such as waits) are forbidden, which obviously is very limiting with regard to data structures, but where lock-free provides a good solution.
Also, lock-free data structures often do not have failure modes; they cannot actually fail, where lock-based data structures can of course fail to obtain their locks. Not having to worry about failures simplifies code.
I've written a library of lock free data structures which I'll be releasing soon. I think if a developer can get hold of a well-proven API, then he can just use it - doesn't matter if it's lock-free or not, he doesn't need to worry about the complexity in the underlying implementation - and that's the way to go.
It's also about scalability. In order to get performance gains these days, you'll have to parallelise the problems you're working on so you can scale them across multiple cores - the more, the merrier.
The traditional way of doing this is by locking data structures that require parallel access but the more threads you can run truly parallel, the bigger an bottleneck this becomes.
So yes, it is about performance...
For preemptive threading, threads suspended while holding a lock can block threads that would otherwise be making forward progress. Lock-free doesn't have that problem since by Herlihy's definition, some other thread can always make forward progress.
For non-preemptive threading, it doesn't matter that much since even spin lock based solutions are lock-free by Herlihy's definition.
This is about performances - but also about the ability to take multi-thread loads:
locks grant an exclusive access to a portion of code: while a thread has a lock, other threads are spinning (looping while trying to acquire the lock) or blocked, sleeping until the lock is released (which usually happens if spinning lasts too long);
atomic operations grant an exclusive access to a resource (usually a word-sized variable or a pointer) by using uninterruptible intrinsic CPU instructions.
As locks BLOCK other threads' execution, a program is slowed-down.
As atomic operations execute serially (one after another), there is no blocking*.
(*) as long as the number of concurrent CPUs trying to access the same resource do not create a bottleneck - but we don't have enough CPU Cores yet to see this as a problem.
I have worked on the matter to write a wait-free (lock-free without wait states) Key-Value store for the server I am working on.
Libraries like Tokyo Cabinet (even TC-FIXED, a simple array) rely on locks to preserve the integrity of a database:
"while a writing thread is operating the database, other reading threads and writing threads are blocked" (Tokyo Cabinet documentation)
The results of a test without concurrency (a one-thread test):
SQLite time: 56.4 ms (a B-tree)
TC time: 10.7 ms (a hash table)
TC-FIXED time: 1.3 ms (an array)
G-WAN KV time: 0.4 ms (something new which works, but I am not sure a name is needed)
With concurrency (several threads writing and reading in the same DB), only the G-WAN KV survived the same test because (by contrast with the others) it never ever blocks.
So, yes, this KV store makes it easier for developpers to use it since they do not have to care about threading issues. Making it work this way was not trivial however.
I believe I saw an article that mathematically proved that any algorithm can be written in a wait free manner (which basically means that you can be assured of each thread always making progress towards its goal). This means that it can be applied to any large scale application (after all, a program is just an algorithm with many, many parameters) and because wait free ensures that neither dead/live-lock occurs within it (as long as it doesn't have bugs which preclude it from being truly wait free), it does simplify that side of the program. On the other hand, a mathematical proof is a far cry from actually implementing the code itself (AFAIK, there isn't even a fully lock-free linked list that can run on PCs, I've seen ones that cover most parts, but they usually either can't handle some common functions, or some functions require the structure to be locked).
On a side note, I've also found another proof that showed any lock-free algorithm can actually be considered wait-free due to the laws of probability and various other factors.
Scalability is a really important issue in efficient multi/manicore programming. The greatest limiting factor is actually the code section that should be executed in serial (see Amdahl's Law). However, contentions on locks are also very problematic.
Lock-free algorithm addresses the scalability problem which legacy lock has. So, I could say lock-free is mostly for performance, not decreasing the possibility of deadlock.
However, keep in mind, with current x86 architecture, writing general lock-free algorithm is impossible. This is because we can't atomically exchange arbitrary size of data in current x86 (and also true for other architectures except for Sun's ROCK). So, current lock-free data structures are quite limited and very specialized for specific uses.
I think current lock-free data structures would not be used anymore in a decade. I strongly expect hardware-assisted general lock-free mechanism (yes, that is transactional memory, TM) will be implemented within a decade. If any kind of TM is implemented, though it can't perfectly solve the problems of locks, many problems (including priority inversion and deadlock) will be eliminated. However, implementing TM in hardware is still very challenging, and in x86, only a draft just has been proposed.
It's still too long: 2 sentences summary.
Lock-free data structure is not panacea for lock-based multithreading programming (even TM is not. If you seriously need scalability and have troubles on lock contention, then consider lock-free data structure.

Resources