Linux Shared Memory Synchronization - c

I have implemented two applications that share data using the POSIX shared memory API (i.e. shm_open). One process updates data stored in the shared memory segment and another process reads it. I want to synchronize the access to the shared memory region using some sort of mutex or semaphore. What is the most efficient way of do this? Some mechanisms I am considering are
A POSIX mutex stored in the shared memory segment (Setting the PTHREAD_PROCESS_SHARED attribute would be required)
Creating a System V semaphore using semget

Rather than a System V semaphore, I would go with a POSIX named semaphore using sem_open(), etc.

Might as well make this an answer.
You can use sem_init with pshared true to create a POSIX semaphore in your shared memory space. I have used this successfully in the past.
As for whether this is faster or slower than a shared mutex and condition variable, only profiling can tell you. On Linux I suspect they are all pretty similar since they rely on the "futex" machinery.

If efficiency is important, I would go with process-shared mutexes and condition variables.
AFAIR, each operation with a semaphore requires a syscall, so uncontended mutex should be faster than the semaphore [ab]used in mutex-like manner.

First, really benchmark to know if performance is important. The cost of these things is often overestimated. So if you don't find that the access to the control structure is of same order of magnitude than the writes, just take whatever construct is semantically the best for your use case. This would be the case usually if you'd have some 100 bytes written per access to the control structure.
Otherwise, if the control structure is the bottleneck, you should perhaps avoid to use them. C11 has the new concept of _Atomic types and operations that can be used in cases where there are races in access to data. C11 is not yet widely implemented but probably all modern compilers have extensions that implement these features already.


Is volatile necessary for the resource used in a critical section?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource. Below is the pseudo code that will be executed on those two threads.
// Read shared resource.
// Write something to shared resource.
I am wondering if I need to make that shared resource volatile to make sure that when one thread is reading shared resource, a thread won't just get the value from registers, it will actually read from that shared resource. Or maybe I should use a accessor functions to make the access to that shared resource volatile with some memory barrier operations instead of make that shared resource volatile?
I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.
Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread Ti before it releases lock L are visible to all other threads Tj after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.
When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.
C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section of the current (C17) language spec.
volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.
The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.
In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.
Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.
The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.
The answer is no; volatile is not necessary (assuming the critical-section functions you are using were implemented correctly, and you are using them correctly, of course). Any proper critical-section API's implementation will include the memory-barriers necessary to handle flushing registers, etc, and therefore avoid the need for the volatile keyword.
volatile is normally used inform compiler that this data might be change by others (interrupt, DMA, other CPU,...) to prevent un-expected optimization in compiler.
So in your case you may need or don't need:
If you don't have some while loop with some info from share resource in the thread for value change, you don't really need for volatile.
If you have some wait like while (shareVal == 0) in the source code, you need to tell compiler explicit by attribute volatile.
For case 2 CPUs, there is also possibility issue with cache that a CPU is only reading value from cache memory. Please consider to configure memory attribute properly for shared resource.

Do mutexes only function correctly if all relevant threads attempt to acquire the locks they should be acquiring, prior to utilizing a resource?

I'm just learning about locks for the first time prior to taking an OS class for the first time. I originally thought that locks would literally "lock some resource" where you would need to specify the resource (perhaps by pointer to the address of the resource in memory), but after reading through a couple really basic implementations of spin-locks (say, the unix-like training OS "xv6"'s version):
As well as this previous stack overflow question: (What part of memory does a mutex lock? (pthreads))
I think I had it all wrong.
It seems to me instead that locks are effectively just a boolean flag like variable that temporarily (or indefinitely) blocks execution of some code that would utilize a resource, but only where another thread actually also attempts to acquire the lock (where in that second thread attempting to acquire the lock as well, that blocking of the second thread has the side effect of that second thread not being able to utilize the resource until the lock is released by the first thread). So now I'm wondering instead: if a poorly designed thread that uses no mutexes and simply attempts to utilize a resource that another well designed thread held a lock on, is the poorly designed thread able to access the resource regardless (by simply ignoring the mutex -- which I'm now thinking acts as a flag a thread should look at, but has the opportunity to ignore)?
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
Since I'm relatively new to all this, I appreciate any reasonable terminology edit recommendations if I'm stating my question incorrectly as well an answer!
Thank you very much!
--edit, Thank you all for the prompt and helpful responses!
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
A lot of reasons:
What if the thing you're controlling access to isn't a memory region? What if it's a file or a network connection?
How would the compiler know when it was going to access a region of protected memory? Would the compiler have to assume that any memory access anywhere might synchronize with other threads? That would make many optimizations impossible, including storing possibly shared variables in registers which is pretty critical.
Would hardware have to support locking memory on any granularity? How would it know what memory is associated with an object? Consider a linked list. Would you have to lock every bit of memory associated with that linked list and every object in it? When you add or remove an object from the list, do you have to change what memory is protected? Won't that be both expensive and extremely difficult to use?
How would it know when to release a lock? Say you access some area of memory that needs protection and then later you access some other area of memory. How would the implementation know whether other threads could be allowed to access that area in-between those two accesses? The implementation would need to know whether the code accessing that region was or wasn't relying on a consistent view of the shared state over those two accesses. How could it know that? Get it wrong by keeping the lock and concurrency suffers. Get it wrong by releasing the lock in-between the two accesses, and the code can behave unpredictably.
And so on.

Do I need a mutex to protect a int value which could be get/set via sysfs?

Multiple user space processes could access this value at the same time so I guess we should use some locks or memory barrier things for safe but I could find quite a lot code in linux driver who doesn't, or just protect the write case.
Do we really need a mutex for both read case and write case?
It depends on the CPU and the system that the code is executed. Actually you can do this without synchronization techniques if the operation is atomic. As long as you're not sure about this it's better to use a synchronization object. For int/dword values most of the time people do this without sych object.
Read this article
and also a same issue Are C++ Reads and Writes of an int Atomic?

Mutual exclusion implementation in C for shared memory environments

I would like to implement (C) a communication producer/consumer mechanism based on shared memory. It replaces a stream socket communication between a client and a remote server. Nodes in the network are sharing a pool of memory to communicate to each others. Server can write data (produce) in a memory region and the client should read it (consume).
My software actually uses a thread for reading (client side) and a thread for writing (server side). Threads resides on different machines (distributed).
What is the best and fast way to implement a mutual exclusion to access the shared memory region? (memory is external to both machines and just referred)
The server should atomically produce data (write) if client is not reading; client should atomically consume data (read) if server is not writing.
It is clear I need a phthread mutex like mechanism. Threads are in this case waiting to be unlocked via local kernel interrupts.
Would a phthread implementation also work on this distributed scenario (lock variable placed in shared memory - option PTHREAD_PROCESS_SHARED set)?
How can I differently implement a fast and reliable mutex which makes client thread and server thread access the shared region in turn, ensuring data consistency?
So the short answer is: you can use pthread mutex mechanism so long as pthreads knows about your particular system. Otherwise you'll need to look at the specific hardware/operating system for help.
This long answer is going to be somewhat general because the question does not provide a lot of details about the exact implementation of distributed shared memory that is being used. I will try to explain what is possible, but how to do it will be implementation-dependent.
As #Rod suggests, a producer-consumer system can be implemented with one or more mutex locks, and the question is how to implement a mutex.
A mutex can be considered an object with two states {LOCKED, UNLOCKED} and two atomic operations:
Lock: if state is LOCKED, block until UNLOCKED. Set state to LOCKED and return.
Unlock: set state to UNLOCKED and return.
Often mutexes are provided by the operating system kernel by implementing these operations on an abstract mutex object. For example, some variants of Unix implement mutexes and semaphores as operations on file descriptors. On those systems, pthreads would make use of the kernel facilities.
The advantage of this approach is that user-space programs don't have to care how it's implemented. The disadvantage is that each operations requires a call into the kernel and therefore it can be relatively slow compared to the next option:
A mutex can also be implemented as a memory location (let's say 1 byte long) that stores either the value 0 or 1 to indicate UNLOCKED and LOCKED. It can be accessed with standard memory read/write instructions. We can use the following (hypothetical) atomic operations to implement Lock and Unlock:
Compare-and-set: if the memory location has the value 0, set it to the value 1, otherwise fail.
Conditional-wait: block until the memory location has the value 0.
Atomic write: set the memory location to the value 0.
Generally speaking, #1 and #3 are implemented using special CPU instructions and #2 requires some Kernel support. This is pretty much how How pthread_mutex_lock is implemented.
This approach provides a speed advantage because a kernel call is necessary only when the mutex is contended (someone else has the lock).

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.
