I am investigating POSIX shared memory for IPC in place of a POSIX message queue. I plan to make a shared memory area large enough to hold 50 messages of 750 bytes each. The messages will be sent at random intervals from several cores (servers) to one core (client) that receives the messages and takes action based on the message content.
I have three questions about POSIX shared memory:
(1) is there a method for automatic client notification when new data are available, like the methods available with POSIX pipes and message queues?
(2) What problems would arise using shared memory without a lock where the data are write-once, read-once?
(3) I have read that shared memory is the fastest IPC method because it has the highest bandwith and data become available in both server and client cores immediately. However, with message queues and pipes the server cores can send the messages and continue with their work without waiting for a lock. Does the need for a lock slow the performance of shared memory over message queues and pipes in the type of scenario described above?
(1) There is no automatic mechanism to notify threads/processes that data was written to a memory location. You'd have to use some other mechanism for notifications.
(2) You have a multiple-producer/single-consumer (MPSC) setup. Implementing a lockless MPSC queue is not trivial. You would have to pay careful attention to doing atomic compare-and-swap (CAS) operations in right order with correct memory ordering and you should know how to avoid false cache line sharing. See https://en.cppreference.com/w/c/atomic for the atomic operations support in C11 and read up about memory barriers. Another good read is the paper on Disruptor at http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf.
(3) Your data size (50*750) is small. Chances are that it all fits in cache and you'll have no bandwidth issues accessing it. Lock vs. pipe vs. message queue: none of these is free at times of contention and when the queue is full or empty.
One benefit of lockless queues is that they can work entirely in user-space. This is a huge benefit when extremely low latency is desired.
Related
Let's suppose that there are two threads, A and B. There is also a shared array: float X[100].
Thread A writes to the array one element at a time in order, every 10 steps it updates a shared variable index (in a safe way) that indicates the current index, and it also sends a signal to thread B.
As soon as thread B receives the signal, it reads index in a safe way, and then proceed to read the elements of X until position index.
Is it safe to do this? Thread A really updates the array or just a copy in cache?
Every sane way of one thread sending a signal to another provides the assurance that anything written by a thread before sending a signal is guaranteed to be visible to a thread after it receives that signal. So as long as you sent the signal through some means that provided this guarantee, which they pretty much all do, you are safe.
Note that attempting to use a condition variable without a predicate protected by a mutex is not a sane way of one thread sending a signal to another! Among other things, it doesn't guarantee that the thread that you think received the signal actually received the signal. You do need to make sure the thread that does the reads in fact received the very signal sent by the thread that does the writes.
Is it safe to do this?
Provided your data modification is rendered safe and protected by critical sections, locks or whatever, this kind of access is perfectly safe for what concerns hardware access.
Thread A really updates the array or just a copy in cache?
Just a copy in cache. Most caches are presently write-back and just write data back to memory when a line is ejected from the cache if it has been modified. This largely improves memory bandwidth, especially in a multicore context.
BUT all happens as if the memory had been updated.
For shared memory processors, there are generally cache coherency protocols (except in some processors for real time applications). The basic idea of these protocols is that a state is associated with every cache line.
State describes informations concerning the line in the cache of the different processors.
These states indicate, for instance, if the line is only present in the current cache, or is shared by several caches, in sync with memory, invalid... See for instance this description of the popular MESI cache coherence protocol.
So what happens, when a cache line is written and is also present in another processor?
Thanks to the state, the cache knows that one or more other processor also have a copy of the line and it will send an invalidate signal. The line will be invalidated in the other caches and when they want to read or write it, they have to reload its content. Actually, this reload will be served by the cache that has the valid copy to limit memory accesses.
This way, whilst data is only written in the cache, the behavior is similar to a situation where data would have been written to memory.
BUT, despite the fact that functionally the hardware will ensure correctness of the transfer, one must be take into account the cache existence, to avoid performances degradation.
Assume cache A is updating a line and cache B is reading it. Whenever cache A writes, the line in cache B is invalidated. And whenever cache B wants to read it, if the line has been invalidated, it must fetch it from cache A. This can lead to many transfers of the line between the caches and render inefficient the memory system.
So concerning your example, probably 10 is not a good idea, and you should use informations on the caches to improve your exchanges between sender and receiver.
For instance, if you are on a pentium with 64 bytes cache lines, you should declare X as
_Alignas(64) float X[100];
This way the starting address of X will be a multiple of 64 and fit cache lines boundaries. The _Alignas quaiifier exists since C17, and by including stdalign.h, you can also use similarly alignas(64). Before C17, there were several extensions in most compilers in order to have an aligned placement.
And of course, you should indicate process B to read data only when a full 64 bytes line (16 floats) has been written.
This way, when thread B accesses the data, the cache line will not be modified any longer by thread A and only one initial transfer between caches A and B Will take place. This reduction in the number of transfers between the caches may have a significant impact on performances depending on your program.
If you're using a variable to that tracks readiness to read the index, the variable is protected by a mutex and the signalling is done via a pthread condition variable that thread B waits on under the mutex, then yes.
If you're using POSIX signals, then I believe you need a synchronization mechanism on top of that. Writing to an atomic variable with memory_order_release in thread A, and reading it with memory_order_acquire in thread B should guarantee in the most lightweight fashion that writes in A preceding the write to the atomic should be visible in B after it has read the atomic.
For best performance, the array sharing should be also done in such a way that the shared parts of the array do not cross cache-line boundaries (or else you're performance might degrade due to false sharing).
I have a block of shared memory that multiple processes access.
To this block of memory, I have one process that writes/updates information (which I'm calling a Publisher), and I have more than one process that is reading this data (which I'm calling Subscribers).
This leads me to believe that, because I don't want the Subscribers to read in the middle of a write/update from the Publisher, I need to implement access control, to guarantee that the data currently in shared memory is fully updated before the Subscribers take it (no reading in the middle of a write).
This is the behavior I'm trying to design:
Publisher may modify shared memory, but only when no other Subscriber is currently reading from the memory.
Any Subscriber may read from shared memory, so long as the Publisher is not currently modifying it.
Subscribers may not modify shared memory, only read; therefore, Consumers are allowed to read concurrently (assuming the Publisher is not modifying the shared memory).
The first solution I thought of is a simple mutex, or semaphore of size 1. This would mean that every time the Subscribers want to fetch new information, they would need to wait for the memory to be updated by the Publisher. However, this has the unintended consequences of Subscribers having to wait for other Subscribers, and the possibility that the Publisher gets delayed or locked out of the ability to publish new data if enough Subscribers exist on the system.
The second solution I thought of was looking into shm and found SHM_LOCK and SHM_UNLOCK, which seem useful to enforce the Publisher and Subscriber roles, but otherwise just seems to help reinforce what they can do, not necessarily when they can do it.
Alternatively, I have the reverse situation elsewhere, where the Subscribers from above become Publishers, each of which may or may not set a block of shared memory to a specific value. (They are not guaranteed to write to the block of memory, but the value is guaranteed to be the same across Publishers if they do write.) The Publisher from above becomes a Subscriber.
Addendum:
Each Publisher and Subscriber is an individual process.
'Shared memory' in my question represents multiple different caches of memory, not a single unit. I do not want all shared memory locked out from Subscriber(s) when my Publisher(s) issue an update to just one of N data units.
The Publisher (from the first part) is a daemon. My logic is that I want the daemon to be doing a timely action, putting data somewhere; I don't want the daemon disturbed to any great extent by Subscribers.
My questions:
Is there a control scheme that can properly encode the logic above?
(Publisher sets and removes access, Subscribers read when accessible.)
In this context, are there better methods of publishing information to multiple processes? Or is shared memory the way to go in this situation?
What you need is referred to as a read-write lock.
These are natively supported with pthreads with pthread_rwlock_*. pthread.h. Normally pthreads would be used for threads.
In the case of multiple processes you could implement a read-write lock with semaphores. Do a little bit more reading and research and that would easy enough to figure out the rest on your own.
normally, you need two mutexes for that (or more exactly, two conds, that can share the same mutex) The reason is that only locking the acces with a complex conditional is prone to a problem where readers are continously overlapping and blocking the access to writers. When using two conds, you can give priority to the queue of writers and disallow the blocking of the resources for reading when there's a writer waiting to acquire. Well, I'm supposing that the number of writers is far less than the number or readers, as you can hit the other side, and block readers because writers are overlapping and blocking them....
The most flexible approach is probably to allow writers and readers to act in sequence (well, readers can do in parallel) using a flip-flop and preparing the swith as soon as there's a worker in the other side waiting for access.
Anyway, as you have been suggested in other responses, take a look at the read-write lock suggested in other responses.
I would like to implement (C) a communication producer/consumer mechanism based on shared memory. It replaces a stream socket communication between a client and a remote server. Nodes in the network are sharing a pool of memory to communicate to each others. Server can write data (produce) in a memory region and the client should read it (consume).
My software actually uses a thread for reading (client side) and a thread for writing (server side). Threads resides on different machines (distributed).
What is the best and fast way to implement a mutual exclusion to access the shared memory region? (memory is external to both machines and just referred)
The server should atomically produce data (write) if client is not reading; client should atomically consume data (read) if server is not writing.
It is clear I need a phthread mutex like mechanism. Threads are in this case waiting to be unlocked via local kernel interrupts.
Would a phthread implementation also work on this distributed scenario (lock variable placed in shared memory - option PTHREAD_PROCESS_SHARED set)?
How can I differently implement a fast and reliable mutex which makes client thread and server thread access the shared region in turn, ensuring data consistency?
So the short answer is: you can use pthread mutex mechanism so long as pthreads knows about your particular system. Otherwise you'll need to look at the specific hardware/operating system for help.
This long answer is going to be somewhat general because the question does not provide a lot of details about the exact implementation of distributed shared memory that is being used. I will try to explain what is possible, but how to do it will be implementation-dependent.
As #Rod suggests, a producer-consumer system can be implemented with one or more mutex locks, and the question is how to implement a mutex.
A mutex can be considered an object with two states {LOCKED, UNLOCKED} and two atomic operations:
Lock: if state is LOCKED, block until UNLOCKED. Set state to LOCKED and return.
Unlock: set state to UNLOCKED and return.
Often mutexes are provided by the operating system kernel by implementing these operations on an abstract mutex object. For example, some variants of Unix implement mutexes and semaphores as operations on file descriptors. On those systems, pthreads would make use of the kernel facilities.
The advantage of this approach is that user-space programs don't have to care how it's implemented. The disadvantage is that each operations requires a call into the kernel and therefore it can be relatively slow compared to the next option:
A mutex can also be implemented as a memory location (let's say 1 byte long) that stores either the value 0 or 1 to indicate UNLOCKED and LOCKED. It can be accessed with standard memory read/write instructions. We can use the following (hypothetical) atomic operations to implement Lock and Unlock:
Compare-and-set: if the memory location has the value 0, set it to the value 1, otherwise fail.
Conditional-wait: block until the memory location has the value 0.
Atomic write: set the memory location to the value 0.
Generally speaking, #1 and #3 are implemented using special CPU instructions and #2 requires some Kernel support. This is pretty much how How pthread_mutex_lock is implemented.
This approach provides a speed advantage because a kernel call is necessary only when the mutex is contended (someone else has the lock).
I have a single producer multiple consumer program with threads for each role. I am thinking of implementing a circular buffer for tcp on each of the consumers and allow the producer to keep pointers to the circular buffers' memory then handing out pointer space to the tcp to offload data into.
My problem, how to have consumer threads know when data is in?
I am thinking of busy wait checking the pointer location for something other than a 0; I don't mind being a cpu hog.
I should mention each thread is cpuset and soft RT by SCHED_FIFO, and of course c implemented.
In my experience, the problem with multiple consumer datastructures is to properly handle concurrency while avoiding issues with false sharing or excessivly wasting CPU cycles.
So if your problems allow it, I would use pipe to create a pipe to each consumer and putting items into these pipes in a round robin fashion. The consumers can then use epoll to watch the file handles. This avoids having to implement and optimize a concurrent datastructure and you won't burn CPU cycles needlessly. The cost is that you have to go through syscalls.
If you want to do everything yourself with polling to avoid syscalls, you can build a circular buffer but you have to make sure that only one process reads an item at the same time and only after the item has been written. Usually this is done with 4 pointers and proper mutexes.
This article about Xen's I/O ringbuffers might be of interest.
I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.