Mutual exclusion implementation in C for shared memory environments - c

I would like to implement (C) a communication producer/consumer mechanism based on shared memory. It replaces a stream socket communication between a client and a remote server. Nodes in the network are sharing a pool of memory to communicate to each others. Server can write data (produce) in a memory region and the client should read it (consume).
My software actually uses a thread for reading (client side) and a thread for writing (server side). Threads resides on different machines (distributed).
What is the best and fast way to implement a mutual exclusion to access the shared memory region? (memory is external to both machines and just referred)
The server should atomically produce data (write) if client is not reading; client should atomically consume data (read) if server is not writing.
It is clear I need a phthread mutex like mechanism. Threads are in this case waiting to be unlocked via local kernel interrupts.
Would a phthread implementation also work on this distributed scenario (lock variable placed in shared memory - option PTHREAD_PROCESS_SHARED set)?
How can I differently implement a fast and reliable mutex which makes client thread and server thread access the shared region in turn, ensuring data consistency?

So the short answer is: you can use pthread mutex mechanism so long as pthreads knows about your particular system. Otherwise you'll need to look at the specific hardware/operating system for help.
This long answer is going to be somewhat general because the question does not provide a lot of details about the exact implementation of distributed shared memory that is being used. I will try to explain what is possible, but how to do it will be implementation-dependent.
As #Rod suggests, a producer-consumer system can be implemented with one or more mutex locks, and the question is how to implement a mutex.
A mutex can be considered an object with two states {LOCKED, UNLOCKED} and two atomic operations:
Lock: if state is LOCKED, block until UNLOCKED. Set state to LOCKED and return.
Unlock: set state to UNLOCKED and return.
Often mutexes are provided by the operating system kernel by implementing these operations on an abstract mutex object. For example, some variants of Unix implement mutexes and semaphores as operations on file descriptors. On those systems, pthreads would make use of the kernel facilities.
The advantage of this approach is that user-space programs don't have to care how it's implemented. The disadvantage is that each operations requires a call into the kernel and therefore it can be relatively slow compared to the next option:
A mutex can also be implemented as a memory location (let's say 1 byte long) that stores either the value 0 or 1 to indicate UNLOCKED and LOCKED. It can be accessed with standard memory read/write instructions. We can use the following (hypothetical) atomic operations to implement Lock and Unlock:
Compare-and-set: if the memory location has the value 0, set it to the value 1, otherwise fail.
Conditional-wait: block until the memory location has the value 0.
Atomic write: set the memory location to the value 0.
Generally speaking, #1 and #3 are implemented using special CPU instructions and #2 requires some Kernel support. This is pretty much how How pthread_mutex_lock is implemented.
This approach provides a speed advantage because a kernel call is necessary only when the mutex is contended (someone else has the lock).

Related

POSIX shared memory - method for automatic client notification

I am investigating POSIX shared memory for IPC in place of a POSIX message queue. I plan to make a shared memory area large enough to hold 50 messages of 750 bytes each. The messages will be sent at random intervals from several cores (servers) to one core (client) that receives the messages and takes action based on the message content.
I have three questions about POSIX shared memory:
(1) is there a method for automatic client notification when new data are available, like the methods available with POSIX pipes and message queues?
(2) What problems would arise using shared memory without a lock where the data are write-once, read-once?
(3) I have read that shared memory is the fastest IPC method because it has the highest bandwith and data become available in both server and client cores immediately. However, with message queues and pipes the server cores can send the messages and continue with their work without waiting for a lock. Does the need for a lock slow the performance of shared memory over message queues and pipes in the type of scenario described above?
(1) There is no automatic mechanism to notify threads/processes that data was written to a memory location. You'd have to use some other mechanism for notifications.
(2) You have a multiple-producer/single-consumer (MPSC) setup. Implementing a lockless MPSC queue is not trivial. You would have to pay careful attention to doing atomic compare-and-swap (CAS) operations in right order with correct memory ordering and you should know how to avoid false cache line sharing. See https://en.cppreference.com/w/c/atomic for the atomic operations support in C11 and read up about memory barriers. Another good read is the paper on Disruptor at http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf.
(3) Your data size (50*750) is small. Chances are that it all fits in cache and you'll have no bandwidth issues accessing it. Lock vs. pipe vs. message queue: none of these is free at times of contention and when the queue is full or empty.
One benefit of lockless queues is that they can work entirely in user-space. This is a huge benefit when extremely low latency is desired.

Shared semaphore between user and kernel spaces

Short version
Is it possible to share a semaphore (or any other synchronization lock) between user space and kernel space? Named POSIX semaphores have kernel persistence, that's why I was wondering if it is possible to also create, and/or access them from kernel context.
Searching the internet didn't help much due to the sea of information on normal usage of POSIX semaphores.
Long version
I am developing a unified interface to real-time systems in which I have some added book keeping to take care of, protected by a semaphore. These book keepings are done on resource allocation and deallocation, which is done in non-real-time context.
With RTAI, the thread waiting and posting a semaphore however needs to be in real-time context. This means that using RTAI's named semaphore means switching between real-time and non-real-time context on every wait/post in user space, and worse, creating a short real-time thread for every sem/wait in kernel space.
What I am looking for is a way to share a normal Linux or POSIX semaphore between kernel and user spaces so that I can safely wait/post it in non-real-time context.
Any information on this subject would be greatly appreciated. If this is not possible, do you have any other ideas how this task could be accomplished?1
1 One way would be to add a system call, have the semaphore in kernel space, and have user space processes invoke that system call and the semaphore would be all managed in kernel space. I would be happier if I didn't have to patch the kernel just because of this though.
Well, you were in the right direction, but not quite -
Linux named POSIX semaphore are based on FUTex, which stands for Fast User-space Mutex. As the name implies, while their implementation is assisted by the kernel, a big chunk of it is done by user code. Sharing such a semaphore between kernel and user space would require re-implementing this infrastructure in the kernel. Possible, but certainly not easy.
SysV Semaphores on the other hand are implemented completely in kernel and are only accessible to user space via standard system calls (e.g. sem_timedwait() and friends).
This means that every SysV related operations (semaphore creation, taking or release) is actually implemented in the kernel and you can simply call the underlying kernel function from your code to take the same semaphore from the kernel is needed.
Thus, your user code will simply call sem_timedwait(). That's the easy part.
The kernel part is just a little bit more tricky: you have to find the code that implement sem_timedwait() and related calls in the kernel (they are are all in the file ipc/sem.c) and create a replica of each of the functions that does what the original function does without the calls to copy_from_user(...) and copy_to_user(..) and friends.
The reason for this is that those kernel function expect to be called from a system call with a pointer to a user buffer, while you want to call them with parameters in kernel buffers.
Take for example sem_timedwait() - the relevant kernel function is sys_timedwait() in ipc/sem.c (see here: http://lxr.free-electrons.com/source/ipc/sem.c#L1537). If you copy this function in your kernel code and just remove the parts that do copy_from_user() and copy_to_user() and simply use the passed pointers (since you'll call them from kernel space), you'll get kernel equivalent functions that can take SysV semaphore from kernel space, along side user space - so long as you call them from process context in the kernel (if you don't know what this last sentence mean, I highly recommend reading up on Linux Device Drivers, 3rd edition).
Best of luck.
One solution I can think of is to have a /proc (or /sys or whatever) file on a main kernel module where writing 0/1 to it (or read from/write to it) would cause it to issue an up/down on a semaphore. Exporting that semaphore allows other kernel modules to directly access it while user applications would go through the /proc file system.
I'd still wait to see if the original question has an answer.
I'm not really experienced on this by any means, but here's my take. If you look at glibc's implementation of sem_open, and sem_wait, it's really just creating a file in /dev/shm, mmap'ing a struct from it, and using atomic operations on it. If you want to access the named semaphore from user space, you will probably have to patch the tmpfs subsystem. However, I think this would be difficult, as it wouldn't be straightforward to determine if a file is meant to be a named semaphore.
An easier way would probably be to just reuse the kernel's semaphore implementation and have the kernel manage the semaphore for userspace processes. To do this, you would write a kernel module which you associate with a device file. Then define two ioctl's for the device file, one for wait, and one for post. Here is a good tutorial on writing kernel modules, including setting up a device file and adding I/O operations for it. http://www.freesoftwaremagazine.com/articles/drivers_linux. I don't know exactly how to implement an ioctl operation, but I think you can just assign a function to the ioctl member of the file_operations struct. Not sure what the function signature should be, but you could probably figure it out by digging around in the kernel source.
As I'm sure you know, even the best working solution to this would likely be very ugly. If I were in your place, I would simply concede the battle and use rendezvous points to sync the processes
I have read your project's README and I have the following observations. Apologies in advance:
Firstly there already is a universal interface to real time systems. It is called POSIX; certainly VxWorks, Integrity and QNX are POSIX compliant and in my experience there are very few problems with portability if you develop within the POSIX API. Whether POSIX is sane or not is another matter, but it's the one we all use.
[The reason most RTOSes are POSIX compliant is because one of the big markets for them is defence equipment. And the US DoD won't let you use an OS for their non-IT equipment (eg Radars) unless it is POSIX compliant... This has pretty much made it commercially impossible to do an RTOS without giving it POSIX]
Secondly Linux itself can be made into a pretty good real time OS by applying the PREMPT_RT patch set. Of all the RTOSes out there this is probably the best one at the moment from the point of view of making efficient use of all these multi core CPUs. However it's not quite such a hard-realtime OS as the others, so its quid pro quo.
RTAI takes a different approach of in effect placing their own RTOS underneath Linux and making Linux nothing more than one task running in their OS. This approach is ok up to a point, but the big penalty of RTAI is that the real time bit is now (as far as I can tell) not POSIX compliant (though the API looks like they've just stuck rt_ on the front of some POSIX function names) and interaction with other things is now, as you're discovering, quite complicated.
PREEMPT_RT is a much more intrusive patch set than RTAI, but the payback is that everything else (like POSIX and valgrind) stays completely normal. Plus nice things like FTrace are available. Book keeping is then a case of merely using existing tools, not having to write new ones. Also it looks like PREEMPT_RT is gradually worming its way into the mainstream Linux kernel anyway. That would render other patch sets like RTAI pretty much pointless.
So Linux + PREEMPT_RT gives us realtime POSIX plus a bunch of tools, just like all the other RTOSes out there; commonality across the board. Which kinda sounds like the goal of your project.
I apologise for not helping with the with the "how" of your project, and it is highly ungentlemanly of me to query the "why?" of it too. But I feel it is important to know that there are established things out there that seem to heavily overlap with what you're trying to do. Unseating King POSIX is going to be difficult.
I would like to answer this differently: you don't want to do this. There are good reasons why there is no interface to do this kind of thing and there are good reasons why all other kernel subsystems are designed and implemented to never need a lock shared between user and kernel space. The complexity of lock ordering and implicit locking in unexpected places will quickly get out of hand if you start playing around with userland that can prevent the kernel from doing certain things.
Let me recall a very long debugging session I did around 15 years ago to at least shed some light what complex problems you can run into. I was involved in developing a file system where the large portion of the code was in userland. Something like FUSE.
The kernel would do a filesystem operation, package it into a message and send it to the userland daemon and wait for a reply. The userland daemon reads the message, does stuff and writes a reply to the kernel which wakes up and continues with the operation. Simple concept.
One thing you need to understand about filesystems is locking. When you're looking up a name of a file, for example "foo/bar", the kernel somehow gets the node for the directory "foo" then locks it and asks it if it has the file "bar". The filesystem code somehow finds "bar", locks it and then unlocks "foo". The locking protocol is quite straight forward (unless you're doing a rename), parent always gets locked before the child and the child is locked before the parent lock is released. The lookup message for the file is what would get sent to our userland daemon while the directory was still locked, when the daemon replied the kernel would proceed to first lock "bar" and then unlock "foo".
I don't even remember the symptoms we were debugging, but I remember the issue was not trivially reproducible, it required hours and hours of filesystem torture programs until it manifested itself. But after a few weeks we figured out what was going on. Let's say that the full path to our file was "/a/b/c/foo/bar". We're in the process of doing a lookup on "bar", which means that we're holding the lock on "foo". The daemon is a normal userland process so some operations it does can block and can be preempted too. It's actually talking over the network so it can block for a long time. While we're waiting for the userland daemon some other process want to look up "foo" for some reason. To do this, it has the node for "c", locked of course, and asks it to look up "foo". It manages to find it and attempts to lock it (it has to be locked before we can release the lock on "c") and waits for the lock on "foo" to be released. Another process comes in an wants to look up "c", it of course ends up waiting for that lock while holding the lock on "b". Another process waits for "b" and holds "a". Yet another process wants "a" and holds the lock on "/".
This is not a problem, not yet. This sometimes happens in normal filesystems too, locks can cascade all the way up to the root, you wait for a while for a slow disk, the disk responds, the congestions eases up and everyone gets their locks and everything keeps running fine. In our case though, the reason for holding the lock a long time was because the remote server for our distributed filesystem didn't respond. X seconds later the userland daemon times out and just before responding to the kernel that the lookup operation on "bar" has failed it logs a message to syslog with a timestamp. One of the things that the timestamp needs is the timezone information, so it needs to open "/etc/localtime", of course to do that, it needs to start looking up "/etc" and for that it needs to lock "/". "/" is already locked by someone else, so the userland daemon waits for that someone else to unlock "/" while that someone else waits through a chain of 5 processes and locks for the daemon to respond. The system ends up in a total deadlock.
Now, maybe your code will not have problems like this. You're talking about a real-time system so there might be a level of control you have that normal kernels don't. But I'm not sure if adding an unexpected layer of locking complexity would even let you keep real time properties of the system, or really make sure that nothing you do in userland will ever create a deadlock cascade. If you don't page, if you never touch any file descriptor, if you never do memory operations and a bunch of other things I can't really think of right now you could get away with a lock shared between userland and kernel, but it will be hard and you'll probably find unexpected problems.
Multiple solutions exist in Linux/GLIBC but none permit to share explicitly a semaphore between user and kernel spaces.
The kernel provides solutions to suspend threads/processes and the most efficient is the futex. Here are some details about the state of the art of the current implementations to synchronize user space applications.
SYSV services
The Linux System V (SysV) semaphores are a legacy of the eponymous Unix OS. They are based on system calls to lock/unlock semaphores. The corresponding services are:
semget() to get an identifier
semop() to make operations on the semaphores (e.g. incrementation/decrementation)
semctl() to make some control operations on the semaphores (e.g. destruction)
The GLIBC (e.g. 2.31 version) does not provide any added value on top of those services. The library service directly calls the eponymous system call. For example, semop() (in sysdeps/unix/sysv/linux/semtimedop.c) directly invokes the corresponding system call:
int
__semtimedop (int semid, struct sembuf *sops, size_t nsops,
const struct timespec *timeout)
{
/* semtimedop wire-up syscall is not exported for 32-bit ABIs (they have
semtimedop_time64 instead with uses a 64-bit time_t). */
#if defined __ASSUME_DIRECT_SYSVIPC_SYSCALLS && defined __NR_semtimedop
return INLINE_SYSCALL_CALL (semtimedop, semid, sops, nsops, timeout);
#else
return INLINE_SYSCALL_CALL (ipc, IPCOP_semtimedop, semid,
SEMTIMEDOP_IPC_ARGS (nsops, sops, timeout));
#endif
}
weak_alias (__semtimedop, semtimedop)
Nowadays, SysV semaphores (as well as other SysV IPC like shared memory and message queues) are considered deprecated because as they need a system call for each operation, they slow down the calling processes with systematic context switches. New applications should use POSIX compliant services available through the GLIBC.
POSIX services
POSIX semaphores are based on Fast User Mutexes (FUTEX). The principle consists to increment/decrement the semaphore counter in user space with atomic operations as long as there is no contention. But when there is contention (multiple threads/processes want to "lock" the semaphore at the same time), a futex() system call is done to either wake up waiting threads/processes when the semaphore is "unlocked" or suspend threads/processes waiting for the semaphore to be released. From performance point of view, this makes a big difference compared to the above SysV services which systematically required a system call for any operation. The POSIX services are implemented in GLIBC for the user space part of the operations (atomic operations) with a switch into kernel space only when there is contention.
For example, in GLIBC 2.31, the service to lock a semaphore is located in nptl/sem_waitcommon.c. It checks the value of the semaphore to decrement it with an atomic operation (in __new_sem_wait_fast()) and invokes the futex() system call (in __new_sem_wait_slow()) to suspend the calling thread only if the semaphore was equal to 0 before the attempt to decrement it.
static int
__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
{
[...]
uint64_t d = atomic_load_relaxed (&sem->data);
do
{
if ((d & SEM_VALUE_MASK) == 0)
break;
if (atomic_compare_exchange_weak_acquire (&sem->data, &d, d - 1))
return 0;
}
while (definitive_result);
return -1;
[...]
}
[...]
static int
__attribute__ ((noinline))
__new_sem_wait_slow (struct new_sem *sem, clockid_t clockid,
const struct timespec *abstime)
{
int err = 0;
[...]
uint64_t d = atomic_fetch_add_relaxed (&sem->data,
(uint64_t) 1 << SEM_NWAITERS_SHIFT);
pthread_cleanup_push (__sem_wait_cleanup, sem);
/* Wait for a token to be available. Retry until we can grab one. */
for (;;)
{
/* If there is no token available, sleep until there is. */
if ((d & SEM_VALUE_MASK) == 0)
{
err = do_futex_wait (sem, clockid, abstime);
[...]
The POSIX services based on the futex are for examples:
sem_init() to create a semaphore
sem_wait() to lock a semaphore
sem_post() to unlock a semaphore
sem_destroy() to destroy a semaphore
To manage mutex (i.e. binary semaphores), it is possible to use the pthread services. They are also based on the futex. For examples:
pthread_mutex_init() to create/initialize a mutex
pthread_mutex_lock/unlock() to lock/unlock a mutex
pthread_mutex_destroy() to destroy a mutex
I was thinking about ways that kernel and user land share things directly i.e. without syscall/copyin-out cost. One thing I remembered was the RDMA model where the kernel writes/reads directly from user space, with synchronization of course. You may want to explore that model and see if it works for your purpose.

Linux Shared Memory Synchronization

I have implemented two applications that share data using the POSIX shared memory API (i.e. shm_open). One process updates data stored in the shared memory segment and another process reads it. I want to synchronize the access to the shared memory region using some sort of mutex or semaphore. What is the most efficient way of do this? Some mechanisms I am considering are
A POSIX mutex stored in the shared memory segment (Setting the PTHREAD_PROCESS_SHARED attribute would be required)
Creating a System V semaphore using semget
Rather than a System V semaphore, I would go with a POSIX named semaphore using sem_open(), etc.
Might as well make this an answer.
You can use sem_init with pshared true to create a POSIX semaphore in your shared memory space. I have used this successfully in the past.
As for whether this is faster or slower than a shared mutex and condition variable, only profiling can tell you. On Linux I suspect they are all pretty similar since they rely on the "futex" machinery.
If efficiency is important, I would go with process-shared mutexes and condition variables.
AFAIR, each operation with a semaphore requires a syscall, so uncontended mutex should be faster than the semaphore [ab]used in mutex-like manner.
First, really benchmark to know if performance is important. The cost of these things is often overestimated. So if you don't find that the access to the control structure is of same order of magnitude than the writes, just take whatever construct is semantically the best for your use case. This would be the case usually if you'd have some 100 bytes written per access to the control structure.
Otherwise, if the control structure is the bottleneck, you should perhaps avoid to use them. C11 has the new concept of _Atomic types and operations that can be used in cases where there are races in access to data. C11 is not yet widely implemented but probably all modern compilers have extensions that implement these features already.

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

What's the best linux kernel locking mechanism for a specific scenario

I need to solve a locking problem for this scenario:
A multi CPU system.
All of the CPU's use a common (software) resource.
Read only access to the resource is very common. (Processing of incoming network packets)
Write access is a lot less frequent. (Pretty much configuration changes only).
Currently I use the read_lock_bh, write_lock_bh (spinlocks) mechanism.
The problem is that the more CPU's, the more I get soft lockups in a writer context.
I read the concurrency chapter in this book,
But couldn't quite understand whether the reader or the writer will get priority when using spin locks.
So the questions are:
Does the Linux spinlock mechanism give priority the reader/writer/none of them?
Is there a better mechanism I can use in order to avoid those soft lockups in my scenario, or maybe a way for me to give priority to the writer whenever it tries to obtain the lock, while using my current solution?
Thanks,
Nir
Here's a direct quote from Essential Linux Device Drivers which might be what you're looking for. It seems the part dealing with RCU at the end may be what you're interested in.
Reader-Writer Locks
Another specialized concurrency regulation mechanism is a reader-writer variant of spinlocks. If the usage of a
critical section is such that separate threads either read from or write to a shared data structure, but don't do
both, these locks are a natural fit. Multiple reader threads are allowed inside a critical region simultaneously.
Reader spinlocks are defined as follows:
rwlock_t myrwlock = RW_LOCK_UNLOCKED;
read_lock(&myrwlock); /* Acquire reader lock */
/* ... Critical Region ... */
read_unlock(&myrwlock); /* Release lock */
However, if a writer thread enters a critical section, other reader or writer threads are not allowed inside. To use
writer spinlocks, you would write this:
rwlock_t myrwlock = RW_LOCK_UNLOCKED;
write_lock(&myrwlock); /* Acquire writer lock */
/* ... Critical Region ... */
write_unlock(&myrwlock); /* Release lock */
Look at the IPX routing code present in net/ipx/ipx_route.c for a real-life example of a reader-writer spinlock. A
reader-writer lock called ipx_routes_lock protects the IPX routing table from simultaneous access. Threads
that need to look up the routing table to forward packets request reader locks. Threads that need to add or
delete entries from the routing table acquire writer locks. This improves performance because there are usually
far more instances of routing table lookups than routing table updates.
Like regular spinlocks, reader-writer locks also have corresponding irq variants—namely, read_lock_irqsave(),
read_lock_irqrestore(), write_lock_irqsave(), and write_lock_irqrestore(). The semantics of these
functions are similar to those of regular spinlocks.
Sequence locks or seqlocks, introduced in the 2.6 kernel, are reader-writer locks where writers are favored over
readers. This is useful if write operations on a variable far outnumber read accesses. An example is the
jiffies_64 variable discussed earlier in this chapter. Writer threads do not wait for readers who may be inside
a critical section. Because of this, reader threads may discover that their entry inside a critical section has failed
and may need to retry:
u64 get_jiffies_64(void) /* Defined in kernel/time.c */
{
unsigned long seq;
u64 ret;
do {
seq = read_seqbegin(&xtime_lock);
ret = jiffies_64;
} while (read_seqretry(&xtime_lock, seq));
return ret;
}
Writers protect critical regions using write_seqlock() and write_sequnlock().
The 2.6 kernel introduced another mechanism called Read-Copy Update (RCU), which yields improved
performance when readers far outnumber writers. The basic idea is that reader threads can execute without
locking. Writer threads are more complex. They perform update operations on a copy of the data structure and
replace the pointer that readers see. The original copy is maintained until the next context switch on all CPUs to
ensure completion of all ongoing read operations. Be aware that using RCU is more involved than using the
primitives discussed thus far and should be used only if you are sure that it's the right tool for the job. RCU data
structures and interface functions are defined in include/linux/rcupdate.h. There is ample documentation in
Documentation/RCU/*.
For an RCU usage example, look at fs/dcache.c. On Linux, each file is associated with directory entry
information (stored in a structure called dentry), metadata information (stored in an inode), and actual data
(stored in data blocks). Each time you operate on a file, the components in the file path are parsed, and the
corresponding dentries are obtained. The dentries are kept cached in a data structure called the dcache, to
speed up future operations. At any time, the number of dcache lookups is much more than dcache updates, so
references to the dcache are protected using RCU primitives.
Isn't this the sort of usage case RCU is designed to handle? See http://lwn.net/Articles/262464/ for a good write up on it's use.
If the work you do while holding the lock is small you can try a normal mutex, non reader-writer. It's more efficient.

Resources