C , XPMEM and locks - c

I'm trying to share a portion of the virtual memory between different processes using XPMEM.
This segment of memory contains a shared data structure and I would like to use a lock to order accesses to it.
Can I use already existing locks provided in C that avoids to busy wait on a single variable?
If not, what kind of lock should I implement to reduce the impact of the cache ping pong?
Thanks a lot.

You will probably have to implement your own locks as existing locks (pthread mutex) probably won't work (the same reason posix semaphores and pthread condition variables won't work). The limitation comes from how XPMEM maps pages from one application into another. The mapping is done by page frame number (PFN) only meaning there are no user pages on the side that called xpmem_attach on the memory region. This causes problems for any code that calls futex_wake because it relies on get_user_pages_fast returning page info. If a page does not exist (as is the case with a memory segment returned by xpmem_attach) futex_wake will return EFAULT and not wake any thread waiting on the memory region.
I am trying to figure out if there is a way to work around this issue.

Related

Do mutexes only function correctly if all relevant threads attempt to acquire the locks they should be acquiring, prior to utilizing a resource?

I'm just learning about locks for the first time prior to taking an OS class for the first time. I originally thought that locks would literally "lock some resource" where you would need to specify the resource (perhaps by pointer to the address of the resource in memory), but after reading through a couple really basic implementations of spin-locks (say, the unix-like training OS "xv6"'s version):
http://pages.cs.wisc.edu/~skobov/cs537/P3/xv6/kernel/spinlock.h
http://pages.cs.wisc.edu/~skobov/cs537/P3/xv6/kernel/spinlock.c
As well as this previous stack overflow question: (What part of memory does a mutex lock? (pthreads))
I think I had it all wrong.
It seems to me instead that locks are effectively just a boolean flag like variable that temporarily (or indefinitely) blocks execution of some code that would utilize a resource, but only where another thread actually also attempts to acquire the lock (where in that second thread attempting to acquire the lock as well, that blocking of the second thread has the side effect of that second thread not being able to utilize the resource until the lock is released by the first thread). So now I'm wondering instead: if a poorly designed thread that uses no mutexes and simply attempts to utilize a resource that another well designed thread held a lock on, is the poorly designed thread able to access the resource regardless (by simply ignoring the mutex -- which I'm now thinking acts as a flag a thread should look at, but has the opportunity to ignore)?
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
Since I'm relatively new to all this, I appreciate any reasonable terminology edit recommendations if I'm stating my question incorrectly as well an answer!
Thank you very much!
--edit, Thank you all for the prompt and helpful responses!
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
A lot of reasons:
What if the thing you're controlling access to isn't a memory region? What if it's a file or a network connection?
How would the compiler know when it was going to access a region of protected memory? Would the compiler have to assume that any memory access anywhere might synchronize with other threads? That would make many optimizations impossible, including storing possibly shared variables in registers which is pretty critical.
Would hardware have to support locking memory on any granularity? How would it know what memory is associated with an object? Consider a linked list. Would you have to lock every bit of memory associated with that linked list and every object in it? When you add or remove an object from the list, do you have to change what memory is protected? Won't that be both expensive and extremely difficult to use?
How would it know when to release a lock? Say you access some area of memory that needs protection and then later you access some other area of memory. How would the implementation know whether other threads could be allowed to access that area in-between those two accesses? The implementation would need to know whether the code accessing that region was or wasn't relying on a consistent view of the shared state over those two accesses. How could it know that? Get it wrong by keeping the lock and concurrency suffers. Get it wrong by releasing the lock in-between the two accesses, and the code can behave unpredictably.
And so on.

Is memcpy() a sleeping function?

I would like to copy the content of a an array without using a for loop. The copy is made when owning a spinlock.
Is there any chance that memcpy() can sleep?
Things that might happen with memcpy (or with really any memory access in general):
If part of the source or destination is inaccessible (invalid) memory, memcpy could crash your process, which might leave a shared spinlock in a bad state.
If part of the source memory needs to be paged in, memcpy can block while the kernel grabs the memory for you.
If part of the source or destination is memory-mapped to I/O, memcpy might block while the kernel performs that I/O. (In extreme cases, like memory-mapped network files, memcpy might block indefinitely).
The kernel is also free to swap your process out at any point during the copy, which means the copy could take arbitrarily long to actually complete.
However, memcpy does not do anything that a regular memory access wouldn't do. So, using it with a spinlock should be safe (as safe as accessing the memory normally would be, anyway).
I detect some inconsitency in your question. I'll explain myself.
A spinlock or a busy lock in general, maintains the process (or thread) that is waiting for the lock to be acquired without releasing the cpu to another process (or thread) This means a very fast unlocking and reschedule mechanism when the lock is freed, but a very expensive model for long wait times...
Once said this.... if you are using a spinlock, the reason must be that the loop the process or thread is using to check when the lock is freed should not execute more than three or four times, or the cpu will be wasted just checking once after another time if the lock has been freed.
This completely discourages doing blocking operations like the one you ask for (a memory copy normally is strange that has to deal with a non-present resource ---memory page---, but when it does, your spinlock will go into a loop of millions of checks)
spinlocks where designed to protect very small chuncks of memory, where access could signify at most two or three accesses to memory. In that case, a spinlock is going to solve the problem, as putting the thread to wait and rescheduling it will be milion times faster with the spinlock than with the wait/awake process. But this is in clear antagony to the use of memcpy(3) function, as it is a general copy function that allows for large memory copies in one shot. This means the time the resource is locked for one thread, can signify millions of checks of another thread (in a different core, as this is another reason to use a spinlock, when you have a different core that is going to wait two or three accesses to the lock to see it unlocked)
In my opinion, the only use a spinlock can have is to protect a semaphore's counter, or to protect the access to a cond variable or a mutex, but never to be used as a general memory copy or large resource protection. In those cases, it is better to use a normal, sleeping lock. If you plan to use memcpy(3) the only thing I can assume is that you use the lock to protect large amounts of memory while they are copied into.... that's better handler with a sempahore or a mutex.
In modern kernels, the awakening of a process is so efficient that makes user mode spinlocks almost unusable at all.
As a conclussion, my guess is that you don't have to consider the use of memcpy() to protect a shared memory region... but to consider to use a spinlock itself to do the protection. In most cases it will be a lost of resources, and will make your system heavier and slower.

Modify read-only memory at low overhead

Assume that I have a page of memory that is read-only (e.g., set through mmap/mprotect). How do I modify one word (8 bytes) on this page at the lowest possible overhead?
Some context: I assume x86-64, Linux as my runtime environment. The modifications happen rarely but frequently enough so that I have to worry about overhead. The page is read only to protect some important data that must be read by the program frequently against rogue/illegal modifications. There are only few places that are allowed to modify the data on the page and I know all the locations of these places and the address of the page statically. The problem I'm trying to solve is protecting some data against memory safety bugs in the program with a few authorized places where I need to make modifications to the data. The modifications are not frequent but frequent enough so that several kernel-roundtrips (through system calls) are too costly.
So far, I thought of the following solutions:
mprotect
ptrace
shared memory
new system call
mprotect
mprotect(addr, 4096, PROT_WRITE | PROT_READ);
addr[12] = 0xc0fec0fe;
mprotect(addr, 4096, PROT_READ);
The mprotect solution is clean, simple, and straight-forward. Unfortunately, it involves two round trips into the kernel and will result in some overhead. In addition, the whole page will be writable during that time frame, allowing for some other thread to modify that memory area concurrently.
ptrace
Unfortunately, ptraceing yourself is no longer possible (as a ptraced-process needs to be stopped. So the solution is to fork, ptrace the child process, then use PTRACE_POKETEXT to write to the child processes memory.
This option has the drawback of spawning a parent process and will result in problems if the tracee uses multiple processes. The overhead per write is at least one system call for PTRACE plus the required synchronization between the processes.
shared memory
Shared memory is similar to the ptrace solution except that it reduces the system call. Both processes set up shared memory with different permissions (RW in the child, R in the parent). The two processes still need to synchronize on each write that is then carried out by the parent. Shared memory has similar drawbacks in complexity as the ptrace solution and incompatibilities with multiple communicating processes.
new system call
Adding a new system call to the kernel would solve the problem and would only require a single system call to modify one word in the process without having to change the page tables or the requirement to set up multiple communicating processes.
Is there anything that is faster than the 4 discussed/sketched solutions? Could I rely on any debug features? Are there any other neat low-level systems tricks?

Mutual exclusion implementation in C for shared memory environments

I would like to implement (C) a communication producer/consumer mechanism based on shared memory. It replaces a stream socket communication between a client and a remote server. Nodes in the network are sharing a pool of memory to communicate to each others. Server can write data (produce) in a memory region and the client should read it (consume).
My software actually uses a thread for reading (client side) and a thread for writing (server side). Threads resides on different machines (distributed).
What is the best and fast way to implement a mutual exclusion to access the shared memory region? (memory is external to both machines and just referred)
The server should atomically produce data (write) if client is not reading; client should atomically consume data (read) if server is not writing.
It is clear I need a phthread mutex like mechanism. Threads are in this case waiting to be unlocked via local kernel interrupts.
Would a phthread implementation also work on this distributed scenario (lock variable placed in shared memory - option PTHREAD_PROCESS_SHARED set)?
How can I differently implement a fast and reliable mutex which makes client thread and server thread access the shared region in turn, ensuring data consistency?
So the short answer is: you can use pthread mutex mechanism so long as pthreads knows about your particular system. Otherwise you'll need to look at the specific hardware/operating system for help.
This long answer is going to be somewhat general because the question does not provide a lot of details about the exact implementation of distributed shared memory that is being used. I will try to explain what is possible, but how to do it will be implementation-dependent.
As #Rod suggests, a producer-consumer system can be implemented with one or more mutex locks, and the question is how to implement a mutex.
A mutex can be considered an object with two states {LOCKED, UNLOCKED} and two atomic operations:
Lock: if state is LOCKED, block until UNLOCKED. Set state to LOCKED and return.
Unlock: set state to UNLOCKED and return.
Often mutexes are provided by the operating system kernel by implementing these operations on an abstract mutex object. For example, some variants of Unix implement mutexes and semaphores as operations on file descriptors. On those systems, pthreads would make use of the kernel facilities.
The advantage of this approach is that user-space programs don't have to care how it's implemented. The disadvantage is that each operations requires a call into the kernel and therefore it can be relatively slow compared to the next option:
A mutex can also be implemented as a memory location (let's say 1 byte long) that stores either the value 0 or 1 to indicate UNLOCKED and LOCKED. It can be accessed with standard memory read/write instructions. We can use the following (hypothetical) atomic operations to implement Lock and Unlock:
Compare-and-set: if the memory location has the value 0, set it to the value 1, otherwise fail.
Conditional-wait: block until the memory location has the value 0.
Atomic write: set the memory location to the value 0.
Generally speaking, #1 and #3 are implemented using special CPU instructions and #2 requires some Kernel support. This is pretty much how How pthread_mutex_lock is implemented.
This approach provides a speed advantage because a kernel call is necessary only when the mutex is contended (someone else has the lock).

Linux Shared Memory Synchronization

I have implemented two applications that share data using the POSIX shared memory API (i.e. shm_open). One process updates data stored in the shared memory segment and another process reads it. I want to synchronize the access to the shared memory region using some sort of mutex or semaphore. What is the most efficient way of do this? Some mechanisms I am considering are
A POSIX mutex stored in the shared memory segment (Setting the PTHREAD_PROCESS_SHARED attribute would be required)
Creating a System V semaphore using semget
Rather than a System V semaphore, I would go with a POSIX named semaphore using sem_open(), etc.
Might as well make this an answer.
You can use sem_init with pshared true to create a POSIX semaphore in your shared memory space. I have used this successfully in the past.
As for whether this is faster or slower than a shared mutex and condition variable, only profiling can tell you. On Linux I suspect they are all pretty similar since they rely on the "futex" machinery.
If efficiency is important, I would go with process-shared mutexes and condition variables.
AFAIR, each operation with a semaphore requires a syscall, so uncontended mutex should be faster than the semaphore [ab]used in mutex-like manner.
First, really benchmark to know if performance is important. The cost of these things is often overestimated. So if you don't find that the access to the control structure is of same order of magnitude than the writes, just take whatever construct is semantically the best for your use case. This would be the case usually if you'd have some 100 bytes written per access to the control structure.
Otherwise, if the control structure is the bottleneck, you should perhaps avoid to use them. C11 has the new concept of _Atomic types and operations that can be used in cases where there are races in access to data. C11 is not yet widely implemented but probably all modern compilers have extensions that implement these features already.

Resources