Is memcpy() a sleeping function? - c

I would like to copy the content of a an array without using a for loop. The copy is made when owning a spinlock.
Is there any chance that memcpy() can sleep?

Things that might happen with memcpy (or with really any memory access in general):
If part of the source or destination is inaccessible (invalid) memory, memcpy could crash your process, which might leave a shared spinlock in a bad state.
If part of the source memory needs to be paged in, memcpy can block while the kernel grabs the memory for you.
If part of the source or destination is memory-mapped to I/O, memcpy might block while the kernel performs that I/O. (In extreme cases, like memory-mapped network files, memcpy might block indefinitely).
The kernel is also free to swap your process out at any point during the copy, which means the copy could take arbitrarily long to actually complete.
However, memcpy does not do anything that a regular memory access wouldn't do. So, using it with a spinlock should be safe (as safe as accessing the memory normally would be, anyway).

I detect some inconsitency in your question. I'll explain myself.
A spinlock or a busy lock in general, maintains the process (or thread) that is waiting for the lock to be acquired without releasing the cpu to another process (or thread) This means a very fast unlocking and reschedule mechanism when the lock is freed, but a very expensive model for long wait times...
Once said this.... if you are using a spinlock, the reason must be that the loop the process or thread is using to check when the lock is freed should not execute more than three or four times, or the cpu will be wasted just checking once after another time if the lock has been freed.
This completely discourages doing blocking operations like the one you ask for (a memory copy normally is strange that has to deal with a non-present resource ---memory page---, but when it does, your spinlock will go into a loop of millions of checks)
spinlocks where designed to protect very small chuncks of memory, where access could signify at most two or three accesses to memory. In that case, a spinlock is going to solve the problem, as putting the thread to wait and rescheduling it will be milion times faster with the spinlock than with the wait/awake process. But this is in clear antagony to the use of memcpy(3) function, as it is a general copy function that allows for large memory copies in one shot. This means the time the resource is locked for one thread, can signify millions of checks of another thread (in a different core, as this is another reason to use a spinlock, when you have a different core that is going to wait two or three accesses to the lock to see it unlocked)
In my opinion, the only use a spinlock can have is to protect a semaphore's counter, or to protect the access to a cond variable or a mutex, but never to be used as a general memory copy or large resource protection. In those cases, it is better to use a normal, sleeping lock. If you plan to use memcpy(3) the only thing I can assume is that you use the lock to protect large amounts of memory while they are copied into.... that's better handler with a sempahore or a mutex.
In modern kernels, the awakening of a process is so efficient that makes user mode spinlocks almost unusable at all.
As a conclussion, my guess is that you don't have to consider the use of memcpy() to protect a shared memory region... but to consider to use a spinlock itself to do the protection. In most cases it will be a lost of resources, and will make your system heavier and slower.

Related

Do mutexes only function correctly if all relevant threads attempt to acquire the locks they should be acquiring, prior to utilizing a resource?

I'm just learning about locks for the first time prior to taking an OS class for the first time. I originally thought that locks would literally "lock some resource" where you would need to specify the resource (perhaps by pointer to the address of the resource in memory), but after reading through a couple really basic implementations of spin-locks (say, the unix-like training OS "xv6"'s version):
http://pages.cs.wisc.edu/~skobov/cs537/P3/xv6/kernel/spinlock.h
http://pages.cs.wisc.edu/~skobov/cs537/P3/xv6/kernel/spinlock.c
As well as this previous stack overflow question: (What part of memory does a mutex lock? (pthreads))
I think I had it all wrong.
It seems to me instead that locks are effectively just a boolean flag like variable that temporarily (or indefinitely) blocks execution of some code that would utilize a resource, but only where another thread actually also attempts to acquire the lock (where in that second thread attempting to acquire the lock as well, that blocking of the second thread has the side effect of that second thread not being able to utilize the resource until the lock is released by the first thread). So now I'm wondering instead: if a poorly designed thread that uses no mutexes and simply attempts to utilize a resource that another well designed thread held a lock on, is the poorly designed thread able to access the resource regardless (by simply ignoring the mutex -- which I'm now thinking acts as a flag a thread should look at, but has the opportunity to ignore)?
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
Since I'm relatively new to all this, I appreciate any reasonable terminology edit recommendations if I'm stating my question incorrectly as well an answer!
Thank you very much!
--edit, Thank you all for the prompt and helpful responses!
If that's the case, then why do we implement locks as sophisticated boolean variables such that all threads must use the locks as opposed to a lock that instead prevents access to a memory region?
A lot of reasons:
What if the thing you're controlling access to isn't a memory region? What if it's a file or a network connection?
How would the compiler know when it was going to access a region of protected memory? Would the compiler have to assume that any memory access anywhere might synchronize with other threads? That would make many optimizations impossible, including storing possibly shared variables in registers which is pretty critical.
Would hardware have to support locking memory on any granularity? How would it know what memory is associated with an object? Consider a linked list. Would you have to lock every bit of memory associated with that linked list and every object in it? When you add or remove an object from the list, do you have to change what memory is protected? Won't that be both expensive and extremely difficult to use?
How would it know when to release a lock? Say you access some area of memory that needs protection and then later you access some other area of memory. How would the implementation know whether other threads could be allowed to access that area in-between those two accesses? The implementation would need to know whether the code accessing that region was or wasn't relying on a consistent view of the shared state over those two accesses. How could it know that? Get it wrong by keeping the lock and concurrency suffers. Get it wrong by releasing the lock in-between the two accesses, and the code can behave unpredictably.
And so on.

how can I ensure "nice" (or at least "ok") cache behavior with a volatile mutex'd array?

I have two threads which share a large array of data. One thread writes to it, and the other reads from it. Because the array cannot be in an "incompletely-updated" state when read, I mutex all array operations (reads/writes).
I also try to "play nicely with the cache"- when I read/write large amounts of data, I get one mutex, and read/write as much as required in sequence, then relinquish the mutex.
[edit: to clarify, this is the cache behavior I would like to preserve. if you read/write large swathes of memory in sequence, then the cache can pull in large lines of data from memory (slow!) only once, then operate on the cache (fast!) without again hitting memory until that cache line is exhausted.]
One thing I would like to protect against is "writing to a small part of the array in one thread, and then in another thread (after receiving the mutex) reading from that small part of the array which hasn't yet been flushed to memory (out of the first thread/core's cache), resulting in an outdated read". So the solution would be to mark the array as "volatile" (right?).
Am I correct to worry that "marking the array as volatile" will totally kill my ability to read/write large chunks in accordance with a well-behaved cache? Or will every read/write be called to/from memory?
In a perfect world, what I think I'd want is the ability to: 1. grab a mutex, 2. load data from memory (as though it were volatile), 3. read/write to array (as though it weren't volatile- should be safe to rely on own cache bc mutex), 4.(in the case of write) flush any remaining cache to memory. 5. relinquish mutex
Can I accomplish this? Are there any glaring misunderstandings here on my part?

C , XPMEM and locks

I'm trying to share a portion of the virtual memory between different processes using XPMEM.
This segment of memory contains a shared data structure and I would like to use a lock to order accesses to it.
Can I use already existing locks provided in C that avoids to busy wait on a single variable?
If not, what kind of lock should I implement to reduce the impact of the cache ping pong?
Thanks a lot.
You will probably have to implement your own locks as existing locks (pthread mutex) probably won't work (the same reason posix semaphores and pthread condition variables won't work). The limitation comes from how XPMEM maps pages from one application into another. The mapping is done by page frame number (PFN) only meaning there are no user pages on the side that called xpmem_attach on the memory region. This causes problems for any code that calls futex_wake because it relies on get_user_pages_fast returning page info. If a page does not exist (as is the case with a memory segment returned by xpmem_attach) futex_wake will return EFAULT and not wake any thread waiting on the memory region.
I am trying to figure out if there is a way to work around this issue.

Better to lock on a shared resource, or have a thread to fulfill requests?

I have a shared memory pool from which many different threads may request an allocation. Requesting an allocation from this will occur a LOT in every thread, however the amount of threads is likely to be small, often with only 1 thread running. I am not sure which of the following ways to handle this are better.
Ultimately I may need to implement both and see which produces more favorable results... I also fear that even thinking of #2 may be premature optimization at this point as I don't actually have the code that uses this shared resource written yet. But the problem is so darn interesting that it continues to distract me from the other work.
1) Create a mutex and have a thread attempt to lock it before obtaining the allocation, then unlocking it.
2) Have each thread register a request slot, when it needs an allocation it puts the request in the slot, then blocks(while (result == NULL) { usleep() }) waiting for the request slot to have a result. A single thread continuously iterates request slots making the allocations and assigning them to the result in the request slot.
Number 1 is the simple solution, but a single thread could potentially hog the lock if the timing is right. The second is more complex, but ensures fairness among threads when pulling from the resource. However it still blocks the requesting threads, and if there are many threads the iteration could burn cycles without doing any actual allocations until it finds a request to fulfill.
NOTE: C on Linux using pthreads
Solution 2 is bogus. It's an ugly hack and it does not ensure memory synchronization.
I would say go with solution 1, but I'm a little bit skeptical of the fact that you mentioned "memory pool" to begin with. Are you just trying to allocate memory, or is there some other resource you're managing (e.g. slots in some special kind of memory, memory-mapped file, textures in video memory, etc.)?
If you are just allocating memory, then you're completely right to be worried about premature optimization. The whole problem is premature optimization, and the system malloc will do as well as or better than your memory pool will do. (Or if your code will be running on one of the few systems with a pathologically broken malloc like some video game consoles, just drop in a replacement only on those known-broken systems.)
If you really do have a special resource you need to manage, start with solution 1 and see how it works. If you have problems, you might find you can improve it with a condition variable where the resource manager notifies you when a slot can be allocated, but I really doubt this will be necessary.

if using shared memory, are there still advantages for processes over threading?

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?
Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.
Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.
The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

Resources