if using shared memory, are there still advantages for processes over threading? - c

I have written a Linux application in which the main 'consumer' process forks off a bunch of 'reader' processes (~16) which read data from the disk and pass it to the 'consumer' for display. The data is passed over a socket which was created before the fork using socketpair.
I originally wrote it with this process boundary for 3 reasons:
The consumer process has real-time constraints, so I wanted to avoid any memory allocations in the consumer. The readers are free to allocate memory as they wish, or even be written in another language (e.g. with garbage collection), and this doesn't interrupt the consumer, which has FIFO priority. Also, disk access or other IO in the reader process won't interrupt the consumer. I figured that with threads I couldn't get such guarantees.
Using processes will stop me, the programmer, from doing stupid things like using global variables and clobbering other processes' memory.
I figured forking off a bunch of workers would be the best way to utilize multiple CPU architectures, and I figured using processes instead of threads would generally be safer.
Not all readers are always active, however, those that are active are constantly sending large amounts of data. Lately I was thinking that to optimize this by avoiding memory copies associated with writing and reading the socket, it would be nice to just read the data directly into a shared memory buffer (shm_open/mmap). Then only an index into this shared memory would be passed over the socket, and the consumer would read directly from it before marking it as available again.
Anyways, one of the biggest benefits of processes over threads is to avoid clobbering another thread's memory space. Do you think that switching to shared memory would destroy any advantages I have in this architecture? Is there still any advantage to using processes in this context, or should I just switch my application to using threads?

Your assumption that you cannot meet your realtime constraints with threads is mistaken. IO or memory allocation in the reader threads cannot stall the consumer thread as long as the consumer thread is not using malloc itself (which could of course lead to lock contention). I would recommend reading what POSIX has to say on the matter if you're unsure.
As for the other reasons to use processes instead of threads (safety, possibility of writing the readers in a different language, etc.), these are perfectly legitimate. As long as your consumer process treats the shared memory buffer as potentially-unsafe external data, I don't think you lose any significant amount of safety by switching from pipes to shared memory.

Yes, exactly for the reason you told. It's better to have each processes memory protected and only share what is really necessary to share. So each consumer can allocate and use its resources without bothering with the locking.
As for your index communication between your task, it should be noted that you could then use an area in your shared memory for that and using mutex for the accesses, as it is likely less heavy than the socket communication. File descriptor communication (sockets, pipes, files etc) always involves the kernel, shared memory with mutex locks or semaphores only when there is contention.
One point to be aware of when programming with shared memory in a multiprocessor environment, is to avoid false dependencies on variables. This happens when two unrelated objects share the same cache line. When one is modified it "dirties" also the other, which means that if other processor access the other object it will trigger a cache synchronisation between the CPUs. This can lead to bad scaling. By aligning the objects to the cache line size (64 byte usually but can differ from architecture to architecture) one can easily avoid that.

The main reason I met in my experience to replace processes by threads was efficiency.
If your processes are using a lot of code or unshared memory that could be shared in multithreading, then you could win a lot of performance on highly threaded CPUs like SUN Sparc CPUs having 64 or more threads per CPU. In this case, the CPU cache, especially for the code, will be much more efficient with multithreaded process (cache is small on Sparc).
If you see that your software is not running faster when running on new hardware with more CPU threads, then you should consider multi-threading. Otherwise, your arguments to avoid it seem good to me.
I did not meet this issue on Intel processors yet, but it could happen in the future when they add more cores per CPU.

Related

Is memcpy() a sleeping function?

I would like to copy the content of a an array without using a for loop. The copy is made when owning a spinlock.
Is there any chance that memcpy() can sleep?
Things that might happen with memcpy (or with really any memory access in general):
If part of the source or destination is inaccessible (invalid) memory, memcpy could crash your process, which might leave a shared spinlock in a bad state.
If part of the source memory needs to be paged in, memcpy can block while the kernel grabs the memory for you.
If part of the source or destination is memory-mapped to I/O, memcpy might block while the kernel performs that I/O. (In extreme cases, like memory-mapped network files, memcpy might block indefinitely).
The kernel is also free to swap your process out at any point during the copy, which means the copy could take arbitrarily long to actually complete.
However, memcpy does not do anything that a regular memory access wouldn't do. So, using it with a spinlock should be safe (as safe as accessing the memory normally would be, anyway).
I detect some inconsitency in your question. I'll explain myself.
A spinlock or a busy lock in general, maintains the process (or thread) that is waiting for the lock to be acquired without releasing the cpu to another process (or thread) This means a very fast unlocking and reschedule mechanism when the lock is freed, but a very expensive model for long wait times...
Once said this.... if you are using a spinlock, the reason must be that the loop the process or thread is using to check when the lock is freed should not execute more than three or four times, or the cpu will be wasted just checking once after another time if the lock has been freed.
This completely discourages doing blocking operations like the one you ask for (a memory copy normally is strange that has to deal with a non-present resource ---memory page---, but when it does, your spinlock will go into a loop of millions of checks)
spinlocks where designed to protect very small chuncks of memory, where access could signify at most two or three accesses to memory. In that case, a spinlock is going to solve the problem, as putting the thread to wait and rescheduling it will be milion times faster with the spinlock than with the wait/awake process. But this is in clear antagony to the use of memcpy(3) function, as it is a general copy function that allows for large memory copies in one shot. This means the time the resource is locked for one thread, can signify millions of checks of another thread (in a different core, as this is another reason to use a spinlock, when you have a different core that is going to wait two or three accesses to the lock to see it unlocked)
In my opinion, the only use a spinlock can have is to protect a semaphore's counter, or to protect the access to a cond variable or a mutex, but never to be used as a general memory copy or large resource protection. In those cases, it is better to use a normal, sleeping lock. If you plan to use memcpy(3) the only thing I can assume is that you use the lock to protect large amounts of memory while they are copied into.... that's better handler with a sempahore or a mutex.
In modern kernels, the awakening of a process is so efficient that makes user mode spinlocks almost unusable at all.
As a conclussion, my guess is that you don't have to consider the use of memcpy() to protect a shared memory region... but to consider to use a spinlock itself to do the protection. In most cases it will be a lost of resources, and will make your system heavier and slower.

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?
Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?
It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

Modify read-only memory at low overhead

Assume that I have a page of memory that is read-only (e.g., set through mmap/mprotect). How do I modify one word (8 bytes) on this page at the lowest possible overhead?
Some context: I assume x86-64, Linux as my runtime environment. The modifications happen rarely but frequently enough so that I have to worry about overhead. The page is read only to protect some important data that must be read by the program frequently against rogue/illegal modifications. There are only few places that are allowed to modify the data on the page and I know all the locations of these places and the address of the page statically. The problem I'm trying to solve is protecting some data against memory safety bugs in the program with a few authorized places where I need to make modifications to the data. The modifications are not frequent but frequent enough so that several kernel-roundtrips (through system calls) are too costly.
So far, I thought of the following solutions:
mprotect
ptrace
shared memory
new system call
mprotect
mprotect(addr, 4096, PROT_WRITE | PROT_READ);
addr[12] = 0xc0fec0fe;
mprotect(addr, 4096, PROT_READ);
The mprotect solution is clean, simple, and straight-forward. Unfortunately, it involves two round trips into the kernel and will result in some overhead. In addition, the whole page will be writable during that time frame, allowing for some other thread to modify that memory area concurrently.
ptrace
Unfortunately, ptraceing yourself is no longer possible (as a ptraced-process needs to be stopped. So the solution is to fork, ptrace the child process, then use PTRACE_POKETEXT to write to the child processes memory.
This option has the drawback of spawning a parent process and will result in problems if the tracee uses multiple processes. The overhead per write is at least one system call for PTRACE plus the required synchronization between the processes.
shared memory
Shared memory is similar to the ptrace solution except that it reduces the system call. Both processes set up shared memory with different permissions (RW in the child, R in the parent). The two processes still need to synchronize on each write that is then carried out by the parent. Shared memory has similar drawbacks in complexity as the ptrace solution and incompatibilities with multiple communicating processes.
new system call
Adding a new system call to the kernel would solve the problem and would only require a single system call to modify one word in the process without having to change the page tables or the requirement to set up multiple communicating processes.
Is there anything that is faster than the 4 discussed/sketched solutions? Could I rely on any debug features? Are there any other neat low-level systems tricks?

mmap thread safety in a multi-core and multi-cpu environment

I am a little confused as to the real issues between multi-core and multi-cpu environments when it comes to shared memory, with particular reference to mmap in C.
I have an application that utilizes mmap to share multiple segments of memory between 2 processes. Each process has access to:
A Status and Control memory segment
Raw data (up to 8 separate raw data buffers)
The Status and Control segment is used essentially as an IPC. IE, it may convey that buffer 1 is ready to receive data, or buffer 3 is ready for processing or that the Status and Control memory segment is locked whilst being updated by either parent or child etc etc.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Does this assumption of single-process RAM access also apply to multi-cpu systems? That is, a single PC style board with multiple CPU's (and I guess, multiple cores within each CPU).
If not, I will need to seriously rethink my logic to allow for multi-cpu'd single-boarded machines!
Any thoughts would be greatly appreciated!
PS - by single boarded I mean a single, standalone PC style system. This excludes mainframes and the like ... just to clarify :)
RAM is only ever accessed by a single core (or process) at any one time.
Take a step back and think about your assumption means. Theoretically, yes, this statement is true, but I don't think it means what you think it means. There are no practical conclusions you can draw from this other than maybe "the memory will not catch fire if two CPUs write to the same address at the same time". Let me explain.
If one CPU/process writes to a memory location, then a different CPU/process writes to the same location, the memory writes will not happen at the same time, they will happen one at a time. You can't generally reason about which write will happen before the other, you can't reason about if a read from one CPU will happen before the write from the other CPU, one some older CPUs you can't even reason if multi-byte (multi-word, actually) values will be stored/accessed one byte at a time or multiple bytes at a time (which means that reads and writes to multibyte values can get interleaved between CPUs or processes).
The only thing multiple CPUs change here is the order of memory reads and writes. On a single CPU reading memory you can be pretty sure that your reads from memory will see earlier writes to the same memory (iff no other hardware is reading/writing the memory, then all bets are off). On multiple CPUs the order of reads and writes to different memory locations will surprise you (cpu 1 writes to address 1 and then 2, but cpu 2 might just see the new value at address 2 and the old value at address 1).
So unless you have specific documentation from your operating system and/or CPU manufacturer you can't make any assumptions (except that when two writes to the same memory location happen one will happen before the other). This is why you should use libraries like pthreads or stdatomic.h from C11 for proper locking and synchronization or really dig deep down into the most complex parts of the CPU documentation to actually understand what will happen. The locking primitives in pthreads not only provide locking, they are also guarantee that memory is properly synchronized. stdatomic.h is another way to guarantee memory synchronization, but you should carefully read the C11 standard to see what it promises and what it doesn't promise.
One potential issue is that each core has it's own cache (usually just level1, as level2 and level3 caches are usually shared). Each cpu would also have it's own cache. However most systems ensure cache coherency, so this isn't the issue (except for performance impact of constantly invalidating caches due to writes to the same memory shared in a cache line by each core or processor).
The real issue is that there is no guarantee against reordering of reads and writes due to optimizations by the compiler and/or the hardware. You need to use a Memory Barrier to flush out any pending memory operations to synchronize the state of the threads or shared memory of processes. The memory barrier will occur if you use one of the synchronization types such as an event, mutex, semaphore, ... . Not all of the shared memory reads and writes need to be atomic, but you need to use synchronization between threads and/or processes before accessing any shared memory possibly updated by another thread and/or process.
This does not sound right to me. Two processes on two different cores can both load and store data to RAM at the same time. In addition to this caching strategies can result in all kinds of strangeness-es. So please make sure all access to shared memory is properly synchronized using (interprocess) synchronization objects.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Even if this holds true for some particular architecture, such an assumption is entirely wrong in general. You should have proper synchronisation between the processes that modify the shared memory segment, unless atomic intrinsics are used and the algorithm itself is lock-free.
I would advise you to put a pthread_mutex_t in the shared memory segment (shared across all processes). You will have to initialise it with the PTHREAD_PROCESS_SHARED attribute:
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(mutex, &mutex_attr);

having database in memory - C

I am programming a server daemon from which users can query data in C. The data can also be modified from clients.
I thought about keeping the data in memory.
For every new connection I do a fork().
First thing I thought about that this will generate a copy of the db every time a connection takes places, which is a waste of memory.
Second problem I have is that I don't know how to modify the database in the parent process.
What concepts are there to solve these problems?
Shared memory and multi-threading are two ways of sharing memory between multiple execution units. Check out POSIX Threads for multi-threading, and don't forget to use mutexes and/or semaphores to lock the memory areas from writing when someone is reading.
All this is part of the bigger problem of concurrency. There are multiple books and entire university courses about the problems of concurrency so maybe you need to sit down and study it a bit if you find yourself lost. It's very easy to introduce deadlocks and race conditions into concurrent C programs if you are not careful.
What concepts are there to solve these problems?
Just a few observations:
fork() only clones the memory of the process it executes at the time of execution. If you haven't opened or loaded your database at this stage, it won't be cloned into the child processes.
Shared memory - that is, memory mapped with mmap() and MAP_SHARED will be shared between processes and will not be duplicated.
The general term for communicating between processes is Interprocess communication of which there are several types and varieties, depending on your needs.
Aside On modern Linux systems, fork() implements copy-on-write copying of process memory. Actually, you won't end up with two copies of a process in memory - you'll end up with one copy that believes it has been copied twice. If you write to any of the memory, then it will be copied. This is an efficiency saving that makes use of the fact that the majority of processes alter only a small fraction of their memory as they run, so in fact even if you went for the copy the whole database approach, you might find the memory usage less that you expect - although of course that wouldn't fix your synchronisation problems!

Resources