Does the C/++ memory model apply to the atomic operation itself? - c

I'm left confused about when the C/++ memory model is relevant, even after reading the GCC wiki.
My code is an IO library that allows taking/returning a buffer from a pool and using it for async IO. However, even after the buffer is returned to the pool, it isn't free unless the actual IO operation has also completed.
Each buffer has a structure that has status flags:
#define IO_FLAG_IN_USE 1 // a consumer has taken ownership of the buffer
#define IO_FLAG_IN_FLIGHT 2 // the buffer is in use by the system for async IO
A consumer requests a buffer with io_getbuf and waits using sem_wait. There are two ways a buffer can become available:
When the consumer calls io_putbuf and the IO has already completed, or when IO completes and the buffer has already been returned. This can cause a race, of course. I want to solve it using atomics, like this:
void io_completion(struct bufinfo *buf) {
if(!__atomic_or_fetch(&buf->flags, ~IO_FLAG_IN_FLIGHT, ...))
sem_post(semaphore);
}
void io_putbuf(struct bufinfo *buf) {
if(!__atomic_or_fetch(&buf->flags, ~IO_FLAG_IN_USE, ...))
sem_post(semaphore);
}
But I'm not sure which memory model to specify - does it matter?
tl;dr
Does the memory model apply to the atomic operations themselves (the load->or->return) or only relevant for operations preceding/following the atomic built-ins?

I take you to be asking what memory order property(-ies) to use, and to be asking in particular about the GCC builtin __atomic_and_fetch() (note spelling), which atomically modifies the specified memory location via a bitwise and operation on a scalar having non-_Atomic type and returns the result (atomic read / modify / write). The memory-order alternatives and resulting behavior correspond to the C++ memory model.
Do note well that that's a GCC-ism. C has had atomic types and atomic operations on them since C11, with the same set of memory-order alternatives as C++, but the __atomic_or_fetch() builtin and its siblings are separate from that and GCC-specific.
I'm not sure which memory model to specify - does it matter?
Yes, of course, else there wouldn't be alternatives to choose among.
Does
the memory model apply to the atomic operations themselves (the
load->or->return) or only relevant for operations preceding/following
the atomic built-ins?
The memory order property describes the relationships, if any, between the read and write performed by the atomic operation on one hand and not-necessarily-atomic reads and writes of the same and other memory locations on the other. If ever you are uncertain what memory order to use then you should use __ATOMIC_SEQ_CST. That provides the strongest constraints, and it corresponds to the default memory order for C++ atomic operations.
Other alternatives relax memory-order constraints in various ways, which may afford performance improvements under some circumstances. However, those relaxations are likely to also cause your program to manifest intermittent misbehavior if in fact it requires the stronger constraints, and that analysis involves a holistic evaluation of your threads' use of shared variables and synchronization.
I am not confident that enough information has been presented to fully perform that analysis, but we can see at least that each of the atomic operations must observe the effect of the other, therefore each one needs both acquire and release semantics with respect to the affected memory location. That is, you need at least __ATOMIC_ACQ_REL ordering, which is only one step below __ATOMIC_SEQ_CST. Use the latter, because it's safer, especially given that you are uncertain about the intricacies involved. This is for an I/O subsystem, so even if the former would be sufficient, any performance gain you might see from using that is unlikely to be noticeable anyway.
UPDATE:
Since apparently the above is not clear, again:
Does the memory model apply to the atomic operations themselves (the
load->or->return) or only relevant for operations preceding/following
the atomic built-ins?
And again: the memory order property describes the relationships, if any, between the read and write performed by the atomic operation on one hand and not-necessarily-atomic reads and writes of the same and other memory locations on the other.
That is, the chosen memory order parameter affects
whether there are happens-before relationships between writes to the affected memory location and the atomic op's read of that location by other threads;
whether there are happens-before relationships between the atomic op's write to the affected storage location and other reads of that location by other threads;
whether there are happens-before relationships between the atomic op's read of the affected storage location and other actions performed by the same thread; and
whether there are happens-before relationships between the atomic op's write of the affected storage location and other actions performed by the same thread.
Note that I say "affects", not "determines". Those factors are also affected by the memory order of other atomic operations, by the use of synchronization objects and functions, by details of the other statements executed and expressions evaluated by all threads, and by the vagaries of thread scheduling during any particular run (at least).
All of that speaks to the guarantees upon which one can rely and the conclusions one can draw about relationships among the values of shared variables throughout the program observed by each thread.
The full scope of the guarantees you require is unclear, but you at least need happens-before relationships between the reads and writes of the affected location by one function and those by the other function, with no specific order imposed on the calls to those functions. That requires at minimum __ATOMIC_ACQ_REL ordering, but, again, one step up to __ATOMIC_SEQ_CST is probably a better choice, not least because it might be a necessary one in light of other program code not shown.

With regards to the atomic variable itself, the __ATOMIC_RELAXED memory-model is always sufficient between two atomic operations. It guarantees that there will be some fixed order between them, and that they won't used cached values.
That's why you can use __ATOMIC_RELAXED for the simple incrementing of a counter, see the Relaxed ordering entry at cppreference.com
In the example given, if the buffers involved in the example code are 'simple' buffers that carry no state (they are fixed-sized etc.) and the consuming thread (io_getbuf) does not care about the contents of the buffer (it is just using it for a new read) then there are no other dependencies and you can use __ATOMIC_RELAXED.
The need for memory models is when there are other dependencies involved - if the buffers contain metadata, like the buffer size, and that size might have been changed (e.g. realloc) by the releasing thread then __ATOMIC_RELAXED does not guarantee that the consuming thread will see the update to the size field. Similarly, if this wasn't just a thread pool, but a producer/consumer setup, then the consumer needs to be sure that the contents of the buffer are actually synchronized and have been written before consuming the buffer - that would require a different memory model.

Related

Is volatile necessary for the resource used in a critical section?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource. Below is the pseudo code that will be executed on those two threads.
take_lock();
// Read shared resource.
read_shared_resouce();
// Write something to shared resource.
write_shared_resource();
release_lock();
I am wondering if I need to make that shared resource volatile to make sure that when one thread is reading shared resource, a thread won't just get the value from registers, it will actually read from that shared resource. Or maybe I should use a accessor functions to make the access to that shared resource volatile with some memory barrier operations instead of make that shared resource volatile?
I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.
Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread Ti before it releases lock L are visible to all other threads Tj after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.
When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.
C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section 5.1.2.4 of the current (C17) language spec.
volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.
The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.
In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.
Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.
The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.
The answer is no; volatile is not necessary (assuming the critical-section functions you are using were implemented correctly, and you are using them correctly, of course). Any proper critical-section API's implementation will include the memory-barriers necessary to handle flushing registers, etc, and therefore avoid the need for the volatile keyword.
volatile is normally used inform compiler that this data might be change by others (interrupt, DMA, other CPU,...) to prevent un-expected optimization in compiler.
So in your case you may need or don't need:
If you don't have some while loop with some info from share resource in the thread for value change, you don't really need for volatile.
If you have some wait like while (shareVal == 0) in the source code, you need to tell compiler explicit by attribute volatile.
For case 2 CPUs, there is also possibility issue with cache that a CPU is only reading value from cache memory. Please consider to configure memory attribute properly for shared resource.

A thread only reads and a thread only modifies. Does this variable also need a mutex with linux c? [duplicate]

There are 2 threads,one only reads the signal,the other only sets the signal.
Is it necessary to create a mutex for signal and the reason?
UPDATE
All I care is whether it'll crash if two threads read/set the same time
You will probably want to use atomic variables for this, though a mutex would work as well.
The problem is that there is no guarantee that data will stay in sync between threads, but using atomic variables ensures that as soon as one thread updates that variable, other threads immediately read its updated value.
A problem could occur if one thread updates the variable in cache, and a second thread reads the variable from memory. That second thread would read an out-of-date value for the variable, if the cache had not yet been flushed to memory. Atomic variables ensure that the value of the variable is consistent across threads.
If you are not concerned with timely variable updates, you may be able to get away with a single volatile variable.
It depends. If writes are atomic then you don't need a mutual exclusion lock. If writes are not atomic, then you do need a lock.
There is also the issue of compilers caching variables in the CPU cache which may cause the copy in main memory to not get updating on every write. Some languages have ways of telling the compiler to not cache a variable in the CPU like that (volatile keyword in Java), or to tell the compiler to sync any cached values with main memory (synchronized keyword in Java). But, mutex's in general don't solve this problem.
If all you need is synchronization between threads (one thread must complete something before the other can begin something else) then mutual exclusion should not be necessary.
Mutual exclusion is only necessary when threads are sharing some resource where the resource could be corrupted if they both run through the critical section at roughly the same time. Think of two people sharing a bank account and are at two different ATM's at the same time.
Depending on your language/threading library you may use the same mechanism for synchronization as you do for mutual exclusion- either a semaphore or a monitor. So, if you are using Pthreads someone here could post an example of synchronization and another for mutual exclusion. If its java, there would be another example. Perhaps you can tell us what language/library you're using.
If, as you've said in your edit, you only want to assure against a crash, then you don't need to do much of anything (at least as a rule). If you get a collision between threads, about the worst that will happen is that the data will be corrupted -- e.g., the reader might get a value that's been partially updated, and doesn't correspond directly to any value the writing thread ever wrote. The classic example would be a multi-byte number that you added something to, and there was a carry, (for example) the old value was 0x3f ffff, which was being incremented. It's possible the reading thread could see 0x3f 0000, where the lower 16 bits have been incremented, but the carry to the upper 16 bits hasn't happened (yet).
On a modern machine, an increment on that small of a data item will normally be atomic, but there will be some size (and alignment) where it's not -- typically if part of the variable is in one cache line, and part in another, it'll no longer be atomic. The exact size and alignment for that varies somewhat, but the basic idea remains the same -- it's mostly just a matter of the number having enough digits for it to happen.
Of course, if you're not careful, something like that could cause your code to deadlock or something on that order -- it's impossible to guess what might happen without knowing anything about how you plan to use the data.

Is it possible to achieve 2 lines of code to always occur in a order in a multithreaded program without locks?

atomic_compare_exchange_strong_explicit(mem, old, new, <mem_order>, <mem_order>);
ftruncate(fd, <size>);
All I want is that these two lines of code always occur without any interference (WITHOUT USING LOCKS). Immediately after that CAS, ftruncate(2) should be called. I read a small description about memory orders, although I don’t understand them much. But they seemed to make this possible. Is there any way around?
Your title asks for the things to occur in order. That's easy, and C basically does that automatically with mo_seq_cst; all visible side-effects of CAS will appear before any from ftruncate.
(Not strictly required by the ISO C standard, but in practice real implementations implement seq-cst with a full barrier, except AArch64 where STLR doesn't stall to drain the store buffer unless/until there's a LDAR while the seq-cst store is still in the store buffer. But a system call is definitely going to also include a full barrier.)
Within the thread doing the operation, the atomic is Sequenced Before the system call.
What kind of interference are you worried about? Some other thread changing the size of the file? You can't prevent that race condition.
There's no way to combine some operation on memory + a system call into a single atomic transaction. You would need to use a hypothetical system call that atomically does what you want. (Presumably it would have to do locking inside the kernel to make a file operation and a memory modification appear as one atomic transaction.) e.g. the Linux futex system call atomically does a couple things, but of course there's nothing like this for any other operations.
Or you need locking. (Or to suspend all other threads of your process somehow.)

Is it true that "volatile" in a userspace program tends to indicate a bug?

When I googling about "volatile" and its user space usage, I found mails between Theodore Tso and Linus Torvalds. According to these great masters, use of "volatile" in userspace probably be a bug??Check discussion here
Although they have some explanations, but I really couldn't understand. Could anyone use some simple language explain why they said so? We are not suppose to use volatile in user space??
volatile tells the compiler that every read and write has an observable side effect; thus, the compiler can't make any assumptions about two reads or two writes in a row having the same effect.
For instance, normally, the following code:
int a = *x;
int b = *x;
if (a == b)
printf("Hi!\n");
Could be optimized into:
printf("Hi!\n");
What volatile does is tell the compiler that those values might be coming from somewhere outside of the program's control, so it has to actually read those values and perform the comparison.
A lot of people have made the mistake of thinking that they could use volatile to build lock-free data structures, which would allow multiple threads to share values, and they could observe the effects of those values in other threads.
However, volatile says nothing about how different threads interact, and could be applied to values that could be cached with different values on different cores, or could be applied to values that can't be atomically written in a single operation, and so if you try to write multi-threaded or multi-core code using volatile, you can run into a lot of problems.
Instead, you need to either use locks or some other standard concurrency mechanism to communicate between threads, or use memory barriers, or use C11/C++11 atomic types and atomic operations. Locks ensure that an entire region of code has exclusive access to a variable, which can work if you have a value that is too large, too small, or not aligned to be atomically written in a single operation, while memory barriers and the atomic types and operations provide guarantees about how they work with the CPU to ensure that caches are synchronized or reads and writes happen in particular orders.
Basically, volatile winds up mostly being useful when you're interfacing with a single hardware register, which can vary outside the programs control but may not require any special atomic operations to access. Or it can be used in signal handlers, where because a thread could be interrupted, and the handler run, and then control returned within the same thread, you need to use a volatile value if you want to communicate a flag to the interrupted code.
But if you're doing any kind of sychronization between threads, you should be using locks or some other concurrency primitives provided by a standard library, or really know what you're doing with regards to memory ordering and use memory barriers or atomic operations.

In C, how do I make sure that a memory load is performed only once?

I am programming two processes that communicate by posting messages to each other in a segment of shared memory. Although the messages are not accessed atomically, synchronization is achieved by protecting the messages with shared atomic objects accessed with store-releases and load-acquires.
My problem is about security. The processes do not trust each other. Upon receiving a message, a process makes no assumption about the message being well formed; it first copies the message from shared memory to private memory, then performs some validation on this private copy and, if valid, proceeds to handle this same private copy. Making this private copy is crucial, as it prevents a TOC/TOU attack in which the other process would modify the message between validation and use.
My question is the following: does the standard guarantee that a clever C compiler will never decide that it can read the original instead of the copy? Imagine the following scenario, in which the message is a simple integer:
int private = *pshared; // pshared points to the message in shared memory
...
if (is_valid(private)) {
...
handle(private);
}
If the compiler runs out of registers and temporarily needs to spill private, could it decide, instead of spilling it to the stack, that it can simply discard its value and reload it from *pshared later, provided that an alias analysis ensures that this thread has not changed *pshared?
My guess is that such a compiler optimization would not preserve the semantics of the source program, and would therefore be illegal: pshared does not point to an object that is provably reachable from this thread only (such as an object allocated on the stack whose address has not leaked), therefore the compiler cannot rule out that some other thread might concurrently modify *pshared. By constrast, the compiler may eliminate redundant loads, because one of the possible behaviors is that no other thread runs between the redundant loads, therefore the current thread must be ready to deal with this particular behavior.
Could anyone confirm or infirm that guess and possibly provide references to the relevant parts of the standard?
(By the way: I assume that the message type has no trap representations, so that loads are always defined.)
UPDATE
Several posters have commented on the need for synchronization, which I did not intend to get into, since I believe that I already have this covered. But since people are pointing that out, it is only fair that I provide more details.
I am implementing a low-level asynchronous communication system between two entities that do not trust each other. I run tests with processes, but will eventually move to virtual machines on top of a hypervisor. I have two basic ingredients at my disposal: shared memory and a notification mechanism (typically, injecting an IRQ into the other virtual machine).
I have implemented a generic circular buffer structure with which the communicating entities can produce messages, then send the aforementioned notifications to let each other know when there is something to consume. Each entity maintains its own private state that tracks what it has produced/consumed, and there is a shared state in shared memory composed of message slots and atomic integers tracking the bounds of the regions holding pending messages. The protocol unambiguously identifies which message slots are to be exclusively accessed by which entity at any time. When it needs to produce a message, an entity writes a message (non atomically) to the appropriate slot, then performs an atomic store-release to the appropriate atomic integer to transfer the ownership of the slot to the other entity, then waits until memory writes have completed, then sends a notification to wake up the other entity. Upon receiving a notification, the other entity is expected to perform an atomic load-acquire on the appropriate atomic integer, determine how many pending messages there are, then consume them.
The load of *pshared in my code snippet is just an example of what consuming a trivial (int) message looks like. In a realistic setting, the message would be a structure. Consuming a message does not need any particular atomicity or synchronization, since, as specified by the protocol, it only happens when the consuming entity has synchronized with the other one and knows that it owns the message slot. As long as both entites follow the protocol, everything works flawlessly.
Now, I do not want the entites to have to trust each other. Their implementation must be robust against a malicious entity that would disregard the protocol and write all over the shared memory segment at any time. If this were to happen, the only thing the malicious entity should be able to achieve would be to disrupt the communication. Think of a typical server, that must be prepared to handle ill-formed requests by a malicious client, without letting such misbehavior cause buffer overflows or out-of-bound accesses.
So, while the protocol relies on synchronization for normal operation, the entities must be prepared for the contents of shared memory to change at any time. All I need is a way to make sure that, after an entity makes a private copy of a message, it validates and uses that same copy, and never accesses the original any more.
I have an implementation that copies the message using a volatile read, thus making it clear to the compiler that the shared memory does not have ordinary memory semantics. I believe that this is sufficient; I wonder whether it is necessary.
You should inform the compiler the the shared memory can change at any moment by the volatile modifier.
volatile int *pshared;
...
int private = *pshared; // pshared points to the message in shared memory
...
if (is_valid(private)) {
...
handle(private);
}
As *pshared is declared to be volatile, the compiler can no longer assume that *pshared and private keep same value.
Per your edit, it is now clear, that we all know that a volatile modifier on the shared memory is sufficient to guarantee that the compiler will honour the temporality of all accesses to that shared memory.
Anyway, draft N1256 for C99 is explicit about it in 5.1.2.3 Program execution (emphasize mine)
2 Accessing a volatile object, modifying an object, modifying a file, or calling a function
that does any of those operations are all side effects, which are changes in the state of
the execution environment. Evaluation of an expression may produce side effects. At
certain specified points in the execution sequence called sequence points, all side effects
of previous evaluations shall be complete and no side effects of subsequent evaluations
shall have taken place.
5 The least requirements on a conforming implementation are:
— At sequence points, volatile objects are stable in the sense that previous accesses are
complete and subsequent accesses have not yet occurred
— At program termination, all data written into files shall be identical to the result that
execution of the program according to the abstract semantics would have produced.
That let think that even if pshared is not qualified as volatile, private value must have been loaded from *pshared before the evaluation of is_valid, and as the abstract machine has no reason to change it before the evaluation of handle, a conformant implementation should not change it. At most it could remove the call to handle if it contained no side-effects which is unlikely to happen
Anyway, this is only an academic discussion, because I cannot imagine a real use case where share memory could not need the volatile modifier. If you do not use it, the compiler is free to believe that the previous value is still valid, so on second access, you will still get first value. So even if the answer to this question is it is not necessary, you still have to use volatile int *pshared;.
It's hard to answer your question as posted. Note that you must use a synchronization object to prevent concurrent accesses, unless you are only reading units which are atomic on the platform.
I am assuming that you intend to ask about (pseudocode):
lock_shared_area();
int private = *pshared;
unlock_shared_area();
if (is_valid(private))
and that the other process also uses the same lock. (If not, it would be good to update your question to be a bit more specific about your synchronization).
This code guarantees to read *pshared at most once. Using the name private means to read the variable private, not the object *pshared. The compiler "knows" that the call to unlock the area acts as a memory fence and it won't reorder operations past the fence.
Since the C doesn't have any concept of interprocess communication there is nothing you can do to inform the compiler that there is another process that might be modifying the memory.
Thus, I believe there is no way to prevent a sufficiently clever, malicious, but conforming build system from invoking the "as if" rule to allow it to do the Wrong Thing.
To get something that is 'guaranteed' to work, you need to work whatever guarantees are given by your specific compiler and/or shared memory library you're using.

Resources