Is volatile necessary for the resource used in a critical section? - c

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource. Below is the pseudo code that will be executed on those two threads.
take_lock();
// Read shared resource.
read_shared_resouce();
// Write something to shared resource.
write_shared_resource();
release_lock();
I am wondering if I need to make that shared resource volatile to make sure that when one thread is reading shared resource, a thread won't just get the value from registers, it will actually read from that shared resource. Or maybe I should use a accessor functions to make the access to that shared resource volatile with some memory barrier operations instead of make that shared resource volatile?

I am curious about whether volatile is necessary for the resources used in a critical section. Consider I have two threads executed on two CPUs and they are competing on a shared resource. I know I need to a locking mechanism to make sure only one thread is performing operations on that shared resource.
Making sure that only one thread accesses a shared resource at a time is only part of what a locking mechanism adequate for the purpose will do. Among other things, such a mechanism will also ensure that all writes to shared objects performed by thread Ti before it releases lock L are visible to all other threads Tj after they subsequently acquire lock L. And that in terms of the C semantics of the program, notwithstanding any questions of compiler optimization, register usage, CPU instruction reordering, or similar.
When such a locking mechanism is used, volatile does not provide any additional benefit for making threads' writes to shared objects be visible to each other. When such a locking mechanism is not used, volatile does not provide a complete substitute.
C's built-in (since C11) mutexes provide a suitable locking mechanism, at least when using C's built-in threads. So do pthreads mutexes, Sys V and POSIX semaphores, and various other, similar synchronization objects available in various environments, each with respect to corresponding multithreading systems. These semantics are pretty consistent across C-like multithreading implementations, extending at least as far as Java. The semantic requirements for C's built-in multithreading are described in section 5.1.2.4 of the current (C17) language spec.
volatile is for indicating that an object might be accessed outside the scope of the C semantics of the program. That may happen to produce properties that interact with multithreaded execution in a way that is taken to be desirable, but that is not the purpose or intended use of volatile. If it were, or if volatile were sufficient for such purposes, then we would not also need _Atomic objects and operations.
The previous remarks focus on language-level semantics, and that is sufficient to answer the question. However, inasmuch as the question asks specifically about accessing variables' values from registers, I observe that compilers don't actually have to do anything much multithreading-specific in that area as long as acquiring and releasing locks requires calling functions.
In particular, if an execution E of function f writes to an object o that is visible to other functions or other executions of f, then the C implementation must ensure that that write is actually performed on memory before E evaluates any subsequent function call (such as is needed to release a lock). This is necessary because because the value written must be visible to the execution of the called function, regardless of any other threads.
Similarly, if E uses the value of o after return from a function call (such as is needed to acquire a lock) then it must load that value from memory to ensure that it sees the effect of any write that the function may have performed.
The only thing special to multithreading in this regard is that the implementation must ensure that interprocedural analysis optimizations or similar do not subvert the needed memory reads and writes around the lock and unlock functions. In practice, this rarely requires special attention.

The answer is no; volatile is not necessary (assuming the critical-section functions you are using were implemented correctly, and you are using them correctly, of course). Any proper critical-section API's implementation will include the memory-barriers necessary to handle flushing registers, etc, and therefore avoid the need for the volatile keyword.

volatile is normally used inform compiler that this data might be change by others (interrupt, DMA, other CPU,...) to prevent un-expected optimization in compiler.
So in your case you may need or don't need:
If you don't have some while loop with some info from share resource in the thread for value change, you don't really need for volatile.
If you have some wait like while (shareVal == 0) in the source code, you need to tell compiler explicit by attribute volatile.
For case 2 CPUs, there is also possibility issue with cache that a CPU is only reading value from cache memory. Please consider to configure memory attribute properly for shared resource.

Related

Does the C/++ memory model apply to the atomic operation itself?

I'm left confused about when the C/++ memory model is relevant, even after reading the GCC wiki.
My code is an IO library that allows taking/returning a buffer from a pool and using it for async IO. However, even after the buffer is returned to the pool, it isn't free unless the actual IO operation has also completed.
Each buffer has a structure that has status flags:
#define IO_FLAG_IN_USE 1 // a consumer has taken ownership of the buffer
#define IO_FLAG_IN_FLIGHT 2 // the buffer is in use by the system for async IO
A consumer requests a buffer with io_getbuf and waits using sem_wait. There are two ways a buffer can become available:
When the consumer calls io_putbuf and the IO has already completed, or when IO completes and the buffer has already been returned. This can cause a race, of course. I want to solve it using atomics, like this:
void io_completion(struct bufinfo *buf) {
if(!__atomic_or_fetch(&buf->flags, ~IO_FLAG_IN_FLIGHT, ...))
sem_post(semaphore);
}
void io_putbuf(struct bufinfo *buf) {
if(!__atomic_or_fetch(&buf->flags, ~IO_FLAG_IN_USE, ...))
sem_post(semaphore);
}
But I'm not sure which memory model to specify - does it matter?
tl;dr
Does the memory model apply to the atomic operations themselves (the load->or->return) or only relevant for operations preceding/following the atomic built-ins?
I take you to be asking what memory order property(-ies) to use, and to be asking in particular about the GCC builtin __atomic_and_fetch() (note spelling), which atomically modifies the specified memory location via a bitwise and operation on a scalar having non-_Atomic type and returns the result (atomic read / modify / write). The memory-order alternatives and resulting behavior correspond to the C++ memory model.
Do note well that that's a GCC-ism. C has had atomic types and atomic operations on them since C11, with the same set of memory-order alternatives as C++, but the __atomic_or_fetch() builtin and its siblings are separate from that and GCC-specific.
I'm not sure which memory model to specify - does it matter?
Yes, of course, else there wouldn't be alternatives to choose among.
Does
the memory model apply to the atomic operations themselves (the
load->or->return) or only relevant for operations preceding/following
the atomic built-ins?
The memory order property describes the relationships, if any, between the read and write performed by the atomic operation on one hand and not-necessarily-atomic reads and writes of the same and other memory locations on the other. If ever you are uncertain what memory order to use then you should use __ATOMIC_SEQ_CST. That provides the strongest constraints, and it corresponds to the default memory order for C++ atomic operations.
Other alternatives relax memory-order constraints in various ways, which may afford performance improvements under some circumstances. However, those relaxations are likely to also cause your program to manifest intermittent misbehavior if in fact it requires the stronger constraints, and that analysis involves a holistic evaluation of your threads' use of shared variables and synchronization.
I am not confident that enough information has been presented to fully perform that analysis, but we can see at least that each of the atomic operations must observe the effect of the other, therefore each one needs both acquire and release semantics with respect to the affected memory location. That is, you need at least __ATOMIC_ACQ_REL ordering, which is only one step below __ATOMIC_SEQ_CST. Use the latter, because it's safer, especially given that you are uncertain about the intricacies involved. This is for an I/O subsystem, so even if the former would be sufficient, any performance gain you might see from using that is unlikely to be noticeable anyway.
UPDATE:
Since apparently the above is not clear, again:
Does the memory model apply to the atomic operations themselves (the
load->or->return) or only relevant for operations preceding/following
the atomic built-ins?
And again: the memory order property describes the relationships, if any, between the read and write performed by the atomic operation on one hand and not-necessarily-atomic reads and writes of the same and other memory locations on the other.
That is, the chosen memory order parameter affects
whether there are happens-before relationships between writes to the affected memory location and the atomic op's read of that location by other threads;
whether there are happens-before relationships between the atomic op's write to the affected storage location and other reads of that location by other threads;
whether there are happens-before relationships between the atomic op's read of the affected storage location and other actions performed by the same thread; and
whether there are happens-before relationships between the atomic op's write of the affected storage location and other actions performed by the same thread.
Note that I say "affects", not "determines". Those factors are also affected by the memory order of other atomic operations, by the use of synchronization objects and functions, by details of the other statements executed and expressions evaluated by all threads, and by the vagaries of thread scheduling during any particular run (at least).
All of that speaks to the guarantees upon which one can rely and the conclusions one can draw about relationships among the values of shared variables throughout the program observed by each thread.
The full scope of the guarantees you require is unclear, but you at least need happens-before relationships between the reads and writes of the affected location by one function and those by the other function, with no specific order imposed on the calls to those functions. That requires at minimum __ATOMIC_ACQ_REL ordering, but, again, one step up to __ATOMIC_SEQ_CST is probably a better choice, not least because it might be a necessary one in light of other program code not shown.
With regards to the atomic variable itself, the __ATOMIC_RELAXED memory-model is always sufficient between two atomic operations. It guarantees that there will be some fixed order between them, and that they won't used cached values.
That's why you can use __ATOMIC_RELAXED for the simple incrementing of a counter, see the Relaxed ordering entry at cppreference.com
In the example given, if the buffers involved in the example code are 'simple' buffers that carry no state (they are fixed-sized etc.) and the consuming thread (io_getbuf) does not care about the contents of the buffer (it is just using it for a new read) then there are no other dependencies and you can use __ATOMIC_RELAXED.
The need for memory models is when there are other dependencies involved - if the buffers contain metadata, like the buffer size, and that size might have been changed (e.g. realloc) by the releasing thread then __ATOMIC_RELAXED does not guarantee that the consuming thread will see the update to the size field. Similarly, if this wasn't just a thread pool, but a producer/consumer setup, then the consumer needs to be sure that the contents of the buffer are actually synchronized and have been written before consuming the buffer - that would require a different memory model.

Is it true that "volatile" in a userspace program tends to indicate a bug?

When I googling about "volatile" and its user space usage, I found mails between Theodore Tso and Linus Torvalds. According to these great masters, use of "volatile" in userspace probably be a bug??Check discussion here
Although they have some explanations, but I really couldn't understand. Could anyone use some simple language explain why they said so? We are not suppose to use volatile in user space??
volatile tells the compiler that every read and write has an observable side effect; thus, the compiler can't make any assumptions about two reads or two writes in a row having the same effect.
For instance, normally, the following code:
int a = *x;
int b = *x;
if (a == b)
printf("Hi!\n");
Could be optimized into:
printf("Hi!\n");
What volatile does is tell the compiler that those values might be coming from somewhere outside of the program's control, so it has to actually read those values and perform the comparison.
A lot of people have made the mistake of thinking that they could use volatile to build lock-free data structures, which would allow multiple threads to share values, and they could observe the effects of those values in other threads.
However, volatile says nothing about how different threads interact, and could be applied to values that could be cached with different values on different cores, or could be applied to values that can't be atomically written in a single operation, and so if you try to write multi-threaded or multi-core code using volatile, you can run into a lot of problems.
Instead, you need to either use locks or some other standard concurrency mechanism to communicate between threads, or use memory barriers, or use C11/C++11 atomic types and atomic operations. Locks ensure that an entire region of code has exclusive access to a variable, which can work if you have a value that is too large, too small, or not aligned to be atomically written in a single operation, while memory barriers and the atomic types and operations provide guarantees about how they work with the CPU to ensure that caches are synchronized or reads and writes happen in particular orders.
Basically, volatile winds up mostly being useful when you're interfacing with a single hardware register, which can vary outside the programs control but may not require any special atomic operations to access. Or it can be used in signal handlers, where because a thread could be interrupted, and the handler run, and then control returned within the same thread, you need to use a volatile value if you want to communicate a flag to the interrupted code.
But if you're doing any kind of sychronization between threads, you should be using locks or some other concurrency primitives provided by a standard library, or really know what you're doing with regards to memory ordering and use memory barriers or atomic operations.

How to allow two threads to share a global variable in WinAPI?

I have two threads that are created using CreateThread(), and I have a global variable that one thread writes to, and the other thread reads from.
Now based on my understanding, the compiler and/or the CPU can do all sorts of optimizations, which could mean for example that when I write a value to the variable, the value can be written in some cache and not written directly to memory (and hence the other thread will not be able to see it).
I have read that I can wrap the code that access the variable in a critical section, but the documentation says that a critical section will only enforce mutual exclusion, and does not say anything about enforcing writing directly to memory and reading directly from memory.
Note that I do not which to use the volatile keyword, I want to know how this is done in pure WinAPI (as I could use a language other than C in a later time).
MSDN explicitly states that critical sections are memory barriers. https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355(v=vs.85).aspx

In C, how do I make sure that a memory load is performed only once?

I am programming two processes that communicate by posting messages to each other in a segment of shared memory. Although the messages are not accessed atomically, synchronization is achieved by protecting the messages with shared atomic objects accessed with store-releases and load-acquires.
My problem is about security. The processes do not trust each other. Upon receiving a message, a process makes no assumption about the message being well formed; it first copies the message from shared memory to private memory, then performs some validation on this private copy and, if valid, proceeds to handle this same private copy. Making this private copy is crucial, as it prevents a TOC/TOU attack in which the other process would modify the message between validation and use.
My question is the following: does the standard guarantee that a clever C compiler will never decide that it can read the original instead of the copy? Imagine the following scenario, in which the message is a simple integer:
int private = *pshared; // pshared points to the message in shared memory
...
if (is_valid(private)) {
...
handle(private);
}
If the compiler runs out of registers and temporarily needs to spill private, could it decide, instead of spilling it to the stack, that it can simply discard its value and reload it from *pshared later, provided that an alias analysis ensures that this thread has not changed *pshared?
My guess is that such a compiler optimization would not preserve the semantics of the source program, and would therefore be illegal: pshared does not point to an object that is provably reachable from this thread only (such as an object allocated on the stack whose address has not leaked), therefore the compiler cannot rule out that some other thread might concurrently modify *pshared. By constrast, the compiler may eliminate redundant loads, because one of the possible behaviors is that no other thread runs between the redundant loads, therefore the current thread must be ready to deal with this particular behavior.
Could anyone confirm or infirm that guess and possibly provide references to the relevant parts of the standard?
(By the way: I assume that the message type has no trap representations, so that loads are always defined.)
UPDATE
Several posters have commented on the need for synchronization, which I did not intend to get into, since I believe that I already have this covered. But since people are pointing that out, it is only fair that I provide more details.
I am implementing a low-level asynchronous communication system between two entities that do not trust each other. I run tests with processes, but will eventually move to virtual machines on top of a hypervisor. I have two basic ingredients at my disposal: shared memory and a notification mechanism (typically, injecting an IRQ into the other virtual machine).
I have implemented a generic circular buffer structure with which the communicating entities can produce messages, then send the aforementioned notifications to let each other know when there is something to consume. Each entity maintains its own private state that tracks what it has produced/consumed, and there is a shared state in shared memory composed of message slots and atomic integers tracking the bounds of the regions holding pending messages. The protocol unambiguously identifies which message slots are to be exclusively accessed by which entity at any time. When it needs to produce a message, an entity writes a message (non atomically) to the appropriate slot, then performs an atomic store-release to the appropriate atomic integer to transfer the ownership of the slot to the other entity, then waits until memory writes have completed, then sends a notification to wake up the other entity. Upon receiving a notification, the other entity is expected to perform an atomic load-acquire on the appropriate atomic integer, determine how many pending messages there are, then consume them.
The load of *pshared in my code snippet is just an example of what consuming a trivial (int) message looks like. In a realistic setting, the message would be a structure. Consuming a message does not need any particular atomicity or synchronization, since, as specified by the protocol, it only happens when the consuming entity has synchronized with the other one and knows that it owns the message slot. As long as both entites follow the protocol, everything works flawlessly.
Now, I do not want the entites to have to trust each other. Their implementation must be robust against a malicious entity that would disregard the protocol and write all over the shared memory segment at any time. If this were to happen, the only thing the malicious entity should be able to achieve would be to disrupt the communication. Think of a typical server, that must be prepared to handle ill-formed requests by a malicious client, without letting such misbehavior cause buffer overflows or out-of-bound accesses.
So, while the protocol relies on synchronization for normal operation, the entities must be prepared for the contents of shared memory to change at any time. All I need is a way to make sure that, after an entity makes a private copy of a message, it validates and uses that same copy, and never accesses the original any more.
I have an implementation that copies the message using a volatile read, thus making it clear to the compiler that the shared memory does not have ordinary memory semantics. I believe that this is sufficient; I wonder whether it is necessary.
You should inform the compiler the the shared memory can change at any moment by the volatile modifier.
volatile int *pshared;
...
int private = *pshared; // pshared points to the message in shared memory
...
if (is_valid(private)) {
...
handle(private);
}
As *pshared is declared to be volatile, the compiler can no longer assume that *pshared and private keep same value.
Per your edit, it is now clear, that we all know that a volatile modifier on the shared memory is sufficient to guarantee that the compiler will honour the temporality of all accesses to that shared memory.
Anyway, draft N1256 for C99 is explicit about it in 5.1.2.3 Program execution (emphasize mine)
2 Accessing a volatile object, modifying an object, modifying a file, or calling a function
that does any of those operations are all side effects, which are changes in the state of
the execution environment. Evaluation of an expression may produce side effects. At
certain specified points in the execution sequence called sequence points, all side effects
of previous evaluations shall be complete and no side effects of subsequent evaluations
shall have taken place.
5 The least requirements on a conforming implementation are:
— At sequence points, volatile objects are stable in the sense that previous accesses are
complete and subsequent accesses have not yet occurred
— At program termination, all data written into files shall be identical to the result that
execution of the program according to the abstract semantics would have produced.
That let think that even if pshared is not qualified as volatile, private value must have been loaded from *pshared before the evaluation of is_valid, and as the abstract machine has no reason to change it before the evaluation of handle, a conformant implementation should not change it. At most it could remove the call to handle if it contained no side-effects which is unlikely to happen
Anyway, this is only an academic discussion, because I cannot imagine a real use case where share memory could not need the volatile modifier. If you do not use it, the compiler is free to believe that the previous value is still valid, so on second access, you will still get first value. So even if the answer to this question is it is not necessary, you still have to use volatile int *pshared;.
It's hard to answer your question as posted. Note that you must use a synchronization object to prevent concurrent accesses, unless you are only reading units which are atomic on the platform.
I am assuming that you intend to ask about (pseudocode):
lock_shared_area();
int private = *pshared;
unlock_shared_area();
if (is_valid(private))
and that the other process also uses the same lock. (If not, it would be good to update your question to be a bit more specific about your synchronization).
This code guarantees to read *pshared at most once. Using the name private means to read the variable private, not the object *pshared. The compiler "knows" that the call to unlock the area acts as a memory fence and it won't reorder operations past the fence.
Since the C doesn't have any concept of interprocess communication there is nothing you can do to inform the compiler that there is another process that might be modifying the memory.
Thus, I believe there is no way to prevent a sufficiently clever, malicious, but conforming build system from invoking the "as if" rule to allow it to do the Wrong Thing.
To get something that is 'guaranteed' to work, you need to work whatever guarantees are given by your specific compiler and/or shared memory library you're using.

Linux Shared Memory Synchronization

I have implemented two applications that share data using the POSIX shared memory API (i.e. shm_open). One process updates data stored in the shared memory segment and another process reads it. I want to synchronize the access to the shared memory region using some sort of mutex or semaphore. What is the most efficient way of do this? Some mechanisms I am considering are
A POSIX mutex stored in the shared memory segment (Setting the PTHREAD_PROCESS_SHARED attribute would be required)
Creating a System V semaphore using semget
Rather than a System V semaphore, I would go with a POSIX named semaphore using sem_open(), etc.
Might as well make this an answer.
You can use sem_init with pshared true to create a POSIX semaphore in your shared memory space. I have used this successfully in the past.
As for whether this is faster or slower than a shared mutex and condition variable, only profiling can tell you. On Linux I suspect they are all pretty similar since they rely on the "futex" machinery.
If efficiency is important, I would go with process-shared mutexes and condition variables.
AFAIR, each operation with a semaphore requires a syscall, so uncontended mutex should be faster than the semaphore [ab]used in mutex-like manner.
First, really benchmark to know if performance is important. The cost of these things is often overestimated. So if you don't find that the access to the control structure is of same order of magnitude than the writes, just take whatever construct is semantically the best for your use case. This would be the case usually if you'd have some 100 bytes written per access to the control structure.
Otherwise, if the control structure is the bottleneck, you should perhaps avoid to use them. C11 has the new concept of _Atomic types and operations that can be used in cases where there are races in access to data. C11 is not yet widely implemented but probably all modern compilers have extensions that implement these features already.

Resources