Should I mutex lock a single variable? - c

If a single 32-bit variable is shared between multiple threads, should I put a mutex lock around the variable? For example, suppose 1 thread writes to a 32-bit counter and a 2nd thread reads it. Is there any chance the 2nd thread could read a corrupted value?
I'm working on a 32-bit ARM embedded system. The compiler always seems to align 32-bit variables so they can be read or written with a single instruction. If the 32-bit variable was not aligned, then the read or write would be broken down into multiple instructions and the 2nd thread could read a corrupted value.
Does the answer to this question change if I move to a multiple-core system in the future and the variable is shared between cores? (assuming a shared cache between cores)
Thanks!

A mutex protects you from more than just tearing - for example some ARM implementations use out-of-order execution, and a mutex will include memory (and compiler) barriers that may be necessary for your algorithm's correctness.
It is safer to include the mutex, then figure out a way to optimise it later if it shows as a performance problem.
Note also that if your compiler is GCC-based, you may have access to the GCC atomic builtins.

If all the writing is done from one thread (i.e. other threads are only reading), then no you don't need a mutex. If more than one thread may be writing, then you do.

You don't need mutex.
On 32-bit ARM, single write or read is an atomic operation. (regardless of the number of cores)
Of course, you should declare that variable as volatile.

On a 32-bit system, reads and writes of 32-bit vars are atomic. However, it depends what else you are doing with the variable. E.g. if you maniputale it somehow (e.g. add a value), then this requires a read, manipulation and write. If the CPU and compiler do not support an atomic operation for this, then you will need to use a mutex to protect this multi-operation sequence.
There are other, lock-free techniques which can reduce the need for mutexes.

Related

A thread only reads and a thread only modifies. Does this variable also need a mutex with linux c? [duplicate]

There are 2 threads,one only reads the signal,the other only sets the signal.
Is it necessary to create a mutex for signal and the reason?
UPDATE
All I care is whether it'll crash if two threads read/set the same time
You will probably want to use atomic variables for this, though a mutex would work as well.
The problem is that there is no guarantee that data will stay in sync between threads, but using atomic variables ensures that as soon as one thread updates that variable, other threads immediately read its updated value.
A problem could occur if one thread updates the variable in cache, and a second thread reads the variable from memory. That second thread would read an out-of-date value for the variable, if the cache had not yet been flushed to memory. Atomic variables ensure that the value of the variable is consistent across threads.
If you are not concerned with timely variable updates, you may be able to get away with a single volatile variable.
It depends. If writes are atomic then you don't need a mutual exclusion lock. If writes are not atomic, then you do need a lock.
There is also the issue of compilers caching variables in the CPU cache which may cause the copy in main memory to not get updating on every write. Some languages have ways of telling the compiler to not cache a variable in the CPU like that (volatile keyword in Java), or to tell the compiler to sync any cached values with main memory (synchronized keyword in Java). But, mutex's in general don't solve this problem.
If all you need is synchronization between threads (one thread must complete something before the other can begin something else) then mutual exclusion should not be necessary.
Mutual exclusion is only necessary when threads are sharing some resource where the resource could be corrupted if they both run through the critical section at roughly the same time. Think of two people sharing a bank account and are at two different ATM's at the same time.
Depending on your language/threading library you may use the same mechanism for synchronization as you do for mutual exclusion- either a semaphore or a monitor. So, if you are using Pthreads someone here could post an example of synchronization and another for mutual exclusion. If its java, there would be another example. Perhaps you can tell us what language/library you're using.
If, as you've said in your edit, you only want to assure against a crash, then you don't need to do much of anything (at least as a rule). If you get a collision between threads, about the worst that will happen is that the data will be corrupted -- e.g., the reader might get a value that's been partially updated, and doesn't correspond directly to any value the writing thread ever wrote. The classic example would be a multi-byte number that you added something to, and there was a carry, (for example) the old value was 0x3f ffff, which was being incremented. It's possible the reading thread could see 0x3f 0000, where the lower 16 bits have been incremented, but the carry to the upper 16 bits hasn't happened (yet).
On a modern machine, an increment on that small of a data item will normally be atomic, but there will be some size (and alignment) where it's not -- typically if part of the variable is in one cache line, and part in another, it'll no longer be atomic. The exact size and alignment for that varies somewhat, but the basic idea remains the same -- it's mostly just a matter of the number having enough digits for it to happen.
Of course, if you're not careful, something like that could cause your code to deadlock or something on that order -- it's impossible to guess what might happen without knowing anything about how you plan to use the data.

Is it true that "volatile" in a userspace program tends to indicate a bug?

When I googling about "volatile" and its user space usage, I found mails between Theodore Tso and Linus Torvalds. According to these great masters, use of "volatile" in userspace probably be a bug??Check discussion here
Although they have some explanations, but I really couldn't understand. Could anyone use some simple language explain why they said so? We are not suppose to use volatile in user space??
volatile tells the compiler that every read and write has an observable side effect; thus, the compiler can't make any assumptions about two reads or two writes in a row having the same effect.
For instance, normally, the following code:
int a = *x;
int b = *x;
if (a == b)
printf("Hi!\n");
Could be optimized into:
printf("Hi!\n");
What volatile does is tell the compiler that those values might be coming from somewhere outside of the program's control, so it has to actually read those values and perform the comparison.
A lot of people have made the mistake of thinking that they could use volatile to build lock-free data structures, which would allow multiple threads to share values, and they could observe the effects of those values in other threads.
However, volatile says nothing about how different threads interact, and could be applied to values that could be cached with different values on different cores, or could be applied to values that can't be atomically written in a single operation, and so if you try to write multi-threaded or multi-core code using volatile, you can run into a lot of problems.
Instead, you need to either use locks or some other standard concurrency mechanism to communicate between threads, or use memory barriers, or use C11/C++11 atomic types and atomic operations. Locks ensure that an entire region of code has exclusive access to a variable, which can work if you have a value that is too large, too small, or not aligned to be atomically written in a single operation, while memory barriers and the atomic types and operations provide guarantees about how they work with the CPU to ensure that caches are synchronized or reads and writes happen in particular orders.
Basically, volatile winds up mostly being useful when you're interfacing with a single hardware register, which can vary outside the programs control but may not require any special atomic operations to access. Or it can be used in signal handlers, where because a thread could be interrupted, and the handler run, and then control returned within the same thread, you need to use a volatile value if you want to communicate a flag to the interrupted code.
But if you're doing any kind of sychronization between threads, you should be using locks or some other concurrency primitives provided by a standard library, or really know what you're doing with regards to memory ordering and use memory barriers or atomic operations.

xv6: reading ticks directly without taking the ticks lock?

I'm working on an assignment in operating system course on Xv6. I need to implement a data status structure for a process for its creation time, termination time, sleep time, etc...
As of now I decided to use the ticks variable directly without using the tickslock because it seems not a good idea to use a lock and slow down the system for such a low priority objective.
Since the ticks variable only used like so: ticks++, is there a way where I will try to retrieve the current number of ticks and get a wrong number?
I don't mind getting a wrong number by +-10 ticks but is there a way where it will be really off. Like when the number 01111111111111111 will increment it will need to change 2 bytes. So my question is this, is it possible that the CPU storing data in stages and another CPU will be able to fetch the data in that memory location between the start and complete of the store operation?
So as I see it, if the compiler will create a mov instruction or an inc instruction, what I want to know is if the store operation can be seen between the start and end of it.
There's no problem in asm: aligned loads/stores done with a single instruction on x86 are atomic up to qword (8-byte) width. Why is integer assignment on a naturally aligned variable atomic on x86?
(On 486, the guarantee is only for 4-byte aligned values, and maybe not even that for 386, so possibly this is why Xv6 uses locking? I'm not sure if it's supposed to be multi-core safe on 386; my understanding is that the rare 386 SMP machines didn't exactly implement the modern x86 memory model (memory ordering and so on).)
But C is not asm. Using a plain non-atomic variable from multiple "threads" at once is undefined behaviour, unless all threads are only reading. This means compilers can assume that a normal C variable isn't changed asynchronously by other threads.
Using ticks in a loop in C will let the compiler read it once and keep using the same value repeatedly. You need a READ_ONCE macro like the Linux kernel uses, e.g. *(volatile int*)&ticks. Or simply declare it as volatile unsigned ticks;
For a variable narrow enough to fit in one integer register, it's probably safe to assume that a sane compiler will write it with a single dword store, whether that's a mov or a memory-destination inc or add dword [mem], 1. (You can't assume that a compiler will use a memory-destination inc/add, though, so you can't depend on an increment being single-core-atomic with respect to interrupts.)
With one writer and multiple readers, yes the readers can simply read it without any need for any kind of locking, as long as they use volatile.
Even in portable ISO C, volatile sig_atomic_t has some very limited guarantees of working safely when written by a signal handler and read by the thread that ran the signal handler. (Not necessarily by other threads, though: in ISO C volatile doesn't avoid data-race UB. But in practice on x86 with non-hostile compilers it's fine.)
(POSIX signals are the user-space equivalent of interrupts.)
See also Can num++ be atomic for 'int num'?
For one thread to publish a wider counter in two halves, you'd usually use a SeqLock. With 1 writer and multiple readers, there's no actual locking, just retry by the readers if a write overlapped with their read. See Implementing 64 bit atomic counter with 32 bit atomics
First, using locks or not isn't a matter of whether your objective is low priority or not, but a matter of solving a race condition.
Second, in the specific case you describe, it will be safe to read ticks variable without any locks as this is not a race condition case because RAM access to the same region (even same address here) cannot be made by 2 separate CPUs simultaneously (read more) and because ticks writing only increments the value by 1 and not doing any major changes that you really miss.

C volatile, and issues with hardware caching

I've read similar answers on this site, and elsewhere, but am still confused in a few circumstances.
I'm aware of what the standard actually guarantees us, I understand the intended use of the keyword, and I'm well aware of the difference between the compiler caching and L1/L2/ect. caching; it's more for curiosity's sake that I understand the other cases.
Say I have a variable declared volatile in C. Four scenarios:
Signal handlers, single threaded (As intended): This is the problem the keyword was meant to solve. My process gets a signal callback from the OS, and I modify some volatile variable out of the normal execution of my process. Since it was declared volatile, the normal process won't store this value in a CPU register, and will always do a load from memory. Even if the signal handler writes to the volatile variable, since the signal handler shares the same address space as the normal process, even if the volatile variable was previously cached in hardware (i.e. L1, L2), we guarantee the main process will load the correct, updated variable. Perfect, everyone is happy.
DMA-transfers, single-threaded: Say the volatile variable is mapped to a region of memory for which a DMA-write is taking place. As before, the compiler won't keep the volatile variable in a CPU register, and will always do a load from memory; however, if that variable exists in hardware cache, then the load request will never reach main memory. If the DMA controller updates MM behind our backs, we'll never get the up-to-date value. In a preemptive OS, we are saved by the fact that eventually, we'll probably be context-switched out, and the next time our process resumes, the cache will be cold and we'll actually have to reload from main memory - so we'll get the correct functionality.. eventually (our own process could potentially swap that cache line out too - but again, we might waste valuable cycles before that happens). Is there standardized HW support or OS support that notifies the hardware caches when main memory is updated via the DMA controller? Or do we have to explicitly flush the cache to guarantee we arm't reading a false value? (Is this even possible in the architectures listed?)
Memory-mapped registers, single-threaded: Same as #2, except the volatile variable is mapped to a memory-mapped register (or an explicit IO-port). I would imagine this is a more difficult problem then #3, since at least the DMA controller will signal the CPU when it's done transferring, which gives the OS or HW a chance to do something.
Mutilthreaded: If I have a volatile variable, is there any guarantee of cache-coherency between multiple threads running on separate physical cores? Like sure, again, the compiler is still issuing load instructions from memory, but if the value is cached in one core's cache, is there any guarantee the same value must exist in the other core's caches? (I would imagine it's not an issue at all for hyperthreading threads on different logical cores on the same physical core, since they share physical cache memory). My overwhelming intuition says no, but thought I'd list the case here anyways.
If possible, differentiate between x64 and ARMv6/7/8 architectures, and kernel vs user land solutions.
For 2 and 3, no there's no standardized way this would work.
Normally when doing DMA transfers one would flush the cache in a platform depending manner. Normally there's quite straight forward instructions for doing that (since now-days the caches are integrated in the CPU).
When accessing memory-mapped registers on the other hand, often the behavior is dependent on the order of writes. For example, suppose you have a UART port and write characters to it — you'll need to make sure that there is an actual write to the port each time you write to it from C.
While it might work with flushing the cache between each write, it's not what one normally does. The normal way (for ARM at least) is to set up the MMU so that writes to certain regions of address space happen uncached and in correct sequence.
This approach can also be used for memory used for DMA transfers; one could for example set up dedicated regions for use as DMA buffers and set up the MMU so that reads and writes to that region happen uncached.
On the other hand the language guarantees that all memory (well what you get from declaring variables or allocating memory using new) will behave in certain ways. It should be no difference between if it's multi-threaded or there's signals involved. Note that the C90 and C99 standards don't mention threads (C11 does), but they are supposed to work this way. The implementation has to make sure that the CPU's and cache are used in a way that is consistent with this (as a consequence, the OS might not be able to schedule different threads on different cores if this can't be accomplished). Consequently you should not need to flush caches in order to share data between threads, but you do need to synchronize threads and of course use volatile qualified data. The same is true for signal handlers even if the implementation happens to schedule them on a different core.

mmap thread safety in a multi-core and multi-cpu environment

I am a little confused as to the real issues between multi-core and multi-cpu environments when it comes to shared memory, with particular reference to mmap in C.
I have an application that utilizes mmap to share multiple segments of memory between 2 processes. Each process has access to:
A Status and Control memory segment
Raw data (up to 8 separate raw data buffers)
The Status and Control segment is used essentially as an IPC. IE, it may convey that buffer 1 is ready to receive data, or buffer 3 is ready for processing or that the Status and Control memory segment is locked whilst being updated by either parent or child etc etc.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Does this assumption of single-process RAM access also apply to multi-cpu systems? That is, a single PC style board with multiple CPU's (and I guess, multiple cores within each CPU).
If not, I will need to seriously rethink my logic to allow for multi-cpu'd single-boarded machines!
Any thoughts would be greatly appreciated!
PS - by single boarded I mean a single, standalone PC style system. This excludes mainframes and the like ... just to clarify :)
RAM is only ever accessed by a single core (or process) at any one time.
Take a step back and think about your assumption means. Theoretically, yes, this statement is true, but I don't think it means what you think it means. There are no practical conclusions you can draw from this other than maybe "the memory will not catch fire if two CPUs write to the same address at the same time". Let me explain.
If one CPU/process writes to a memory location, then a different CPU/process writes to the same location, the memory writes will not happen at the same time, they will happen one at a time. You can't generally reason about which write will happen before the other, you can't reason about if a read from one CPU will happen before the write from the other CPU, one some older CPUs you can't even reason if multi-byte (multi-word, actually) values will be stored/accessed one byte at a time or multiple bytes at a time (which means that reads and writes to multibyte values can get interleaved between CPUs or processes).
The only thing multiple CPUs change here is the order of memory reads and writes. On a single CPU reading memory you can be pretty sure that your reads from memory will see earlier writes to the same memory (iff no other hardware is reading/writing the memory, then all bets are off). On multiple CPUs the order of reads and writes to different memory locations will surprise you (cpu 1 writes to address 1 and then 2, but cpu 2 might just see the new value at address 2 and the old value at address 1).
So unless you have specific documentation from your operating system and/or CPU manufacturer you can't make any assumptions (except that when two writes to the same memory location happen one will happen before the other). This is why you should use libraries like pthreads or stdatomic.h from C11 for proper locking and synchronization or really dig deep down into the most complex parts of the CPU documentation to actually understand what will happen. The locking primitives in pthreads not only provide locking, they are also guarantee that memory is properly synchronized. stdatomic.h is another way to guarantee memory synchronization, but you should carefully read the C11 standard to see what it promises and what it doesn't promise.
One potential issue is that each core has it's own cache (usually just level1, as level2 and level3 caches are usually shared). Each cpu would also have it's own cache. However most systems ensure cache coherency, so this isn't the issue (except for performance impact of constantly invalidating caches due to writes to the same memory shared in a cache line by each core or processor).
The real issue is that there is no guarantee against reordering of reads and writes due to optimizations by the compiler and/or the hardware. You need to use a Memory Barrier to flush out any pending memory operations to synchronize the state of the threads or shared memory of processes. The memory barrier will occur if you use one of the synchronization types such as an event, mutex, semaphore, ... . Not all of the shared memory reads and writes need to be atomic, but you need to use synchronization between threads and/or processes before accessing any shared memory possibly updated by another thread and/or process.
This does not sound right to me. Two processes on two different cores can both load and store data to RAM at the same time. In addition to this caching strategies can result in all kinds of strangeness-es. So please make sure all access to shared memory is properly synchronized using (interprocess) synchronization objects.
My understanding is, and PLEASE correct me if I am wrong, is that in a multi-core CPU environment on a single boarded PC type infrastructure, mmap is safe. That is, regardless of the number of cores in the CPU, RAM is only ever accessed by a single core (or process) at any one time.
Even if this holds true for some particular architecture, such an assumption is entirely wrong in general. You should have proper synchronisation between the processes that modify the shared memory segment, unless atomic intrinsics are used and the algorithm itself is lock-free.
I would advise you to put a pthread_mutex_t in the shared memory segment (shared across all processes). You will have to initialise it with the PTHREAD_PROCESS_SHARED attribute:
pthread_mutexattr_t mutex_attr;
pthread_mutexattr_init(&mutex_attr);
pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
pthread_mutex_init(mutex, &mutex_attr);

Resources