A quick question I've been wondering about for some time; Does the CPU assign values atomically, or, is it bit by bit (say for example a 32bit integer).
If it's bit by bit, could another thread accessing this exact location get a "part" of the to-be-assigned value?
Think of this:
I have two threads and one shared "unsigned int" variable (call it "g_uiVal").
Both threads loop.
On is printing "g_uiVal" with printf("%u\n", g_uiVal).
The second just increase this number.
Will the printing thread ever print something that is totally not or part of "g_uiVal"'s value?
In code:
unsigned int g_uiVal;
void thread_writer()
{
g_uiVal++;
}
void thread_reader()
{
while(1)
printf("%u\n", g_uiVal);
}
Depends on the bus widths of the CPU and memory. In a PC context, with anything other than a really ancient CPU, accesses of up to 32 bit accesses are atomic; 64-bit accesses may or may not be. In the embedded space, many (most?) CPUs are 32 bits wide and there is no provision for anything wider, so your int64_t is guaranteed to be non-atomic.
I believe the only correct answer is "it depends". On what you may ask?
Well for starters which CPU. But also some CPUs are atomic for writing word width values, but only when aligned. It really is not something you can guarantee at a C language level.
Many compilers offer "intrinsics" to emit correct atomic operations. These are extensions which act like functions, but emit the correct code for your target architecture to get the needed atomic operations. For example: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
You said "bit-by-bit" in your question. I don't think any architecture does operations a bit at a time, except with some specialized serial protocol busses. Standard memory read/writes are done with 8, 16, 32, or 64 bits of granularity. So it is POSSIBLE the operation in your example is atomic.
However, the answer is heavily platform dependent.
It depends on the CPU's capabilities.
Can the hardware do an atomic 32-bit
operation? Here's a hint: If the
variable you are working on is larger
than the native register size (e.g.
64-bit int on a 32-bit system), it's
definitely NOT atomic.
It depends on how the compiler
generates the machine code. It could
have turned your 32-bit variable
access into 4x 8-bit memory reads.
It gets tricky if the address of what
you are accessing is not aligned
across a machine's natural word
boundary. You can hit a a cache
fault or page fault.
It is VERY POSSIBLE that you would see a corrupt or unexpected value using the code example that you posted.
Your platform probably provides some method of doing atomic operations. In the case of a Windows platform, it is via the Interlocked functions. In the case of Linux/Unix, look at the atomic_t type.
To add to what has been said so far - another potential concern is caching. CPUs tend to work with the local (on die) memory cache which may or may not be immediately flushed back to the main memory. If the box has more than one CPU, it is possible that another CPU will not see the changes for some time after the modifying CPU made them - unless there is some synchronization command informing all CPUs that they should synchronize their on-die caches. As you can imagine such synchronization can considerably slow the processing down.
Don't forget that the compiler assumes single-thread when optimizing, and this whole thing could just go away.
POSIX defines the special type sig_atomic_t which guarentees that writes to it are atomic with respect to signals, which will make it also atomic from the point of view of other threads like you want. They don't specifically define an atomic cross-thread type like this, since thread communication is expected to be mediated by mutexes or other sychronization primitives.
Considering modern microprocessors (and ignoring microcontrollers), the 32-bit assignment is atomic, not bit-by-bit.
However, now completely off of your question's topic... the printing thread could still print something that is not expected because of the lack of synchronization in this example, of course, due to instruction reordering and multiple cores each with their own copy of g_uiVal in their caches.
Related
I am trying to do an atomic increment on a 64bit variable on a 32 bit system. I am trying to use atomic_fetch_add_explicit(&system_tick_counter_us,1, memory_order_relaxed);
But the compiler throws out an error - warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Watomic-alignment]
My question is how I can achieve atomicity without using critical sections.
how I can achieve atomicity without using critical sections?
On an object that is larger than the size of a single memory "word?" You probably can't. End of story. Use a mutex. Or, use atomic<...> and accept that the library will use a mutex on your behalf.
You can't even do this using 32bit data on systems which have to read the memory, modify the value and save it. (RMW). Almost all (if not all) RISC processors do not have instructions modifying the memory in a single instruction. It includes all ARM-Cortex micros, RISCV-V and many many other processors.
Many of them have special hardware mechanisms which can help archive the atomic access (at least preventing other processes to access the data). Cortex-M cores have LDREX, STREX instructions, some have hardware mutexes or semaphores but they still require the programmer to provide atomic (or at least mutually excluded access) to the memory location
If you need to read the 64-bit value while the program is running then you probably can't do this safely without a mutex as others have said, but on the off-chance that you only need to read this value after all of the threads have finished, then you can implement this with an array of 2 32-bit atomic variables.
Since your system can only guarantee atomicity of this type on 4-byte memory regions, you should use those instead to maximize performance, for instance:
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
_Atomic uint32_t system_tick_counter_us[2];
Then increment one of those two 4-byte atomic variables whenever you want to increment an 8-byte one, then check if it overflowed, and if it did, atomically increment the other. Keep in mind that atomic_fetch_add_explicit returns the value of the atomic variable before it was incremented, so it's important to check for the value that will cause the overflow, not zero.
if(atomic_fetch_add_explicit(&system_tick_counter_us[0], 1, memory_order_relaxed) == (uint32_t)0-1)
atomic_fetch_add_explicit(&system_tick_counter_us[1], 1, memory_order_relaxed);
However, as I mentioned, this can cause a race condition in the case that the 64-bit variable is constructed between system_tick_counter_us[0] overflowing and that same thread incrementing system_tick_counter_us[1] but if you can find a way to guarantee that all threads are done executing the two lines above, then this is a safe solution.
The 64-bit value can be constructed as ((uint64_t)system_tick_counter_us[1] << 32) | (uint64_t)system_tick_counter_us[0] once you're sure the memory is no longer being modified
In a non-OOP programming language, like C, If we only allow local variables to be mutated in every possible way (change internal fields, re-assigning, ...) but disallow mutation of function arguments, will it help us prevent shared mutable state?
Note that in this case, function main can start 10 threads (functions) and each of those 10 threads will receive an immutable reference to the same variable (defined in main). But the main function can still change the value of that shared variable. So can this cause problem in a concurrent/parallel software?
I hope the question is clear, but let me know if it's not.
P.S. Can "software transactional memory (STM)" solve the potential problems? Like what Clojure offers?
Yes and no... this depends on the platform, the CPU, the size of the shared variable and the compiler.
On an NVIDIA forum, in relation to GPU operations, a similar question was very neatly answered:
When multiple threads are writing or reading to/from a naturally aligned location in global memory, and the datatype being read or written is the same by all threads, and the datatype corresponds to one of the supported types for single-instruction thread access ...
(Many GPU single-instruction can handle 16 Byte words (128bit) when it's known in advance, but most CPUs use single-instruction 32bits or 64bit limits)
I'm leaving aside the chance that threads might read from the CPU registers instead of the actual memory (ignoring updates to the data), these are mostly solvable using the volatile keyword in C.
However, conflicts and memory corruption can still happen.
Some memory storage operations are handled internally (by the CPU) or by your compiler (the machine code) using a number of storage calls.
In these cases, mostly on multi-core machines (but not only), there's the risk that the "reader" will receive information that was partially updated and has no meaning whatsoever (i.e., half of a pointer is valid and the other isn't).
Variables larger than 32 bits or 64 bits, will usually get updated a CPU "word" (not an OS word) at a time (32bits or 64 bits).
Byte sized variables are super safe, That's why they are are often used as flags... but they should probably be handled using the atomic_* store/write operations provided by the OS or the compiler.
Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)
I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.
The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.
If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.
I'm using 32-bit microcontroller (STR91x). I'm concurrently accessing (from ISR and main loop) struct member of type enum. Access is limited to writing to that enum field in the ISR and checking in the main loop. Enum's underlying type is not larger than integer (32-bit).
I would like to make sure that I'm not missing anything and I can safely do it.
Provided that 32 bit reads and writes are atomic, which is almost certainly the case (you might want to make sure that your enum's word-aligned) then that which you've described will be just fine.
As paxdiablo & David Knell said, generally speaking this is fine. Even if your bus is < 32 bits, chances are the instruction's multiple bus cycles won't be interrupted, and you'll always read valid data.
What you stated, and what we all know, but it bears repeating, is that this is fine for a single-writer, N-reader situation. If you had more than one writer, all bets are off unless you have a construct to protect the data.
If you want to make sure, find the compiler switch that generates an assembly listing and examine the assembly for the write in the ISR and the read in the main loop. Even if you are not familiar with ARM assembly, I'm sure you could quickly and easily be able to discern whether or not the reads and writes are atomic.
ARM supports 32-bit aligned reads that are atomic as far as interrupts are concerned. However, make sure your compiler doesn't try to cache the value in a register! Either mark it as a volatile, or use an explicit memory barrier - on GCC this can be done like so:
int tmp = yourvariable;
__sync_synchronize(yourvariable);
Note, however, that current versions of GCC person a full memory barrier for __sync_synchronize, rather than just for the one variable, so volatile is probably better for your needs.
Further, note that your variable will be aligned automatically unless you are doing something Weird (ie, explicitly specifying the location of the struct in memory, or requesting a packed struct). Unaligned variables on ARM cannot be read atomically, so make sure it's aligned, or disable interrupts while reading.
Well, it depends entirely on your hardware but I'd be surprised if an ISR could be interrupted by the main thread.
So probably the only thing you have to watch out for is if the main thread could be interrupted halfway through a read (so it may get part of the old value and part of the new).
It should be a simple matter of consulting the specs to ensure that interrupts are only processed between instructions (this is likely since the alternative would be very complex) and that your 32-bit load is a single instruction.
An aligned 32 bit access will generally be atomic (unless it were a particularly ludicrous compiler!).
However the rock-solid solution (and one generally applicable to non-32 bit targets too) is to simply disable the interrupt temporarily while accessing the data outside of the interrupt. The most robust way to do this is through an access function to statically scoped data rather than making the data global where you then have no single point of access and therefore no way of enforcing an atomic access mechanism when needed.
If I have a multi-threaded program that reads a cache-type memory by reference. Can I change this pointer by the main thread without risking any of the other threads reading unexpected values.
As I see it, if the change is atomic the other threads will either read the older value or the newer value; never random memory (or null pointers), right?
I am aware that I should probably use synchronisation methods anyway, but I'm still curious.
Are pointer changes atomic?
Update: My platform is 64-bit Linux (2.6.29), although I'd like a cross-platform answer as well :)
As others have mentioned, there is nothing in the C language that guarantees this, and it is dependent on your platform.
On most contemporary desktop platforms, the read/write to a word-sized, aligned location will be atomic. But that really doesn't solve your problem, due to processor and compiler re-ordering of reads and writes.
For example, the following code is broken:
Thread A:
DoWork();
workDone = 1;
Thread B:
while(workDone != 0);
ReceiveResultsOfWork();
Although the write to workDone is atomic, on many systems there is no guarantee by the processor that the write to workDone will be visible to other processors before writes done via DoWork() are visible. The compiler may also be free to re-order the write to workDone to before the call to DoWork(). In both cases, ReceiveResultsOfWork() might start working on incomplete data.
Depending on your platform, you may need to insert memory fences and so on to ensure proper ordering. This can be very tricky to get right.
Or just use locks. Much simpler, much easier to verify as correct, and in most cases more than performant enough.
The C language says nothing about whether any operations are atomic. I've worked on microcontrollers with 8 bit buses and 16-bit pointers; any pointer operation on these systems would potentially be non-atomic. I think I remember Intel 386s (some of which had 16-bit buses) raising similar concerns. Likewise, I can imagine systems that have 64-bit CPUs, but 32-bit data buses, which might then entail similar concerns about non-atomic pointer operations. (I haven't checked to see whether any such systems actually exist.)
EDIT: Michael's answer is well worth reading. Bus size vs. pointer size is hardly the only consideration regarding atomicity; it was simply the first counterexample that came to mind for me.
You didn't mention a platform. So I think a slightly more accurate question would be
Are pointer changes guaranteed to be atomic?
The distinction is necessary because different C/C++ implementations may vary in this behavior. It's possible for a particular platform to guarantee atomic assignments and still be within the standard.
As to whether or not this is guaranteed overall in C/C++, the answer is No. The C standard makes no such guarantees. The only way to guarantee a pointer assignment is atomic is to use a platform specific mechanism to guarantee the atomicity of the assignment. For instance the Interlocked methods in Win32 will provide this guarantee.
Which platform are you working on?
The cop-out answer is that the C spec does not require a pointer assignment to be atomic, so you can't count on it being atomic.
The actual answer would be that it probably depends on your platform, compiler, and possibly the alignment of the stars on the day you wrote the program.
'normal' pointer modification isn't guaranteed to be atomic.
check 'Compare and Swap' (CAS) and other atomic operations, not a C standard, but most compilers have some access to the processor primitives. in the GNU gcc case, there are several built-in functions
The only thing guaranteed by the standard is the sig_atomic_t type.
As you've seen from the other answers, it is likely to be OK when targeting generic x86 architecture, but very risky with more "specialty" hardware.
If you're really desperate to know, you can compare sizeof(sig_atomic_t) to sizeof(int*) and see what they are you your target system.
It turns out to be quite a complex question. I asked a similar question and read everything I got pointed to. I learned a lot about how caching works in modern architectures, and didn't find anything that was definitive. As others have said, if the bus width is smaller than the pointer bit-width, you might be in trouble. Specifically if the data falls across a cache-line boundary.
A prudent architecture will use a lock.