I am trying to do an atomic increment on a 64bit variable on a 32 bit system. I am trying to use atomic_fetch_add_explicit(&system_tick_counter_us,1, memory_order_relaxed);
But the compiler throws out an error - warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Watomic-alignment]
My question is how I can achieve atomicity without using critical sections.
how I can achieve atomicity without using critical sections?
On an object that is larger than the size of a single memory "word?" You probably can't. End of story. Use a mutex. Or, use atomic<...> and accept that the library will use a mutex on your behalf.
You can't even do this using 32bit data on systems which have to read the memory, modify the value and save it. (RMW). Almost all (if not all) RISC processors do not have instructions modifying the memory in a single instruction. It includes all ARM-Cortex micros, RISCV-V and many many other processors.
Many of them have special hardware mechanisms which can help archive the atomic access (at least preventing other processes to access the data). Cortex-M cores have LDREX, STREX instructions, some have hardware mutexes or semaphores but they still require the programmer to provide atomic (or at least mutually excluded access) to the memory location
If you need to read the 64-bit value while the program is running then you probably can't do this safely without a mutex as others have said, but on the off-chance that you only need to read this value after all of the threads have finished, then you can implement this with an array of 2 32-bit atomic variables.
Since your system can only guarantee atomicity of this type on 4-byte memory regions, you should use those instead to maximize performance, for instance:
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
_Atomic uint32_t system_tick_counter_us[2];
Then increment one of those two 4-byte atomic variables whenever you want to increment an 8-byte one, then check if it overflowed, and if it did, atomically increment the other. Keep in mind that atomic_fetch_add_explicit returns the value of the atomic variable before it was incremented, so it's important to check for the value that will cause the overflow, not zero.
if(atomic_fetch_add_explicit(&system_tick_counter_us[0], 1, memory_order_relaxed) == (uint32_t)0-1)
atomic_fetch_add_explicit(&system_tick_counter_us[1], 1, memory_order_relaxed);
However, as I mentioned, this can cause a race condition in the case that the 64-bit variable is constructed between system_tick_counter_us[0] overflowing and that same thread incrementing system_tick_counter_us[1] but if you can find a way to guarantee that all threads are done executing the two lines above, then this is a safe solution.
The 64-bit value can be constructed as ((uint64_t)system_tick_counter_us[1] << 32) | (uint64_t)system_tick_counter_us[0] once you're sure the memory is no longer being modified
Related
I'd like to be able to use something like this to make access to my ports clearer:
typedef struct {
unsigned rfid_en: 1;
unsigned lcd_en: 1;
unsigned lcd_rs: 1;
unsigned lcd_color: 3;
unsigned unused: 2;
} portc_t;
extern volatile portc_t *portc;
But is it safe? It works for me, but...
1) Is there a chance of race conditions?
2) Does gcc generate read-modify-write cycles for code that modifies a single field?
3) Is there a safe way to update multiple fields?
4) Is the bit packing and order guaranteed? (I don't care about portability in this case, so gcc-specific options to make it Do What I Mean are fine.)
Handling race conditions must be done by operating system level calls (which will indeed use read-modify-writes), GCC won't do that.
Idem., and no GCC does not generate read-modify-write instructions for volatile. However, a CPU will normally do the write atomically (simply because it's one instruction). This holds true if the bit-field stays within an int for example, but this is CPU/implementation dependent; I mean some may guarantee this up to 8-byte value, while other only up to 4-byte values. So under that condition, bits can't be mixed up (i.e. a few written from one thread, and others from another thread won't occur).
The only way to set multiple fields at the same time, is to set these values in an intermediate variable, and then assign this variable to the volatile.
The C standard specifies that bits are packed together (it seems that there might be exceptions when you start mixing types, but I've never seen that; everyone always uses unsigned ...).
Note: Defining something volatile does not cause a compiler to generate read-modify-writes. What volatile does is telling the compiler that an assignment to that pointer/address must always be made, and may not be optimised away.
Here's another post about the same subject matter. I found there to be quite a few other places where you can find more details.
The keyword volatile has nothing to do with race conditions, or what thread is accessing code. The keyword tells the compiler not to cache the value in registers. It tells the compiler to generate code so that every access goes to the location allocated to the variable, because each access may see a different value. This is the case with memory mapped peripherals. This doesn't help if your MPU has it's own cache. There are usually special instructions or un-cached areas of the memory map to ensure the location, and not a cached copy, is read.
As for being thread safe, just remember that even a memory access may not be thread safe is it is done in two instructions. E.g. in 8051 assembler, you have to get a 16 bit value one byte at a time. The instruction sequence can be interrupted by an IRQ or another thread and the second byte read or written, potentially corrupted.
Recently I've peeked into the Linux kernel implementation of an atomic read and write and a few questions came up.
First the relevant code from the ia64 architecture:
typedef struct {
int counter;
} atomic_t;
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic64_read(v) (*(volatile long *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
#define atomic64_set(v,i) (((v)->counter) = (i))
For both read and write operations, it seems that the direct approach was taken to read from or write to the variable. Unless there is another trick somewhere, I do not understand what guarantees exist that this operation will be atomic in the assembly domain. I guess an obvious answer will be that such an operation translates to one assembly opcode, but even so, how is that guaranteed when taking into account the different memory cache levels (or other optimizations)?
On the read macros, the volatile type is used in a casting trick. Anyone has a clue how this affects the atomicity here? (Note that it is not used in the write operation)
I think you are misunderstanding the (very much vague) usage of the word "atomic" and "volatile" here. Atomic only really means that the words will be read or written atomically (in one step, and guaranteeing that the contents of this memory position will always be one write or the other, and not something in between). And the volatile keyword tells the compiler to never assume the data in that location due to an earlier read/write (basically, never optimize away the read).
What the words "atomic" and "volatile" do NOT mean here is that there's any form of memory synchronization. Neither implies ANY read/write barriers or fences. Nothing is guaranteed with regards to memory and cache coherence. These functions are basically atomic only at the software level, and the hardware can optimize/lie however it deems fit.
Now as to why simply reading is enough: the memory models for each architecture are different. Many architectures can guarantee atomic reads or writes for data aligned to a certain byte offset, or x words in length, etc. and vary from CPU to CPU. The Linux kernel contains many defines for the different architectures that let it do without any atomic calls (CMPXCHG, basically) on platforms that guarantee (sometimes even only in practice even if in reality their spec says the don't actually guarantee) atomic reads/writes.
As for the volatile, while there is no need for it in general unless you're accessing memory-mapped IO, it all depends on when/where/why the atomic_read and atomic_write macros are being called. Many compilers will (though it is not set in the C spec) generate memory barriers/fences for volatile variables (GCC, off the top of my head, is one. MSVC does for sure.). While this would normally mean that all reads/writes to this variable are now officially exempt from just about any compiler optimizations, in this case by creating a "virtual" volatile variable only this particular instance of a read/write is off-limits for optimization and re-ordering.
The reads are atomic on most major architectures, so long as they are aligned to a multiple of their size (and aren't bigger than the read size of a give type), see the Intel Architecture manuals. Writes on the other hand many be different, Intel states that under x86, single byte write and aligned writes may be atomic, under IPF (IA64), everything use acquire and release semantics, which would make it guaranteed atomic, see this.
the volatile prevents the compiler from caching the value locally, forcing it to be retrieve where ever there is access to it.
If you write for a specific architecture, you can make assumptions specific to it.
I guess IA-64 does compile these things to a single instruction.
The cache shouldn't be an issue, unless the counter crosses a cache line boundry. But if 4/8 byte alignment is required, this can't happen.
A "real" atomic instruction is required when a machine instruction translates into two memory accesses. This is the case for increments (read, increment, write) or compare&swap.
volatile affects the optimizations the compiler can do.
For example, it prevents the compiler from converting multiple reads into one read.
But on the machine instruction level, it does nothing.
I have multiple threads reading same int variable.
and one thread is writing the value.
I don't care about the race condition.
only my concern is writing and reading int value at same time is memory safe ?
and it will not result in any application crash .
Yes, that should be all right. The only way I can envision that crashing is if one of the threads deallocates the memory backing that integer. For best results I would also make sure the integers are aligned at sizeof(int) boundaries. (Some CPUs cannot access integers at all without this alignment. Others provide weaker guarantees of atomicity for unaligned access.)
Yes, on x86 and x86-64, as long as the value you're reading is aligned properly. 32-bit ints, they need to be aligned on a 4-byte boundary in order for access to be atomic when reading or writing, which will almost always be the case unless you go out of your way to create unaligned ints (say, by using a packed structure or by doing casting/pointer arithmetic with byte buffers).
You probably also want to declare your variable as volatile so that the compiler will generate code that will re-fetch the variable from memory every time it's accessed. That will prevent it from making optimizations such as caching it in a register when it might be altered by another thread.
On all Linux platforms that I know of, reads and writes of aligned int's are atomic and safe. You will never read a value that wasn't written (no word tearing). You will never cause a fault or crash.
I am trying to understand atomic and non atomic operations.With respect to Operating System and also with respect to C.
As per the wikipedia page here
Consider a simple counter which different processes can increment.
Non-atomic
The naive, non-atomic implementation:
reads the value in the memory location;
adds one to the value;
writes the new value back into the memory location.
Now, imagine two processes are running incrementing a single, shared memory location:
the first process reads the value in memory location;
the first process adds one to the value;
but before it can write the new value back to the memory location it is suspended, and the second process is allowed to run:
the second process reads the value in memory location, the same value that the first process read;
the second process adds one to the value;
the second process writes the new value into the memory location.
How can the above operation be made an atmoic operation.
My understanding of atomic operation is that any thing which executes without interruption is atomic.
So for example
int b=1000;
b+=1000;
Should be an atomic operation as per my understanding because both the instructions executed without an interruption,how ever I learned from some one that in C there is nothing known as atomic operation so above both statements are non atomic.
So what I want to understand is what is atomicity is different when it comes to programming languages than the Operating Systems?
C99 doesn't have any way to make variables atomic with respect to other threads. C99 has no concept of multiple threads of execution. Thus, you need to use compiler-specific extensions, and/or CPU-level instructions to achieve atomicity.
The next C standard, currently known as C1x, will include atomic operations.
Even then, mere atomicity just guarantees that an operation is atomic, it doesn't guarantee when that operation becomes visible to other CPUs. To achieve visibility guarantees, in C99 you would need to study your CPU's memory model, and possibly use a special kind of CPU instructions known as fences or memory barriers. You also need to tell the compiler about it, using some compiler-specific compiler barrier. C1x defines several memory orderings, and when you use an atomic operation you can decide which memory ordering to use.
Some examples:
/* NOT atomic */
b += 1000;
/* GCC-extension, only in newish GCCs
* requirements on b's loads are CPU-specific
*/
__sync_add_and_fetch(&b, 1000);
/* GCC-extension + x86-assembly,
* b should be aligned to its size (natural alignment),
* or loads will not be atomic
*/
__asm__ __volatile__("lock add $1000, %0" : "+r"(b));
/* C1x */
#include <stdatomic.h>
atomic_int b = ATOMIC_INIT(1000);
int r = atomic_fetch_add(&b, 1000) + 1000;
All of this is as complex as it seems, so you should normally stick to mutexes, which makes things easier.
int b = 1000;
b+=1000;
gets turned into multiple statements at the instruction level. At the very least, preparing a register or memory, assigning 1000, then getting the contents of that register/memory, adding 1000 to the contents, and re-assigning the new value (2000) to that register. Without locking, the OS can suspend the process/thread at any point in that operation. In addition, on multiproc systems, a different processor could access that memory (wouldn't be a register in this case) while your operation is in progress.
When you take a lock out (which is how you would make this atomic), you are, in part, informing the OS that it is not ok to suspend this process/thread, and that this memory should not be accessed by other processes.
Now the above code would probably be optimized by the compiler to a simple assignment of 2000 to the memory location for b, but I'm ignoring that for the purposes of this answer.
b+=1000 is compiled, on all systems that I know, to multiple instructions. Thus it is not atomic.
Even b=1000 can be non atomic although you have to work hard to construct a situation where it is not atomic.
In fact C has no concept of threads and so there is nothing that is atomic in C. You need to rely on implementation specific details of your compiler and tools.
The above statements are non atomic because it becomes a move instruction to load b into a register (if it isnt) then add 1000 to it and the store back into memory. Many instruction sets allow for atomicity through atomic increment easiest being x86 with lock addl dest, src; some other instruction sets use cmpxchg to achieve the same result.
So what I want to understand is what
is atomicity is different when it
comes to programming languages than
the Operating Systems?
I'm a bit confused by this question. What do you mean exactly? The atomicity concept is the same both in prog. languages and OS.
Regarding atomicity and language, here is for example a link about atomicity in JAVA, that might give you a different perspective: What operations in Java are considered atomic?
A quick question I've been wondering about for some time; Does the CPU assign values atomically, or, is it bit by bit (say for example a 32bit integer).
If it's bit by bit, could another thread accessing this exact location get a "part" of the to-be-assigned value?
Think of this:
I have two threads and one shared "unsigned int" variable (call it "g_uiVal").
Both threads loop.
On is printing "g_uiVal" with printf("%u\n", g_uiVal).
The second just increase this number.
Will the printing thread ever print something that is totally not or part of "g_uiVal"'s value?
In code:
unsigned int g_uiVal;
void thread_writer()
{
g_uiVal++;
}
void thread_reader()
{
while(1)
printf("%u\n", g_uiVal);
}
Depends on the bus widths of the CPU and memory. In a PC context, with anything other than a really ancient CPU, accesses of up to 32 bit accesses are atomic; 64-bit accesses may or may not be. In the embedded space, many (most?) CPUs are 32 bits wide and there is no provision for anything wider, so your int64_t is guaranteed to be non-atomic.
I believe the only correct answer is "it depends". On what you may ask?
Well for starters which CPU. But also some CPUs are atomic for writing word width values, but only when aligned. It really is not something you can guarantee at a C language level.
Many compilers offer "intrinsics" to emit correct atomic operations. These are extensions which act like functions, but emit the correct code for your target architecture to get the needed atomic operations. For example: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
You said "bit-by-bit" in your question. I don't think any architecture does operations a bit at a time, except with some specialized serial protocol busses. Standard memory read/writes are done with 8, 16, 32, or 64 bits of granularity. So it is POSSIBLE the operation in your example is atomic.
However, the answer is heavily platform dependent.
It depends on the CPU's capabilities.
Can the hardware do an atomic 32-bit
operation? Here's a hint: If the
variable you are working on is larger
than the native register size (e.g.
64-bit int on a 32-bit system), it's
definitely NOT atomic.
It depends on how the compiler
generates the machine code. It could
have turned your 32-bit variable
access into 4x 8-bit memory reads.
It gets tricky if the address of what
you are accessing is not aligned
across a machine's natural word
boundary. You can hit a a cache
fault or page fault.
It is VERY POSSIBLE that you would see a corrupt or unexpected value using the code example that you posted.
Your platform probably provides some method of doing atomic operations. In the case of a Windows platform, it is via the Interlocked functions. In the case of Linux/Unix, look at the atomic_t type.
To add to what has been said so far - another potential concern is caching. CPUs tend to work with the local (on die) memory cache which may or may not be immediately flushed back to the main memory. If the box has more than one CPU, it is possible that another CPU will not see the changes for some time after the modifying CPU made them - unless there is some synchronization command informing all CPUs that they should synchronize their on-die caches. As you can imagine such synchronization can considerably slow the processing down.
Don't forget that the compiler assumes single-thread when optimizing, and this whole thing could just go away.
POSIX defines the special type sig_atomic_t which guarentees that writes to it are atomic with respect to signals, which will make it also atomic from the point of view of other threads like you want. They don't specifically define an atomic cross-thread type like this, since thread communication is expected to be mediated by mutexes or other sychronization primitives.
Considering modern microprocessors (and ignoring microcontrollers), the 32-bit assignment is atomic, not bit-by-bit.
However, now completely off of your question's topic... the printing thread could still print something that is not expected because of the lack of synchronization in this example, of course, due to instruction reordering and multiple cores each with their own copy of g_uiVal in their caches.